Compiler Transformations For High-Performance Computing: David F. Bacon, Susan L. Graham, and Oliver J. Sharp

Compiler Transformations for High-Performance Computing
DAVID F. BACON, SUSAN L. GRAHAM, AND OLIVER J. SHARP

Computer Sc~ence Diviszon, Unwersbty of California, Berkeley, California 94720
In the last three decades a large number of compiler transformations for optimizing
programs have been implemented. Most optimization for uniprocessors reduce the
number of instructions executed by the program using transformations based on the
analysis of scalar quantities and data-flow techniques. In contrast, optimization for
high-performance superscalar, vector, and parallel processors maximize parallelism
and memory locality with transformations that rely on tracking the properties of
arrays using loop dependence analysis.
This survey is a comprehensive overview of the important high-level program
restructuring techniques for imperative languages such as C and Fortran.
Transformations for both sequential and various types of parallel architectures are
covered in depth. We describe the purpose of each traneformation, explain how to
determine if it is legal, and give an example of its application.
Programmers wishing to enhance the performance of their code can use this survey
to improve their understanding of the optimization that compilers can perform, or as
a reference for techniques to be applied manually. Students can obtain an overview of
optimizing compiler technology. Compiler writers can use this survey as a reference for
most of the important optimizations developed to date, and as a bibliographic reference
for the details of each optimization. Readers are expected to be familiar with modern
computer architecture and basic program compilation techniques.
Categories and Subject Descriptors: D.1.3 [Programming Techniques]: Concurrent

Programming; D.3.4 [Programming Languages]: Processors—compilers;
optimization; 1.2.2 [Artificial Intelligence]: Automatic Programming-program

transformation
General Terms: Languages, Performance
Additional Key Words and Phrases: Compilation, dependence analysis, locality,

multiprocessors, optimization, parallelism, superscalar processors, vectorization
INTRODUCTION
As optimizing compilers become more
Optimizing compilers have become an es- effective, programmers can become less
sential component of modern high-perfor- concerned about the details of the under-
mance computer systems. In addition to lying machine architecture and can em-
translating the input program into ma- ploy higher-level, more succinct, and
chine language, they analyze it and ap- more intuitive programming constructs
ply various transformations to reduce its and program organizations. Simultane-
running time or its size. ously, hardware designers are able to
This research has been sponsored in part by the Defense Advanced Research Projects Agency (DARPA)
under contract DABT63-92-C-O026, by NSF grant CDA-8722788, and by an IBM Resident Study Program
Fellowship to David Bacon,
Permission to copy without fee all or part of this material is granted provided that the copies are not made
or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication
and its date appear, and notice is given that copying is by permission of the Association for Computing
Machinery. To copy otherwise, or to republish, requires a fee and]or specific permission.
@ 01994 0360-0300/94/1200-0345 $03.50
ACM Computing Surveys, Vol. 26, No. 4, December 1994

346 * David F. Bacon et al.
CONTENTS imperative languages such as Fortran

and C for high-performance architec-
tures, including superscalar, vector,
and various classes of multiprocessor
INTRODUCTION machines.
1. SOURCE LANGUAGE Most of the transformations we de-
2 TRANSFORMATION ISSUES
scribe can be applied to anyofthe Algol-
2 1 Correctness
22 Scope
family languages; many are applicable to
3 TARGET ARCHITECTURES functional, logic, distributed, and object-
31 Characterizing Performance oriented languages as well. These other
32 Model Architectures
languages raise additional optimization
4 COMPILER ORGANIZATION
issues that space does not permit us
5 DEPENDENCE ANALYSIS
51 Types of Dependences to cover in this survey. The references
52 Representing Dependences include some starting points for investi-
5,3 Loop Dependence Analysis gation of optimizations for LISP and
5.4 Subscript Analysis
functional languages [Appel 1992; Clark
6. TRANSFORMATIONS
61 Data-Flow-Based Loop Transformations
and Peyton-Jones 1985; Kranz et al.
6.2 Loop Reordering 1986], object-oriented languages [Cham-
6.3 Loop Restructuring bers and Ungar 1989], and the set-based
6.4 Loop Replacement Transformations
language SETL [Freudenberger et al.
6.5 Memory Access Transformations
1983].
6.6 Partial Evaluation
6.7 Redundancy Ehmmatlon
We have also restricted the discussion
68 Procedure Call Transformations to higher-level transformations that re-
7. TRANSFORMATIONS FOR PARALLEL quire some program analysis. Thus we
MACHINES
exclude peephole optimizations and most
71 Data Layout
72 Exposing Coarse-Grained Parallelism
machine-level optimizations. We use the
73 Computation PartltIOnmg term optimization as shorthand for opti-
74 Communication Optimization mizing transformation.
75 SIMD Transformations
Finally, because of the richness of the
76 VLIWTransformatlons
topic, we have not given a detailed de-
8 TRANSFORMATION FRAMEWORKS
scription of intermediate program repre-
8.1 Unified Transformation
8.2 Searching the Transformation Space
sentations and analysis techniques.
9. COMPILER EVALUATION We make use of a number of different
9.1 Benchmarks machine models, all based on a hypothet-
9.2 Code Characterlstlcs
ical superscalar processor called S-DLX.
9.3 Compiler Effectiveness
The Appendix details the machine mod-
94 Instruction-Level Parallehsm
CONCLUSION els, presenting a simplified architecture
APPENDIX: MACHINE MODELS and instruction set that we use when we
A.1 Superscalar DLX need to discuss machine code. While all
A 2 Vector DLX
assembly language examples are com-
A3 Multiprocessors
ACKNOWLEDGMENTS
mented, the reader will need to refer to
REFERENCES the Armendix to understand some details
of the’ ~xamples (such as cycle counts).
We assume a basic familiarity with
~ro~am compilation issues. Readers un-
employ designs that yield greatly im- ~am~liar with program flow analysis or
proved performance because they need other basic compilation techniques may
only concern themselves with the suit- consult Aho et al. [1986].
ability of the designas a compiler target,
not with its suitability as a direct pro-
1. SOURCE LANGUAGE
grammer interface.
In this survey we describe transforma- All of the high-level examples in this
tions that optimize programs written in survey are written in a language similar

Compiler Transformations “ 347
to Fortran 90, with minor variations and Programmers unfamiliar with Fortran
extensions; most examples use only fea- should also note that all arguments are
tures found in Fortran 77. We have cho- passed by reference.
sen to use Fortran because it is the de When describing compilation for vector
facto standard of the high-performance machines, we sometimes use array nota-
engineering and scientific computing tion when the mapping to hardware reg-
community. Fortran has also been the isters is clear. For instance, the loop
input language for a number of research
doalli=l,64
projects studying parallelization [Allen et a[i] = a[i] + c
al. 1988a; Balasundaram et al. 1989; end do all
Polychronopoulos et al. 1989]. It was cho-
sen by these projects not only because of could be implemented with a scalar-vec-
its ubiquity among the user community, tor add instruction, assuming a machine
but also because its lack of pointers and with vector registers of length 64. This
its static memory model make it more would be written in Fortran 90 array
amenable to analysis. It is not yet clear notation as
what effect the recent introduction of a[l :64] = a[l :64] + c
pointers into Fortran 90 will have on
optimization. or in vector machine assembly language
The optimizations we have presented as
are not specific to Fortran—in fact, many
commercial compilers use the same in- LF F2, c(R30) ;Ioad c into register F2
termediate language and optimizer for ADDI R8, R30, #a ;Ioad addr. of a into R8
both C and Fortran. The presence of un- LV VI, R8 ;Ioad vector a[l :64] to
restricted pointers in C can reduce oppor- V1
ADDSV VI, F2, VI ;add scalar to vector
tunities for optimization because it is
Sv VI. R8 ;store vector in a[l :64]
impossible to determine which variables
may be referenced by a pointer. The pro-
cess of determining which references may Array notation will not be used when
point to the same storage locations is loop bounds are unknown, because there
called alias analysis [Banning 1979; Choi is no longer an obvious correspondence
et al. 1993; Cooper and Kennedy 1989; between the source code and the fixed-
Landi et al. 1993]. length vector operations that perform the
The only changes to Fortran in our computation. To use vector operations,
examples are that array subscripting is the compiler must perform the transfor-
denoted by square brackets to distin- mation called strip-mining, which is dis-
guish it from function calls; and we use cussed in Section 6.2.4.
do all loops to indicate textually that all
the iterations of a loop may be executed 2. TRANSFORMATION ISSUES
concurrently. To make the structure of
For a compiler to apply an optimization
the computation more explicit, we will
to a program, it must do three things:
generally express loops as iterations
rather than in Fortran 90 array notation. (1) Decide upon a part of the program to
We follow the Fortran convention that optimize and a particular transfor-
arrays are stored contiguously in mem- mation to apply to it.
ory in column-major form. If the array is
(2) Verify that the transformation either
traversed so as to visit consecutive loca- does not change the meaning of the
tions in the linear memory, the first (that program or changes it in a restricted
is, leftmost) subscript varies the fastest. way that is acceptable to the user.
In traversing the two-dimensional array
(3) Transform the program.
declared as a[n, m], the following array
locations are contiguous: a[n – 1, 3], In this survey we concentrate on the
a[n, 31, a[l, 41, a[2, 41. last step: transformation of the program.
ACM Computmg Surveys, Vol. 26, No 4, December 1994

348 ● David F. Bacon et al.
However, in Section 5 we introduce de- tions in the two executions produces the
pendence analysis techniques that are same result.
used for deciding upon and verifying
many loop transformations. Nondeterminism can be introduced by
Step (1) is the most difficult and poorly language constructs (such as Ada’s se-
understood; consequently compiler de- lect statement), or by calls to system or
sign is in many ways a black art. Be- library routines that return information
cause analysis is expensive, engineering about the state external to the program
issues often constrain the optimization (such as Unix time( ) or read()).
strategies that the compiler can perform. In some cases it is straightforward to
Even for relatively straightforward cause executions to be identical, for in-
uniprocessor target architectures, it is stance by entering the same inputs, In
possible for a sequence of optimizations other cases there may be support for de-
to slow a program down. For example, an terminism in the programming environ-
attempt to reduce the number of instruc- ment—for instance, a compilation option
tions executed may actually degrade per- that forces select statements to evaluate
formance by making less efficient use of their guards in a deterministic order. As
the cache. a last resort, the programmer may have
As processor architectures become to replace nondeterministic system calls
more complex, (1) the number of dimen- temporarily with deterministic opera-
sions in which optimization is possible tions in order to compare two versions.
increases and (2) the decision process is To illustrate many of the common
greatly complicated. Some progress has problems encountered when trying to op-
been made in systematizing the applica- timize a program and yet maintain cor-
tion of certain families of transforma- rectness, Figure l(a) shows a subroutine,
tions, as discussed in Section 8. However, Figure l(b) is a transformed version of it.
all optimizing compilers embody a set of The new version may seem to have the
heuristic decisions as to the transforma- same semantics as the original, but it
tion orderings likely to work best for the violates Definition 2.1.1 in the following
target machine(s). ways:
a Overflow. If b[k] is a very large num-

2.1 Correctness ber, and a[l ] is negative, then chang-
When a program is transformed by the ing the order of the additions so that
compiler, the meaning of the program C = b [k] + 100000.0 is executed before
should remain unchanged. The easiest the loop body could cause an overflow
way to achieve this is to require that the to occur in the transformed program
transformed program perform exactly the that did not occur in the original. Even
same operations as the original, in ex- if the original program would have
actly the order imposed by the semantics overflowed, the transformation causes
of the language. However, such a strict the exception to happen at a different
interpretation leaves little room for im- point. This situation complicates de-
provement. The following is a more bugging, since the transformation is
practical definition. not visible to the programmer. Finally,

if there had been a print statement be-
Definition 2.1.1 A transformation is tween the assignment and use of C,
legal if the original and the transformed the transformation would actually
programs produce exactly the same out- change the output of the program.
put for all identical executions. * Different results. Even if no overflow
Two executions of a program are iden- occurs, the values that are left in the
tical executions if they are supplied with array a may be slightly different be-
the same input data and if every corre- cause the order of the additions has
sponding pair of nondeterministic opera- been changed. The reason is that float-
ACM Computmg Surveys, Vol 26, No 4, December 1994

Compiler Transformations ● 349
subroutine tricky (a, b,n, m,k) ● Memory fault. If k > m but n < 1, the
integer n, m, k reference to b[k] is illegal. The refer-
real a[m], b[m] ence would not be evaluated in the
original program because the loop body
doi=l, n is never executed, butit would occurin
a[i] = b[k] + a[i] + 100000.0 the call shown in Figure l(c) to the
end do transformed code in Figure l(b).
return e Different results. a and b may be com-
(a) original program pletely or partially aliased to one an-
other, changing the values assigned to
a in the transformed program. Figure
subroutine tricky(a,b,n,m,k)
l(d) shows how this might occur: If the
integer n, m, k
call is to the original subroutine, b[k]
real a[m], b[m], C
is changed when i = 1, since k = n and
b [n] is aliased to a[l ]. In the trans-
C = b[k] + 100000.0
formed version, the old value of b [k] is
doi=n, i,-1
used for all i, since it is read before the
a[i] = a[i] + C
loop. Even if the reference to b[k] were
end do
moved back inside the loop, the trans-
return
formed version would still give differ-
(b) transformed program
ent results because the loop traversal
order has been reversed.
k = m+l As a result of these problems, slightly
n=O different definitions are used in practice.
call tricky(a,b,n,m,k) When bitwise identical results are de-
(c) possible call to tricky sired, the following definition is used:
Definition 2.1.2 A transformation is

equivalence (a[l], b[n]) legal if, for all semantically correct pro-
gram executions, the original and the
k=n transformed programs produce exactly
call tricky(a,b,n,m,k) the same output for identical executions.
(d) possible call to tricky
Languages typically have many rules
Figure 1. Incorrect program transformations. that are stated but not enforced; for in-
stance in Fortran, array subscripts must
remain within the declared bounds, but
ing-point numbers are approximations this rule is generally not enforced at
ofrealnumbers ,andtheorder inwhich compile- or run-time. A program execu-
the approximations are applied(round- tion is correct if it does not violate the
ing) can affect the result. However, for rules of the language. Note that correct-
a sequence of commutative and asso- ness is a property of a program execu-
ciative integer operations, if no order tion, not of the program itself, since the
of evaluation can cause an exception, same program may execute correctly un-
then all evaluation orders are equiva- der some inputs and incorrectly under
lent. We call operations that are alge- others.
braically but not computationally Fortran solves the problem of parame-
commutative (or associative) semicom- ter aliasing by declaring calls that alias
mutative (or semiassociative) opera- scalars or parts of arrays illegal, as in
tions. These issues do not arise with Figure l(d). In practice, many programs
Boolean operations, since they do not make use of aliasing anyway. While this
cause exceptions or compute approxi- may improve performance, it sometimes
mate values. leads to very obscure bugs.

350 “ David F. Bacon et al.
For languages and standards that de- niques. The advantage for analysis is
fine a specific semantics for exceptions, that there is only one entry point, so
meeting Definition 2.1.2 tightly con- control transfer need not be considered
strains the permissible transforma- in tracking the behavior of the code.
tions. If the compiler may assume that * Innermost loop. To target high-perfor-
exceptions are only generated when the
mance architectures effectively, com-
program is semantically incorrect, a
pilers need to focus on loops. Most of
transformed program can produce differ-
the transformations discussed in this
ent results when an exception occurs and
survey are based on loop manipulation.
still be legal. When exceptions have a
Many of them have been studied or
well-defined semantics, as in the IEEE
widely applied only in the context of
floating-point standard [American Na-
the innermost loop.
tional Standards Institute 1987], many
transformations will not be legal under 0
Perfect loop nest. A loop nest is a set of
this definition. loops one inside the next. The nest is
However, demanding bitwise identical called a perfect nest if the body of ev-
results may not be necessary. An alterna- ery loop other than the innermost con-
tive correctness criterion is: sists of only the next loop in the nest.
Because a perfect nest is more easily
Definition 2.1.3 A transformation is summarized and reorganized, several
legal if, for all semantically correct ex- transformations apply only to perfect
ecutions of the original program, the nests.
original and the transformed programs
e General loop nest. Any loop nesting,
perform equivalent operations for identi-
perfect or not.
cal executions. All permutations of semi-
commutative operations are considered ●
Procedure. Some optimizations, mem-
equivalent. ory access transformations in particu-
lar, yield better improvements if they
Since we cannot predict the degree to are applied to an entire procedure at
which transformations of semicommuta- once. The compiler must be able to
tive operations change the output, we manage the interactions of all the ba-
must use an operational rather than an sic blocks and control transfers within
observational definition of equivalence. the procedure. The standard and rather
Generally, in practice, programmers ob- confusing term for procedure-level op-
serve whether the numeric results differ timization in the literature is global
by more than a certain tolerance, and if optimization.
they do, force the compiler to employ 0 InterProcedural. Considering several
Definition 2.1.2. proc~dures together often exp~ses more
opportunities for optimization; in par-
2.2 Scope ticular, procedure call overhead is of-
ten significant, and can sometimes be
Optimizations can be applied to a pro- reduced or eliminated with inter-
gram at different level~ of granularity. procedural analysis.
As the scope of the transformation is en-
larged, the cost of analysis generally in-
We will generally confine our attention
creases. Some useful gradations of to optimizations beyond the basic block
complexity are: level.
* Statement. Arithmetic expressions are
the main source of potential optimiza-
tion within a statement. 3. TARGET ARCHITECTURES
0 Basic block (straight-line code). This is In this survey we discuss compilation

the focus of early optimization tech- techniques for high-performance archi-

tectures, which for our purposes are su- ware detects that the result of a vector
perscalar, vector, SIMD, shared-memory multiply is used by a vector add, and
multiprocessor, and distributed-memory uses a strategy called chaining in which
multiprocessor machines. These architec- the results from the multiplier’s pipeline
tures have in common that they all use are fed directly into the adder. Other
parallelism in some form to improve compound operations are sometimes
performance. implemented as well.
The structure of an architecture dic- Use of such compound operations may
tates how the compiler must optimize allow an application to run up to twice as
along a number of different (and some- fast. It is therefore important to organize
times competing) axes. The compiler must the code so as to use multiply-adds wher-
attempt to: ever possible.
e maximize use of computational re-
sources (processors, functional units, 3.1 Characterizing Performance
vector units),
. minimize the number of operations There are a number of variables that we
performed, have used to help describe the perfor-
● minimize use of memory bandwidth mance of the compiled code quantita-
(register, cache, network), and tively:
- minimize the size of total memory e S is the hardware speed of a single
required.
processor in operations per second.
While optimization for scalar CPUS has Typically, speed is measured either in
concentrated on minimizing the dynamic millions of instructions per second
instruction count (or more precisely, the (MIPS) or in millions of floating-point
number of machine cycles required), operations per second (megaflops).
CPU-intensive applications are often at e P is the number of processors.
least as dependent upon the performance 0 F is the number of operations executed
of the memory system as they are on the
by a program.
performance of the functional units.
In particular, the distance in memory
0
T is the time in seconds to run a pro-
between consecutively accessed elements gram.
of an array can have a major perfor- e U = F/ST is the utilization of the ma-
mance impact. This distance is called the chine by a program; a utilization of 1 is
stride. If a loop is accessing every fourth ideal, but real programs typically have
element of an array, it is a stride-4 loop. significantly lower utilizations. A su-
If every element is accessed in order, it is perscalar machine can issue several in-
a stride-1 loop. Stride-1 access is desir- structions per cycle, but if the program
able because it maximizes memory local- does not contain a mixture of opera-
ity and therefore the efficiency of the tions that matches the mixture of func-
cache, translation lookaside buffer (TLB), tional units, utilization will be lower.
and paging systems; it also eliminates Similarly, a vector machine cannot be
bank conflicts on vector machines. utilized fully unless the sizes of all of
Another key to achieving peak perfor- the arrays in the program are an exact
mance is the paired use of multiply and multiple of the vector length.
add operations in which the result from 0
Q measures reuse of operands that are
the multiplier can be fed into the adder. stored in memory. It ii the ratio of the
For instance, the IBM RS/6000 has a number of times the operand is refer-
multiply-add instruction that uses the enced during computation to the num-
multiplier and adder in a pipelined fash- ber of times it is loaded into a register.
ion; one multiply-add can be issued each As the value of Q rises, more reuse is
cycle. The Cray Y-MP C90 does not have being achieved, and less memory band-
a single instruction; instead the hard- width is consumed. Values of Q below

352 e David F. Bacon et al.
1 indicate that redundant loads are level intermediate language, and object
occurring. code. First, optimizations are applied to a
0 Qc is an analogous quantity that mea- high-level intermediate language (HIL)
sures reuse of a cache line in memory. that is semantically very close to the
It is the ratio of the number of words original source program. Because all the
that are read out of the cache from semantic information of the source m-o-
that particular line to the number of gram is available, higher-level tran~for-
times the cache fetches that line from mations are easier to apply. For instance,
memory. array references are clearly distinguish-
able, instead of being a sequence of low-
level address calculations.
3.2 Model Architectures
Next, the program is translated to
In the Appendix we present a series of low-level intermediate form (LIL), essen-
model architectures that we will use tially an abstract machine language, and
throughout this survey to demonstrate optimized at the LIL level. Address com-
the effect of’ various compiler transforma- putations provide a major source of
tions. The architectures include a super- potential optimization at this level. For
scalar CPU (S-DLX), a vector CPU instance, two references to a[5, 3] and
(V-DLX), a shared-memory multiproces- a[7, 3] both require the computation of
sor (sMX), and a distributed-memory the address of column 3 of array a.
multiprocessor (dMX). We assume that Finally, the program is translated to
the reader is familiar with basic princi- object code, and machine-specific opti-
ples of modern computer architecture, in- mization are performed. Some optimiza-
cluding RISC design, pipelining, caching, tion, like code collocation, rely on profile
and instruction-level parallelism. Our information obtained by running the
generic architectures are based on DLX, program. These optimizations are of-
an idealized RISC architecture intro- ten implemented as binary-to-binary
duced by Hennessey and Patterson translations.
[1990]. The high-level optimization phase be-
gins by doing all of the large-scale re-
structuring that will significantly change
4. COMPILER ORGANIZATION
the organization of the program. This in-
Figure 2 shows the design of a hypotheti- cludes various procedure-restructuring
cal compiler for a superscalar or vector optimlizations, as well as scalarization,
machine. It includes most of the general- which converts data-parallel constructs
purpose transformations covered in this into loops. Doing this restructuring at
survey. Because compilers for parallel the beginning allows the rest of the com-
machines are still a very active area of piler to work on individual procedures in
research, we have excluded these ma- relative isolation,
chines from the design. Next, high-level data-flow optimiza-
The purpose of this compiler design is tion, partial evaluation, and redundancy
to give the reader (1) an idea of how the elimination are performed. These opti-
various types of transformations fit to- mization simplify the program as much
gether and (2) some of the ordering is- as possible, and remove extraneous code
sues that arise. This organization is by that could reduce the effectiveness of
no means definitive: different architec- subsequent analysis.
tures dictate different designs, and opin- The main focus in compilers for high-
ions differ on how best to order the performance architectures is loop opti-
transformations. mization. A sequence of transformations
Optimization takes place in three dis- is performed to convert the loops into a
tinct phases, corresponding to three dif- form that is more amenable to optimiza-
ferent representations of the program: tion: where possible, perfect loop nests
high-level intermediate language, low- are created; subscript expressions are

F %
Source Program
Perfect Loop Ne.os
lnductton Vartables
I
Loop Reordering
Loop Intercharrga
Loop Skewing
Loop Reversal
“F
Loop Tthng
Loop Coalescing
Strip Mining
Loop Unrolling
Cycle Shnnkma
Tail-recursion Ehmmatlon
Procedure Cull Graph

Procedure Anwkmons 1
t
Low-level IL Generation
High-Level Dataflow Optimization
Low-lsvel Inferresdlate
Constant Propagationand Folding
Language (UL)
Copy Props ation
Common.%%sxpreswonElimination Low-level Dataflow Optimization
Loop-invariant Code MotIon Constant Propagation and Folding
Loop Unswtfching Copy Propagation
Stranoth Reduction Common Subaxpression Elimination
Loq-mvanant Code Motion
Strength Reduction of lnd.Var. Exprs.
Reassociation
InductIon Variable Elimination
Partial Evaluation
Algebralc Slmplficatron Low-level Redundancy Eliminatiorr
Short-circuiting
&%%%%%X:?duct’On
=
Redundancy Elimination
UnreachableCode Elimmation
UsaleasCede Elimmation
Dead Variable Eltminat!on
! LooIz Preparation I I
= Cross-Call Re Ister Allocation
Loop Distribution
Loop Normalization Assemblv L@muaee 1
Loop Peeling
Scalar Expanaion Micro-optimization
Peephok Optimization
Depeedsnce Vectors Superoptlmlzatlon
t
Loop Preparation II 1
ForwardSubstation
Array Padding
Array Alignment
Idiom and ReductionRecognition
LOODCollaosmo
Erectable Progrom 1
Instruction Ceche Optimization

Code Co-locatiom
Displacement Minimization
Cache Conflict Avoidance
t
Final Opttmtzed Co&
Pigure 2. Organization of a hypothetical optimizing compiler.
ACM Computmg Surveys, Vol. 26, No. 4, December 1994

rewritten in terms of the induction vari- There is a control dependence between

ables; and so on. Then the loop iterations statement 1 and statement 2, written S1
are reordered to maximize parallelism 4s2, when statement S1 determines
and locality. A postprocessing phase or- whether S’z will be executed. For exam-
ganizes the resulting loops to minimize ple:
loop overhead.
After the loop optimization, the pro- 1 if (a = 3) then
2 b=10
gram is converted to low-level intermedi-
end if
ate language (LIL). Many of the data-flow
and redundancy elimination optimiza- Two statements have a data dependence
tion that were applied to the HIL are if they cannot be executed simultane-
reapplied to the LIL in order to eliminate ously due to conflicting uses of the same
inefficiencies in the component expres- variable. There are three types of data
sions generated from the high-level con- dependence: flow dependence (also
structs. Additionally, induction variable called true dependence), antidependence,
optimization are performed, which are and output dependence. Sb has a flow
often important for high performance on dependence on S~ (denoted by S~ - S’d)
uniprocessor machines. A final LIL opti- when S~ must be executed first because
mization pass applies procedure call it writes a value that is read by S’d. For
optimizations. example:
Code generation converts LIL into
assembly language. This phase of the 3 a=c=lo
compiler is responsible for instruction 4 d=2*a+c
selection, instruction scheduling, and
SG has an antidependence on S~ (denoted
register allocation. The compiler applies
by S~ * Sc ) when Sc writes a variable
low-level optimizations during this phase
that is read by S~:
to improve further the performance of
the code. 5 e= f*4+g
Finally, object code is generated by the 6 g=z.h
assembler, and the object files are linked
into an executable. After profiling, pro- An antidependence does not constrain
file-based cache optimization can be ap- execution as tightly as a flow depen-
plied to the executable program itself. dence. As before, the code will execute
correctly if So is delayed until after S~
5, DEPENDENCE ANALYSIS completes. An alternative solution is to
Among the various forms of analysis used use two memory locations g~ and g~ to
by optimizing compilers, the one we rely hold the values read in S~ and written in
on most heavily in this survey is depen- SG, respectively. If the write by SG com-
dence analysis [Banerjee 1988b; Wolfe pletes first, the old value will still be
1989b]. This section introduces depen- available in g~.
dence analysis, its terminology, and the An output dependence holds when both
underlying theory. statements write the same variable:
A dependence is a relationship be-
7 a=b*c
tween two computations that places
8 a=d i-e
constraints on their execution order. De-
pendence analysis identifies these con- We denote this condition by writing ST
straints, which are then used to deter- ~ Sg. Again, as with an antidependence,
mine whether a particular transforma- storage replication can allow the state-
tion can be applied without changing the ments to execute concurrently. In this
semantics of the computation. short example there is no intervening
use of a and no control transfer between
5.1 Types of Dependence
the two assignments, so the computation
There are two kinds of dependence: con- in ST is redundant and can actually be
trol dependence and data dependence. eliminated.
ACM Computing Surveys, Vol. 26, No 4, December 1994

Compiler Transformations ●
355
The fourth possible relationship, called doi=2, n

an input dependence, holds when two ac- 1 a[i] = a[il + c
cesses to the same memory location are 2 b[i] = a[i-1] * b [i]
both reads. Although an input depen- end do
dence imposes no ordering constraints,
Figure 3. Loop-carried dependence.
the compiler can make note of it for the
purposes of optimizing data placement
on multiprocessors.
We denote an unspecified type of de-
pendence by SI = S2. Another common 5.3 Loop Dependence Analysis
notation for _dependences uses S~ 8S4 for
fjl~ + S4, S58SG for S5 ++ SG, and S7 ~“sa So far we have examined dependence in
for S7 + Sa, the context of straight-line code with con-
In the case of data dependence, when ditionals—analyzing loops is a more
we write X * Y we are being somewhat complicated problem. In straight-line
imprecise: the individual variable refer- code, each statement is executed at most
ences within a statement generate the once, so the dependence arcs described so
dependence, not the statement as a far capture all the possible dependence
whole. In the output dependence example relationships. In loops, each statement
above, b, C, d, and e can all be read from may be executed many times, and for
memory in any order, and the results of many transformations it is necessary
b * c and d + e can be executed as soon as to describe dependence that exist be-
their operands have been read from tween iterations, called loop-carried
memory. S7 * S~ actually means that dependence.
the store of the value b * c into a must A simple example of loop-carried de-
precede the store of the value d + e into pendence is shown in Figure 3. There is
a. When there is a potential ambiguity, no dependence between S’l and S2 with-
we will distinguish between different in any single iteration of the loop, but
variable references within statements. there is one between two successive iter-
ations. When i = k, S2 reads the value of
a[k – 1] written by S1 in iteration k – 1.
To compute dependence information for
5.2 Representing Dependence loops, the key problem is to understand
the use of arrays; scalar variables are
To capture the dependence information relatively easy to manage. To track array
for a piece of code, the compiler creates a behavior, the compiler must analyze the
dependence graph; typically each node in subscript expressions in each array
the graph represents one statement. An reference.
arc between two nodes indicates that To discover whether there is a depen-
there is a dependence between the com- dence in the loop nest, it is sufficient to
putations they represent. determine whether any of the iterations
Because it can be cumbersome to ac- can write a value that is read or written
count for both control and data depen- by any of the other iterations.
dence during analysis, sometimes Depending on the language, loop incre-
compilers convert control dependence ments may be arbitrary expressions.
into data dependence using a technique However, the dependence analysis algo-
called if-conversion [Allen et al. 1983]. rithms may require that the loops have
If-conversion introduces additional only unit increments. When they do not,
boolean variables that encode the condi- the compiler may be able to normalize
tional predicates; every statement whose them to fit the requirements of the anal-
execution depends on the conditional is ysis, as described in Section 6.3.6. For
then modified to test the boolean vari- the remainder of this section we will as-
able. In the transformed code, data de- sume that all loops are incremented by 1
pendence subsumes control dependence. for each iteration.

do il = 11, u1
do id = [d, Ud
1 a[fl(il, . . ..id). fm($l, ($l, . . . ..d)] = . . .

2 . ..= a[gl(il )..., id), . . ..9m(& !... , id)]
end do
end do
end do
Figure 4. GeneraI loop nest.
Figure 4 shows a generalized perfect In other words, there is a dependence

nest of d loops. The body of the loop nest when the values of the subscripts are the
reads and writes elements of the m- same in different iterations. If no such 1
dimensional array a. The function f, and and J exist, the two references are inde-
g, map the current values of the 100P pendent across all iterations of the loop.
iteration variables to integers that index In the case of an output dependence
the ith dimension of a. The generalized caused by the same write in different
loop can give rise to any type of data iterations, the condition is simply Vp:
dependence: for instance, two different fp(~) = fp(w.
iterations may write the same element of For example, suppose that we are at-
a, creating an output dependence. tempting to describe the behavior of the
An iteration can be uniquely named by loop in Figure 5(a). Each iteration of the
a vector of d elements 1 = (ii, . . . , id), inner loop writes the element a[i, j]. There
where each index falls within the itera- is a dependence if any other iteration
tion range of its corresponding loop in reads or writes that same element. In
the nesting (that is, 1P < iP < u ). The this case, there are many pairs of itera-
outermost loop corresponds to t L e left- tions that depend on each other. Con-
most index. sider iterations 1 = (1,3) and J = (2, 2).
We wish to discover what loop-carried Iteration 1 occurs first, and writes the
dependence exist between the two refer- value a[l, 3]. This value is read in itera-
ences to a, and to describe somehow those tion J, so there is a flow dependence
dependence that exist. Clearly, a refer- from iteration 1 to iteration J. Extend-
ence in iteration J can depend only on ing the notation for dependence, we
another reference in iteration 1 that was write I + J.
executed before it, not after it. We for- When X - Y, we define the depen-
malize the notion of “before” with the < dence distance as Y–X=(yl–
relation: xl, ..., y~ – x~). In Figure 5(a), the de-
pendence distance J – I = (1,– 1).When
I< Jiff3p:(iP <jP Avq <p:iq=jq). all the dependence distances for a spe-
cific pair of references are the same, the
Note that this definition must be ex- potentially unbounded set of depen-
tended slightly when the loop increment dence can be represented by the depen-
may be negative. dence distance. When a dependence dis-
A reference in some iteration J de- tance is used to describe the dependence
pends on a reference in iteration 1 if and for all iterations, it is called a distance
only if at least one reference is a write vector (introduced by Kuck [1978] and
and Muraoka [1971]).
A legal distance vector V must be lexi-
I + JA ~p: fP(I) =gP(J).
cographically positive, meaning that O <
ACM Computmg Surveys, Vol. 26, No, 4, December 1994

doi=2, n We will use the general term depen-

do j = 1, n-1 dence vector to encompass both distance
a[i, j] = a[i, j] + a[i-l, j+l] and direction vectors. In Figure 5 the
end do direction vector for loop (a) is ( <, >),
end do and the direction vectors for loop (b) are
(a) {(1, -l)} {(=, <),(<, =),(<, >)}. Notethatadi-
rection vector entry of < corresponds to
a distance vector entry that is greater
doi=l, n than zero.
do j = 2, n-1 The dependence behavior of a loop is
a[jl=(a[jl + a[j-11 + a[j+il)/3 described by the set of dependence vec-
end do tors for each pair of possibly conflicting
end do references. These can be summarized into
(b) {(0,1),(1,0),(1,-1)} a single loop direction vector, at the ex-
pense of some loss of information (and
Figure 5. Distance vectors.
potential for optimization). The depen-
dence of the loop in Figure 5(b) can be
summarized as ( < , *). The symbol #
V (the first nonzero element of the dis- denotes both a < and > direction. and
tance vector must be positive). A nega- * denotes < , > and = .
tive element in the distance vector means A particular dependence between
that the dependence in the corresponding statements S’l and S2 is denoted by writ-
loop is on a higher-numbered iteration. If ing the dependence vector above the ar-
the first nonzero element was negative, (<)
this would indicate a dependence on a row, for example SI -+ S2.
future iteration, which is impossible. Burke and Cytron [1986] present a
The reader should note that distance framework for dependence analysis that
vectors describe dependence among iter- defines a hierarchy of dependence vec-
ations, not among array elements. The tors and allows flow and antide~en-
operations on array elements create the dences to be treated symmetrically: An
dependence, but the distance vectors de- antidependence is simply a flow depen-
scribe dependence among iterations. For dence with an impossible dependence
instance, the loop nest that updates the vector (V + O).
one-dimensional array a in Figure 5(b)
has dependence described by the set of 5.4 Subscript Analysis
two-element distance vectors {(O, 1), (1, O), In discussing the analysis of loop-carried
(1, - l)}. dependence, we omitted an important de-
In some cases it is not possible to detail: how the compiler decides whether
termine the exact dependence distance at two array references might refer to the
compile-time, or the dependence distance same element in different iterations. In
may vary between iterations; but there is examining a loop nest, first the compiler
enough information to partially charac- tries to prove that different iterations are
terize the dependence. A direction vector independent by applying various tests to
(introduced by Wolfe [1989b]) is com- the subscript expressions. These tests
monly used to describe such depen- rely on the fact that the expressions are
dence. almost always linear. When dependence
<
For a dependence I * J, we define the are found, the compiler tries to describe
direction vector W = (w ~, ..., w~) where them with a direction or distance vector.
If the subscript expressions are too com-
if iP < jp
I
plex to analyze, the compiler assumes
. if ip = jp the statements are fully dependent on
‘P= —
> one another such that no change in exe-
ifip >jP.
cution order is permitted.
ACM Computing Surveys, Vol 26, No 4, December 1994

There are a large variety of tests, all of do i = 1, n-l

which can prove independence in some a[i, i] = a[i, i+l]
cases. It is infeasible to solve the problem end do
directly, even for linear subscript expres-
Figure 6. Coupled subscripts.
sions, because finding dependence is
equivalent to the NP-complete problem
of finding integer solutions to systems of the presence of coupled subscript
linear Diophantine equations [Banerjee expressions.
et al. 1979]. Two general and approxi- Section 9.2 discusses a number of stud-
mate tests are the GCD [Towle 1976] and ies that examine the subscript ex-
Banerjee’s inequalities [Banerjee 1988a]. pressions in scientific applications and
Additionally, there are a large number evaluate the effectiveness of different
of exact tests that exploit some subscript dependence tests.
characteristics to determine whether a
particular type of dependence exists. One
6. TRANSFORMATIONS
of the less expensive exact tests is the
Single-Index Test [Banerjee 1979; Wolfe This section catalogs general-purpose
1989b]. The Delta Test [Goff et al. 1991] program transformations; those transfor-
is a more general strategy that examines mations that are only applicable on par-
some combinations of dimensions. Other allel machines are discussed in Section 7.
tests that are more expensive to evaluate The primary emphasis is on loops, since
consider the multiple subscript dimen- that is generally where most of the exe-
sions simultaneously, such as the A-test cution time is spent. For each transfor-
[Li et al. 1990], multidimensional GCD mation, we provide an example, discuss
[Banerjee 1988a], and the power test the benefits and shortcomings, identify
[Wolfe and Tseng 1992]. The omega test any variants, and provide citations.
[Pugh 1992] uses a linear programming A major goal of optimizing compilers
algorithm to solve the dependence equa- for high-performance architectures is to
tions. The SUIF compiler project at Stan- discover and exploit parallelism in loops.
ford has had success applying a series of We will indicate when a loop can be exe-
exact tests, starting with the cheaper cuted in parallel by using a do all loop
ones [Maydan et al. 1991]. They use a instead of a do loop. The iterations of a
general algorithm (Fourier-Motzkin vari- do all loop can be executed in any order,
able elimination [Dantzig and Eaves or all at once.
1974]) as a backup. A standard reference on compilers in
One advantage of the multiple-dimen- general is the “Red Dragon” book, to
sion exact tests is their ability to handle which we refer for some of the most com-
coupled subscript expressions [Goff et al. mon examples [Aho et al. 1986]. We draw
1991]. Two expressions are coupled if the also on previous summaries [Allen and
same variable appears in both of them. Cocke 1971; Kuck 1977; Padua and Wolfe
Figure 6 shows a loop that has no depen- 1986; Rau and Fisher 1993; Wolfe 1989b].
dence between iterations. The state- Because some transformations were al-
ment reads the values to the right of the ready familiar to programmers who ap-
diagonal and updates the diagonal. A test plied them manually, often we cite only
that only examines one dimension of the the work of researchers who have
subscript expressions at a time, however, systematized and automated the imple-
will not detect the independence because mentation of these transformations. Ad-
there are many pairs of iterations where ditionally, we omit citations to works that
the expression i in one is equal to the are restricted to basic blocks when global
expression i + 1 in the other. The more optimization techniques exist. Even so,
general exact tests would discover that the origin of some optimizations is murky.
the iterations are independent despite For instance, Ershov’s ALPHA compiler

Compiler Transformations o 359
[Ershov 1966] performed interprocedural doi=l, n

constant propagation, albeit in a limited a[i] = a[i] + c*i
form, in 1964! end do
(a) original loop
6.1 Data-Flow-Based Loop Transformations
A number of classical loop optimizations T=c

are based on data-flow analysis, which doi=l, n
tracks the flow of data through a pro- a[i] = a[i] + T
gram’s variables [Muchnick and Jones T=T+c
1981]. These optimizations are summa- end do
rized by Aho et al. [1986]. (b) after strength reduction
Figure 7. Strength reduction example,

6.1.1 Loop-Based Strength Reduction
Reduction in strength replaces an ex-

pression in a loop with one that is equiv- Table 1. Strength Reductions (the variable c IS
loop invariant; x may vary between Iterations)
alent but uses a less expensive operator
[Allen 1969; Allen et al. 1981]. Figure Expression Initialization Use Update
7(a) shows a loop that contains a multi- cxi T=c T T=T+c
plication. Figure 7(b) is a transformed c’ T=c T T=Txc
version of the loop where the multiplica- (-1)’ T=–1 T T=–T
z/c T = ~/C XXT
tion has been replaced by an addition.
Table 1 shows how strength reduction
can be applied to a number of operations.
The operation in the first column is as- Figure 8(a–c). This special case is most
sumed to appear within a loop that iter- important on architectures in which inte-
ates over i from 1 to n. When the loop is ger multiply operations take more cycles
transformed, the compiler initializes a than integer additions. (Current exam-
temporary variable T with the expression ples include the SPARC [Sun Microsys-
in the second column, The operation tems 1991] and the Alpha [Sites 1992].)
within the loop is replaced by the expres- Strength reduction may also make other
sion in the third column, and the value of optimizations possible, in particular the
T is updated each iteration with the value elimination of induction variables, as is
in the fourth. shown in the next section.
A variable whose value is derived from The term strength reduction is also
the number of iterations that have been applied to operator substitution in a non-
executed by an enclosing loop is called an loop context, like replacing x x 2 with
induction variable. The loop control vari- x + x. These optimizations are covered in
able of a do statement is the most com- Section 6.6.7.
mon kind of induction variable, but other
variables may also be induction
6.1.2 induction Variable Elimination
variables.
The most common use of strength re- Once strength reduction has been per-
duction, often implemented as a special formed on induction variable expres-
case, is strength reduction of induction sions, the compiler can often eliminate
variable expressions [Aho et al. 1986; the original variable entirely. The loop
Allen 1969; Allen et al. 1981; Cocke and exit test must then be expressed in terms
Schwartz 1970]. of one of the strength-reduced induction
Strength reduction can be applied to variables; the optimization is called lin-
products involving induction variables by ear function test replacement [Allen 1969;
converting them to references to an Aho et al. 1986]. The replacement not
equivalent running sum, as shown in only reduces the number of operations in

Davidl? Bacon et al.
do i= l,n
a[i] = a[i] + sqrt(x)
doi= 1, n end do
a[il = a[i] + c
(a) original loop
end do
(a) original code
if (n > O) C = sqrt(x)
do i = l,n
LF F4 , C(R30) ;load c Into F4 a[i] = a[i] + C
LU R8 , n(R30) ;load n into R8 end do
LI R9 , #1 ;set i (R9) to 1 (b) after code motion
ADDI R12, R30, #a ;Ri2=address(a[l])
Loop:14ULTI R1O, R9, #4 ;RiO=i*4 Figure9. Loop-invariant code motion.
ADDI RIO,R12,R1O ;RiO=address(a[i+l])
LF F5, -4(R1O) ;load a[i] into F5
ADDF F5, F5, F4 ;a[i] :=a[i]+c
SF -4(R1O), F5 ;store new a[il
SLT Ril, R9, R8 ;Ril = i<n? a loop, but frees the register used by the
ADDI R9, R9, #i ; i=i+l induction variable.
BNEZ Rll, LOOp ;if i<n, goto Loop Figure 8(d) shows the result of apply-
(b) initial compiled loop
ing induction variable elimination to the
strength-reduced code from the previous
section.
LF F4, c(R30) ;load c into F4
LU R8, n(R30) ;load n into R8 6.1.3 Loop-lnvarianf Code Motion
LI R9, #l ;set i (R9) to 1
When a computation appears inside a
ADDI RI0,R30, #a ;RIO=address(a[i])
100P, but its result does not change be-
Loop : LF F5, (R1O) ;load a[i] into F5
tween iterations, the compiler can move
ADDF F5, F5, F4 ;a[i]:=a[i]+c
that computation outsidethe loop [Ahoet
SF (R1O), F5 ;store new a[i]
al, 1986; Cocke and Schwartz 1970].
ADDI R9, R9, #1 ;i=i+l
Code motion canbe applied at a high
ADDI R1O, Ri0,#4 ;RIO=address(a[i+l])
level to expressions in the source code, or
SLT Rll, R9, R8 ;Rll = i<n?
at a low level to address computations.
BNEZ Rll, LOOp ;if i<n, goto Loop
The latter is particularly relevant when
(c) after strength reduction—RiO k a running sum
indexing multidimensional arrays or
instead of being recomputed from X9 and R12
dereferencing pointers, as when the in-
ner loop of a C program contains an ex-
pression like a.b + c.d[i].
LF F4, c(R30) ;load c into F4
Figure 9(a) shows an example inwhich
L!d R8, n(R30) ;load n into R8
an expensive transcendental function call
ADDI RIO, R30,#a ;RiO=address(a[i])
is moved outside of the inner loop. The
MULTI R8, R8, #4 ; R8=n*4
test in the transformed code in Figure
ADDI R8, RIO, R8 ;R8=address(a[n+l])
9(b) ensures that if the loop is never
Loop : LF F5, (R1O) ;load a[i] Into F5
executed, themoved codeis not executed
ADDF F5, F6, F4 ;aCil:=a[il+c
either, lest it raise an exception.
SF (R1O), F5 ;store new a[il
ADDI R1O, R10,#4 ;RIO=address(a[i+l]) Theprecomputed value is generally as-
SLT Rll, R1O,R8 ;Rll= R1O<R8? signed to a register. If registers are
BNEZ Rll, LOOp ;if Rll, goto Loop scarce, and the expression movedis inex-
pensive to compute, code motion may ac-
(d) after elimination of induction variable (R9)
tuallydeoptimize the code, since register
Figure8. Induction variable optimlzations. spills willbe introduced in the loop.
Although code motionis sometimes re-
ferred to as code hoisting, hoistingis a
ACM Computmg Sumeys,Vol, 26,N0 4, December 1994

do i=2, n Conditionals that are candidates for

a[i] = a[i] + c unswitching can be detected during the
if (x < 7) then analysis for code motion, which identifies
b[i] = a[i] * c [i] loop-invariant values.
else In Figure 10(a) the variable x is loop
b[i] = a[i-1] * b[i-1] invariant, allowing the loop to be
end if unstitched and the true branch to be
end do executed in parallel, as shown in Figure
(a) original loop 10(b). Note that, as with loop-invariant
code motion, if there is any chance that
the condition evaluation will cause an
if (n > 1) then exception, it must be guarded by a test
if (x < 7) then that the loop will be executed.
do all i=2, n In a loop nest where the inner loop has
a[i] = a[i] + c unknown bounds, if code is generated
b[i] = a[i] * c[i] straightforwardly there will be a test be-
end do all fore the body of the inner loop to deter-
else mine whether it should be executed at
do i=2, n all. The test for the inner loop will be
a[il = a[i] + c repeated every time the outer loop is exe-
b[i] = a[i-1] * b[i-1] cuted. When the compiler uses an inter-
end do mediate representation of the program
end if
that exposes the test explicitly, unswitch-
end if
ing can be applied to move the test out-
(b) after unswitching side of the outer loop. The RS/6000 XL
C/Fortran compiler uses unswitching for
Figure IO. Loop unswitching.
this purpose [0’Brien et al. 1990].
6.2 Loop Rerxdering

more general term referring to any
transfo~mation that moves a- comput~- In this section we describe transforma-
tion to an earlier point in the program tions that change the relative order of
[Aho et al. 1986]. While loop-invariant execution of the iterations of a loop nest
code motion is one particularly common or nests. These transformations are pri-
form of hoisting, there are others: an marily used to expose parallelism and
expression can be evaluated earlier to improve memory locality.
reduce register pressure or avoid arith- Some compilers only apply reordering
metic unit latency, or a load instruction transformations to perfect loop nests (see
might be moved upward to reduce the Section 6.2.7). To increase the opportuni-
effect of memory access latency. ties for optimization, such a compiler can
sometimes apply loop distribution to ex-
tract perfect loop nests from an imperfect
6.1.4 Loop Unswitching
nesting.
Loop unswitching is applied when a loop The compiler determines whether a
contains a conditional with a loop- loop can be executed in parallel by exam-
invariant test condition. The loop is then ining the loop-carried dependence. The
replicated inside each branch of the obvious case is when all the dependence
conditional, saving the overhead of condi- distances for the loop are O (direction
tional branching inside the loop, reduc- =), meaning that there are no depen-
ing the code size of the loop body, and dence carried across iterations by the
possibly enabling the parallelization of a loop. Figure n(a) is an example; the dis-
branch of the conditional [Allen and tance vector for the loop is (O, 1), so the
Cocke 1971]. outer loop is parallelizable.

362 - David F. Bacon et al.
doi=l, n double precision a[*]

doj=2, n
a[i, j] = a[i, j-1] + c do i = 1, 1024* stride, stride
end do a[i] = a[i] + c
end do end do
(a) outer loop is parallelizable (a) loop with varying stride
doi=i, n I etride \ Cache I TLB I Relative \

doj=l, n Misses Misses Speed (%)
a[i, j] = a[i-l, j] + a[i-l, j+l]
1 64 2 100
end do
2 128 4 83
end do
(b) inner loop is parallelizable.
Figure Il. Dependence conditions forparallelizing t I 1

loops. 16 I 1024 I 32 I 23
64 I 1024 I 128 I 19
256 I 1024 512 12
More generally, the pth loop in a loop 512 I 1024 1024 8
nest is parallelizable if for every distance (b) effect of stride
vector V=(vl, . . ..vP. v~), v~),
Figure 12. Predicted effect of stride on perfor-
up =ov3q<p:vq>o. mance of an IBM RS/6000 for the above loop.
Array elements are double precision (8 bytes): miss
In Figure 1 l(b), the distance vectors rates are per 1024 iterations. Beyond stride 16,
are {(1, 0), (1, – l)}, so the inner loop is TLB misses dominate [IBM 1992],
parallelizable. Both references on the
right-hand side of the expression read
elements of a from row i — 1, which was
loop nest to increase the granularity of
written during the previous iteration of
each iteration and reduce the number
the outer loop. Therefore the elements of
of barrier synchronizations;
row i may be calculated and written in
any order. . reduce stride, ideally to stride 1; and
● increase the number of loop-invariant
6.2.1 Loop Interchange expressions in the inner loop.
Loop interchange exchanges the position Care must be taken that these benefits
of two loops in a perfect loop nest, gener- do not cancel each other out, For in-
ally moving one of the outer loops to the stance, an interchange that improves
innermost position [Allen and Kennedy register reuse may change a stride-1 ac-
1984; 1987; Wolfe 1989b]. Interchange is cess pattern to a stride-n access pattern
one of the most powerful transformations with much lower overall performance due
and can improve performance in many to increased cache misses. Figure 12
ways. demonstrates the dramatic effect of dif-
Loop interchange may be performed to: ferent strides on an IBM RS /6000.
In Figure 13(a), the inner l’oop accesses
e enable vectorization by interchanging
array a with stride n (recall that we are
an inner, dependent loop with an outer,
assuming the Fortran convention of col-
independent loop;
umn-maj or storage order). By inter-
* improve vectorization by moving the changing the loops, we convert the inner
independent loop with the largest range loop to stride-1 access, as shown in
into the innermost position; Figure 13(b).
e improve parallel performance by mov- For a large array in which less than
ing an in-depende-nt loop outwa~d in a one column fits in the cache, this opti-

do i = l,n doi=2, n
do j = l,n do j = 1, n-1
total [i] = total[il + a[i, jl a[i, j] = a[i-l, j+i]
end do end do
end do end do
(a) original loop nest (a)
J J
do j = l,n 123
123
do i = i,n
total [i] = total[il + a[i, jl
end do
end do
(b) interchanged loop nest
Figure 13. Loop interchange, (b) (c)
Figure 14. Original loop (a); original traversal or-

der (b); traversal order after interchange (c).
mization reduces the number of cache
misses on a from n 2 to n 2 X elementsizei
linesize, or n 2/ 16 with 4-byte elements
cuted is shown by the dotted line.
and the 64-byte lines of S-DLX. However,
The traversal order after interchange is
the original loop allows total [i] to be
shown in Figure 14(c): some iterations
placed in a register, eliminating the
are executed before iterations that they
load\ store operations in the inner loop
depend on, so the interchange is illegal.
(see scalar replacement in Section 6.5.4). Switching the loop bounds is straight-
So the optimized version increases the
forward when the iteration space is rect-
number of load/store operations for total
angular, as in the loop nest in Figure 13.
from 2n to 2n 2. If a fits in the cache, the
In this case the bounds of the inner loop
original loop is better.
are independent of the indices of the con-
On a vector architecture, the trans- taining loop, and the two can simply be
formed loop enables vectorization by exchanged. When the iteration space is
eliminating the dependence on total [i] in not rectangular, computing the bounds is
the inner loop. more complex. Triangular spaces are of-
Interchanging loops is legal when the
ten used by programmers, and trape-
altered dependence are legal and when zoidal spaces are introduced by loop
the loop bounds can be switched. If two skewing (discussed in the next section).
loops p and q in a perfect loop nest of d
Further techniques are necessary to
loops are interchanged, each dependence manage imperfectly nested loops. Some
vector V=(ul, . . ..up. vq, ,,u~). .,u~) in of the variations are discussed in detail
the original loop nest becomes V’ = by Wolfe [ 1989b] and Wolf and Lam
(vi,...,u~,UP,,,v~)..,v~) in the trans- [1991].
formed loop nest. If V’ is lexicograph-
ically positive, then the dependence
6.2.2 Loop Skewing
relationships of the original loop are sat-
isfied. Loop skewing is an enabling transforma-
A loop nest of just two loops can be tion that is primarily useful in combina-
interchanged unless it has a dependence tion with loop interchange [Lamport
vector of the form ( < , >). Figure 14(a) 1974; Muraoka 1971; Wolfe 1989b].
shows the loop nest with the dependence Skewing was invented to handle wave-
(1, – 1), giving rise to the loop-carried front computations, so called because the
dependence shown in Figure 14(b). The updates to the array propagate like a
order in which the iterations are exe- wave across the iteration space.

do I = 2, n-1 dex variables to compensate, skewing

do J = 2, m-1 does not change the meaning of the pro-
a[l, Jl = (a[~-t, Jl + a[i, j-11 + gram and is always legal.
a[l+l, jl + a[l, J+ll )/4
The original loop nest in Figure 15(a)
end do
can be interchanged, but neither loop can
end do
be parallelized, because there is a depen-
(a) ongmal code dependence {(1, O), (O, 1)}
dence on both the inner loop (O, 1) and on
the outer loop (1, O).
The result when f = 1 is shown in Fig-
ure 15(c–d). The transformed code is
equivalent to the original, but the effect
on the iteration space is to align the
diagonal wave fronts of the original loop
(b) original space (c) skewed space
nest so that for a given value of j, all
iterations in i can be executed in parallel.
To expose this parallelism, the skewed
do 1 = 2, n-1 loop nest must also be interchanged. Af-
do I = i+2, i+m-1 ter skew and interchange, the loop nest
a[l,~-11 = (a[l-i, j-d + a[i, j-1-ii +
has distance vectors {(1, O), (1, 1)}. The
a[l+l, j-11 + a[l, J+l-ll )/4
first dependence allows the inner loop to
end do
end do
be parallelized because the correspond-
(d) skewed code dependence {(1, 1), (O, 1)}
ing dependence distance is O. The second
dependence allows the inner loop to be
parallelized because it is a dependence
do I = 4, n+n-2 on previous iterations of the outer loop.
do 1 = max(2, J-m+l), min(n-l, ~-2) Skewing can expose parallelism for a
a[i, j-d = (a[i-l, j-d + a[l, j-1-1] +
nest of two loops with a set of distance
a[i+l, j-ii + a[l, ]+l-d )/4
vectors {Vk } only if
end do
end do
(e) skewed and interchanged code dependence
{(1,0),(1,1)}
Figure 15. Loop skewing.

When skewing by f, the original distance
vector (vi, vz) becomes (rJl, fvl + Pz). For
any dependence where Uz < 0, the goal is
In Figure 15(a) we show a typical
to find f such that ful + V2 > 1. The
wavefront computation, Each element is
correct skew factor f is computed by tak-
computed by averaging its four nearest
ing the maximum of f, = [(1 – u12)/u,ll
neighbors, While neither of the loops is
over all the dependence [Kennedy et al.
parallelizable in their original form, each
1993].
array diagonal (“ wavefront”) could be
Interchanging of skewed loops is com-
computed in parallel. The iteration space
plicated by the fact that their bounds
and dependence are shown m Figure
depend on the iteration variables of the
15(b), with the dotted lines indicating the
enclosing loops. For two loops with
wave fronts.
bounds il = 11, U1 and iz = lZ, Uz where
Skewing is performed by adding the
lZ and Uz are expressions independent of
outer loop index multiplied by a skew
il, the skewed inner loop has bounds iz
factor, f, to the bounds of the inner itera-
= fi ~ + 12, fi ~ + U2. After interchange,
tion variable, and then subtracting the
the bounds are
same quantity from every use of the in-
ner iteration variable inside the loop. Be- doi2=fZ1+12, fu1+u2
cause it alters the loop bounds but then do il = max(ll, [(i, – u2)/fD,
alters the uses of the corresponding in- min(ul, [(iz – 12)/f 1)

An alternative method for handling doi=l, n

wavefront computations is supernode do j =1, n
partitioning [Irigoin and Triolet 1988]. a[i, j] = a[i-1, j+i] + I
end do
end do
6.2.3 Loop Reversal
(a) original loop nest: distance vector (1, - 1).
Reversal changes the direction in which Interchange is not possible.
the loop traverses its iteration range
[Wedel 19751. It is often used in conjunc-
tion with other iteration space reordering doi=l, n
transformations because it changes the doj=n, l,-1

dependence vectors [Wolfe 1989b]. a[i, j] = a[i-1, j+i] + I
As an optimization in its own right, end do
reversal can reduce loop overhead by end do
eliminating the need for a compare in- (b) inner loop reversed: direction vector (1, 1).
struction on architectures without a com- Loops may be interchanged.
pound compare-and-branch (such as the
Figure 16. Loop reversal.
Alpha [Sites 1992]), The loop is reversed
so that the iteration variable runs down
to zero, allowing the loop to end with a
branch-if-not-equal-to-zero instruction do i=l, n
(BNEZ, on S-DLX). a[i] = a[i] + c
Reversal can also eliminate the need end do
for temporary arrays in implementing (a) originaf loop
Fortran 90 array statements (see Section
6.4.3).
If loop p in a nest of d loops is re- TN = (n/64)*64
versed, then for each dependence vector do TI=I, TN, 64
V, the entry UP is negated. The reversal a[TI :TI+63] = a[TI :TI+63] + c
is legal if each resulting vector V’ is end do
lexicographically positive, that is, when do i=TN+l, n
=Oor3q<p:v~>0. a[i] = a[i] + c

‘P
end do
For instance, the inner loop of a loop
nest with direction vectors {(< , =), (b) after strip mining
( <, >)} can be reversed, because the re-

sulting dependence are all still
lexico~aphically positive. R9 = address of a[TI]
Figure 16 shows how reversal can en- LV VI, R9 ; VI <- a[TI :TI+63]
able loop interchange: the original loop ADDSV VI, F8, VI ; vi <- VI + c (F8=c)
nest (a) has the distance vector (1, – 1) Sv Vl, R9 ; a[TI :TI+63] <- VI
that prevents interchange because the (c) vector assembly code for the update of
resulting distance vector ( – 1, 1) is not a[TI :TI+63]
lexicographically positive; the reversed

Figure 17. Strip mining.
loop nest (b) can legally be interchanged.
6.2.4 Strip Mining

array notation, and is equivalent to a do
Strip mining is a method of adjusting the all loop. Cleanup code is needed if the
granularity of an operation, especially a iteration length is not evealy divisible by
parallelizable operation [Abu-Sufah et al. the strip length.
1981; Allen 1983; Loveman 1977]. One of the most common uses of strip
An example is shown in Figure 17. The mining is to choose the number of inde-
strip-mined computation is expressed in pendent computations in the innermost

loop of a nest. On a vector machine, for doi=i, n

example, the serial loop can then be con- i a[i+k] = b[i]
verted into a series of vector operations, 2 b[i+k] = a[i] + c[i]
each vector comprising a single “strip” end do
[Cray Research 1988]). Strip mining is a) because of the write to a, S1 SS2; because
also used for SIMD compilation [Weiss of the write to b, S2~S1.
1991], for combining send operations in
a loop on distributed-memory multipro-
cessors [Hiranandani et al. 1992], and for do TI=l, n,k
limiting the size of compiler-generated do all i = TI , TI+k-1
temporary arrays [Abu-Sufah 1979; Wolfe a[i+k] = b[i]
1989b]. 2 b[i+k] = a[i] + c[i]
Strip mining often requires other end do all
transformations to be performed first. end do
Loop distribution (see Section 6.2.7) can (b) k iterations can be performed in parallel
expose simple loops from within an origi- because that is the minimum dependence
nal loop nest that is too complex to strip distance.
mine. Loop interchange can be used to
move a parallelizable loop into the inner-
~~
most position of a loop nest or to maxi- o
1
mize the length of the strip. 2 3 4 s 6
The examples given above demonstrate

how strip mining can create a larger unit (c) iteration space when n = 6 and k = 2.
of work out of smaller ones. The transfor-

Figure 18. Cycle shrinking.
mation can also be used in the reverse
direction, as shown in Section 7.4.
6.2.5 Cycle Shrinking The result is potentially a speedup by

a factor of k, but k is likely to be small
Cycle shrinking is essentially a special- (usually 2 or 3); so this optimization is
ization of strip mining. When a loop has normally limited to exposing parallelism
dependence that prevent it from being that can be exploited at the instruction
executed in parallel (that is, converted to level (for instance by loop unrolling), Note
a do all), the compiler may still be able that k must be constant within the loop
to expose some parallelism if the depen- and must at least be known to be positive
dence distance is greater than one. In at compile time.
this case cycle shrinking will convert a
serial loop into an outer serial loop and
6.2.6 Loop Tiling
an inner parallel loop [Polychronopoulos
1987a]. Cycle shrinking is primarily use- Tiling is the multidimensional general-
ful for exposing fine-grained parallelism. ization of strip mining. Tiling (also called
For instance, in Figure 18(a), a[i + k] is blocking) is primarily used to improve
written in iteration i and read in itera- cache reuse ( QC) by dividing an iteration
tion i + k; the dependence distance is k. space into tiles and transforming the loop
Consequently the first k iterations can be nest to iterate over them [Abu-Sufah et
performed in parallel provided that none al. 1981; Gannon et al, 1988; Lam et al,
of the subsequent iterations is allowed to 1991; Wolfe 1989a]. However, it can also
begin until the first k are complete. The be used to improve processor, register,
same is then done for the next k itera- TLB, or page locality.
tions, as shown in Figure 18(b). The iter- The need for tiling is illustrated by the
ation space dependence are shown in loop in Figure 19(a) that assigns a the
Figure 18(c): each group of k iterations is transpose of b. With the j loop innermost,
dependent only on the previous group. access to b is stride-1, while access to a is
ACM Computmg Surveys, Vol 26, No. 4, December 1994

Compiler Transformations * 367
do i=l, n do i=l, n
do j=l, n a[i] = a[i] +C
a[i, j] = b[j, i] x[i+ll = x[i]*7 + x[i+i] + a[i]
end do end do
end do (a) original loop
(a) original loop
do all i=l, n
do TI=I, n, 64 a[i] = a[i]+c
do TJ=l, n, 64 end do all
do i=TI, min(TI+63, n) do i=i, n
do j=TJ, min(TJ+63ufi) x[i+l] = x[i]*7 + x[i+l] + a[i]
a[i, j] = b[j, i] end do
end do (b) after loop distribution
end do
end do Figure 20. Loop distribution.
end do
(b) tiled loop
Figure 19, Looptiling. 0 improve instruction cache and instruc-

tion TLB locality due to shorter loop
bodies;
0 reduce memory requirements by iterat-
stride-n. Interchanging does not help,
ing over fewer arrays; and
since it makes access to b stride-n. By
iterating over subrectangles of the itera-
e increase register reuse by decreasing
tion space, as shown in Figure 19(b), the register pressure.
loop uses every cache line fully.
The inner two loops of a matrix multi- Figure 20 is an example in which dis-
ply have this structure; tiling is critical tribution removes de~endences and al-
for achieving high performance in dense lows part of a loop to ‘be run in parallel,
matrix multiplication. Distribution may be applied to any
A pair of adjacent loops can be tiled if loop, but all statements belonging to a
they can legally be interchanged. After dependence cycle (called a n-block [Kuck
tiling, the outer pair of loops can be in- 1977]) must be placed in the same loop,
terchanged to improve locality across and if SI = Sz in the original loop, then
tiles, and the inner loops can be ex- the loop containing SI must precede the
changed to exploit inner-loop parallelism loop that contains S’z. If the loop contains
or register locality. control flow, applying if-conversion (see
Section 5.2) can expose greater opportu-
nities for distribution. An alternative is
6.2.7 Loop Distribution
to use a control dependence graph
Distribution (also called loop fission or [Kennedy and McKinley 1990],
loop splitting) breaks a single loop into A specialized version of this transfor-
many. Each of the new loops has the mation is distribution by name, origi-
same iteration space as the original, but nally called horizontal distribution of
contains a subset of the statements of the name partition [Abu-Sufah et al. 1981].
original loop [Kuck 1977; Kuck et al. Rather than performing full dependence
1981: Muraoka 19711. analysis on the loop, the statements are
Distribution is used to ~artitioned into sets that reference mu-
~ually exclusive variables. These state-
●
create perfect loop nests; ments are guaranteed to be independent.
●
create subloops with fewer depen- When the arrays in question are large,
dence; distribution by name can increase cache

368 0 David F. Bacon et al.
locality. Note that the above loop cannot I

be distributed using fission by name,
since both statements reference a.
6.2.8 Loop Fusion
The inverse transformation of distribu- s20 0 o’ 0

tion is fusion (also called jamming)
[Ershov 1966]. It can improve Figure 21. Two lo~ps containing S1 and S2 cannot
performance by be fused when S2 = S’l in the fused loop.
reducing loop overhead;

increasing instruction parallelism; 6.3 Loop Restructuring
improving register, vector [Wolfe
This section describes loop transforma-
1989b], data cache, TLB, or page [Abu-
tions that change the structure of the
Sufah 1979] locality; and
loop, but leave the computations per-
improving the load balance of parallel formed by an iteration of the loop body
loops. and their relative order unchanged.
In Figure 20, distribution enables the

6.3.1 Loop Unrolling
parallelization of part of the loop. How-
ever, fusing the two loops improves regis- Unrolling replicates the body of a loop
ter and cache locality since a[i] need only some number of times called the un-
be loaded once. Fusion also increases in- rolling factor (u) and iterates by step u
struction parallelism by increasing the instead of step 1. The benefits of un-
ratio of floating-point operations to inte- rolling have been studied on several dif-
ger operations in the loop and reduces ferent architectures [Dongarra and Hind
loop overhead by a factor of two. With 1979]; it is a fundamental technique for
large n, the distributed loop should run generating the long instruction se-
faster on a vector machine while the fused quences required by VLIW machines
loop should be better on a superscalar [Ellis 19861.
machine. Unrolling can improve the perfor-
For two loops to be fused, they must mance by
have the same loop bounds; when the
= reducing loop overhead;
bounds are not identical, it is sometimes
possible to make them identical by peel- “ increasing instruction parallelism; and
ing (described in Section 6.3.5) or by in- * improving register, data cache, or TLB
troducing conditional expressions into the locality,
body of the loop. Two loops with the same
In Figure 22, we show all three of these
bounds may be fused if there do not exist
improvements in an example. Loop over-
statements S1 in the first loop and Sz in
head is cut in half because two iterations
the seco~~ ~uch that they have a depen-
are performed before the test and branch
dence Sz = S1 in the fused loop. The rea- at the end of the loop. Instruction paral-
son this would be incorrect is that before lelism is increased because the second
fusing, all instances of S1 execute before assignment can be performed while the
any Sz. After fusing, corresponding in- results of the first are being stored and
stances are executed together. If any in- the loop variables are being updated.
stance of S1 has a dependence on (i.e., If array elements are assigned to regis-
must be executed after) any subsequent ters (either directly or by using scalar
instance of Sz, the fusion alters execu- replacement, as described in Section
tion order illegally, as shown in Figure 6.5.4), register locality will improve be-
21. cause a[i] and a[i + 1] are used twice in

Compiler Transformations 9 369
do i=2, n-1 Most compilers for high-performance

a[ij = a[i] + a[i-1] * a[i+ll machines will unroll at least the inner-
end do most loop of a nesting. Outer loop un-
(a) original loop rolling is not as universal because it
yields replicated instances of the inner
loops. To avoid the additional control
do i=2, n-2, 2 overhead, the compiler can often fuse the
a[i] = a[i] + a[i-1] * a[i+i] copies back together, yielding the same
a[i+l] = a[i+l] + a[i] * a[i+2] loop structure that appeared in the origi-
end do nal code. This combination of transfor-
mations is sometimes referred to as
if (mod(n-2,2) = i) then
unroll-and-jam [Callahan et al. 1988].
a[n-1] = a[n-1] + a[n-2] * a[nl
Loop quantization [Nicolau 1988] is
end if
another approach to unrolling that avoids
(b) loop unrolled twice replicated inner loops. Rather than creat-
ing multiple copies and then subse-
Figure 22. Loop unrolling
quently eliminating them, quantization
adds additional statements to the inner-
most loop directly. The iteration ranges
are changed, but the structure of the loop
nest remains the same. However, unlike
the loop body, reducing the number of straightforward unrolling, quantization
loads per iteration from 3 to 2. changes the order of execution of the loop
If the target machine has double- or and is not always a legal transformation.
multiword loads, unrolling often allows
several loads to be combined into one. 6.3.2 Software Pipelining
The if statement at the end of Figure Another technique to improve instruction
22(b) is the loop epilogue that must be parallelism is software pipelining [Lam
generated when it is not known at com- 1988]. In hardware pipelining, instruc-
pile time whether the number of iteration execution is broken into stages, such
tions of the loop will be an exact multiple as Fetch, Execute, and Write-back. The
of the unrolling factor u. If u > 2, the first instruction is fetched in the first
loop epilogue is itself a loop. clock cycle. In the next cycle, the second
Unrolling has the advantage that it instruction is fetched while the first is
can be applied to any loop, and can be executed, and so on. Once the pipeline
done profitably at both the high and the has been filled, the machine will com-
low levels. Some compilers also perform plete 1 instruction per cycle.
loop rerolling prior to unrolling because In software pipelining, the operations
programs often contain loops that were of a single loop iteration are broken into
unrolled by hand for a different target s stages, and a single iteration performs
architecture. stage 1 from iteration i, stage 2 from
Figure 23(a–c) shows the effects of un- iteration i – 1, etc. Startup code must be
rolling in more detail. The source code in generated before the loop to initialize the
Figure 23(a) is translated to the assem- pipeline for the first s – 1 iterations, and
bly code in Figure 23(b), which takes 6 cleanup code must be generated after the
cycles per result on S-DLX. After un- loop to drain the pipeline for the last
rolling 3 times, the code requires 8 cycles s — 1 iterations.
per iteration or 22/s cycles per result, as Figure 23(d) shows how software
shown in Figure 23(c), The original loop pipelining improves the performance of a
stalls for one cycle waiting for the load, simple loop. The depth of the pipeline is
and for two cycles waiting for the ADDF s = 3. The software pipelined loop pro-
to complete. In the unrolled loop some of duces one new result every iteration, or 4
these cycles are filled. cycles per result.

do i=l, n
a[i] = a[i] + c
end do
(a) the initial loop
1 LW R8, n(R30) ADDI RIO, R30, #a ;load n mto R8 ;RIO=address(a[l])

2 LF F4, c(R30) ;load c mto F4
3 MJLTI R8, R8, #4 ; R8=n*4
5 ADDI R8, RIO, R8 ; R8=address(a[n+l] )
1 L:LF F5, (RIO) ADDI R1O, RiO, #4 ;load a[l] into F5 ;RIO=address(a[l+l] )
3 ADDF FE., F5, F4 SLT Ril, RI0,R8 ;a[il=a[il+c ;Rll= RIO<R8?
6 SF -4(R1O), F5 BNEZ Rll, L ;store new a[ll ;lf Rll, goto L
(b) the compiled loop body for S-DLX. This is the loop from Fig 8(d) after instruction scheduling
1 L:LF F5, (R1O) ;load a[l] into F5

2 LF F6, 4(RIO) ;load a[l+i] into F6
3 LF F7, 8(R1O) ADDF F5, F5, F4 ;load a[i+2] Into F8 ;a[l]=a[l]+c
4 ADDI Rio, Rio, #12 ADDF F6, F6, F4 ;R1O=address (a[l+3]) ;a[l+l]=a[i+l]+c
5 SLT Rii, RI0,R8 ADDF F7, F7, F4 ;Rll= R1O<R87 ;a[i+21=a[l+2]+c
6 SF -I2(R1O) >F5 ;store new a[i]
-r SF -8(R1O), F6 ;store new a[l+ll
a SF -4(RIO)> F7 BNEZ Rli, L ;store new a[i+21 ;If Rll, goto L
(c) after unrolhng 3 tmnes and rescheduhng (loop prologue and epdogue omitted)
1 LW R8, n(R30) ADDI RIO, R30,#a ;load n into R8 ;RiO=address(a[l])

2 LF F4, c(R30) SUBI R8, R8, #2 ;load c into F4 ; R8=n-2
3 MJLTI R8, R8, #4 ;R8=(n-2)*4
5 ADDI R8, RIO, R8 LF F5, (R1O) ;R8=address(a[n-1]) ;F5=a[l]
7 ADDF F6, F5, F4 LF F5, 4(R1O) ;F6=a[l]+c ;F5=a[2]
1 L:SF (R1O), F6 ADDI RIO, R1O, #4 ;store new a[i] ;RIO=address(a[i+i] )

2 ADDF F6, F5, F4 SLT Rll, R1O,R8 ;a[l+l]=a[i+l]+c ;Rll= R1O<R8?
3 LF F5, 4(R1O) BNEZ Rli, L ;load a[i+2] mto F5 ;if Rii, goto L
4 [stall]
(d) after software pipelining and rescheduling, but without unrolling (loop epilogue omitted)
i L:SF (RIO), F6 ADDF F6, F5, F4 ;store new a[i] ; a[i+21 =a[i+21 +C
2 SF 4( R1O), F8 ADDF F8, F7, F4 ;store new a[i+l] ;a[i+3]=a[i+3]+c
3 LF F5, 16(RIO) ADDI R1O, R1O, #8 ;load a[l+4] Into F5 ;R1O=address (a[l+21)
4 LF F7, 12(RIO) SLT Rli, RIO, R8 ;load a[i+5] into F7 ;Rli= RIO<R87
5 BNEZ Rll, L ;if Rll, goto L
(e) after unrolling 2 times and then software pipehnmg (loop prologue and epilogue omitted)
Figure 23. Increasing instruction parallelism in loops.
ACM Computmg Surveys. Vol 26,N0 4, December 1994

do all i=l, n
Overlappxl
or=.~.- do all j=l, m
a[i, jl = a[i, jl + c
rme end do all
(a) LcJOp
unrolling
end do all
(a) original loop
Tmle
(b) SoftwarePipelimng do all T=l, n*m
i. = ((T-1) / m)*m + 1
Figure 24. Loop unrolling vs. software pipelining. j = MOD(T-1, m) + 1
end do all
Finally, Figure 23(e) shows the result (b) coalesced loop
of combining software pipelining (s = 3)
with unrolling (u = 2). The loop takes 5
cycles per iteration, or 21/z cycl:s per real TA [n*m]
result. Unrolling alone achieves 2 /~ cy- equivalence (TA, a)
cles per result. If software pipelining is do all T = 1, n*m
combined with unrolling by u = 3, the TA [T] = TA [T] + c
resulting loop would take 6 cycles per end do all
iteration or two cycles per result, which (c) collapsed loop
is optimal because only one memory op-
Figure 25. Loop coalescing vs. collapsing.
eration can be initiated per cycle.
Figure 24 illustrates the difference be-
tween unrolling and software pipelining:
unrolling reduces overhead, while in Aiken and Nicolau [1988a] and Lam
pipelining reduces the startup cost of [1988].
each iteration.
Perfect pipelining combines unrolling
6.3.3 Loop Coalescing
by loop quantization with software
pipelining [Aiken and Nicolau 1988a; Coalescing combines a loop nest into a
1988b]. A special case of software pipelin- single loop, with the original indices com-
ing is predictive commoning, which is puted from the resulting single induction
applied to memory operations. If an ar- variable [Polychronopoulos 1987b; 1988].
ray element written in iteration i – 1 is Coalescing can improve the scheduling of
read in iteration i, then the first element the loop on a parallel machine and may
is loaded outside of the loop, and each also reduce loop overhead.
iteration contains one load and one store. In Figure 25(a), for example, if n and m
The IW/6000 XL C/Fortran compiler are slightly larger than the number of
[0’Brien et al. 1990] performs this processors P, then neither of the loops
optimization. schedules as well as the outer parallel
If there are no loop-carried depen- loop, since executing the last n – P iter-
dence, the length of the pipeline is the ations will take the same time as the
length of the dependence chain. If there first P. Coalescing the two loops ensures
are multiple, independent dependence that P iterations can be executed every
chains. thev can be scheduled together time except during the last (nm mod P)
subject to ~esource availability, o; loop iterations, as shown in Figure 25(b).
distribution can be applied to put each Coalescing itself is always legal since it
chain into its own loop. The scheduling does not change the iteration order of the
constraints in the presence of loop- loop. The iterations of the coalesced loop
carried dependence and conditionals are may be parallelized if all the original
more complex; the details are discussed loops were parallelizable. A criterion for

parallelizability in terms of dependence doi=2, n

vectors is discussed in Section 6.2. b[i] = b[i] + b[21
The complex subscript calculations end do
introduced by coalescing can often be doalli=3, n
simplified to reduce the overhead of the a[i] = a[i] + c
coalesced loop [Polychronopoulos 1987 b]. end do all
(a) original loops
6.3.4 Loop Collapsing
Collapsing is a simpler, more efficient, if (2 <= n) then

but less general version of coalescing in b[2] = b[2] + b[2]
which the number of dimensions of the end if
array is actually reduced. Collapsing do all i=3, n
eliminates the overhead of multiple b[i] = b[il + b[2]

nested loops and multidimensional array a[i] = a[i] + c
indexing. end do all
Collapsing is used not only to increase (b) after peeling one iteration from first loop
the number of parallelizable loop itera- and fusing the resulting loops
tions, but also to increase vector lengths
Figure 26. Loop peeling
and to eliminate the overhead of a nested
loop (also called the Carry optimization
[Allen and Cocke 1971; IBM 1991]).
The collapsed version of the loop dis- Since peeling simply breaks a loop into
cussed in the previous section is shown sections without changing the iteration
in Figure 25(c). order, it can be applied to any loop.
Collapsing is best suited to loop nests
that iterate over memory with a constant
stride. When more complex indexing is 6.3.6 Loop Normalization
involved, coalescing may be a better
approach. Normalization converts all loops so that
the induction variable is initially 1 (or O)
and is incremented by 1 on each iteration
6.3.5 Loop Peeling [Allen and Kennedy 1987]. This transfor-
When a loop is peeled, a small number of mation can expose opportunities for fu-
iterations are removed from the begin- sion and simplify inter-loop dependence
ning or end of the loop and executed analysis, as shown in Figure 27. It can
separately. If only one iteration is peeled, also help to reveal which loops are candi-
a common case, the code for that itera- dates for peeling followed by fusion.
tion can be enclosed within a conditional. The most important use of normaliza-
For a larger number of iterations, a sepa- tion is to permit the compiler to apply
rate loop can be introduced. Peeling has subscript analysis tests, many of which
two uses: for removing dependence cre- require normalized iteration ranges.
ated by the fh-st or last few loop itera-
tions, thereby enabling parallelization;
6.3.7 Loop Spreading
and for matching the iteration control of
adjacent loops to enable fusion. Spreading takes two serial loops and
The loop in Figure 26(a) is not paral- moves some of the computation from the
lelizable because of a flow dependence second to the first so that the bodies of
between iteration i = 2 and iterations i = both loops can be executed in parallel
3 . . . n. Peeling off the first iteration al- [Girkar and Polychronopoulos 1988].
lows the rest of the loop to be parallelized An example is shown in Figure 28: the
and fused with the following loop, as two loops in the original program (a) can-
shown in Figure 26(b). not be fused because they have different

Compiler Transformations ~ 373
doi=l, n do i = 1, n/2
a[i] = a[i] + c 1 a[i+l] = a[i+l] + a[i]
end do end do
do i = 1, n-3
do i = 2, n+l 2 b[i+l] = b[i+l] + b[i] + a[i+3]
b[i] = a[i-i] * b[i] end do
end do
(a) original loops
(a) original loops
do i = 1, n/2
doi=l, n COBEGIN
a[i] = a[i] + c a[i+l] = a[i+i] + a[i]
end do if (i J 3) then
b[i-2] = b[i-2]+b[i-3]+a[i]
doi=l, n end if
b[i+l] = a[i] * b[i+l] COEND
end do end do
(b) after normalization, the two loops can be do i = n/2-3, n-3
fused b[i+l] = b[i+l] + b[i] + a[i+3]
end do
Figure 27. Loop normalization.
(b) after spreading
Figure 28. Loop spreading.
bounds, and there would be a depen- 6.4 Loop Replacement Transformations
dence S’z ‘~) SI in the fused loop due to This section describes loop transforma-
the write to a in S1 and the read of a in tions that operate on whole loops and
S’z. By executing the statement Sz three completely alter their structure.
iterations later within the new loop, it is
possible to execute the two statements in
6.4.1 Reduction Recognition
parallel, which we have indicated textu-
ally with the COBEGIN / COEND com- A reduction is an operation that com-
pound statement in Figure 28(b). putes a scalar value from an array, Com-
The number of iterations by which the mon reductions include computing either
body of the second loop must be delayed the sum or the maximum value of the
is the maximum dependence distance be- elements in an array. In Figure 29(a), the
tween any statement in the second loop sum of the elements of a are accumulated
and any statement in the first loop, plus in the scalar S. The dependence vector for
1. Adding one ensures that there are no the loop is (l), or ( <). While a loop with
dependence within an iteration. For this direction vector (<) must normally be
reason, there must not be any scalar de- executed serially, reductions can be par-
pendence between the two loop bodies. allelized if the operation performed is
Unless the loop bodies are large, associative. Commutativity provides ad-
spreading is primarily beneficial for ex- ditional opportunities for reordering.
posing instruction-level parallelism. 13e- In Figure 29(b), the reduction has been
pending on the amount of instruction vectorized by using vector adds (the in-
parallelism achieved, the introduction of ner do all loop) to compute TS; the final
a conditional may negate the gain due to result is computed from TS using a scalar
spreading. The conditional can be re- loop. For semicommutative and semias-
moved by peeling the first few (in this sociative operators like floating-point
case 3) iterations from the first loop, at multiplication, the validity of the trans-
the cost of additional loop overhead. formation depends on the language se-
ACM Computing Surveys, Vol. 26, No, 4, December 1994

doi=l, n is described by Chen and Kuck [19751, by

s=s+a[i] Kuck [1977; 1978], and Wolfe [1989b].
end do Blelloch [1989] describes compilation
(a) a sum reduction loop strategies for recognizing and exploiting
parallel prefix operations.
The compiler can recognize and con-
real TS [64] vert other idioms that are specially
supported by hardware or software. For
TS[1:64] = 0.0 instance, the CMAX Fortran preproces-
sor converts loops implementing vector
do TI = 1, n, 64 and matrix operations into calls to as-
TS[1:64] = TS[1:64] + a[TI:TI+63] sembly-coded BLAS (Basic Linear Alge-
end do bra Subroutines) [Sabot and Wholey
do TI = 1, 64 1993]. The VAX Fortran compiler con-
s = s + TSITI] verts string copy loops into block transfer
end do
instructions [Harris and Hobbs 1994].
(b) loop transformed for vectorization
6.4.3 Array Statement Scalarization

Figure 29. Reduction recognition.
When a loop is expressed in array nota-
tion, the compiler can either convert it
into vector operations or scalarize it into
mantics and the programmer’s intent, as one or more serial loops [Wolfe 198!2b].
described in Section 2.1. Provided that However, the conversion is not com-
bitwise identical results are not required, pletely straightforward because array
the partial sums can be computed in notation requires that the operation be
parallel. performed as if every value on the right-
Maximum parallelism is achieved by hand side and every subexpression on
computing the reduction with a tree: the left-hand side were computed before
pairs of elements are summed, then pairs any assignments are performed.
of these results are summed, and so on. The example in Figure 30 shows a
The number of serial steps is reduced computation in array notation (a), its
from O(n) to O(log n). “obvious” (but incorrect) conversion to se-
Operations such as and, or, rein, and rial form (b), and a correct conversion (c).
max are truly associative, and their re- The reason that Figure 30(b) is not cor-
duction can be parallelized under all rect is that in the original code, every
circumstances. element of a is to be incremented by the
value of the previous element. The incre-
ments are to happen as if they were all
6.4,2 Loop Idiom Recognition performed simultaneously; in the incor-
rect version, each element is incremented
Parallel architectures often provide spe-
by the updated value of the previous ele-
cialized hardware that the compiler can
ment.
take advantage of. Frequently, for exam-
The general solution is to introduce a
ple, SIMD machines support reduction
temporary array T and to have a sepa-
directly in the processor interconnection
rate loop that writes the values back into
network. Some parallel machines, such
as the Connection Machine CM-2 [Think- a, as shown in Figure 30(c). The tempo-
rary array can then be eliminated if the
ing Machines 1989], include hardware
two loops can be legally fused, namely,
not only for reduction but for parallel
prefix operations, allowing a loop with a when there is no dependence S2 ‘~) SI in
body of the form a[il = a[i – 1] + a[i] to the fused loop, where S1 is the assign-
be parallelized. The parallelization of a ment to the temporary, and S2 is the
more general class of linear recurrences assignment to the original array.

a[2:n-1] = a[2:n-1] + a[l:n-2] As a result, optimization of the use of

(a) initial array language expression the memory system has become steadily
more important. Factors affecting mem-
ory performance include:
do i = 2, n-1
a[i] = a[i] + a[i-1] 0 Reuse, denoted by Q and Qc, the ratio
end do of uses of an item to the number of
(b) incorrect scalarization times it is loaded (described in Section
3);
● Parallelism. Vector machines often
do i = 2, n-1 divide memory into banks, allowing
1 T[i] = a[i.] + a[i-1]
vector registers to be loaded in a paral-
end do lel or pipelined fashion. Superscalar
do i = 2, n-1
machines often support double- or
2 a[i] = T[i]
quadword load and store instructions;
end do
0 Working Set Size. If all the memorv
(c) correct scalarization
elemen~s accessed inside of a loop d;
not fit in the data cache, then items
do i = n-1, 2, -1
that will be accessed in later iterations
a[i] = a[i] + a[i-1]
may be flushed, decreasing Qc. If more
end do variables are simultaneously live than
there are available registers (that is,
(d) reversing both loops allows fusion and
the register pressure is high), then
eliminates need for temporary array T
loads and stores will have to spill val-
ues into memory, decreasing Q. If more
a[2:n-11 = a[2:n-il + a[i:n-2] + a[3:n] pages are accessed in a loop than there
(e) array expression requiring a temporary are entries in the TLB, then the TLB
may thrash.
Figure 30. Array statement scalarization.
Since the registers are the top of the
memory hierarchy, efficient register us-
age is absolutely crucial to high perfor-
In this case there is an antidepen- mance. Until the late seventies, register
dence, butitcan be removedby reversing allocation was considered the single most
the loops, enabling fusion, andeliminat- important problem in compiler optimiza-
ing the temporary, as shown in Figure tion for which there was no adequate
30(d). However, the array language solution. The introduction of techniques
statementin Figure 30(e) requires atem- based on graph coloring [Chaitin 1982;
porary since an antidependence exists re- Chaitin et al. 1981; Chow and Hennessy
gardless of the direction of the loop. 1990] yielded very efficient global (within
a procedure) register allocation. Most
compilers now make use of some variant
6.5 Memory Access Transformations
of graph coloring in their register
High-performance applications are as allocator.
frequently memory limited as they are In general, memory optimizations are
compute limited. In fact, for the last fif- inhibited by low-level coding that relies
teen years CPU speeds have doubled ev- on particular storage layouts. Fortran
ery three to five years, while DRAM EQUIVALENCE statements are the most
speeds have doubled about once every obvious example. C and Fortran 77 both
decade (DRAMs, or dynamic random ac- specify the storage layout for each type of
cess memory chips, are the low-cost, declaration. Programmers relying on a
low-power memory chips used for main particular storage layout may defeat a
memory in most computers). wide range of important optimization.

Optimizations covered in other sec- real a[8,512]
tions that also can improve memory sys-

tem performance are loop interchange do i = 1, 512
a[i, i] = a[l, i] + c
(6.2.1), loop tiling (6.2.6), loop unrolling
end do
(6.3. 1), 100P fusion (6.2.8), and various
optimization that eliminate register (a) original code
saves at procedure calls (6.8).
real a[9,512]
6.5.1 Array Padding
do i = 1, 512
Padding is a transformation whereby un-
a[l, i] = a[l, i] + c
used data locations are inserted between
end do
the columns of an array or between ar-
rays. Padding is used to ameliorate a (b) padding a eliminates memory bank
number of memory system conflicts, in confbcts on V-DLX
particular:
Figure 31. Array padding
- bank conflicts on vector machines with

banked memory [Burnett and Coffman,
Jr. 1970]; sensitive to low power-of-two strides.
. cache set or TLB set conflicts; However, large power-of-two strides will
o cache miss jamming [Bacon et al. 1994]; cause extremely poor performance due to
and cache set and TLB set conflicts. The
* false sharing of cache lines on shared- problem is that addresses are mapped to
memory multiprocessors. sets simply by using a range of bits from
the middle of the address. For instance,
Bank conflicts on vector machines like on S-DLX, the low 6 bits of an address
the Cray and our hypothetical V-DLX are the byte offset within the line, and
can be caused when the program indexes the next 8 bits are the cache set. Figure
through an array dimension that is not 12 shows the effect of stride on the per-
laid out contiguously in memory, leading formance of a cache-based superscalar
to a nonunit stride whose value we will machine.
define to be s. If s is an exact multiple of A number of researchers have noted
the number of banks (B), the bandwidth that other strides yield reduced cache
of the memory system will be reduced by performance. Bailey [1992] has quanti-
a factor of B (8, on V-DLX) because all fied this by noting that because many
memory accesses are to the same bank. words fit into a cache line, if the number
The number of memory banks is usually of sets times the number of words per set
a power of two. is almost exactly divisible by the stride
In general, if an array will be accessed (say by n f e), then for a limited number
with stride s, the array should be padded of references r, every nth memory refer-
by the smallest p such that gcd(s + ence will map to the same set, If [ r/n] is
p, B) = 1. This will ensure that B suc- greater than the associativity of the
cessive accesses with stride s + p will all cache, then the later references will evict
address different banks. An example is the lines loaded by the earlier references,
shown in Figure 3 l(a): the loop accesses precluding reuse.
memory with stride 8, so all memory ref- Set and bank conflicts can be caused
erences will be to the first bank. After by a bad stride over a single array, or by
padding, successive iterations access a loop that accesses multiple arrays that
memory with stride 9, so they go to suc- all align to the same set or bank. Thus
cessive banks, as shown in Figure 3 l(b). padding can be inserted between columns
Cached memory systems, especially of an array (intraarray padding), or be-
those that are set-associative, are less tween arrays (interarray padding).
ACM Computmg Surveysj Vol 26, No 4, December 1994

A further performance artifact, called doi=l, n

cache miss J“amming, can occur on ma- c = b[i]
chines that allow processing to continue a[i] = a[i] + c
during a cache miss: if cache misses are end do
spread nonuniformly across the loop iter- (a) original loop
ations, the asynchrony of the processor
will not be exploited, and performance
will be reduced. Jamming typically oc- real T[n]
curs when several arrays are accessed
with the same stride and when all have doalli=l, n
the same alignment relative to cache line T[i] = b [i]
boundaries (that is, the same low ad- a[i] = a[i] + T[i]
dress bits). Bacon et al. [1994] describe end do all
cache miss jamming in detail and pre- (b) after scalar expansion
sent a unified framework for inter- and
Figure 32. Scalar expansion.
intraarray padding to handle set con-
flicts and jamming.
The disadvantages of padding are that
it increases memory consumption and performed for any compiler-generated
makes the subscript calculations for temporaries in a loop. To avoid creating
operations over the whole array more unnecessarily large temporary arrays, a
complex, since the array has “holes.” In vectorizing compiler can perform scalar
particular, padding reduces the benefits expansion after strip mining, expanding
of loop collapsing (see Section 6.3.4). the temporary to the size of the vector
strip.
6.5.2 Scalar Expansion Scalar expansion can also increase in-
Loops often contain variables that are struction-level parallelism by removing
used as temporaries within the loop body. dependence.
Such variables will create an antidepen-
6.5.3 Array Contraction
dence S’z ‘~) S1 from one iteration to the
next, and will have no other loop-carried After transformation of a loop nest, it
dependence. Allocating one temporary may be possible to contract scalars or
for each iteration removes the depen- arrays that have previously been ex-
dence and makes the loop a candidate for panded. It may also be possible to con-
parallelization [Padua et al. 1980; Wolfe tract other arrays due to interchange or
1989b], as shown in Figure 32. If the the use of redundant storage allocation
final value of c is used after the loop, c by the programmer [Wolfe 1989b].
must be assigned the value of T[n]. If the iteration variable of the pth loop
Scalar expansion is a fundamental in a loop nest is being used to index the
technique for vectorizing compilers, and k th dimension of an array x, then di-
was performed by the Burroughs Scien- mension k may be removed from x if (1)
tific Processor [Kuck and Stokes 1982] loop p is not parallel, (2) all distance
and Cray-1 [Russell 1978] compilers. An vectors V involving x have VP = O, and
alternative for parallel machines is to (3) x is not used subsequently (that is, x
use private variables, where each proces- is dead after the loop). The latter two
sor has its own instance of the variable; conditions are true for compiler-
these may be introduced by the compiler expanded variables unless the loop struc-
(see Section 7.1.3) or, if the language suP- ture of the program was changed after
ports private variables, by the expansion. In particular, loop distribu-
programmer. tion can inhibit array contraction by
If the compiler vectorizes or paral- causing the second condition to be
lelizes a loop, scalar expansion must be violated.
ACM Computing Surveys, Vol. 26, No. 4, Decsmber 1994

real T[n, n] do i = l,n

do j = i,n
doi=l, n total [i] = total[il + a[i, j]
doallj=l, n end do
T[i, j] = a[i, j]*3 end do
b[i, j] = T[i, j] + b[i, j]/T[i, j] (a) original loop nest

end do all
end do
do i = l,n
(a) original code
T = total [i]
do j = l,n
real T[n] T = T + a[i, j]
end do
doi=l, n total [i] = T
doallj=l, n end do
T[j] = a[i, j]*3 (b) after scalar replacement

b[i, j] = T[j] + b[i, j]/T[j]
end do all Figure 34. Scalar replacement.
end do
(b) after array contraction
6.1.3). Loop interchange can be used to

Figure 33. Array contraction.
enable or improve scalar replacement;
Carr [1993] examines the combination of
scalar replacement, interchange, and un-
Contraction reduces the amount of roll-and-jam in the context of cache
storage consumed by compiler-generated optimization.
temporaries, as well as reducing the An example of scalar replacement is
number of cache lines referenced. Other shown in Figure 34; for a discussion of
methods for reducing storage consump- the interaction between replacement and
tion by temporaries are strip mining (see loop interchange, see Section 6,2.1.
Section 6.2.4) and dynamic allocation of
temporaries, either from the heap or from 6.5.5 Code Collocation
a static block of memory reserved for
temporaries. Code collocation improves memory access
behavior by placing related code in close
proximity. The earliest work rearranged
6.5.4 Scalar Replacement
code (often at the granularity of a proce-
Even when it is not possible to contract dure) to improve paging behavior [F’errari
an array into a scalar, a similar opti- 1976; Hatfield and Gerald 1971].
mization can be performed when a fre- More recent strategies focus on im-
quently referenced array element is proving cache behavior by placing the
invariant within the innermost loop or most frequent successor to a basic block
loom. In this case. the arrav element can (or the most frequent callee of a proce-
be loaded into a scalar (anfi presumably dure) immediately adjacent to it in in-
therefore a register) before the inner loop struction memory [Hwu and Chang 1989;
and, if it is modified, stored after the Pettis and Hansen 1990].
inner loop [Callahan et al. 1990]. An estimate is made of the frequency
Replacement multiplies Q for the ar- with which each arc in the control flow
ray element by the number of iterations graph will be traversed during program
in the inner loop(s). It can also eliminate execution (using either profiling informa-
unnecessary subscript calculations, al- tion or static estimates). Procedures are
though that optimization is often done by grouped together using a greedy algo-
loop-invariant code motion (see Section rithm that always takes the pair of pro-

cedures (or procedure groups) with the placement jumps, the code should be
largest number of calls between them. organized to keep related sections close
Within a procedure, basic blocks can be together in memory, in particular those
grouped in the same way (although the sections executed most frequently
direction of the control flow must be [Szymanski 19781.
taken into account), or a top-down algo- Displacement minimization can also be
rithm can be used that starts from the applied to data. For instance, a base reg-
procedure entry node. Basic blocks with a ister may be allocated for a Fortran com-
frequency estimate of zero can be moved mon block or group of blocks:
to a separate page to increase locality
common /big / q, r, x[20000], y, z
further. However, accessing that page
may require long displacement jumps to If the array x contains word-sized ele-
be introduced (see the next subsection), ments, the common block is larger than
creating the potential for performance the amount of memory indexable by the
loss if the basic blocks in question are offset field in the load instruction (216
actually executed. bytes on S-DLX). To address y and z,
Procedure inlining (see Section 6.8.5) multiple-instruction sequences must be
can also affect code locality, and has been used in a manner analogous to the long-
studied both in conjunction with [Hwu jump sequences above. The problem is
and Chang 1989] and independent of avoided if the layout of big is:
[McFarling 1991] code positioning. Inlin-
common /big / q, r, y, z, x[20000]
ing improves performance often by reduc-
ing overhead and increasing locality, but
if a procedure is called more than once in 6.6 Partial Evaluation
a loop, inlining will often increase the Partial evaluation refers to the general
number of cache misses because the pro- technique of performing part of a compu-
cedure body will be loaded more than tation at compile time. Most of the classi-
once. cal optimizations based on data-flow
analysis are either a form of partial eval-
6.5.6 Dkplacement Minimization uation or of redundancy elimination
The target of a branch or a jump is usu- (described in Section 6.7). Loop-based
data-flow optimizations are described in
ally specified relative to the current value
of the program counter (PC). The largest Section 6.1.
offset that can be specified varies among
architectures; it can be as few as 4 bits. If 6.6.1 Constant Propagation
control is transferred to a location Constant propagation [Callahan et al.
outside of the range of the offset, a multi- 1986; Kildall 1973; Wegman and Zadeck
instruction sequence or long-format in- 1991] is one of the most important opti-
struction is required to perform the jump. mization that a compiler can perform,
For instance, the S-DLX instruction and a good optimizing compiler will ap-
BEQZ R4, error is only legal if error is ply it aggressively, Typically programs
within 215 bytes. Otherwise, the instruc- contain many constants; by propagating
tion must be replaced with: them through the program, the compiler
can do a significant amount of precompu-
BNEZ R4, cent ; reversed test tation. More important, the propagation
LI R8, error ;get low bits reveals many opportunities for other op-
LUI R8, error >>16 ;get high bits timization. In addition to obvious possi-
JR R8 ;jump to target bilities such as dead-code elimination,
cent:
loop optimizations are affected because
constants often appear in their induction
This sequence requires three extra in- ranges. Knowing the range of the loop,
structions. Given the cost of long dis- the compiler can be much more accurate
ACM Computmg Sumeys, Vol. 26, No. 4, December 1994

380 e Dauid F. Bacon et al.
n=64 t = i*4
C=3 S=t
doi=l, n print *, a[s]
a[i] = a[i] + c r.t
end do a[r] = a[r] + c
(a) original code (a) original code
doi=l,64 t . 1*4
a[i] = a[i] + 3 print *, a [t]
end do a[t] = a[t] + c
(b) after constant propagation (b) after copy propagation
Figure 35. Constant propagation. Figure 36. Copy propagation
in applying the loop optimizations that,

more than anything else, determine per- pression elimination (6.7.4) may cause the
formance on high-speed machines. same value to be copied several times.
Figare 35 shows a simple example of The compiler can propagate the original
constant propagation. On V-DLX, the re- name of the value and eliminate redun-
sulting loop can be converted into a sin- dant copies [Aho et al. 1986].
gle vector operation because the loop is Copy propagation reduces register
the same length as the hardware vector pressure and eliminates redundant regis-
registers. The original loop would have to ter-to-register move instructions. An ex-
be strip mined before vectorization (see ample is shown in Figure 36.
Section 6.2.4), increasing the overhead of
the loop.
6.6.4 Forward Substitution
6.6.2 Constant Folding Forward substitution is a generalization
Constant folding is a companion to con- of copy propagation. The use of a variable
stant propagation: when an expression is replaced by its defining expression,
contains an operation with constant val- which must be live at that point. Substi-
ues as operands, the compiler can replace tution can change the dependence rela-
the expression with the result. For exam- tion between variables [Wolfe 1989b] or
ple, x = 3 * * 2 becomes x = 9. Typically improve the analysis of subscript expres-
constants are propagated and folded si- sions in loops [Allen and Kennedy 1987;
multaneously [Aho et al. 1986]. Kuck et al. 1981].
Note that constant folding may not be For instance, in Figure 37(a) the loop
legal (under Definition 2.1.2 in Section cannot be parallelized because an un-
2.1) if the operations performed at com- known element of a is being written. Af-
pile time are not id~nticd to ‘chow that ter forward substitution, as shown in
Fi~re 37(b), the subscript expression is
would have been performed at run time;
a common source of such problems is the in terms of the loop-bound variable, and
rounding of floating-point numbers dur- it is straightforward to determine that
ing input or output between phases of the loop can be implemented as a paral-
the compilation process [Clinger 1990; lel reduction (described in Section 6.4.1).
Steele and White 1990]. The use of variables like npl is a com-
mon Fortran idiom that was developed
when compilers did not perform aggres-
6.6.3 Copy Propagation
sive optimization. The idiom is recom-
Optimization such as induction variable mended as “good programming style” in
elimination (6. 1.2) and common- subex- a number of Fortran programming texts!

npl = n+l Xxo = o

doi=l, n
o/x = o
a[npl] = a[npl] + a[i]
Zxl = z
end do
(a) original code
Z+o = z
x/1 = x
doalli=l, n Figure 38. Algebraic identities used in expression

simplification.
a[n+ll = a[n+ll + a[i]
end do all
(b) after forward substitution
Figure 37. Forward substitution. Table 2. Identltles Used In Strength Reduction

(&IS the string concatenation operator)
Expression I Reduced Expr. I Datatypes
I
Forward substitution is generally per-
formed on array subscript expressions at
the same time as loop normalization
(Section 6.3.6). For efficient subscript ~
analysis techniques to work, the array (a, O) + (b, O) 1 (a+ b, O) complex
-
subscripts must be linear functions of the
len(s~ & s~) \ len(.s~)+len(sz) string
induction variables.
6.6.5 Reassociation
Reassociation is a technique for increas-

ing the number of common subexpres- Floating-point arithmetic can be prob-
sions in a program [Cocke and Markstein lematic to simplify; for example, if x is an
1980; Markstein et al. 1994]. It is gener- IEEE floating-point number with value
ally applied to address calculations Nan (not a number), then x x O = x, in-
within loops when performing strength stead of O.
reduction on induction variable expres-
sions (see Section 6.1.1). Address calcula-
tions generated by array references 6.6.7 Strength Reduction
consist of several multiplications and ad-
The identities in Table 2 are called
ditions. Reassociation applies the asso-
strength reductions because they replace
ciative, commutative, and distributive
laws to rewrite these expressions in a an expensive operator with an equivalent
less expensive operator. Section 6.1.1 dis-
canonical sum-of-products form.
Forward substitution is usually per- cusses the application of strength reduc-
tion to operations that appear inside a
formed where possible in the address
loop.
calculations to increase the number of
The two entries in the table that refer
potential common subexpressions.
to multiplication, x x 2 = x + x and i x
2’ = i << c, can be generalized. Multipli-
6.6.6 Algebraic Simplification
cation by any integer constant can be
The compiler can simplify arithmetic ex- performed using only shift and add in-
pressions by applying algebraic rules to structions [Bernstein 1986]. Other trans-
them. A particularly useful example is formations are conditional on the values
the set of algebraic identities. For in- of the affected variables; for example,
stance, the statement x = (y* 1 + O) / 1 i/2c = i >> c if and only if i is non-
can be transformed into x = y if x and y negative [Steele 1977].
are integers. Figure 38 illustrates some It is also possible to convert exponenti-
of the commonly applied rules. ation to multiplication in the evaluation

382 ● Dauid F. Bacon et al.
wri.te(6,100) c[i] Format compilation is further compli-

read(7,100) (d(j), j = 1, 100) cated in C by the fact that printf and
100 format (Al) scanf are library functions and may be
(a) original code redefined by the programmer.
6.6.9 Superoptimizing
call putchar(c[i], 6)
A superoptimizer [Massalin 1987] repre-
call fgets(d, 100, 7)
sents the extreme of optimization, seek-
(b) after format compilation
ing to replace a sequence of instructions
Figure39. Format compilation
with the optimal alternative, It does an
exhaustive search, beginning with a sin-
gle instruction. If all single instruction
sequences fail, two-instruction sequences
ofpolynomials ,using the identity
are searched, and so on.
n–1
A randomly generated instruction se-
a~xn + a~.lx + .. . +alx +aO
quence is checked by executing it with a
=(a~x”-l+a~_lx’-2 + . ..+al) small number of test inputs that were
X+ao. run through the original sequence. If it
passes these tests, a thorough verifica-
tion procedure is applied.
6.6.8 110 Format Compilation Superoptimization at compile time is
practical only for short sequences (on the
Most languages provide fairly elaborate
order of a dozen instructions). It can also
facilities for formatting input and output.
be used by the compiler designer to find
The formatting specifications are in ef-
optimal sequences for commonly occur-
fect a formatting sublanguage that is
ring idioms, It is particularly useful for
generally “interpreted” at run time, with
eliminating conditional branches in short
a correspondingly high cost for character
instruction sequences, where the pipeline
input and output.
stall penalty may be larger than the cost
Formatted writes can be converted al-
of executing the operations themselves
most directly into calls to the run-time
[Granlund and Kenner 1992].
routines that implement the various for-
mat styles. These calls are then likely
6.7 Redundancy Elimination
candidates for inline substitution. Figure
39 shows two 1/0 statements and their There are many optimizations that im-
compiled equivalents. Idiom recognition prove performance by identifying redun-
has been performed to convert the im- dant computations and removing them
plied do loop into an aggregate [Morel and Renvoise 1979]. We have al-
operation. ready covered one such transformation,
Note that in Fortran, a format state- loop-invariant code motion, in Section
ment is analogous to a procedure defini- 6.1.3. There a computation was being
tion, which may be invoked by any performed repeatedly when it could be
number of read or write statements, ‘The done once.
same trade-off as with procedure inlining Redundancy-eliminating transforma-
applies: the formatted 1/0 can be ex- tions remove two other kinds of computa-
panded inline for higher efficiency, or tions: those that are unreachable and
encapsulated as a procedure for code those that are useless. A computation is
compactness (inlining is described in Sec- unreachable if it is never executed; re-
tion 6.8.5). moving it from the program will have no
Format compilation is done by the VAX semantic effect on the instructions exe-
Fortran compiler [Harris and Hobbs cuted. Unreachable code is created by
1994] and by the Gnu C compiler [Free programmers (most frequently with con-
Software Foundation 1992]. ditional debugging code), or by transfor-
ACM Computmg Surveys. Vol. 26. No 4, December 1994

mations that have left “orphan” code integer c, n, debug

behind. debug = O
A computation is useless if none of the n=O
outputs of the program are dependent on a . b+j’
it. if (debug j 1) then
C=a+b+d
print *, ‘Warning -- total is ‘, c
6.7.1. Unreachable-Code Elimination
end if
Most compilers perform unreachable-code call foo(a)
elimination [Allen and Cocke 1971; Aho doi=l, n
et al. 1986]. In structured programs, there a[i] = a[i] + c
are two primary ways for code to become end do
unreachable. If a conditional predicate is (a) original code
known to be true or false, one branch of
the conditional is never taken, and its
code can be eliminated. The other com- integer c, n, debug
mon source of unreachable code is a loo~ debug = O
that does not perform any iterations. ‘ n.o
In an unstructured program that relies a = b+7
on goto statements to transfer control, if (O > 1) then
unreachable code is not obvious from the end if
program structure but can be found by call foo(a)
traversing the control flow graph of the doi=l, O
program. end do
Both unreachable and useless code are (b) after constant propagation and unreachable
often created by constant propagation, code elimination
described in Section 6.6.1. In Figure 40(a),
the variable debug is a constant. When
integer c, n, debug
its value is propagated, the conditional
debug = O
expression becomes if (O > 1). This ex-
n=O
pression is always false, so the body of
a . b+7
the conditional is never executed and can
call foo(a)
be eliminated, as shown in Figure 40(b).
Similarly, the body of the do loop is never (c) after redundant control elimination
executed and is therefore removed.

Unreachable-code elimination can in
integer c, n, debug
turn allow another iteration of constant
a = b+7
propagation to discover more constants;
call foo(a)
for this reason some compilers perform
(d) after useless code elimination
constant propagation more than once.
Unreachable code is also known as
dead code, but that name is also applied a . b+7
to useless code. so we have chosen to use call foo(a)
the more speci~c term.
(e) after dead variable elimination
Sometimes considered as a separate
step, redundant-control elimination re- Figure 40. Redundancy elimination.
moves control constructs such as loom.
and conditionals when they become re-
dundant (usually as a result of constant 6.7.2 Useless-Code Elimination
propagation). In Figure 40(b), the loop Useless code is often created by other
and conditional control exm-essions
. are optimizations, like unreachable-code
not used, and we can remove them from elimination. When the compiler discovers
the program, as shown in (c). that the value being computed by a state-

ment is not necessary, it can remove the first operand [Arden et al, 1962]. For
code. This can be done for any nonglobal example, the control expression in
variable that is not liue immediately af- If ((a = 1) and (b = 2)) then
ter the defining statement. Live-variable c=~
analysis is a well-known data-flow prob- end if
lem [Aho et al. 1986]. In Figure 40(c), the
is known to be false if a does not equal 1,
values computed by the assignment
regardless of the value of b. Short circuit-
statements are no longer used; they have
ing would convert this expression to:
been eliminated in (d).
if (not (a = 1)) goto 10
if (not (b = 2)) goto 10
C=5
6.7.3 Dead-Var/ab/e Elimination
10 continue
After a series of transformations, partic-
Note that if any of the operands in the
ularly loop optimizations, there are often
boolean expression have side-effects,
variables whose value is never used. The
short circuiting can change the results of
unnecessary variables are called dead
the evaluation. The alteration may or
variables; eliminating them is a common
may not be legal, depending on the lan-
optimization [Aho et al. 1986].
guage semantics. Some language defini-
In Figure 40(d), the variables C, n, and
tions, including C and many dialects of
debug are no longer used and can be
Lisp, address this problem by requiring
removed; (e) shows the code after the
short circuiting of boolean expressions.
variables are pruned.
6.8 Procedure Call Transformations
6.7.4 Common-Subexpression Elimination The optimizations described in the next

several sections attempt to reduce the
In many cases, a set of computations will
overhead of procedure calls in one of four
contain identical subexpressions. The re-
ways:
dundancy can arise in both user code and
in address computations generated by the
●
eliminating the call entirely;
compiler. The compiler can compute the ●
eliminating execution of the called pro-
value of the subexpression once, store it, cedure’s body;
and reuse the stored result [Aho et al. ●
eliminating some of the entry/exit
1977; 1986; Cocke 1970]. Common-subex- overhead; and
pression elimination is an important e avoiding some steps in making a proce-
transformation and is almost universally dure call when the behavi~r ;f the
performed. While it is generally a good called procedure is known or can be
idea to perform common-subexpression altered.
elimination wherever possible, the com-
piler must consider the current register 6.8.1 A Calling Convention for S-DLX
pressure and the cost of recomputing. If
To demonstrate the procedure call opti-
storing the temporary value(s) forces ad- mization, we fh-st define a calling con-
ditional spills to memory, the transfer.
vention for S-DLX. Table 3 shows how
mation can actually deoptimize the the registers are used.
program.
In general, each called procedure is
responsible for ensuring that the values
in registers RI 6 – R25 are preserved
6.7.5 Short Circuiting
across the call. The stack begins at the
Short circuiting is an optimization that top of memory and grows downward.
can be performed on boolean expressions. There is no explicit frame pointer; in-
It is based on the observation that the stead, the stack pointer is decremented
value of many binary boolean operations by the size s of the procedure’s frame at
can be determined from the value of the entry and left unchanged during the call.

Compiler Transformations ● 3%5
Table 3. S-DLX Registers and Their Usage
Number Usage
RO Always zero; writes are ignored
Ri Return value when returning from a procedure call
R2. .R7 The first six words of the arguments to the procedure call
R8. .R15 8 caller save registers, used as temporary registers by callee
R16. .R25 10 callee save registers. These registers must be preserved across a call.
R26 . . R29 Reserved for use by the operating system
R30 Stack pointer
R31 Return address during a procedure call
FO. .F3 The first four floating point arguments to the procedure call
F4. .F17 14 caller save floating point registers
F18. .F31 14 callee save floating point registers
The value R30 + s serves as a uirtual (5) The frame is removed from the stack.
frame pointer that points to the base of (6) ~~~ds is transferred to the return
the stack frame, avoiding the use of a
second dedicated register. For languages
that cannot predict the amount of stack Calling a procedure is a four-step pro-
space used during execution of a proce- cess:
dure, an additional general-purpose reg-
ister can be used as a frame pointer, or
(1) The values of any of the registers
the size of the frame can be stored at the R1 – RI 5 that contain live values are
top of the stack (R30 + O).
saved. If the values of any global
On entering a procedure, the return variables that might be used by the
address is in R31. The first six words of callee are in a register and have been
the procedure arguments appear in reg- modified, the copy of those variables
isters R2 – R7, and the rest of the argu- in memory is updated.
ment data is on the stack. Figure 41
(2) The arguments are stored in the des-
shows the layout of the stack frame for a
ignated registers and, if necessary,
procedure invocation.
on the stack.
A similar convention is followed for
floating-point registers, except that only (3) A linked jump is made to the target
four are reserved for arguments. procedure; the CPU leaves the ad-
Execution of a procedure consists of six dress of the next instruction in R31.
steps: (4) Upon return, the saved registers are
restored, and the registers holding
(1) Space is allocated on the stack for the
global variables are reloaded.
procedure invocation.
(2) The values of registers that will be To demonstrate the structure of a pro-
modified during procedure execution cedure and the calling convention, Figure
(and that must be preserved across 42 shows a simple function and its com-
the call) are saved on the stack. If the piled code. The function (foo) and the
procedure makes any procedure calls function that it calls (max) each take two
itself, the saved registers should in- integer arguments, so they do not need to
clude the return address, R31. pass arguments on the stack. The stack
(3) The procedure body is executed. frame for foo is three words, which are
(4) The return value (if any) is stored in used to save the return address (R31 )
RI, and the registers that were saved and register R 16 during the call to max,
in step 2 are restored. and to hold the local variable d. R31
ACM Computing Surveys, Vol 26, No. 4, December 1994

high addresses
caller’s stack frarrte
arg n
ar’z 1
1
locals and temporaries
saved registers
frame size
arguments for called

procedures
Sp --.+
low addresses
4
stack grows down
Figure 41. Stack frame layout
must be preserved because it is overwrit- the return address (R31 ). Additionally, if

ten by the jump-and-link (JAL) instruc- the procedure does not have any local
tion; RI 6 must be preserved because it is variables allocated to memory, the com-
used to hold c across the call. piler does not need to create a stack
The procedure first allocates the 12 frame.
bytes for the stack frame and saves R31 Figure 43(a) shows the function max
and R 16; then the parameters are loaded (called by the previous example function
into registers. The value of d is calcu- foo), its original compiled code (b), and
lated in the temporary register R9. Then the code after leaf procedure optimiza-
the addresses of the arguments are stored tion (c). After eliminating the save/re-
in the argument registers, and a jump to store of R31, there is no need to allocate
max is made. On return from max, the a stack frame. Eliminating the code that
return value has been computed into the deallocates the frame also allows the
return value register (Rl ). After remov- function to return directly if x < y.
ing the stack frame and restoring the
saved registers, the procedure jumps back 6.8.3 Cross-Call Register Allocation
to its caller through R31.
Separate compilation reduces the amount
of information available to the compiler
6.8.2 Leaf Procedure Optimization
about called procedures. However, when
A leaf procedure is one that does not call both callee and caller are available, the
any other procedures; the name comes compiler can take advantage of the regis-
from the fact that these procedures are ter usage of the callee to optimize the
leaves in the static call graph. The sim- call.
plest optimization for leaf procedures is If the callee does not need (or can be
that they do not need to save and restore restricted not to use) all the temporary

integer function foo(c, b) integar function max(x, y)
integer c, b integer x, y
integer d, e if (x > y) then
d = c+b max=x
e = max(b, d) else
foo = e+c max = Y
return end if
end return
end
(a) source code for function foo
(a) source code for function rnax
foo: SUBI R30, #12 ;adjust SP

Sw 8(R30), R31 ;eave retaddr max: SUBI R30, #4 ;adjust SP
Sw 4(R30), R16 ;save R16

Sw (R30), R31 ;save retaddr
LW R8, (R2) ; R8=X
LW R16, (R2) ;R16=C
LH R9, (R3) ; R9=y
LW R8, (R3) ; R8=b
ADD R9, R16, R8 ;R9=d=c+b SGT RIO, R8, R9 ;RIO=(X > y)
BEQZ R1O, Else ;X <= y
Sw (R30), R9 ;save d
MOV Rl, R8 ; rnax=x
MOV R2, R3 ;argl=addr(b)
ADDI R3, R30, #O ;arg2=addr(d) J Ret
Else:MOV Rl, R9 ; max=y
JAL max ;call max; Rl=e
ADD RI, Rl, R16 ;Rl=e+c Ret: LW R31, (R30) ;restore R31
LW R16, 4(R30) ;restore R16 ADDI R30, #4 ;restore SP
JR R31 ; return
LW R31, 8(R30) ;restore retaddr
ADDI R30, #12 ;restore SP (b) original compiled code of max
JR R31 ; return
(b) compiled code for foo
max: LW R8, (R2) ; R8=X
Figure42. Functionfoo andits compiledcode. LW R9, (R3) ; R9=y
SGT R1O, R8, R9 ;R1O=(X > y)
MOV Rl, R8 ; max=x
JR R31 ; return
registers (R8–R15), the caller can leave
Elee:MOV Rl, R9 ; max=y
values in the unused registers through-
JR R31 ; return
out execution of the callee. Additionally,
(c) max after leaf optimization
move instructions for parameters can be
eliminated. Figure 43. Leafprocedure optimization.
To perform this optimizationon apro-
cedure f, registers must first have been
allocatedfor each ofthe procedures thatf
foo: SUBI R30, #4 ;adjust SP
calls. This can be done by performing
MOV Rll, R31 ;save retaddr
register allocation on procedures ordered
LW R12, (R2) ;R12=c
by adepth-firstpostorder traversalofthe
LW R8, (R3) ; R8=b
call graph [Chow 1988].
ADD R9, R12, R8 ; R9=d
Forexample, in Figure 43(c)max uses
Sw (R30), R9 ;save d
only R8–RIO. Figure 44 shows foo after
MOV R2, R3 ;argi=addr(b)
cross-call register allocation. R31 is saved
ADDI R3, R30, #0 ; arg2=addr (d)
in Rll instead ofon the stack, and the
JAL max ;call max
return is ajump to the saved addresses
ADD RI, RI, R12 ; Rl=e+c
in R1l. Additionally, R12 is used instead
ADDI R30, #4 ;restore SP
of R16, allowing the save and restore of
JR Rll ; return
RI 6 to be eliminated. Only d remains in
the stack frame. Figure 44. Cross-call register allocation for fOO.
ACM Computing Surveys, Vol. 26,No.4, December 1994

6.8.4 Parameter Promotion max: SGT R1O, R2, R3 ;RIO=(X > y)

When a parameter is passed by refer-
MOV RI, R2 ; max=x
ence, the address calculation is done by
JR R31 ; return
the caller, but the load ofthe parameter
Else:MOV RI, R3 ; nrax=y
value is done by the callee. This wastes
JR R31 ; return
an instruction, since most address calcu-
(a)maxafter parameter promotion: xandy are
lations canbehandled withtheot~set(Rn)
passed by value in R2 and R3.
format of load instructions.
More importantly, if the operand is al-
ready in a register in the caller, it must ; save retaddr
foo: MOV Rll, R31
be spilled to memory and reloaded by the LW R12, (R2) ;R12=c
callee. If the callee modifies the value, it LW R2, (R3) ; R2=b
must then be stored. Upon return to the ADD R3, R12, R2 ;R3=d
caller, if the compiler cannot prove that JAL max ; call max
the callee did not modify the operand, it ADD RI , Rl, R12 ;Rl=e+c
must be loaded again. Thus, as many as JR RI 1 ; return
two unnecessary loads and two unneces-
(b) foo after parameter promotion on max
sary stores can be introduced.
An unmodified reference parameter Figure 45. Parameter promotion.
can be passed by value, and a modified
reference parameter can be passed by
value-result. Figure 45(a) shows max af-
1979; Scheifler 1977]. Each occurrence of
ter this transformation has been applied.
a formal parameter is replaced with a
Figure 45(b) shows the corresponding
version of the corresponding actual pa-
modified form of foo.
rameter, modified to reflect the calling
Since d can now be held in a register,
convention of the language. Renaming
there is no longer a need for a stack
may also be required for the local vari-
frame, so frame collapsing can be ap-
ables of the inlined procedure if they con-
plied, and the stack frame is eliminated.
flict with the calling procedure’s variable
In general, when the compiler can stati-
names or if the procedure is inlined more
cally identify all the callers of a leaf pro-
than once in the same caller.
cedure, it can expand their stack frames
Inlining can almost always be per-
to include enough space for both proce-
formed, except when the procedure in
dures. The leaf procedure simply uses
question is recursive. Even recursive rou-
the caller’s stack frame without doing
tines may benefit from a finite number of
any new allocation of its own.
irdining steps, particularly when a con-
Parameter promotion is particularly
stant argument allows the compiler to
important for languages such as Fortran,
determine the number of recursive calls
in which all argument passing is by ref-
that will be performed. For Fortran pro-
erence. However, the argument-passing
grams, incompatible common-block us-
semantics of value-result are not the
ages between caller and callee can make
same as for arguments passed by refex=
inlining more complex and in practice
ence, in particular in the event of alias-
often prevent it.
ing. Interprocedural analysis may be
When a call is inlined, all the overhead
necessary to determine the legality of the
for the invocation is eliminated. The stack
transformation.
frames for the caller and callee are allo-
cated together, and the transfer of con-
6.8.5 Procedure Inlining
trol is eliminated. This is particularly
Procedure inlining (also known as proce- important for the return (J R31 ), since
dure integration) replaces a procedure a jump through a register may incur a
call with a copy of the body of the called higher pipeline penalty than a jump to a
procedure [Allen and Cocke 1971; Ball fixed address.

Compiler Transformations w 389
do i=i, n foo: Lu R12, (R2) ;R12=c
call f(a, i) LW R2, (R3) ; R2=b
end do ADD R3, R12, R2 ;R3=d
subroutine f(x, j) max: SGT R1O, R2, R3 ;R1O=(X > y)
dimension x[*I BEQZ R1O, Else ;X <= y
x[jl = x[jl + c MOV RI, R2 ; max=x
return J Ret ; “return” to f

Else:MOV Rl, R3 ; max=y
(a) original code
Ret: ADD RI, RI, R12 ;Rl=e+c
do all i=l, n 3R R31 ; return

a[il = a[il + c
Figure 47. max inlined into foo.
end do all
(b) after inlining
Figure 46, Procedure inlining case exponentially. However, in practice

it is simple to control the size increase by
selective application of inlining (for ex-
Another reason for irdining is to
ample, to small leaf procedures, or to
improve compiler analysis and opti-
procedures that are only called from a
mization. In many compilers, a loop
few places). The result can be a dramatic
containing a procedure call cannot be
improvement in execution speed. Figure
parallelized because its read-write be-
46 shows a source-level example of inlin-
havior is unknown. After the call is in-
ing; Figure 47 shows the assembler out-
lined, the compiler may be able to prove
put after the procedure max is inlined
loop independence, thereby allowing vec-
into fOO.
torization or parallelization. Addition-
Ignoring cache effects, assume the fol-
ally, register usage may be improved,
lowing: if tp is the time to execute the
constants propagated more accurately,
entire procedure and t~ is the time to
and more redundant operations elimi-
execute just the body of the n
procedure,
nated.
is the number of times it is called, and T
An alternative to inlining is to perform
is the total execution time of the pro-
interprocedural analysis. The advantage
gram, then
of interprocedural analysis is that it can
be applied uniformly, since it does not t, = n(tP – t~)
cause code expansion the way inlining
does. However, interprocedural analysis is the time saved by inlining, and
can be costly and increases the
complexity of the compiler.
Inlining also affects the instruction I=:
cache behavior of the program
[McFarling 1991]. The change can be fa- is the fraction of the total run time saved
vorable, because locality is improved by by inlining.
eliminating the transfer of control. On Some recent studies [Cooper et al. 1991;
the other hand, if a loop body is made 1992] have demonstrated that realizing
much larger, it may no longer fit in the the theoretical benefits of inlining may
cache and therefore cause additional require some modification of the com-
memory accesses. Furthermore, if the piler, because inlined code has different
loop contains multiple calls to the same characteristics than human-written code.
procedure, multiple copies of the proce- Peculiarities of a language or a com-
dure will be loaded into the cache. piler may reduce the effectiveness of in-
The primary disadvantage of inlining lining or produce unexpected results.
is that it increases code size, in the worst Cooper et al. examined several commer-

cial Fortran compilers and uncovered the call f(a, n, 2)

following potential problems: (1) in-
creased demand for registers; (2) larger subroutine f(x, n, p)
numbers of local variables, sometimes real x [*I
exceeding a compiler’s limit for the num- integer n, p
ber of analyzable variables; (3) loss of doi=l, n
information due to the Fortran conven- x[i] = x[i] **p
tion that procedure parameters may be end do
assumed to be unaliased. (a) original code
6.8.6 Procedure Cloning

call F.2(a, n)
Procedure cloning [Cooper et al. 1993] is
a technique for improving optimization subroutine F.2(x, n)
across procedure call boundaries. The call real x [*]
sites of the procedure being cloned are integer n
divided into groups, and a specialized doi=l, n
version of the procedure is created for x [i] = x [i] *x [i]
each group. The specialization often pro- end do
vides opportunities for better optimiza- (b) after cloning
tion, particularly due to improved
Figure 48. Procedure cloning.
constant propagation.
In Figure 48, cloning procedure f with
p replaced by the constant 2 allows re-
do i=l, n
duction in strength. The real-valued
call f(x, n)
exponentiation is replaced by a multipli-
end do
cation, which is usually at least 10 times
faster.
subroutine f (a, j )
real a[*]
6.8.7 Loop Pushing a[jl = a[jl + c
return
Loop pushing (also called loop embed-
ding [Hall et al. 1991]) moves a loop nest (a) original loop and procedure
from the caller to a cloned version of the

called procedure. If a compiler does not
call F.2(x)
perform vectorization or parallelization
across procedure calls directly, pushing
subroutine F-2 (a)
is a less general way of achieving a
real a[*]
similar effect.
do all i=l, n
For example, pushing is done by the
a[i] = a[il + c
CMAX Fortran preprocessor for the end do all
Thinking Machines CM-5. CMAX con-
return
verts Fortran 77 programs to Fortran 90,
(b) after loop pushing
attempting to discover data-parallel op-
erations in the process [Sabot and Figure 49. Loop pushing.
Wholey 1993].
Pushing not only allows the paral-
lelization of the loop in Figure 49, it also pendence analysis for distribution must
eliminates the overhead of all but one of be interprocedural. If there are no other
the procedure calls. statements in the loop, the transforma-
If there are other statements in the tion is always legal.
loop, distribution is a prerequisite to Procedure inlining (see Section 6.8.5)
pushing. In this case, however, the de- is a different way of achieving a very

Compiler Transformations 8 391
similar effect, and does not require inter- recursive logical function Inarray(a, x,i,n)
procedural analysis. real x, a[nl

integer i, ~
6.8.8 Tail Recursion Elimination If (i > n) then

inarray = . FALSE.
Tail recursion is a particularly common else if (a[i] = x) then
form of recursion. A function is recursive inarray = TRUE,
if it invokes itself, directly or indirectly. else
It is tail recursive if its last act is to call inanay = inarray(a, x, 1+1, n)

end if
itself and return the value of the recur-
return
sive call without performing any further
(a) A tail-recursive procedure
processing.
When a function is tail recursive, it is
unnecessary to invoke a separate in- logical function inarray(a, x, 1, n)
stance with its own stack frame. The real x, a[nl
recursion can be eliminated; the current Integer i, n
invocation will not be using its frame any
if (i > n) then
longer, so the call can be replaced by a
inarray = . FALSE.
jump to the top of the procedure. Figure
else if (a[i] = x) then
50 shows an example of a tail-recursive inarray = . TRUE.
function (a) and the result after the re- else
cursion is eliminated (b). A function that 1 = i+l
is not tail recursive is shown in Figure goto i
end if
50(c): it uses the result of the recursive
return
call as an operand to the addition, so
(b) After tail recursion elimination
there is computation that must be per-
formed after the recursive call returns.
Recursive programs can be trans-
formed automatically into tail-recursive recursive integer function surmrray(a, x,i, n)
versions that can be executed iteratively real x, a[n]
[Burstall and Darlington 1977], but this integer i, n
is not commonly performed by existing

if (i = n) then
compilers for imperative languages. Some
sumsrray = a [i]
languages prevent tail recursion elimina- else
tion by requiring clean-up code to be exe- sunrarray = a[l]+sunmrray(a, x, i+l, n)
cuted after a procedure is finished. The end if
semantics of C + +, for example, de- return
mand that before a procedure returns it (c) A procedure that is not tail-recursive
must call a deconstructor on each stack-
Figure 50. Tail recursion elimination.
allocated local object variable.
Other languages, like Scheme [Rees et
al. 1986], require that all tail-recursive
procedures be identified and that they be possible to cache the results of recent
executed without consuming stack space invocations. When the procedure is called
proportional to the recursion depth. again with the same arguments, the
cached result is used instead of recom-
puting it [Abelson and Sussman 1985;
6.8.9 Function Memorization
Michie 1968].
Memorization is an optimization that is Figure 51 shows a simple example of
applied to side-effect free procedures memorization. If f is often called with the
(that is, procedures that do not change same arguments and f also takes a non-
the state of the program, also called ref- trivial amount of time to run, then mem-
erentially transparent). In such cases it is oization will substantially increase

y = f(i) describe transformations that are spe-

(a) original function call cific to parallel architectures.
Highly efficient parallel applications
have been developed by manually apply-
logical f. UNCACHED[n] ing the transformations discussed here to
real f .CACHE[n] sequential code. Duplicating these suc-
doi=l, n cesses by automatically transforming the
f_ UNCACHED[i] = true. code within a compiler is an active area
end do of research. While a great deal of work
has been done, automatic parallelization
... of full-scale applications with existing
compilers does not consistently yield sig-
if (f_ UNCACHED[i]) then nificant speedups. Experiments that
f. CACHE [i] = f(i) measure the effectiveness of parallelizing
f-UNCACHED[i] = false. compilers are presented in Section 9.3.
end if Because the parallelization of se-
y = f-CACHE[i] quential code is proving to be difficult,
(b) code augmented for memorization languages like HPF Fortran [High Per-
formance Fortran Forum 1993] provide
Figure 51. Function memoization
constructs that allow the programmer to
declare data decomposition strategies and
to expose parallelism. The compiler is
performance. Ifnot, it will degrade per- then responsible for generating commu-
formance and consume memory. nication and synchronization code, using
This example assumes that f’s parame- various optimizations (discussed below)
ter is confined to the range 1 . . . n, and to reduce the resulting overhead. As of
that n is not extremely large. A more yet, however, there are few real applica-
sophisticated memorization scheme would tions that have been written in these
hash the arguments and use a cache size languages, and there is little experimen-
that makes a sensible trade-off between tal data.
reuse and memory consumption. For In this section we make extensive use
functions that return dynamically allo- of our model shared-memory and dis-
cated objects, storage management must tributed-memory multiprocessors sMX
be considered as well. and dMX, which are described fully in
For c calls and a hit rate of r-, the time the Appendix. Both machines are con-
to execute the c calls is structed using S-DLX processors. The
programming environment initializes the
T= C(~th + (1 – r)tm)
global variable Pnum to be the number of
where tk is the time to retrieve the mem- processors in the machine and Pid to be
the number of the local processor
oized result from the cache (a hit), and
(starting with O).
tn is the time to call f plus the overhead
of discovering that it is not in the cache
(a miss). With a high hit rate and a large 7.1 Data Layout
difference between th and tn, memoriz-
ation can significantly improve perfor-
One of the key decisions that a compiler
must make in targeting a multiprocessor
mance.
is the decomposition of data across the
processors. The need to distribute data
7. TRANSFORMATIONS FOR PARALLEL
and manage it is obvious in a dis-
MACHINES
tributed-memory programming model,
All the transformations discussed so far where communication is explicit. The
are applicable to a wide variety of com- performance of code under a shared-
puter organizations. In this section we memory model is also dependent on

avoiding unnecessary communication: doallj=l, n

doalli=l, n
the fact that communication is performed
a[i, j] = a[i, j] + c
implicitly does not eliminate its cost.
end do all
end do all
7.1.1 Regular Array Decomposition (a) original loop
The most common strategy for decompos-

12345878
ing an array on a parallel machine is to
map each dimension to a set of proces- 2
sors in a regular pattern. The four com- 3
monly used patterns are discussed in the 40123

5
following sections. s
7
8
“!IIII
Serial Decomposition
(b) ( block, sertd) decornpos,t;on of a on four
Serial decomposition is the degenerate processors
case, where an entire dimension is allo-
cated to a single processor. Figure 52(a) call FORK(Pnum)
shows a simple loop nest that adds c to do j = (n/Pnuro)*Pid+i, rnin( (n/Pnum)*(Pid+i) , n)
each element of a two-dimensional array, doi=l, n
the decomposition of the array onto pro- a[i, j] = a[i, j] + c
cessors (b), and the loop transformed into end do
end do
code for sMX (c). Each column is mapped
call JOIN()
serially, meaning that all the data in
(c) corresponding ( block, serial) scheduling of
that column will be placed on a single
the loop, using block size n/Pnurn
processor. As a result, the loop that iter-
ates over the columns (the inner loop) is
*2345678
unchanged.
1
2
o 1
3
Block Decomposition 14
A block decomposition divides the ele- 8
6
ments into one group of adj scent ele- 1
2 3
ments per processor. Using the previous 8
E13
example, the rows of array a are divided
(d) ( block, block) decomposition of a on four
between the four processors, as shown in processors
Figure 52(b). Because the rows have been
divided across the processors, the outer Figure 52. Serial and block decompositions of an
array on the ehared-memory multiprocessor sMX.
loop in Figure 52(c) has been modified so
that each processor iterates only over the
local portion of each row.
Figure 52(d) shows the decomposition across the processors, each of which has
if block scheduling is applied across both its own private address space.
the horizontal and vertical dimensions. As a result, naming a nonlocal array
A major difference between compila- element requires a mapping function
tion for shared- and distributed-memory from the original, global name space to a
machines is that shared-memory ma- processor number and local array indices
chines provide a global name space, so within that processor. The examples
that any particular array element can be shown here are all for shared memory,
named by any processor using the same and therefore sidestep these complexi-
subscripts as in the sequential code. On ties, which are discussed in Section 7.3.
the other hand, on a distributed-memory However, the same decomposition princi-
machine, arrays must usually be divided ples are used for distributed memory.

I doj=l, n
f2345678 doi=l, j
1 total = total + a[i, jl
2 end do
3
end do
la
5
(a) loop
6
(a) (Cyclic, serial) decomposition of a on four

processors
i
call FORK (Pnund
do j = Pid+l, n, Pnum
doi=l, n
end do
end do
(b) elements of a read by the loop
call JOIN()
(b) corresponding (CYCIZC, ser!a~) scheduling Of

Figure 54. Triangular iteration space.
loop
Figure 53. Cyclic decomposition on sMX
be clustered together and computed by a

single processor. With a cyclic decomposi-
tion, the expensive iterations are spread
The advantage of block decomposition
across the processors.
is that adj scent elements are usually on
the same processor. Because a loop that
computes a value for a given element Biock-Cyclic Decomposition
often uses the values of neighboring ele- A block-cyclic decomposition combines
ments, the blocks have good locality and the two strategies; the elements are di-
reduce the amount of communication. vided into many adjacent groups, usually
an integer multiple of the number of pro-
Cyclic Decomposition cessors available. Each group of elements
is assigned to a processor cyclically. Fig-
A cyclic decomposition assigns succes-
ure 55(a) shows an array whose columns
sive elements to successive processors. A
have been allocated to two processors in
cyclic allocation of columns to processors
a block-cyclic fashion, and Figure 55(b)
for the sequential code in Figure 52(a) is
gives the code to implement the mapping
shown in Figure 53(a). Figure 53(b) shows
on sMX. Block-cyclic is a compromise be-
the transformed code for sMX.
tween the locality of block decomposition
Cyclic decomposition has the opposite of cyclic
and the load balancing
effect of blocking: it hafi poor locality
decomposition.
for neighbor-based communication, but
spreads load more evenly. Cyclic decom-
Irregular Computations
position works well for loops like the one
in Figure 54, as long as the array is large Some loops, like the one shown in Figure
enough so that each processor handles 56, have highly irregular execution be-
many columns. A common algorithm that havior, When the variance of execution
results in this type of load balance is LU times is high, none of the static regular
factorization of matrices. A block decom- decomposition strategies may be suf-
position would work poorly because the ficient to yield good performance. A
iterations most costly to execute would variety of dynamic algorithms make

across a set of processors; in a block de-

1236 S678 composition, for example, the array might
< be divided into contiguous 10 x 10 sub-
2
arrays. The alignment of the array es-
3
tablishes the specific elements that are
i’
,0 1 0 1 in each of those subarrays.
6 Languages like HPF Fortran rely on
7
the proWammer to declare how arrays
8
EIll are to be aligned and what style of de-
(a) (block-cyclic, serial) decomposition of a on composition to use; the language includes
two processors annotations like BLOCK and CYCLIC.
A variety of research projects have in-
vestigated whether the strategy can be
call FORK(Pnum)
determined automatically without pro-
do k = 1, n, 2*Pnum
grammer assistance. Programmers can
do j = k + 2*Pid, k + 2*Pid + i
doi=l, n
find it difficult to choose a good set of
a[i, j] = a[i, j] + c declarations, particularly since the opti-
end do mal decomposition can change during the
end do course of program execution.
end do The general strategy for automating
call JOIN() decomposition and alignment is to ex-
(b) corresponding (block-cyclic, serial) press the behavior of the program in a
scheduling of loop, using block size 2 representation that either captures com-
munication explicitly or allows it to be
Figure 55. Block-cyclic decomposition on sMX.
computed. The compiler applies opera-
tions that reflect different decomposition
doi=l, n and alignment choices; the goal is to
if (mask [i] = 1) then maximize parallelism and minimize
a[i] = expensive. functiono communication. The approaches include
end if the Alignment-Distribution Graph
end do [Chatterjee et al. 1993a; 1993b], the
Component Affinity Graph [Li and Chen
(a) example loop
1990; 1991], constraints [Gupta 1992],
and affine transformations [Anderson
aa and Lam 1993].
J&
In addition to presenting an algorithm
to find the optimal mapping of arrays to
processors for one- and two-dimensional
1234567 8S arrays, Mace [1985; 1987] shows that
finding optimal decompositions is an
(b) iteration versus cost for the above loop NP-complete problem.
Figure 56. Irregular execution behavior.

7.1.3 Scalar Privatization
The goal behind privatization is to in-

scheduling decisions at run time based
crease the amount of parallelism and to
on the historical behavior of the compu-
avoid unnecessary communication. When
tation [Lucco 1992; Polychronopoulos and
a scalar is used within a loop solely as a
Kuck 1987; Tang and Yew 1990].
scratch variable, each processor can be
given a private copy so the use of the
7.1.2 Automatic Decomposition and Alignment
scalar need not involve any communica-
The decomposition of an array deter- tion. The transformation is safe if there
mines how the elements are distributed are no loop-carried dependence involv-

doi=l, n tion discussed in this survey and led to

c = b[i] the development of dependence analysis.
a[i] = a[i] + c Feautrier [ 1988; 1991] extends depen-
end do dence analysis to support array expan-
sion, essentially the same optimization
Figure 57. Scalar value suitable for privatization.
as privatization. The analysis is expen-
sive because it requires the use of para-
metric integer programming. Other
ing the scalar; before the scalar is used privatization research has extended
in the loop body, its value is always up- data-flow analysis to consider the behav-
dated within the same iteration. ior of individual array elements [Li 1992;
Figure 57 shows a loop that could ben- Maydan et al. 1993; Tu and Padua 1993].
efit from privatization. The scalar value c In addition to determining whether
is simply a scratch value; if it is priva- privatization is legal, the compiler must
tized, the loop is revealed to have inde- also check whether the values that are
~endent
. iterations. If the arravs are left in an array are used by subsequent
allocated to separate processors before computations. If so, after the array is
the loop is executed, no further commu- privatized the compiler must add code to
nication is necessary. In this very simple preserve those final values by performing
example the variable could also have been a copy-out operation from the private
eliminated entirely, but in more complex copies to a global array.
loops temporary variables are repeatedly
updated and referenced. 7.1.5 Cache Alignment
Scalar privatization is analogous to
scalar expansion (see Section 6.5.2); in On shared-memory processors like sMX,
both cases, spurious dependence are when one processor modifies storage, the
eliminated by replicating the variable. cache line in which it resides is flushed
The transformation is widely incorpo- by any other processors holding a copy.
rated into parallel compilers [Allen et al. The flush is necessary to ensure proper
1988b; Cytron and Ferrante 1987]. synchronization when multiple proces-
sors are updating a single shared vari-
able. If several .m-ocessors are continually .
7.1.4 Array Privatization
updating different variables or array ele-
Array privatization is similar to scalar ments that reside in the same cache line,
privatization; each processor is given its however. the constant flushing of cache
own copy of the entire array. The effect of lines may severely degrade pe~formance.
privatizing an array is to reduce commu- This problem is called fake sharing,
nication requirements and to expose When false sharing occurs between
additional parallelism by removing un- ~arallel lootI iterations accessing differ-
necessary dependence caused by storage ;nt column’s of an array or ~ifferent
reuse. structures, false sharing can be ad-
Parallelization studies of real applica- dressed by aligning each of the columns
tion programs have found that privatiza- or structures to a cache line boundarv.
tion is necessary for high performance False sharing of scalars can be address~d
[Eigenmann et al. 1991; Singh and by inserting padding between them so
Hennessy 1991]. Despite its importance, that each shared scalar resides in its own
verifying that privatization of an array is cache line (padding is discussed in Sec-
legal is much more difficult than the tion 6.5. 1). A further optimization is to
equivalent test on a scalar value. place shared variables ‘and their corre-
Traditionally, data-flow analysis sponding lock variables in the same cache
[Muchnick and Jones 1981] treats an en- line, which will cause the lock acquisition
tire array as a single value. This lack of to act as a prefetch of the data [Torrellas
precision prevents most loop optimiza- et al. 1994],

7.2 Exposing Coarse-Grained Parallelism dence between them that prevent the
Transferring data can be an expensive compiler from executing both simultane-
ously on different processors. However, it
operation on a machine with asyn-
is often the case that only some of the
chronously executing processors. One way
iterations conflict. Split uses memory
to avoid the inefficiency of frequent com-
munication is to divide the program into summarization to identify the iterations
in one loop nest that are independent of
subcomputations that will execute for an
extended period of time without generat- the other one, placing those iterations in
a separate loop. Applying split to an ap-
ing message traffic. The following opti-
plication exposes additional sources of
mization are designed to identify such
computations or to help the compiler concurrency and pipelining.
create them.
7.2.3 Graph Partitioning
7.2.1 Procedure Call Parallelization Data-flow languages [Ackerman 1982;
One potential source of coarse-grained McGraw 1985; Nikhil 1988] expose paral-
parallelism in imperative languages is lelism explicitly. The program is con-
the procedure call. In order to determine verted into a graph that represents basic
whether a call can be performed as an computations as nodes and the move-
independent task, the compiler must use ment of data as arcs between nodes.
interprocedural analysis to examine its Data-flow graphs can also be used as an
behavior. internal representation to express the
Because of its importance to all forms parallel structure of programs written in
of optimization, interprocedural analysis imperative languages.
has received a great deal of attention One of the major difficulties with
from compiler researchers. One approach data-flow languages is that they expose
is to propagate data-flow and dependence parallelism at the level of individual
information through call sites [Banning arithmetic operations. As it is impracti-
1979; Burke and Cytron 1986; Callahan cal to use software to schedule such a
et al. 1986; Myers 1981]. Another is to small amount of work, early projects fo-
summarize the array access behavior of a cused on developing architectures that
procedure and use that summary to eval- embed data-flow execution policies into
uate whether a call to it can be executed hardware [Arvind and Culler 1986;
in parallel [Balasundaram 1990; Calla- Arvind et al. 1980; Dennis 1980]. Such
han and Kennedy 1988a; Li and Yew machines have not proven to be success-
1988; Triolet et al, 1986]. ful commercially, so researchers began to
develop techniques for compiling data-
7.2.2 Split flow languages on conventional architec-
tures. The most common approach is to
A more comprehensive approach to pro-
interpret the data-flow graph dynami-
gram decomposition summarizes the data
cally, executing a node representing a
usage behavior of an arbitrary block of
computation when all of its operands are
code in a symbolic descriptor [Graham et
available. To reduce scheduling over-
al. 1993]. As in the previous section, the
head, the data-flow graph is generally
summary identifies independence be-
transformed by gathering many simple
tween computations—between procedure
computations into larger blocks that are
calls or between different loop nests. Ad-
executed atomically [Anderson and
ditionally, the descriptor is used in ap-
Hudak 1990; Hudak and Goldberg 1985;
plying the split transformation.
Mirchandaney et al. 1988; Sarkar 1989].
Split is unusual in that it is not ap-
plied in isolation to a computation like a
7.3 Computation Partitioning
loop nest. Instead, split considers the re-
lationship of pairs of computations. Two In addition to decomposing data as dis-
loop nests, for example, may have depen- cussed in Section 7.1, parallel compilers

must allocate the computations of a pro- doi=l, n

gram to different processors. Each pro- a[i] = a[i] + c
cessor will generally execute a restricted b[i] = b[i] + c
set of iterations from a particular loop. end do
The transformations in this section seek (a) original loop
to enforce correct execution of the pro-
gram without introducing unnecessary
overhead. LBA = (n/Pnum)*Pid + 1
UBA = (n/Pnurn)* (Pid + 1)
LBB = (n/Pnum) *Pid + 1
7.3.1 Guard Introduction
UBB = (n/Pnum)*(pid + 1)
After a loop is parallelized, not all the doi=l, n
processors will compute the same itera- if (LBA <= i and. i <= UBA)
tions or send the same slices of data. The a[i] = a[il + c
compiler can ensure that the correct com- if (LBB <= i and. i <= UBB)
putations are performed by introduc- b[i] = b[i] + c
ing conditional statements (known as end do
guards) into the program [Callahan and (b) parallelized loop with guards introduced
Kennedy 1988b].
Figure 58(a) shows an example of a
sequential loop that we will transform LB = (n/Pnum)*Pid + 1
for parallel execution. Figare 58(b) shows UB = (n/Pnum)*(Pid + 1)
the parallel version of the loop. The code doi=l, n
computes upper and lower bounds that if (LB <= i and. i <= US) then
the guards use to prevent computation a[i] = a[i] + c
on array elements that are not stored b[i] = b[i] + c
locally within the processor. The bounds end if
depend on the size of the array, the num- end do
ber of processors, and the Pid of the (c) after guard combination
processor.
We have made a number of simplifying
LB = (n/Pnum)*Pid + 1
assumptions: the value of n is an even
UB = (n/Pnum)*(Pid + 1)
multiple of Pnum, and the arrays are
doi=LB, UB
already distributed across the processors
a[i] = a[i] + c
with a block decomposition. We have also
b[i] = b[i] + c
chosen an example that does not require
end do
interprocessor communication. To paral-
lelize the loops encountered in practice, (d) after bounds reduction
compilers must introduce communica-

Figure 58. Computation partitioning.
tion statements and much more complex
guard and induction variable expres-
sions.
dMX does not have a global name-
serve the original element indices to
space, so each processor’s instance of an
maintain a closer relationship between
array is separate. In Figure 58(b), the
the sequential code and its transformed
compiler could take advantage of the iso-
version.
lation of each processor to remap the
array elements. Instead of the original
array of size n, each processor could de-
7.3.2 Redundant Guard Elimination
clare a smaller array of n / Pnum ele-
ments and modify the loop to iterate from In the same way that the compiler can
1 to n / Pnum. Although the remapping use code hoisting to optimize computa-
would simplify the loop bounds, we pre- tion, it can reduce the number of guards
ACM Computmg Surveysj Vol. 26, No. 4, December 1994

by hoisting them to the earliest point of cvcles reauired to send one word to the
where they can be correctly computed. number of ‘cycles required for a single
Hoisting often reveals that identical operation on dMX is about 500:1.
guards have been introduced and that all Like vectors on vector processors, com-
but one can be removed. In Figure 58(b), munication o~erations on a distributed-
the two guard expressions are identical memory machine are characterized by a
because the arrays have the same decom- startup time, t~, and a per-element cost,
position. When the guards are hoisted to t~, the time required to send one byte
the beginning of the loop, the compiler once the message has been initiated. On
can eliminate the redundant guard. The dMX, t, = 10PS and th = 100ns, so
second pair of bound computations be- sending ‘one 10-byte message costs 11 ps
comes useless and is also removed. The while sending two 5-byte messages costs
final result is shown in Figure 58(c). 21 ps. To avoid paying the startup cost
unnecessarily. . . the o~timizations
. in this
section combine data from multiple mes-
7.3.3 Bounds Reduction
sages and send them in a single
In Figure 58(c), the guards control which operation.
iterations of the loop perform computa-
tion. Since the desired set of iterations is
7.4.1 Message Vectorization
a contiguous range, the compiler can
achieve the same effect by changing the Analysis can often determine the set of
induction expressions to reduce the loop data items transferred in a loop. Rather
bounds [Koelbel 1990]. Figure 58(d) than sending each element of an array in
shows the result of the transformation. an individual message, the compiler can
group many of them together and send
them in a sirwle block transfer. Because
7.4 Communication Optimization
this is analo~ous to the way a vector
An important part of compilation for dis- processor interacts with memory, the op-
tributed-memory machines is the analy- timization is called message vectoriza-
sis of an application’s communication tion [Balasundaram et al. 1990; Gerndt
needs and introduction of explicit mes- 1990; Zima et al. 1988].
sage-passing operations into it. Before a Figure 59(a) shows a sample loop that
statement is executed on a given proces- multiplies each element of array a by the
sor, any data it relies on that is not mirror element of array b. Figure 59(b) is
available locally must be sent. A simple- an inefficient parallel version for proces-
minded approach is to generate a sors O and 1 on dMX. To simplify the
message for each nonlocal data item code, we assume that the arrays have
referenced by the statement, but the already been allocated to the processors
resulting overhead will usually be un- using a block decomposition: the lower
acceptably high. Compiler designers have half of each array is on processor O and
developed a set of optimizations that re- the upper half on processor 1.
duce the time required for communica- Each processor begins by computing
tion. the lower and upper bounds of the range
The same fundamental issues arise in that it is res~onsible for. During each
communication optimization as in opti- iteration, it s&ds the element of”b that
mization for memory access (described the other processor will need and waits
in Section 6.5): maximizing reuse, mini- to receive the corresponding message.
mizing the working set, and making use Note that Fortran’s call-by-reference
of available parallelism in the em-nmuni- semantics c.on~ert the arrav . reference
cation system. However, the problems are implicitly into the address of the corre-
magnified because while the ratio of one sponding element, which is then used by
memory access to one computation on the low-level communication routine to
S-DLX is 16:1, the ratio of the number extract the bytes to be sent or received,

doi=l, n their internal buffer sizes are exceeded.

a[il = a[il + b[n+l-i] The compiler may be able to perform
end do additional communication optimization
(a) original loop by using strip mining to reduce the mes-
sage length as shown in Figure 59(d).
LB = Pid * (n/2) + 1
US = LB + (n/2)
7.4.2 Message Coalescing
otherPid = i - Pid
doi=LB, UB Once message vectorization has been
call SEND(b[i] , 4, otherPid) performed, the compiler can further re-
call RECEIVE (b[n+l-i] , 4)
duce the frequency of communication by
a[i] = a[i] + b[n+l-i]
grouping messages together that send
end do
overlapping or adjacent data. The For-
(b) parallel loop
tran D compiler [Tseng 1993] uses Regu-
lar Sections [Callahan and Kennedy
LB = Pid * (n/2) + 1 1988a], an array summarization strat-
LIE = LB + (n/2) egy, to describe the array slice in each
otherPid = 1 - Pid message. When two slices being sent to
otherLB = otherPid * (n/2) + 1 the same processor overlap or cover con-
otherUB = otherLB + (n/2) tiguous ranges of the array, the associ-
call SEND(b[LB] , (n/2)*4, otherPid) ated messages are combined.
call RECEIVE(b [otherLB] , (n/2) *4, otherPid)
doi=LB, UB
a[i] = a[i] + b[n+l-i] 7.4.3 Message Aggregation
end do
(c) parallel loop with vectorized messages Sending a message is generally much
more expensive than performing a block
copy locally on a processor. Therefore it
do j = LB, UB, 256 is worthwhile to aggregate messages be-
call SEND(b[j] , 256*4, otherPid)
ing sent to the same processor even if the
call RECEIVE(b[otherLB+ (j-LB)] ,
data they contain is unrelated [Tseng
256*4, otherPid)
1993]. A simple aggregation strategy is
do i = j, j+255
to identify all the messages directed at
a[i] = a[i] + b[n+l-i]
end do
the same target processor and copy the
end do various strips of data into a single buffer.
(d) after strip mining messages (assuming
The target processor performs the
array size is a multiple of 256) reverse operation.
Figure 59. Message vectorization
7.4.4 Collective Communication
Many parallel architectures and mes-

sage-pawing libraries offer e.pecial-pur-
When the message arrives, the iteration pose communication primitives such as
proceeds. broadcast, hardware reduction, and scat-
Figure 59(c) is a much more efficient ter-gather. Compilers can improve per-
version that handles all communication formance by recognizing opportunities to
in a single message before the loop be- exploit these operations, which are often
gins executing. Each processor computes highly efficient. The idea is analogous to
the upper and lower bound for both itself idiom and reduction recognition on se-
and for the other processor so as to place quential machines. Li and Chen [1991]
the incoming elements of b properly. present compiler techniques that rely on
Message-passing libraries and network pattern matching to identify opportuni-
hardware often perform poorly when ties for using collective communication.

7.4.5 Message Pipelining 7.5 SIMD Transformations
Another important optimization is to SIMD architectures exhibit much more

pipeline parallel computations by over- regular behavior than MIMD machines,
lapping communication and computation. eliminating many problematic synchro-
Studies have demonstrated that many nization issues. In addition to the align-
applications perform very poorly with- ment and decomposition strategies for
out pipelining [Rogers 1991]. Many mes- distributed-memory systems (see Section
sage-passing systems allow the processor 7.1.2), the regularity of SIMD intercon-
to continue executing instructions while nection networks offers additional oppor-
a message is being sent. Some support tunities for optimization. The compiler
fully nonblocking send and receive opera- can use very accurate cost models to esti-
tions. In either case, the compiler has the mate the performance of a particular
opportunity to arrange for useful compu- layout.
tation to be performed while the network Early SIMD compilation work targeted
is delivering messages. IVTRAN [Millstein and Muntz 1975], a
A variety of algorithms have been de- Fortran dialect for the Illiac IV that pro-
veloped to discover opportunities for vided layout and alignment declarations.
pipelining and to move message transfer The compiler provided a parallelizing
operations so as to maximize the amount module called the Paralyzer [Presberg
of resulting overlap [Rogers 1991; and Johnson 1975] that used an early
Koelbel and Mehrotra 1991; Tseng 1993]. form of dependence analysis to identify
independent loops and applied linear
transformations to optimize communica-
7.4.6 Redundant Communication Elimination
tion.
To avoid sending messages wherever The Connection Machine Convolution
possible, the compiler can perform a vari- Compiler [Bromley et al. 1991] targets
ety of transformations to eliminate re- the topology of the SIMD architecture
dundant communication. Many of the explicitly with a pattern-matching strat-
optimizations covered earlier can also be egy. The compiler focuses on computa-
used on messages. If a message is sent tions that update array elements based
within the body of a loop but the data on their neighbors, It finds the pattern of
does not change from one iteration to neighbors that are needed to compute a
the next, the SEND can be hoisted out of given element, called the stencil. The
the loop. When two messages contain the cross, a common stencil, represents a
same data, only one need be sent. computation that updates each array ele-
Messages offer further opportunities ment using the value of its neighbors to
for optimization. If the contents of a mes- the north, east, west, and south. Stencils
sage are subsumed by a previous commu- are aggregated into larger patterns,
nication, the message need not be sent; or multistencils, using optimizations
this situation is often created when analogous to loop unrolling and strip
SENDS are hoisted in order to maximize mining. The multistencils are mapped
pipelining opportunities. If a message to the hardware so as to minimize
contains data, some of which has already communication.
been sent, the overlap can be removed to SIMD compilers developed at Compass
reduce the amount of data transferred. [Knobe and Natarajan 1993; Knobe et al.
Another possibility is that a message is 1988; 1990] construct a preference graph
being sent to a collection of processors, based on the computations being per-
some of which previously received the formed. The graph represents alignment
data. The list of recipients can be pruned, requests that would yield the least com-
reducing the amount of communication munication overhead. The compiler satis-
traffic. These optimizations are used by fies every request when possible, using a
the PTRAN II compiler [Gupta et al. greedy algorithm to find a solution when
1993] to reduce overall message traffic. preferences are in conflict.

Wholey [1992] begins by computing a pilers for these machines must look
detailed cost function for the computa- through branches to find additional in-
tion. The function is the input to a hill- structions to execute. The multiple paths
climbing algorithm that searches the of control require the introduction of
space of possible allocations of data to speculative execution—computing some
processors. It begins with two processors, results that may turn out to be unused
then four, and so forth; each time the (ideally at no additional cost, by func-
number of processors is doubled, the al- tional units that would otherwise have
gorithm uses the most efficient allocation gone unused).
from the previous stage as a starting The speculation can be handled dy-
point. namically by the hardware [ Sohi and
In addition to general-purpose map- Vajapayem 1990], but that complicates
ping algorithms, the compiler can use the design significantly. The trace-sched-
specific transformations like loop flatten- uling alternative is to have the compiler
ing [von Hanxleden and Kennedy 1992] do it by identif~ng paths (or traces)
to avoid idle time due to nonuniform through the control-flow graph that
computations. might be taken. The compiler guesses
which way the branches are likely to go
and hoists code above them accordingly.
7.6 VLIW Transformations
Superblock scheduling [Hwu et al. 19931
The Very Long Instruction Word (VLIW) is an extension of trace scheduling that
architecture is another strategy for exe- relies on program profile information to
cuting instructions in parallel. A VLIW choose the traces.
processor [Colwell et al. 1988; Fisher et
al. 1984; Floating Point Systems 1979;
8. TRANSFORMATION FRAMEWORKS
Rau et al. 1989] exposes its multiple
functional units explicitly; each machine Given the many transformations that
instruction is very large (on the order of compiler writers have available, they face
512 bits) and controls many independent a daunting task in determining which
functional units. Typically the instruc- ones to apply and in what order. There is
tion is divided into 32-bit pieces, each of no single best order of application; one
which is routed to one of the functional transformation can permit or prevent a
units. second from being applied, or it can
With traditional processors, there is change the effectiveness of subsequent
a clear distinction between low-level changes. Current compilers generally
scheduling done via instruction selection use a combination of heuristics and a
and high-level scheduling between pro- partially fixed order of applying
cessors. This distinction is blurred in transformations.
VLIW architectures. They rely on the There are two basic approaches to this
compiler to expose the parallel opera- problem: uniffing the transformations in
tions explicitly and schedule them at a single mechanism, and applying search
compile time. Thus the task of generat- techniques to the transformation space.
ing object code incorporates high-level In fact there is often a degree of overlap
resource allocation and scheduling deci- between these two approaches.
sions that occur only between processors
on a more conventional parallel
8.1 Unified Transformation
architecture.
Trace scheduling [Ellis 1986; Fisher A promising strategy is to encode both
1981; Fisher et al. 1984] is an analysis the characteristics of the code being
strategy that was developed for VLIW transformed and the effect of each trans-
architectures. Because VLIW machines formation; then the compiler can quickly
require more parallelism than is typi- search the space of possible sets of trans-
cally available within a basic block, com- formations to find an efficient solution.

One framework that is being actively doi =2,10

investigated is based on unimodular ma- do j = 1, 10
trix theory [Banerjee 1991; Wolf and Lam a[i, j] = a[i-l, jl + a[i, jl
1991]. It is applicable to any loop nest end do
whose dependence can be described with end do
a distance vector; a subset of the loops
Figure 60. Unimodular transformation example.
that require a direction vector can also
be handled. The transformations that a
unimodular matrix can describe are in-
terchange, reversal, and skew. Any loop nest whose dependence are
The basic principle is to encode each all representable by a distance vector can
transformation of the loop in a matrix be transformed via skewing into a canon-
and apply it to the dependence vectors of ical form called a fully permutable loop
the loop. The effect on the dependence nest. In this form, any two loops in the
pattern of applying the transformation nest can be interchanged without chang-
can be determined by multiplying the ing the loop semantics. Once in this
matrix and the vector. The form of the canonical form, the compiler can decom-
product vector reveals whether the transpose the loop into the granularity that
formation is valid. To model the effect of matches the target architecture [Wolf and
applying a sequence of transformations, Lam 1991].
the corresponding matrices are simply Sarkar and Thekkath [1992] describe a
multiplied. framework for transforming perfect loop
Figure 60 shows a loop nest that can nests that includes unimodular transfor-
be transformed with unimodular matri- mations, tiling, coalescing, and parallel
ces. The distance vector that describes loop execution. The transformations are
the loop is D = (1, O), representing the encoded in an ordered sequence. Rules
dependence of iteration i on i – 1 in the are provided for mapping the dependence
outer loop. vectors and loop bounds, and the trans-
Because of the dependence, it is not formation to the loop body is described by
legal to reverse the outer loop of the nest. a template.
The reversal transformation is Pugh [1991] describes a more ambi-
tious (and time-consuming) technique
that can transform imperfectly nested
R=–::.
[1 loops and can do most of the transforma-
tions possible through a combination of
statement reordering, interchange, fu-
The product is sion, skewing, reversal, distribution, and
parallelization. It views the transforma-
RD=P1=
H
‘:.
tion problem as that of finding
schedule for a set of operations
nest. A method is given for generating
the best
in a loop
and testing candidate schedules.

This demonstrates that the transforma-
tion is not legal, because PI is not lexico-
graphically positive. 8.2 Searching the Transformation Space
We can also test whether the two loops Wang [1991] and Wang and Gannon
can be interchanged; the interchange [1989] describe a parallelization system
transformation is 1 =
[. 1
~ ~ . Applying that uses heuristic
from artificial intelligence
search techniques
to find a pro-
that to D
the resulting
yields
[1
P2 = ~ . In this case,
vector is lexicographically
gram transformation
get machine is represented
sequence. The tar-
by a set of
positive showing that the transformation features that describe the type, size, and
is legal. speed of the processors, memory, and in-

terconnect. The heuristics are organized benchmarks measure the combined effec-
hierarchically. The main functions are: tiveness of the compiler and the target
description of parallelism in the program machine. Thus two compilers that target
and in the machine; matching of program the same machine can be compared by
parallelism to machine parallelism; and using the SPEC ratings of the generated
control of restructuring. code.
For parallel machines, the contribution
of the compiler is even more important,
9. COMPILER EVALUATION
since the difference between naive and
Researchers are still trying to find a good optimized code can be many orders of
way to evaluate the effectiveness of com- magnitude. Parallel benchmarks include
pilers. There is no generally agreed upon SPLASH (Stanford Parallel Applications
way to determine the best possible perfor Shared Memory) [Singh et al. 1992]
formance of a particular program on a and the NASA Numerical Aerodynamic
particular machine, so it is difficult to Simulation (NAS) benchmarks [Bailey
determine how well a compiler is doing. et al. 1991].
Since some applications are better struc- The Perfect Club [Berry et al. 1989] is
tured than others for a given architec- a benchmark suite of computationally in-
ture or a given compiler, measurements tensive programs that is intended to help
for a particular application or group of evaluate serial and parallel machines.
applications will not necessarily predict
how well another application will fare.
9.2 Code Characteristics
Nevertheless, a wide variety of mea-
surement studies do exist that seek to A number of studies have focused on the
evaluate how applications behave, how applications themselves, examining their
well they are being compiled, and how source code for programmer idioms or
well they could be compiled. We have profiling the behavior of the compiled
divided these studies into several groups. executable.
Knuth [1971] carried out an early and
influential study of Fortran programs. He
9.1 Benchmarks
studied 440 programs comprising 250,000
Benchmarks have received by far the lines (punched cards). The most impor-
most attention since they measure deliv- tant effect of this study was to dramatize
ered performance, yielding results that the fact that the majority of the execu-
are used to market machines. They were tion time of a program is usually spent in
originally developed to measure machine a very small proportion of the code. Other
speed, not compiler effectiveness. The interesting statistics are that 9570 of all
Livermore Loops [McMahon 1986] is one the do loops incremented their index
of the early benchmark suites; it sought variable by 1, and 4070 of all do loops
to compare the performance of supercom- contained only one statement.
puters. The suite consists of a set of small Shen et al. [1990] examined the sub-
loops based on the most time-consuming script expressions that appeared in a set
inner loops of scientific codes. of Fortran mathematical libraries and
The SPEC benchmark suite [Dixit scientific applications. They applied vari-
1992] includes both scientific and gen- ous dependence tests to these expres-
eral-purpose applications intended to be sions. The results demonstrate how the
representative of an engineering/scien- tests compare to one another, showing
tific workload. The SPEC benchmarks are that one of the biggest problems was un-
widely used as indicators of machine per- known variables. These variables were
formance, but are essentially uniproces- caused by procedure calls and by the use
sor benchmarks. of an array element as an index value
As architectures have become more into another array. Coupled subscripts
complex, it has become obvious that the also caused problems for tests that exam-

Compiler Transformations e 405
ine a single array dimension at a time One study of four Perfect benchmark
(coupled subscripts are discussed in programs compiled on the Alliant FX/8
Section 5.4). produced speedups between 0.9 (that is,
A study by Petersen and Padua [1993] a slowdown) and 2.36 out of a potential
evaluates approximate dependence tests 32; when the applications were tuned by
against the omega test [Pugh 1992], find- hand, the speedups ranged from 5.1 to
ing that the latter does not expose sub- 13.2 [Eigenmann et al. 1991]. In another
stantially more parallelism in the Perfect study of 12 Perfect benchmarks compiled
Club benchmarks. with the KAP compiler for a simulated
machine with unlimited parallelism, 7
applications had speedups of 1.4 or less;
9.3 Compiler Effectiveness
two applications had speedups of 2-4;
As we mentioned above, researchers find and the rest were sped up 10.3, 66, and
it difficult to evaluate how well a com- 77. All but three of these applications
piler is doing. They have come up with could have been sped up by a factor of 10
four approaches to the problem: or more [Padua and Petersen 1992].
Lee et al. [1985] study the ability of the
e examine the compiler output by hand Parafrase compiler to parallelize a mix-
to evaluate its ability to transform code; ture of small programs written in For-
e measure performance of one compiler tran. Before compilation, while loops
against another; were converted to do loops, and code for
e compare the performance of executa- handling error conditions was removed.
ble compiled with full optimization With 32 processors available, 4 out of 15
against little or no optimization; and applications achieved 30920 efficiency, and
e compare an application compiled for 2 achieved 10% efficiency; the other
parallel execution against the sequen- 9 out of 15 achieved less than 10%
tial version running on one processor. efficiency. Out of 89 loops, 59 were
parallelized, most of them loops that
Nobayashi and Eoyang [1989] compare initialized array data structures. Some
the performance of supercomputer com- coding idioms that are amenable to im-
pilers from Cray, Fujitsu, Alliant, proved analysis were identified.
Ardent, and NEC. The compilers were Some preliminary results have also
applied to various loops from the Liver- been reported for the Fortran D compiler
more and Argonne test suites that re- [Hiranandani et al. 1993] on the Intel
quired restructuring before they could be iPSC/860 and Thinking Machines CM-5
computed with vector operations. architectures.
Carr [1993] applies a set of loop trans-
formations (unroll-and-jam, loop inter-
9.4 Instruction-Level Parallelism
change, tiling, and scalar replacement) to
various scientific applications to find a In order to evaluate the potential gain
strategy that optimizes cache utilization. from instruction-level parallelism, re-
Wolf [1992] uses inner loops from the searchers have engaged in a number of
Perfect Club and from one of the SPEC studies to measure an upper bound on
benchmarks to compare the performance how much parallelism is available when
of hand-optimized to compiler-optimized an unlimited number of functional units
code. The focus is on improving locality is assumed. Some of these studies are
to reduce traffic between memory and discussed and evaluated in detail by Rau
the cache. and Fisher [1993].
Relatively few studies have been per- Early studies [Tjaden and Flynn 1970]
formed to test the effectiveness of real were pessimistic in their findings, mea-
compilers in parallelizing real programs. suring a maximum level of parallelism
However, the results of those that have on the order of two or three—a result
are not encouraging. that was called the Flynn bottleneck. The

main reason for the low numbers was SPEC89 suite. The resulting code was
that these studies did not look beyond executed on a variety of simulated
basic blocks. machines. The range includes unrealisti-
Parallelism can be exploited across cally omniscient architectures that com-
basic block boundaries, however, on ma- bine data-flow strategies for scheduling
chines that use speculative execution. h- with perfect branch prediciton. They also
stead of waiting until the outcome of a restricted models with limited hardware.
conditional branch is known, these archi- Under the most optimistic assumptions,
tectures begin executing the instructions they were able to execute 17 instructions
at either or both potential branch tar- per cycle; more practical models yielded
gets; when the conditional is evaluated, between 2.0 and 5.8 instructions per cy-
any computations that are rendered in- cle. Without looking past branches, few
valid or useless must be discarded. applications could execute more than 2
When the basic block restriction is re- instructions per cycle.
laxed, there is much more parallelism Lam and Wilson [1992] argue that one
available. Riseman and Foster [1972] as- major reason why studies of instruction-
sumed replicated hardware to support level parallelism differ markedly in their
speculative execution. They found that results is branch prediction. The studies
there was significant additional paral- that perform speculative execution based
lelism available, but that exploiting it on prediction yield much less parallelism
would require a great deal of hardware. than the ones that have an oracle with
For the programs studied, on the order of perfect knowledge about branches.
216 arithmetic units would be necessary The authors speculate that executing
to expose 10-way parallelism. branches locally will not yield large de-
A subsequent study by Nicolau and grees of parallelism; they conduct experi-
Fisher [1984] sought to measure the par- ments on a number of applications and
allelism that could be exploited by a find similar results to other studies (be-
VLIW architecture using aggressive com- tween 4.2 and 9.2). They argue that com-
piler analysis. The strategy was to compiler optimizations that consider the
pile a set of scientific programs into an global flow of control are essential.
intermediate form, simulate execution, Finally, Larus [1993] takes a different
and apply a scheduling algorithm approach than the others, focusing
retroactively that used branch and ad- strictly on the parallelism available in
dress information revealed by the simu- loops. He examines six codes (three inte-
lation. The study revealed significant ger programs from SPEC89 and three
amounts of parallelism—a factor of tens scientific applications), collects a trace
or hundreds. annotated with semantic information,
Wall [1991] took trace data from a and runs it through a simulator that as-
number of applications and measured the sumes a uniform shared-memory archi-
amount of parallelism under a variety of tecture with infinite processors. Two of
models. His study considered the effect of the scientific codes, the ones that were
architectural features such as varying primarily array manipulators, showed a
numbers of registers, branch prediction, potential speedup in the range of 65–250.
and jump prediction. It also considered The potential speedup of the other codes
the effects of alias analysis in the com- was 1.7–3, suggesting that the tech-
piler. The results varied widely, yielding niques developed for parallelizing sci-
as much as 60-way parallelism under the entific codes will not work well in
most optimistic assumptions. The aver- parallelizing other types of programs.
age parallelism on an ambitious combi-
nation of architecture and compiler
CONCLUSION
support was 7.
Butler et al. [1991] used an optimizing We have described a large number of
compiler on a group of programs from the transformations that can be used to im-

prove program performance on sequen- APPENDIX MACHINE MODELS

tial and parallel machines. Numerous
A.1 Superscalar DLX
studies have demonstrated that these
transformations, when properly applied, A superscalar processor has multiple
are sufficient to yield high performance. functional units and can issue more than
However, current optimizing compilers one instruction per clock cycle. Current
lack an organizing principle that allows examples of superscalar machines are the
them to choose how and when the trans- DEC Alpha [Sites 1992], HP PA-RISC
formations should be applied. Finding a [Hewlett-Packard 1992], IBM RS/6000
framework that unifies many transfor- [Oehler and Blasgen 1991], and Intel
mations is an active area of research. Pentium [Alpert and Avnon 1993].
Such a framework could simplify the S-DLX is a simplified superscalar RISC
problem of searching the space of possi- architecture. It has four independent
ble transformations, making it easier and functional units for integer, load/store,
therefore cheaper to build high-quality branch, and floating-point operations. In
optimizing compilers. A key part of this every cycle, the next two instructions are
problem is the development of a cost considered for execution. If they are for
function that allows the compiler to eval- different functional units, and there
uate the effectiveness of a particular set are no dependence between the instruc-
of transformations. tions, they are both initiated. Otherwise,
Despite the absence of a strategy for the second instruction is deferred until
uniffing transformations, compilers have the next cycle. S-DLX does not reorder
proven to be quite successful in targeting the instructions.
sequential and superscalar architectures. Most operations complete in a single
On ~arallel architectures. however. most cycle. When an operation takes more than
hig~-performance applications currently one cycle, subsequent instructions that
rely on the programmer rather than the use results from multicycle instructions
compiler to manage parallelism. Com- are stalled until the result is available.
pilers face a large space of possible Because there is no instruction reorder-
transformations to apply, and parallel ing, when an instruction is stalled no
machines exact a very high cost of fail- instructions are issued to any of the
ure when optimization algorithms do not functional units.
discover a good reorganization strategy S-DLX has a 32-bit word, 32 general-
for the application. purpose registers (GPRs, denoted by Rn),
Because efforts to ~arallelize tradi- and 32 floating-point registers (FPRs, de-
tional languages autom~tically have met noted by Fn). The value of RO is always O.
with little success, the focus of research The FPRs can be used as double-preci-
has shifted to compiling languages in sion (64-bit) register pairs. For the
which the Dromammer shares the re- sake of simplicity we have not included
sponsibility }or”exposing parallelism and the double-precision floating-point
choosing a data layout. instructions.
Another developing area is the opti- Memory is byte addressable with a
mization of nonscientific applications. 32-bit virtual address. All memory refer-
Like most researchers working on high- ences are made by load and store instruc-
performance architectures, compiler tions between memory and the registers.
writers have focused on code that relies Data is cached in a 64-kilobyte 4-way
on loop-based manipulation of arrays. set-associative write-back cache com-
Other kinds of programs, particularly posed of 1024 64–byte cache lines. Figure
those written in an object-oriented pro- 61 is a block diagram of the architecture
gramming style, rely heavily on pointers. and its primary datapaths.
Optimization of pointer manipulation and Table A.1 describes the instruction set.
of object-oriented languages is emerging All instructions are 32 bits and must be
as another focus of compiler research. word-aligned. The immediate operand

-.
I 64
-&
32 Branch 64 from RAM memory
Integer * * 64 KB Cache
Unit Unit
32 32 64 bytedline
? _
1024 lines
32 32
32 Int Registers * Addr ~
“ Load/Store 4-way set- ~
64 Unit s-l associative Xlation Pg$
*
32 FP Registers
64
—
Floating
Point Unit
Figure 61. S-DLX functional diagram,
Table A.1. The S-DLX InstructIon Set

Example Instr. Name Meaning Simdar instructions
LW Rl, 30(R2) Load word RI +-Memory [30+R2] Load float (LF)

SW 500(R4), R3 Store word Memory [500+R41 +R3 Store float (SF)
LI RI, #666 Load Immediate RI-666
LUI RI, #666 Load upper immediate R11,5 31 t--666
MOV RI, R2 Move register R1+R2
ADD RI, R2, R3 Add R1*R2+R3 Subtract (SUB)
WJLT RI, R2, R3 Multlply R1-R2x R3 Divide (DIV)
ADDI RI, R2, #3 Add immediate R1-R2+3 SUBI , WJLTI , DIVI
SLL RI> R2, R3 Shift left logical RI+R2<<R3 Shift right logical (SRL)
SLLI RI, R2, #3 Shift left immediate RI+R2< 3 SRLI
SLT RI, R2, R3 Set less than If (R2<R3) RI+l SEQ, SNE, SLE, SGE, SGT and
else RI-O lmmedlate forms
J label Jump PCelabel
JR R3 Jump register PCt-R3
JAL label Jump and link R31+PC+4; PCt-label
JALR R2 Jump and link register R31+PC+4, PC+-R2
BEQZ R4, label Branch If equal zero If (R4=O) PC+--label Branch If not equal zero (BIfEZ)
BFPT label Branch if floating If (FPCR) PCtlabel Branch if floating point false
point true (BFPF)
ADDF Fl, F2, F3 Add float FltF2+F3 Subtract float (SUBF)

f’lWLTF F1 , F2 , F3 Multiply float FltF2x F3 Divide float (DIVF)
HAF F1, F2> F3 Multiply-Add float FltFl+(F2x F3)
EQF Fl, F2 Test equal float If (F1=F2) FPCR +’-1 LTF , GTF, LEF , GEF, IiEF
else FPCR +0
field is 16 bits. The address for load and high 16 bits; then the high 16 bits must
store instructions is computed by adding be set with a load upper immediate (LUI)
the 16-bit immediate operand to the reg- instruction.
ister. To create a full 32-bit constant, the The program counter is the special reg-
low 16 bits must first be set with a load ister PC. Jumps and branches are rela-
immediate (Ll) instruction that clears the tive to PC + 4; jumps have a 26-bit signed

Table A.2. V-DLX Vector InstructIons

Example Instr. Name Meaning Slmllar I
LV VI, RI Load vector register V1+-VLR words at H [RI] I I
uws VI. (RI. R2) Load vector with stride Vlt-everv R2ffi word for
i SV VI. RI I Store vector register I HIR1] -VLR WC
SUBV V1, V2, V3 I Vector-vector subtraction I V.. -. . . . . .. .-

SUBVS VI .V2. F1 / Vector-scalar subtraction I VI [1. . VLR] +V2
SUBSV V1; F1 ;V2 I Scalar-vector subtraction I VI [1 . .VLR] t-Fi-V2[l .VLR] I DIVSV
I SVLR Ri I Set vector length register I VLR+RI
offset, and branches have a 16-bit signed vector support. This new architecture,
offset. Integer branches test the GPRs for V-DLX, has eight vector registers, each
zero; floating-point branches test a spe- of which holds a vector consisting of up
cial floating-point condition register to 64 floating-point numbers. The vector
(FPCR). functional units perform all their opera-
There is no branch delay slot. Because tions on data in the vector and scalar
the number of instructions executed per registers.
cycle varies on a superscalar machine, it We discuss only the functional units
does not make sense to have a fixed num- that perform floating-point addition,
ber of delay slots. The number of delay multiplication, and division, though vec-
slots is also heavily dependent on the tor machines typically have units to per-
pipeline depth, which may vary from one form integer and logical operations as
chip generation to the next. well. V-DLX issues only one scalar or one
Instead, static branch prediction is vector instruction per cycle, but noncon-
used: forward branches are always pre- flicting scalar and vector instructions can
dicted as not taken, and backward overlap each other.
branches are predicted as taken. If the A special register, the vector length
prediction is wrong, there is a one-cycle register (VLR), controls the number of
delay. quantities that are loaded, operated upon,
Floating-point divides take 14 cycles or stored in any vector instruction. By
and all other floating-point operations software convention, the VLR is normally
take four cycles, except when the result set to 64, except when handling the last
is used by a store operation, in which few iterations of a loop.
case they take three cycles. A load that The vector operations are described in
hits in the cache takes two cycles; if it Table A.2. They include arithmetic opera-
misses in the cache it takes 16 cycles. tions on vector registers and load/store
Integer multiplication takes two cycles. operations between vector registers and
All other instructions take one cycle. memory. Vector operations take place ei-
When a load or store is followed by an ther between two vector registers or
integer instruction that modifies the ad- between a vector register and a scalar
dress register, the instructions may be register. In the latter case, the scalar
issued in the same cycle. If a floating- value is extended across the entire vec-
point store is followed by an operation tor. All vector computations have a vec-
that modifies the register being stored, tor register as the destination.
the instructions may also be issued in The speed of a vector operation de-
the same cycle. pends on the depth of the pipeline in its
implementation. The first result appears
A.2 Vector DLX
after some number of cycles (called the
For vectorizing transformations, we will startup time). After the pipeline is full,
use a version of DLX extended to include one result is computed per clock cycle. In
ACM Computmg Surveys. Vol. 26, No. 4, December 1994

Table A.3. Startup Times in Cycles on V-DLX 90 array section expressions are exam-
[ Operation I Start-Up Time I ples of explicitly data parallel operations.
APL [Iverson 1962] also contains a wide
variety of data-parallel array operators.
~
Examples of control-parallel operations
,
I Vector Load 12 j are cobegin / coend blocks and
doacross loops [Cytron 19861.
For both shared- and distributed-mem-
ory multiprocessors, we assume that the
the meantime, the processor can con- programming environment initializes the
tinue to execute other instructions. Table variables Pn urn to be the total number of
A.3 gives the startup times for the vector processors and Pid to be the number of
operations in V-DLX. These times should the local processor. Processors are num-
not be compared directly to the cycle bered starting with O.
times for operations on S-DLX because
vector machines typically have higher A.3. 1 Shared-Memory DLX Multiprocessor
clock speeds than microprocessors, al-
though this gap is closing. sMX is our prototypical shared-memory
A large factor in the sQeed of vector parallel architecture. It consists of 16 S-
architec~ures is their mem’ory system. V- DLX processors and 512MB of shared
DLX has eight memory banks. After the main memory. The processors and mem-
load latency, data is supplied at the rate ory system are connected together by a
of one word per clock cycle, provided that bus. Each processor has an intelligent
the stride with which the data is ac- cache controller that monitors the bus (a
cessed does not cause bank conflicts (see snoopy cache). The caches are the same
Section 6.5.1). as on S-DLX, except that they contain
Current examples of vector machines 256KB of data. The bandwidth of the bus
are the Cray C-90 [Oed 1992] and IBM is 128 megabytes/second.
ES 9000 Model 900 VF [Gibson and Rao The processors share the memory
1992]. units; a program running on a processor
can access any memory element, and the
system ensures that the values are main-
A.3 Multiprocessors
tained consistently across the machine.
When multiple processors are employed Without caching, consistency is easy to
to execute a program, many additional maintain; every memory reference is
issues arise. The most obvious issue is handled by the main memory controller.
how much of the underlying machine ar- However, performance would be very poor
chitecture to expose to the programmer. because memory latencies are already too
At one extreme, the programmer can high on sequential machines to run well
make explicit use of hardware-supported without a cache; having many processors
operations by programming with locks, share a common bus would make the
fork/join primitives, barrier synchro- problem much worse by increasing mem-
nizations, and message send and receive. ory access latency and introducing band-
These operations are typically provided width restrictions. The solution is to give
as system calls or library routines. Sys- each processor a cache that is smart
tem call semantics are usually not de- enough to resolve reference conflicts.
fined by the language, making it difficult A snoopy cache implements a sharing
for the compiler to optimize them protocol that maintains consistency while
automatically. still minimizing bus traffic. A processor
High-level languages for large-scale that modifies a cache line invalidates all
parallel processing provide primitives for other copies and is said to own that cache
expressing parallelism in one of two ways: line. There are a variety of cache co-
control parallel or data parallel. Fortran herency protocols [Stenstrom 1990;

Table A.4, Memory reference latency in sMX pass through a processor en route to some
Type of Memory Reference / Cycles ] other destination in the mesh are han-
Read value available in local cache I
1
21 dled by the network processor without
Read value owned by ‘ other mocessor I 1; i interrupting the CPU.
Read value nobody owns “ II 20 I The latency of a message transfer is
Write value owneri locally 1 ten microseconds plus 1 microsecond per
Write value owned by other Drocessor I 18
10 bytes of message (we have simplified
Write value nobody ;wns ‘ 22
the machine by declaring that all remote
processors have the same latency). Com-
munication is supported by a message
library that provides the following calls:
Eggers and Katz 1989], but the details
are not relevant to this survey. From the “ SEND(buffer, nbytes,target)
compiler writer’s perspective, the key is- Send nbytes from buffer to the proces-
sue is the time it takes to make a mem- sor target. If the message fits in the
ory reference. Table A.4 summarizes the network processor’s memory, the
latency of each kind of memory reference call returns after the copy (1 microsec-
in sMX, ond/ 10 bytes). If not, the call blocks
Typically, shared-memory multiproces- until the message is sent. Note that
sors implement a number of synchroniza- this raises the potential for deadlock,
tion operations. FORK(n) starts identical but we will not concern ourselves with
copies of the current program running on that in this simplified model.
n processors; because it must copy the ● BROADCAST(buffer, nbytes)
stack of the forking processor to a private Send the message to every other pro-
memory for each of the n – 1 other processor in the machine. The communica-
cessors, it is a very expensive operation.
tion library uses a fast broadcasting
JOIN( ) desynchronizes with the previous
algorithm, so the maximum latency is
FORK, and makes the processor available roughly twice that of sending a mes-
for other FORK operations (except for the sage to the furthest edge of the mesh.
original forking processor, which
* RECEIVE (buffer, nbytes)
proceeds serially).
Wait for a message to arrive; when it
BARRIER( ) performs a barrier synchro-
does, up to nbytes of it will be put in
nization, which is a synchronization point
the buffer. The call returns the total
in the program where each processor
number of bytes in the incoming mes-
waits until all processors have arrived at
sage; if that value is greater than
that point.
nbytes, RECEIVE guarantees that sub-
sequent calls will return the rest of
A.3.2 Distributed-Memory DLX Multiprocessor
that message first.
dMX is our hypothetical distributed-
memory multiprocessor, The machine ACKNOWLEDGMENTS
consists of 64 S-DLX processors (num-
We thank Michael Burke, Manish Gupta, Monica
bered O through 63) connected in an 8 x 8 Lam, and Vivek Sarkar for their assistance with
mesh. The network bandwidth of each various parts of this survey. We thank John
link in the mesh is 10 MB per second. Boyland, James Demmel, Alex Dupuy, John Hauser,
Each processor has an associated net- James Larus, Dick Muntz, Ken Stanley, the mem-
work processor that manages communi- bers of the IBM Advanced Development Technology
cation; the network processor has its own Institute, and the anonymous referees for their
pool of memory and can communicate valuable comments on earlier drafts.
without involving the CPU. Having a

separate processor manage the network REFERENCES
allows applications to send a message ASELSON, H. AND SUSSMAN, G. J. 1985. Structure
asynchronously and continue executing and Interpretation of Computer Programs. MIT
while the message is sent. Messages that Press, Cambridge, Mass.

ABU-SUFAH, W 1979 Improving the performance of Reduction of operator strength, In Program

virtual memory computers. Ph D., thesis, Tech Flow Analysis: Theory and ApphcatLons, S. S.
Rep 78-945, Univ of Illinois at Urbana- Muchnik and N. D. Jones, Eds., Prentice-Hall,
Champaign Englewood Cliffs, N, J,, 79-101,
ABU-SUFAEL W., Kuzw, D ,J., AND LAWRIE, D 1981 ALLEN, J R., KENNEDY, K., PORTERFIELD, C., AND
On the performance enhancement of paging WARREN, J 1983 Conversion of control depen-
systems through program analysis and trans- dence to data dependence In Conference Record
formations IEEE Trans Comput C-30, 5 of the 10th ACM Symposium on Principles of
(May), 341-356 Programming Languages (Austin, Tex., Jan,).
ACM Press, New York, 177-189.
ACKRRiWMJ, W B. 1982. Data flow languages Com-
puter 15, 2 (Feb ), 15-25 Z%PERT. D AND AVNON, D 1993 Architecture of the
Pentium microprocessor IEEE Micro 13, 3
AHo, A. V,, JOHNSON, S C.. AND ULLMAN, J D 1977
(June), 11-21
Code generation for expressions with common
subexpressions J ACM 24, 1 (Jan), 146–160 AMERICAN NATIONAI, STANDARDS INSTITUTE 1987 An
American National standard, IEEE standard
AHo, A V., SETHI, R, AND ULLMAN, J D. 1986
for binary floating-point arithmetic SIGPLAN
Compder~. Pnnccples, Techruqucs, and Tools
Not 22, 2 (Feb.), 9-25
Addmon-Wesley, Reading, Mass
ANDERSON, J. M. AND LAM, M. S 1993 Global opti-
AIKDN, A. AND NICOLAU, A. 1988a Optimal loop
mlzatlons for parallelism and locahty on scal-
parallehzatlon. In Proceechngs of the SIGPLAN
able parallel machmes. In Proceedz ngs of
Conference on Progrczmmlng Language Desgn
the SIGPLAN Conference on Programming
and Implementutz on (Atlanta, Ga , June) SIG-
Language Design and Implementation (Al-
PLAN Not 23, 7 (July), 308-317
buquerque, New Mexico, June) SIGPL4N Not.
AIKEN, A. AND NICOLAU. A, 1988b. Perfect pipehn- 28, 6, 112-125
mg: A new loop parallelization technique. In
ANDERSON, S. AND HUDAK, P 1990 Compilation of
Proceedings of the 2nd European Symposrum
Haskell array comprehensions for scientific
on Programmmg Lecture Notes in Computer
computing. In Proceedings of the SIGPLAN
Science, vol 300 Springer-Verlag, Berlin,
Conference on Programming Language Design
221-235.
and Implementation (White Plains, N.Y , June).
ALLEN, J, R, 1983. Dependence analysis for sub- SIGPLAN Not 25, 6, 137-149
scripted variables and Its apphcatlon to pro-
APPEL, A. W. 1992. Compdmg wLth Contmuatzons.
gram transformations. Ph.D. thesis, Computer
Cambridge Umverslty Press, Cambridge, Eng-
Science Dept., Rice Umverslty, Houston, Tex
land.
ALLEN, F. E, 1969. Program optlmlzatlon. In An-
ARDEN, B. W., GALLER, B, A,, AND GRAHAM, R, M,
nual Rev~ew m Automatw Programmmg 5. In-
1962. An algorithm for translating Boolean ex-
ternational Tracts in Computer Science and
pressions, J, ACM 9, 2 (Apr,), 222-239,
Technology and then- Apphcatlon. vol. 13 Perg-
AKWND AND CULLER, D. E. 1986. Dataflow architec-
amon Press, Oxford, England, 239–307
tures In Annual Reznew of Computer Sctence.
ALLEN, F. E. AND COCKE, J. 1971. A catalogue of
Vol. 1. J. F, Traub, B. J. Grosz, B, W. Lampson,
optimizing transformations. In Design and 0p-
and N, J. Nilsson, Eds, Annual Rewews, Palo
tmawatbon of Compzlers, R Rustm, Ed Pren- Mto, Cahf, 225-253
tice-Hall, Englewood Chffs, N. J., 1–30.
,&VZND, KAT13AIL, V , AND PINGALI, K. 1980. A
ALLEN, J. R. AND KENNEDY, K. 1987, Automatic
dataflow architecture with tagged tokens. Tech.
translation of Fortran programs to vector form. Rep TM-174, Laboratory for Computer Sci-
ACM Trans. Program Lang. Syst. 9, 4 (Oct.).
ence, Massachusetts Institute of Technology,
491-542,
Cambridge, Mass.
ALLEN, J. R. AND KENNEDY, K. 1984, Automatic loop BACON, D. F., CHOW, J.-H., Ju, D R, MUTHUKUMAR,
interchange In Proceedings of the SIGPLAN K , AND SARKAR, V. 1994. A compiler framework
Symposium on Compiler Construction for restructuring data declarations to enhance
(Montreal, Quebec, June). SIGPJ5AN Not. 19, cache and TLB effectiveness. In Proceedings of
6, 233-246. CASCON ’94 (Toronto, Ontario, Oct.).
ALLEN, F, E., BURKE, M,, CHARLES, P., CYTRON, R., BAILEY, D. H, 1992, On the behawor of cache memo-
AND FERRANTE, J. 1988a. An overwew of the ries with stnded data access, Tech. Rep. RNR-
PTRAN analysis system for multiprocessing, J. 92-015, NASA Ames Research Center, Moffett
Parall Distrib. Comput. 5, 5 (Oct.), 617-640. Field, Calif., (May),
ALLEN, F. E,, BURKE, M., CYTRON, R., FERKANTE, J., BAILEY, D. H., BARSZCZ, E., BARTON, J. T.,
HSIEH, W., AND SARKAR, V. 1988b. A framework BROWNING, D. S., CARTER, R. L., DAGUM, L,,
for determining useful parallelism. In Proceed- FATOOHI, R. A., FR~DERICKSON, P. O., LASINSKZ,
ings of the ACM International Conference on T. A., SCHREIBER, R. S., SIMON, H. D.,
Supercomputu-zg (St. Malo, France. July). ACM VENKATAKRISHNAN, V., AND WEERATUNGA, S. K.
Press, New York, 207–215. 1991. The NAS parallel benchmarks. Int. J.
ALLEN, F. E , COCKE, J., AND KENNEDY, K. 1981, Supercomput. Appl, 5, 3 (Fall), 63-73,
ACM Computmg Surveys, Vol. 26. No 4, December 1994

BALASUNDARAM, V. 1990. A mechanism for keeping BLELLOCH, G, 1989, Scans as primitive parallel op-
useful internal information in parallel pro- erations. IEEE Trans. Comput. C-38, 11 (Nov.),
gramming tools: The Data Access Descriptor, 1526-1538.
J. Parall. DtstrLb. Comput. 9, 2 (June), BROMLEY, M., HELLER, S., MCNERNEY, T., AND
154-170. STEELE, G. L., JR. 1991. Fortran at ten gi
BALASUNDARAM, V., Fox, G., KENNEDY, K., AND gaflops: The Connection Machine convolution
KREMER, U. 1990. An interactive environment compiler. In Proceedings of the SIGPLAN Con-
for data partitioning and distribution. In Pro- ference on Programmmg Language Destgn and
ceedings of the 5th Dlstnbuted Memory Corn- Implementation (Toronto, Ontano, June). SIG-
puter Conference (Charleston, South Carolina, PLAN Not. 26, 6, 145-156.
Apr.). IEEE Computer Society Press, Los BURKE, M. AND CYTRON, R. 1986. Interprocedural
Alamitos, Calif. dependence analysis and parallelization. In
BALASUNDARAM, V., KENNEDY, K., KREMER, U., Proceedings of the SIGPLAN Symposium on
MCKINLEY, K., AND SUBHLOK, J. 1989. The Compiler Construction (Palo Alto, Calif., June).
ParaScope editor: An interactive parallel pro- SZGPLANNot. 21, 7 (July), 162-175. Extended
gramming tool. In Proceedings of Superconz- version available as IBM Thomas J. Watson
putmg ’89 (Reno, Nev., Nov.). ACM Press, New Research Center Tech. Rep. RC 11794.
York, 540-550. BURNETT, G. J. AND COFFMAN, E. G., JR. 1970. A
BALL, J. E. 1979. Predicting the effects of optimiza- study of interleaved memory systems. In Pro-
tion on a procedure body. In Proceedings of the ceedings of the Spring Joint AFIPS Computer
SIGPLAN Symposium on Compiler Construc- Conference. vol. 36. AFIPS, Montvale, N.J.,
tion (Denver, Color., Aug.). SIGPLAN Not. 14, 467-474.
8, 214-220. BURSTALL, R. M. AND DARLINGTON, J. 1977. A trans-
formation system for developing recursive pro-
BANERJEE, U. 1991. Unimodular transformations of
grams. J. ACM 24, 1 (Jan.), 44-67.
double loops. In Aduances in Languages and
Compilers for Parallel Processing, A. Nicolau, BUTLER) M., YEH, T., PATT, Y., ALsuP, M., SCALES,
Ed. Research Monographs in Parallel and Dis- H., AND SHEBANOW, M. 1991, Single instruction
tributed Computing, MIT Press, Cambridge, stream parallelism is greater than two. In Pro-
Mass., Chap. 10. ceedings of the 18th Annual International Sym-
posium on Computer Architecture (Toronto,
BANERJEE, U. 1988a. Dependence Analysis for Su-
Ontario, May). SIGARCH Comput. Arch. Neuw
percomputing. Kluwer Academic Publishers,
19, 3, 276-286.
Boston, Mass.
CALLAHAN, D. AND KENNEDY, K. 1988a. Analysis of
BANEKJEE, U. 1988b. An introduction to a formal
interprocedural side-effects in a parallel pro-
theory of dependence analysis. J. Supercom-
gramming environment J. Parall. Distrib
put. 2, 2 (Oct.), 133-149.
Cornput. 5, 5 (Oct.), 517-550.
BANERJEE, U. 1979. Speedup of ordinary programs, CALLAHAN, D. AND KENNEDY, K. 1988b. Compihng
Ph.D. thesis, Tech. Rep. 79-989, Computer Sci- programs for distributed-memory multiproces-
ence Dept., Univ. of Illinois at Urbana- sors. J. Supercompzd. 2, 2 (Oct.), 151–169.
Champaign.
CALLAHAN, D., CARR, S., AND KENNEDY, K. 1990.
BANERJEE, U., CHEN, S. C., KUCK, D. J., AND TOWLE, Improving register allocation for subscripted
R. A. 1979. Time and parallel processor bounds variables. In Proceedings of the SIGPLAN Con-
for Fortran-like loops. IEEE Trans. Comput. ference on Programming Language Design and
C-28, 9 (Sept.), 660-670. Implementation (White Plains, N.Y., June).
BANNING, J. P. 1979. An efficient way to find the SIGPLAN Not. 25, 6, 53-65.
side-effects of procedure calls and the aliases of CALLAHAN,
D , COCKE, J., AND KENNEDY, K. 1988.
variables. In Conference Record of the 6th ACM Estimating interlock and improving balance for
Symposium on Principles of Programming Lan - pipelined architectures J. Parall. Distrib.
guages (San Antonio, Tex., Jan.). ACM Press, Comput. 5, 4 (Aug.), 334-358.
29-41. CALLAHAN, D., COOPER, K., KENNEDY, K., AND
BERNSTEIN, R. 1986. Multiplication by integer con- TORCZON, L. 1986. Interprocedural constant
stants. Softw. Pratt. Exper. 16, 7 (July), propagation. In Proceedings of the SIGPLAN
641-652. Sympos~um on CompJer Construction (Palo
Alto, Cahf., June). SIGPLAN Not. 21, 7 (July),
BERRY, M., CHEN, D., Koss, P., KUCK, D., Lo, S.,
152-161.
PANG, Y., POINTER, L., ROLOFF, R., SAMEH, A.,
CLEMENTI, E , CHIN. S , SCHNEI~ER, D , Fox, G , CARR, S. 1993. Memory-hierarchy management.
ME SSINA, P., WALKER, D., HSIUNG, C., Ph.D. thesis, R]ce Umversity, Houston, Tex.
SCHWARZMEIER, J., LUE, K., ORSZAG, S., S~IDL, CHAITIN, G. J. 1982. Register allocation and spilling
F., JOHNSON, O., GOODRUM, R., AND MARTIN, J. via graph coloring. In Proceedings of the SIG-
1989. The Perfect Club benchmarks: Effective PLAN Symposium on Compiler Construction
performance evaluation of supercomputers. Znt. (Boston, Mass., June). SIGPLAN Not 17, 6,
J. Supercomput. Appl. 3, 3 (Fall), 5-40. 98-105.

CFIMTIN, G. J,, AUSLANDER, M. A., CHANDRA, A. K., 221-228. Also available as IBM Tech Rep. RC
COCKE, J., HOPKINS, M. E., AND MARKSTEIN, 8111, Feb. 1980.
P. W. 1981. Register allocation via coloring. COCKE, J. AND SCHWARTZ, J. T. 1970. Programming
Comput. Lang. 6, 1 (Jan.), 47-57. languages and their compders (preliminary
CHAMBERS, C. .MXD UNGAR, D. 1989. Customization: notes). 2nd ed. Courant Inst. of Mathemalzcal
Optimizing compiler technology for SELF, a Sciences, New York Umverslty, New York.
dynamically-typed object-oriented program- COLWELL, R. P,, NIX, R, P,, ODONNEL, J. J.,
ming language. In proceedings of the SIG- PAPWORTH, D. B., AND RODMAN, P, K. 1988. A
PLAN Conference on Programming Language VLIW architecture for a trace scheduling com-
Design and Implementation (Portland, Ore., piler. IEEE Trans. Comput. C-37, 8 (Aug.),
June). SIGPLAN Not. 24, 7 (July), 146-160. 967-979.
CEATTERJEE, S., GILBERT, J. R., AND SCHREIBER, R. COOPER, K, D, AND KENNEDY, K, 1989, Fast inter-
1993a. The alignment-distribution graph. In procedural ahas analysie, In Conference Record
Proceedings of the 6th International Workshop of the 16th ACM Symposmm on Prmc~ples of
on Languages and Compilers for Parallel Com- Programmmg Languages (Austin, Tex., Jan.).
puting. Lecture Notes in Computer Science, ACM Press, New York, 49-59.
vol. 768. 234–252.
COOPER, K, D., HALL, M. W,, AND KENNEDY, K. 1993.
CHATTERJEE, S., GILBERT, J. R., SCHREIBER, R., AND A methodology for procedure cloning. Comput.
TENG, S.-H. 1993b. Automatic array alignment Lang. 19, 2 (Apr.), 105-117.
in data-parallel programs. In Conference Record
COOPER, K. D., HALL, M. W., AND TORCZON, L. 1992.
of the 20th ACM Sympo.mum on Prmczples of
Unexpected side effects of inline substitution
Programmmg Languages (Charleston, S. Car-
A case study. ACM Lett. Program. Lang. Syst
olina, Jan.). ACM Press, New York, 16–28.
1, 1 (Mar.), 22-32.
CHEN, S. C. AND KUCK, D. J. 1975. Time and paral-
COOPER, K D., HALL, M. W., AND TORCZON, L 1991.
lel processor bounds for linear recurrence sys-
An experiment with inline substitution. Softw
tems. IEEE Trans. Comput. C-24 7 (July),
Pratt. Exper. 21, 6 (June), 581-601
701-717.
CRAY RESEARCH 1988 CFT77 Reference Manual.
CHOI, J.-D., BURKE, M., AND CARINI, P. 1993. Effi- Publication SR-0018-C. Cray Research, Inc..
cient flow-sensitive interprocedural computa- Eagan, Minn.
tion of pointer-induced aliases and side effects.
CYTRON, R. 1986. Doacross: Beyond vectorizatlon
In Conference Record of the 20th ACM Sympo-
for multiprocessors. In Proceedings of the Inter-
sium on Principles of Programming Languages
national Conference on Parallel Processing (St.
(Charleston, S. Carolina, Jan.). ACM Press,
Charles, Ill., Aug.). IEEE Computer Society,
New York, 232-245.
Washington, D. C., 836-844.
CHOW, F C 1988. Minimizing register usage
CYTRON, R. AND FERRANTE, J. 1987. What’s in a
penalty at procedure calls. In Proceedings of
name? -or- the value of renaming for paral-
the SIGPLAN Conference on Programming
lehsm detection and storage allocation. Pro-
Language Design and Implementation (Atlanta,
ceedings of the Internat~onal Conference on
Ga., June). SIGPLAN Not. 23, 7 (July), 85-94.
Parallel Procewng (University Park, Penn.,
CHOW, F. C. AND HENNESSEY, J. L. 1990. The prior- Aug.), Pennsylvania State University Press,
ity-based coloring approach to register alloca- University Park. Pa., 19-27.
tion. ACM Trans. Program. Lang. Syst. 12, 4
DANTZIG, G. B. AND EAVES, B. C. 1974. Fourier-
(Oct.), 501-536. Motzkin ehmination and sits dual with applica-
CLARKR, C. D. AND PEYTON-JONES, S. L. 1985. Strict- tion to integer programming. In Combinatorial
ness analysis—a practical approach. In Func- Programming: Methods and Applications
tional Programming Languages and Computer (Versailles, France, Sept.). B. D. Roy, Ed. D.
Architecture. Lecture Notes in Computer Sci- Reidel, Boston, Mass., 93-102.
ence, vol. 201. Springer-Verlag, Berlin, 35–49. DENNIS, J B 1980 Data flow supercomputers
CLINGER, W. D. 1990 How to read floating point Computer 13, 11 (Nov.), 48-56.
numbers accurately. In proceedings of the SIG- DIXIT, K. M. 1992. New CPU benchmarks from
PLAN Conference on Programmmg Language SPEC. In Digest of Papers, Spring COMPCON
Deslogn and Implementation (White Plains, 1992, 37th IEEE Computer Society Interna-
NY., June). SIGPLAN Not. 25, 6, 92-101. tional Conference (San Francisco, Calif., Feb.).
COCKE, J. 1970. Global common subexpression elim- IEEE Computer Society Press, Los Alamitos,
ination. In proceedings of the ACM Symposmm Calif., 305-310
on Compder OptLmizatzon (July). SIGPLAN DONGARRA, J. AND HIND, A. R. 1979. Unrolling loops
Not. 5, 7, 20–24. m Fortran. Softzo, Pratt. Exper. 9, 3 (Mar.),
219-226.
COCKR, J. AND MARKSTEIN, P. 1980. Measurement of
program improvement algorithms. In Proceed- EGGERS, S. J. AND KATZ, R. H. 1989 Evaluating the
ings of the IFIP Congress (Tokyo, Japan, Oct.). performance of four snooping cache coherency
North-Holland, Amsterdam, Netherlands, protocols. In proceedings of the 16th Annual

International Symposmm on Computer Archi- Nov.). IEEE Computer Society, Washington,

tecture (Jerusalem, Israel, May). SZGARCH D. C,, 164-173.
Comput. Arch. News 17, 3 (June), 2-15.
GOFF, G,, KENNEDY, K,, AND TSENG, C.-W. 1991.
EIGENMANN, R,, HOEFLINGER, J., LI, Z,, AND PADUA, Practical dependence testing. In Proceedings of
D, A. 1991, Experience in the automatic paral- the SIGPLAN Conference on Programming
lelization of four Perfect-benchmark programs. Language Design and Implementat~on (Toronto,
In Proceedings of the 4th International Work- Ontario, June). SIGPLAN Not, 26, 6, 15-29,
shop on Languages and Compders for Parallel
GRAHAM, S. L., LUCCO, S., AND SHARP, O. J. 1993.
Computing. Lecture Notes in Computer Sci-
Orchestrating interactions among parallel com-
ence, vol. 589. Springer-Verlag, Berlin, 65–83.
putations, In Proceedings of the SIGPLAN
Also available as Tech. Rep. 1193, Center for
Conference on Programming Language Design
Supercomputing Research and Development,
and Implementation (Albuquerque, New Mex-
ELLIS, J. R. 1986. Bulldog: A Compiler for VLIW ico, June), SZGPLAN Not. 28, 6, 100-111.
Architectures. ACM Doctoral Dissertation
GRANLUND, T. AND KENNER, R. 1992. Eliminating
Award. MIT Press, Cambridge, Mass.
branches using a superoptimizer and the GNU
ERSHOV, A. P. 1966. ALPHA-an automatic pro- compiler. In Proceedings of the SIGPLAN Con-
gramming system of high efficiency. J. ACM ference on Programming Language Design and
13, 1 (Jan.), 17-24. Implementation (San Francisco, Calif., June).
FEAUTRIER, P. 1991. Dataflow analysis of array and SZGPLAN Not. 27, 7 (July), 341-352.
scalar references. Int. J. Parall. Program. 20, GUPTA, M. 1992. Automatic data partitioning on
1 (Feb.), 23-52. distributed memory multicomputers. Ph.D. the-
FEAUTRIER, P. 1988. Array expansion. In Proceed- sis, Tech. Rep. UILU-ENG-92-2237, Univ. of
ings of the ACM International Conference on Illinois at Urbana-Champaign.
Supercomputing (St. Malo, France, July), ACM GUPTA, M., MIDKIE-F, S., SCHONBERG, E., SWEENEY,
Press, New York, 429-441. P., WANG, K. Y., AND BURRE, M. 1993. PTRAN
FERRARI, D, 1976, The improvement of program II: A compiler for High Performance Fortran.
behavior. Computer 9, 11 (Nov.), 39-47. In Proceedings of the 4th International Work-
FISHER, J.A. 1981. Trace scheduling A technique shop on Compilers for Parallel Computers
for global microcode compaction. IEEE Trans. (Delft, Netherlands, Dec.). 479-493.
Comput. C-30, 7 (July), 478-490. HALL, M. W., KENNEDY, K., AND MCKINLEY, K. S.
FISHER, J. A., ELLIS, J. R., RUTTENBERG, J. C., AND 1991. Interprocedural transformations for par-
NICOLAU, A. 1984. Parallel processing A smart allel code generation. In Proceedings of Super-
compiler and a dumb machine. In Proceedings computing ’91 (Albuquerque, New Mexico,
of the SIGPLAN Symposium on Compiler Con- Nov.). IEEE Computer Society Press, os Alami-
struction (Montreal, Quebec, June). SIGPLAN tos, Calif., 424-434.
Not. 19, 6, 37-47. HARRIS, K. AND HOBBS, S. 1994. VAX Fortran. In
FLOATING POINT SYSTEMS. 1979. FPS AP-120B Pro- Optimization in Compders, F. E. Allen, R.
cessor Handbook. Floating Point Systems, Inc., Cytron, B. K. Rosen, and K. Zadeck, Eds. ACM
Beaverton, Ore. Press, New York, chap. 16. To be published,
FREE SOFTWARE FOUNDATION. 1992. gcc 2,x Refer- HATFIELD, D. J. AND GERALD, J. 1971. Program re-
ence Manual. Free Software Foundation, Cam- structuring for virtual memory, IBM Syst. J.
bridge, Mass. 10, 3, 168-192,
FREUDENEERGER, S. M., SCHWARTZ, J. T., AND SHARIR, HENNESSY, J. L, AND PATTERSON, D. A. 1990. Com-
M. 1983. Experience with the SETL optimizer. puter Architecture: A Quantitative Approach.
ACM Trans. Program. Lang. Syst. 5, 1 (Jan.), Morgan Kaufmann Publishers, San Mateo,
26-45. Calif,
GANNON, D., JALBY, W., AND GALLIVAN, K. 1988. HEWLETT-PACKARD. 1992. PA-RZSC 1.1 Architecture
Strategies for cache and local memory manage- and Instruction Manual. 2nd ed. Part Number
ment by global program transformation, J. 09740-90039. Hewlett-Packard, Palo Alto, Calif.
Parall. Distr~b. Comput. 5, 5 (Oct.), 587-616. HIGH PERFORMANCE FORTRAN FORUM. 1993. High
GERNDT, M. 1990. Updating distributed variables in Performance Fortran language specification,
local computations. Concurrency Pratt. Ilxper. version 1.0. Tech. Rep. CRPC-TR92225, Rice
2, 3 (Sept.), 171-193. University, Houston, Tex.
GIBSON, D. H. AND RAO, G. S. 1992. Design of the HIRANANDAM, S., KENNEDY, K., AND TSENG, C.-W.
IBM System/390 computer family for numeri- 1993. Preliminary experiences with the
cally intensive applications: An overview for Fortran D compiler. In Proceedmg.s of Super-
engineers and scientists. IBM J. Res. Dev. 36, computing ’93 (Portland, Ore., Nov.). IEEE
4 (July), 695-711. Computer Society Press, Los Alamitos, CaIif.,
GIRKAR, M. AND POLYCHRONOPOULOS, C. D. 1988. 338-350.
Compiling issues for supercomputers. In Pro- HIRANANDANL S., KRNNEDY, K., AND TSENG, C.-W.
ceedings of Supercomputing ’88 (Orlando, Fla., 1992. Compiling Fortran D for MIMD dis-

416 * David F. Bacon et al,
tributed-memory machines. Commun. ACM 35, programs. Soflw. Pratt. Exper. 1,2 (Apr.-June),
8 (Aug.), 66-80. 105-133.
HUDAK, P. AND GOLDBERG, B. 1985. Distributed exe- KOELBEL, C 1990. Compiling programs for non-
cution of functional programs using serial com- shared memory machines Ph D thesis, Purdue
binators IEEE Trans. Comput. C-34, 10 (Ott), University, West Lafayette, Ind.
881-890. KOELBEL, C AND MEHROTRA, P. 1991 Compiling
HWU, W. W. AND CHANG, P. P. 1989 Achieving high global name-space parallel loops for distributed
instruction cache performance with an optimiz- execution. IEEE Trans. Parall Dlstnb. Syst.
ing compiler. In Proceedings of the 16th An- 2, 4 (Oct.), 440-451
n ual International Symposium on Computer KRANZ, D., KELSE~, R., REES, J., HUDAK, P., PHILBIN,
Architecture (Jerusalem, Israel, May) J , AND ADAMS, N. 1986. ORBIT: An optimizing
SIGARCH Comput. Arch. News 17, 3 (June), compiler for Scheme. In Proceedings of the
242-251. SIGPLAN Symposmm on Compder Construc-
Hwu, W. W., MAHLKE, S. A., CHEN, W. Y., CHANG, tion (Palo Alto, Calif, June). SIGPLAN Not
P. P., WARTER, N. J., BRINGMANN, R. A., 21, 7 (July), 219-233
OUELLETTE, R. G., HANK, R. E., KIYOHARA, T., Kuc~, D. J. 1978. The Structure of Computers and
HAAO, G E., HOLM, J, G., AND LAVERY, D M Computations Vol. 1. John Wdey and Sons,
1993. The superblock: An effective technique New York.
for VLIW and superscalar compilation. J. Su -
KUCK, D. J. 1977. A survey of parallel machine
percomput 7, 1/2 (May), 229-248
orgamzation and programming. ACM Comput.
IBM 1992 Optimization and Tuning Guide for the Suru. 9, 1 (Mar.), 29-59.
XL Fortran and XL C Compilers. 1st ed. Publi-
KUCK, D. J, AND STOKES, R, 1982. The Burroughs
cation SC09-1545-00, IBM, Armonk, N.Y.
Sclentitic Processor (BSP). IEEE Trans. Com-
IBM. 1991 IBM RISC System/6000 NIC Tuning put, C-31, 5 (May), 363-376.
Guide for Fortran and C Publication GG24-
KUCK, D. J., KUHN, R. H,, PADUA, D,, LEASURE, B,,
3611-01, IBM, Armonk, NY.
AND WOLFE, M. 1981, Dependence graphs and
IRIGOIN, F. AND TRIOLET, R. 1988. Supernode parti- compiler optimizatlons, In Conference Record of
tioning In Conference Record of the 15th ACM the 8th ACM Svmposlum on Prmctples of Pro-
Symposium on Principles of Programming Lan - grammmg Languages (Williamsburg, Vs., Jan,).
guages (San Diego, Calif., Jan) ACM Press, ACM Press, New York, 207-218.
New York, 319-329 LAM, M S 1988. Software pipelining: An effective
IVERSON, K. E. 1962. A F’rogrammzng Language scheduling technique for VLIW machines In
John Wiley and Sons, New York. Proceedings of the SIGPLAN Conference on
KENNEDY, K. AND MCKINLEY, K. S. 1990 Loop dis- Programming Language Design and Implemen-
tation (Atlanta, Ga., June). SIGPLAN Not. 23,
tribution with arbitrary control flow In Pro-
7 (July), 318-328
ceedings of Supercomputing ’90 (New York,
N.Y, Nov. ) IEEE Computer Society Press, Los LAM, M. S AND WILSON, R P 1992. Limits of con-
Alamitos, Calif., 407-416 trol flow on parallelism In Proceedings of the
19th Annual International Symposium on
KENNEDY, K., MCKTNLEY, K. S., AND TSENG, C.-W.
Computer Architecture (Gold Coast, Australia,
1993. Analysis and transformation in an inter-
May). SIGARCH Comput Arch News 20, 2,
active parallel programming tool Concurrency
46-57
Pratt. Exper 5, 7 (Oct.), 575-602.
LAM, M, S., ROTHBERG, E. E., AND WOLF, M. E. 1991.
KILDALL, G. 1973. A unified approach to global pro-
The cache performance and optlmlzatlon of
gram optimization In Conference Record of the
blocked algorithms. In Proceedings of the 4th
ACM Sympostum on Prmclples of Program-
International Conference on Architectural
mmg Languages (Boston, Mass , Oct.). ACM
Support for Programmmg Languages and Op-
Press, New York, 194–206.
erating Systems (Santa Clara, Cahf,, Apr.).
KNOBE, K. AND NATARAJAN, V. 1993. Automatic data SIGPLAN Not. 26, 4, 63–74.
allocation to minimize communication on SIMD
LAMPORT, L. 1974 The parallel execution of DO
machmes. J. Supercomput. 7, 4 (Dec.), 387–415.
loops. Commun. ACM 17, 2 (Feb.), 83-93.
KNOBE, K., LUKAS, J, D., AND STEELE, G. L., JR.
LANDI, W., RYDER, B. G., AND ZHANG, S. 1993. Inter-
1990, Data optimlzatlon: Allocation of arrays to
procedural modification side effect analysis
reduce communication on SIMD machmes. J.
with pointer abasing. In Proceedings of the
Parall Distrib, Comput. 8, 2 (Feb.), 102-118.
SIGPLAN Conference on Programmmg Lan-
KNOBE, K., LUKAS, J D., AND STEELE, G. L., JR guage Design and Implementation (Al-
1988. Massively parallel data optimization, In buquerque, New Mexico, June), SZGPLAN Not.
The 2nd. Symposium on the Frontiers of Mas- 28, 6, 56-67,
siuely Parallel Computation (Fairfax, Vs., Oct.). LARUS, J. R. 1993 Loop-level parallelism in nu-
IEEE, Piscataway, N.J., 551-558. meric and symbolic programs. IEEE Trans.
KNUTH, D. E. 1971. An empirical study of Fortran Parall. Dwtrtb. Syst 4, 7 (July), 812-826.

LEE, G., KRUSKAL, C. P., AND KUCK, D. J. 1985. An MAYDAN, D. E., HENNESSEY, J. L., AND LAM, M. S.
empirical study of automatic restructuring of 1991. Efficient and exact data dependence
nonnumerical programs for parallel processors. analysis. In Proceedings of the SIGPLAN Con-
IEEE Trans. Comput. C-34, 10 (Oct.), 927-933. ference on Programming Language Design and
LI, Z. 1992. Array privatization for parallel execu- Implementation (Toronto, Ontario, June). SIG-
tion of loops. In Proceedings of the ACM Inter- PLAN Not. 26, 6, 1-14.
national Conference on Supercomputmg MCFARLING, S. 1991. Procedure merging with in-
(Washington, D. C., July). ACM Press, New struction caches. In Proceedings of the SIG-
York, 313-322. PLAN Conference on Programming Language
LI, J. AND CHEN, M. 1991. Compiling communica- Design and Implementation (Toronto, Ontario,
tion-eftlcient programs for massively parallel June). SIGPLAN Not. 26, 6, 71-79.
machines. IEEE Trans. Parall. Distrib. Syst. MCGRAW, J. R. 1985. SISAL: Streams and iteration
2, 3 (July), 361-376. in a single assignment language. Tech. Rep.
LI, J. AND CHEN, M. 1990. Index domain alignment: M-146, Lawrence Livermore National Labora-
Minimizing cost of cross-referencing between tory, Livermore, Calif.
distributed arrays. In The 3rd Symposium on
MCMAHON, 1?. M. 1986. The Livermore Fortran ker-
the Frontiers of Massively Parallel Computa-
nels: A computer test of numerical performance
tion, J. Jaja, Ed. IEEE Computer Society Press,
range. Tech. Rep. UCRL-55745, Lawrence Liv-
Los Alamitos, Calif., 424-433.
ermore National Laboratory, Livermore, Calif.
LI, Z. AND YEW, P. 1988. Interprocedural analysis
MICHIE, D. 1968. “Memo” functions and machine
for parallel computing. In Proceedings of the
learning. Nature 218, 19-22.
International Conference on Parallel Process-
ing. F. A. Briggs, Ed. Vol. 2. Pennsylvania State MILLSTEIN, R. E. AND MUNTZ, C. A. 1975. The IL-
University Press, University Park, Pa., LIAC-IV Fortran compiler. In Programming
221-228. Languages and Compilers for Parallel and Vec-
tor Machines (New York, N.Y., Mar). SIG-
LI, Z., YEW, P., AND ZHU, C. 1990. Data dependence
PLAN Not. 10, 3, 1-8.
analysis on multi-dimensional array refer-
ences. IEEE Trans. Parall. D~str~b. Syst. 1, 1 MIRCHANDANEY, R., SALTZ, J. H., SMITH, R. M., NICOL,
(Jan.), 26-34. D. M., AND CROWLEY, K. 1988. Principles of
LOVRMAN, D. B. 1977. Program improvement by runtime support for parallel processors. In Pro-
source-to-source transformation. J. ACM 1, 24 ceedings of the ACM International Conference
(Jan.), 121-145. on Supercomputing (St. Male, France, July).
ACM Press, New York, 140-152.
Lucco, S. 1992. A dynamic scheduling method for
irregular parallel programs. In Proceedings of MOREL, E. AND RENVOME, C. 1979. Global optimiza-
the SIGPLAN Conference on Programming tion by suppression of partial redundancies.
Language Design and Implementation (San Commun. ACM 22, 2 (Feb.), 96-103.
Francisco, Calif., June). SIGPLAN Not. 27, 7 MUCHNICK, S. S. AND JONES, N., (EDs.) 1981. Pro-
(July), 200-211. gram Flow Analysis. Prentice-Hall, Englewood
MACE, M, E, 1987, Memory Storage Patterns m Cliffs, N.J.
Parallel Processing. Kluwer Academic Publish- MURAOKA, Y. 1971. Parallelism exposure and ex-
ers, Norwell, Mass. ploitation in programs. Ph.D. thesis, Tech. Rep.
MACE, M. E. AND WAGNER, R. A. 1985. Globally 71-424, Univ. of Illinois at Urbana-Champaign.
optimum selection of memory storage patterns. MYERS, E. W. 1981. A precise interprocedural data
In Proceedings of the International Conference
flow algorithm. In Conference Record of the 8th
on Parallel Processing, D. DeGrott, Ed. IEEE ACM Symposium on Principles of Program-
Computer Society, Washington, D. C., 264-271.
ming Languages (Williamsburg, Vs., Jan.).
MARKSTEIN, P., MARKSTEIN, V., AND ZADECK, K. 1994. ACM Press, New York, 219-230.
Strength reduction. In Optimization in t30mpd-
NICOLAU, A. 1988. Loop quantization: A generalized
ers. ACM Press, New York, Chap. 9 To be
loop unwinding technique. J. Parall Distrib.
published.
Comput. 5, 5 (Oct.), 568-586.
MASSALIN, H. 1987. Superoptimizer: A look at the
NICOLAU, A. AND FISHER, J. A. 1984. Measuring the
smallest program. In Proceedings of the 2nd
parallelism available for very long instruction
International Conference on Architectural Sup-
word architectures. IEEE Trans. Comput. C-33,
port for Programmmg Languages and Operat-
11 (Nov.), 968-976.
ing Systems (Palo Alto, Calif., Oct.). SIGPLAN
Not. 22, 10, 122-126. NIKHIL, R. S. 1988. ID reference manual, version
MAYDAN, D. E., AMAMSINGHE, S. P., AND LAM, M. S. 88.0. Tech. Rep. 284, Laboratory for Computer
1993. Array data flow analysis and its use in Science, Massachusetts Instztute of Technol-
array privatization. In Conference Record of the ogy, Cambridge, Mass.
20th ACM Sympostum on Principles of Pro- NOBAYASHL H. AND EOYANG, C. 1989. A comparison
gramming Languages (Charleston, S. Carolina, study of automatically vectorizing Fortran
Jan.). ACM Press, New York, 1-15. compilers. In Proceedings of Supercomputing

’89 (Reno, Nev., Nov.), ACM Press, New York, Proceedings of the International Conference on
820-825 Parallel Processing. Volume 2, Pennsylvania
OBRIEN, K., HAY, B., MINISH, J., SCHAFFER, H , State Umverslty Press, Umversity Park, Pa.,
SCHLOSS, B., SHEPHERD, A., AND ZALESKI, M. 39-48.
1990 Advanced compiler technology for the PRESBERG, D. L. AND JOHNSON, N. W. 1975 The
RISC System/6000 architecture. In IBM RISC Paralyzer: IVTRAN’S parallehsrn analyzer and
System/6000 Technology. Publication SA23- synthesizer, In Programmmg Languages and
2619. IBM Corporation, Mechanicsburg, Penn Compders for Parallel and Vector Machines
OED, W. 1992. Cray Y-MP C90: System features (New York, NY., Mar.). SIGPLAN Not. 10, 3,
and early benchmark results. Parall. Comput. 9-16.
18, 8 (Aug.), 947-954. PUGH, W. 1992. A practical algorithm for exact
OEHLER, R. R. AND BLASGEN, M, W. 1991. IBM RISC array dependence analysls Commun. ACM 35,
System/6000: Architecture and performance. 8 (Aug.), 102-115,
IEEE Micro 11, 3 (June), 14-24,
PUGH, W. 1991, Umform techniques for loop opti-
PADUA, D. A. AND PETERSEN, P. M. 1992 Evaluation mization, In proceedings of the ACM Interna-
of parallelizing compilers. Parallel Computing tional Conference on Supercomputmg (Cologne,
and Transputer Appllcatlons, (Barcelona, Germany, June). ACM Press, New York.
Spain, Sept ) CIMNE, 1505-1514. Also avail-
RAU, B. AND FISHER, J. A. 1993. InstructIon-level
able as Center for Supercomputing Research
parallel processing: History, overview, and per-
and Development Tech Rep. 1173.
spectwe, J. Supercomput. 7, 1/2 (May), 9–50
PADUA, D. A, AND WOLFE, M. J. 1986. Advanced
RAU, B., YEN, D. W. L., YEN, W., AND TOWI,E, R. A.
compiler optlmlzations for supercomputers.
1989. The Cydra 5 departmental supercom-
Commun. ACM 29, 12 (Dec.), 1184-1201.
puter: Design phdosophies, decisions, and
PADUA, D. A., KUCK, D. J. AND LAWRIE, D. 1980.
trade-offs, Computer 22, 1 (Jan,), 12-34.
High-speed multiprocessors and compilation
techniques IEEE Trans. Comput. C-29, 9 REES, J., CLINGER, W, ET AL. 1986 Revised3 report
(Sept.), 763-776. on the algorithmic language Scheme. SIG-
PLAN Not 21, 12 (Dee), 37-79.
PETERSEN, P. M. AND PADUA, D A 1993 Static and
dynamic evaluation of data dependence analY- RISEMAN, E. M, AND FOSTER, C. C, 1972, The mhlbl-
sis. In Proceedings of the ACM International tion of potential parallehsrn by conditional
Conference on Supercomputmg (Tokyo, Japan, jumps. IEEE Trans, Comput, C-21, 12 (Dec.),
July). ACM Press, New York, 107-116. 1405-1411.
PETTIS, K, AND HANSEN, R. C, 1990. Profile guided ROGERS, A. M, 1991. Compiling for locality of refer-
code posltionmg. In Proceedings of the SIG- ence. Ph.D. thesis, Tech Rep. TR 91-1195, Dept.
PLAN Conference on Programmmg Language of Computer Science, Cornell University
Des~gn and Implementation (Whte Plains,
NY., June). SIGPLAN Not. 25, 6, 16-27. RUSSELL, R. M. 1978. The CRAY-1 computer sys-
tem. Commun. ACM 21, 1 (Jan.), 63–72,
POLYCHRONOPOULOS, C. D. 1988. Parallel Program-
ming and Compilers. Kluwer Academic Pub- SABOT, G. AND WHOLEY, S. 1993 CMAX: A Fortran
lishers, Boston, Mass. translator for the ConnectIon Machine system.
POLYCHRONOPOULOS, C D 1987a Advanced loop In proceedings of the ACM International Con-
optlmlzations for parallel computers In Pro- ference on Supercomputmg (Tokyo, Japan,
ceedings of the Ist International Conference on July). ACM Press, New York.
Supercomputmg. Lecture Notes in Computer SARKAR, V. 1989. Partitioning and Scheduling Par-
Science, vol 297 Springer-Verlag, Berlin, allel Programs for Multiprocessors. Research
255-277. Monographs in Parallel and Distributed Com-
POLYCHRONOPOULOS, C. D. 1987b, LOOP coalescing: puting. MIT Press, Cambridge, Mass
A compiler transformation for parallel ma-
SARKAR, V. AND THEKKATH, R. 1992. A general
chines, In Proceedings of the International
framework for Iteration-reordering transforma-
Conference on Parallel Processing (Umverslty
tions. In Proceedings of the SIGPLAN Confer-
Park, Penn., Aug.). Pennsylvania State Univer-
ence on Programming Language Design and
sity Press, University Park, Pa., 235–242.
Implementation (San Francisco, Calif., June).
POLYCHRONOPOULOS, C. D. AND KUCK, D. J. 1987. SIGPLAN Not. 27, 7 (July), 175-187,
Guided self-scheduling: A practical scheduhng
scheme for parallel supercomputers. IEEE SCHEIFLER, R. W. 1977. An analysis of inline substi-
Trans. Comput. C-36, 12 (Dec.), 1425-1439, tution for a structured programming language.
Commun. ACM 20, 9 (Sept.), 647-654.
POLYCHRONOPOULOS, C. D., GIRKAR, M., HAGHIGHAT,
M. R., LEE, C. L., LEUNG, B., AND SCHOUTEN, D. SHEN, Z , LI, Z., AND YEW, P -C. 1990. An empirical
1989. Parafrase-2: An environment for paral- study of Fortran programs for parallelizing
Ielizing, partitioning, synchronizing, and compilers. IEEE Trans. Parall. Distrib. Syst,
scheduling programs on multiprocessors. In 1, 3 (July), 356-364.

SINGH, J. P. AND HENNESSY, J. L. 1991. An empirical Computer Science Dept., Rice University,
investigation of the effectiveness and limita- Houston, Tex.
tions of automatic parallelization. In Proceed-
Tu, P, AND ~ADUA, D. A, 1993, Automatic array
ings of the International Symposium on Shared
privatization. In Proceedings of the 6th Inter-
Memory Multiprocessing, (Apr.), 213-240.
national Workshop on Languages and Compd-
SINGH, J. P., WEBER, W.-D., AND GUPTA, A. 1992. ers for Parallel Computing. Lecture Notes in
SPLASH: Stanford parallel applications for Computer Science, vol. 768. Springer-Verlag,
shared memory. SZGARCH C’omput. Arch, Berlin, 500-521.
News 20, 1 (Mar.), 5-44. Also available as
VON HANXLEDEN, R. AND KENNEDY, K. 1992. Relax-
Stanford Univ. Tech. Rep. CSL-TR-92-526.
ing SIMD control flow constraints using loop
SITES, R. L., (ED.) 1992. Alpha Architecture Refer- transformations. In Proceedings of the SIG-
ence Manual. Digital Press, Bedford, Mass. PLAN Conference on Programming Language
SOHI, G. S. AND VAJAPAYEM, S. 1990. Instruction Design and Implementation (San Francisco,
issue logic for high-performance, interruptable, Calif., June). SIGPLAN Not. 27, 7 (July),
multiple functional unit, pipelined computers. 188-199.
IEEE Trans. Comput. C-39, 3 (Mar.), 349-359. WALL, D. W. 1991. Limits of instruction-level paral-
STEELE, G. L., JR. 1977. Arithmetic shifting consid- lelism. In Proceedings of the 4th Interncutonal
ered harmful, SIGPLAN Not. 12, 11 (Nov.), Conference on Arch~tectural Support for Pro-
61-69. gramming Languages and Operating Systems
STEELE, G, L., JR, AND WHITE, J, L. 1990. How to (Santa Clara, Calif., Apr.). SIGPLAN Not. 26,
print floating-point numbers accurately, In 4, 176-188.
Proceedings of the SIGPLAN Conference on WANG, K. 1991. Intelligent program optimization
Programming Language Design and Implemen- and parallelization for parallel computers.
tation (White Plains, N.Y., June). SIGPLAN Ph.D. thesis, Tech. Rep. CSD-TR-91-030, Pur-
Not, 25, 6, 112-126. due University, West Lafayette, Ind.
STENSTROM, P. 1990. A survey of cache coherence WANG, K. AND GANNON, D. 1989. Applying Al tech-
schemes for multiprocessors. Computer 23, 6 niques to program optimization for parallel
(June), 12-24. computers. In Parallel Processing for Super-
SUN MICROSYSTEMS. 1991. SPARC Architecture computers and Art~fictal Intelligence, K. Hwang
Manual, Version 8. Part No. 800-1399-08. Sun and D. Llegroot, Eds. McGraw Hill, New York,
Microsystems, Mountain View, Calif. Chap. 12,
f?IZYMANSHI, T. G. 1978. Assembling code for ma- WEDEL, D. 1975. Fortran for the Texas Instruments
chines with span-dependent instructions. Corn- ASC system. In Programmmg Languages and
mum. ACM 21, 4 (Apr.), 300–308. Compilers for Parallel and Vector Machines
TANG, P, AND YEW, P, 1990. Dynamic proceesor (New York, N.Y., Mar.). SZGPLAN Not. 10, 3,
self-scheduling for general parallel neeted 119-132.
loops, IEEE Trans. Comput. C-39, 7 (July), WEGMAN, M. N. AND ZADECK, F. K. 1991. Constant
919-929. propagation with conditional branches, ACM
THINKING MACHINES. 1989. Connection Machine, Trans. Program. Lang. Syst. 13, 2 (Apr.),
Model CM-2 Technical Summary. Thinking 181-210,
Machines Corp., Cambridge, Mass. WEISS, M. 1991. Strip mining on SIMD architec-
TJADEN, G. S. AND FLYNN, M. J. 1970. Detection and tures. In Proceedings of the ACM International
parallel execution of parallel instructions. Conference on Supercomputtng (Cologne, Ger-
IEEE Trans. Comput. C-19, 10 (Oct.), 889-895. many, June). ACM Press, New York, 234–243.
TORRELLAS, J., LAM, H. S., AND HENNESSY, J. L. WHOLEY, S. 1992. Automatic data mapping for dis-
1994. False sharing and spatial locality in mul- tributed-memory parallel computers. In Pro-
tiprocessor caches. IEEE Trans. Comput. 43, 6 ceedings of the ACM International Conference
(June), 651-663. on Supercomputing (Washington, D. C., July).
ACM Pressj New York, 25–34.
TOWLE, R. A. 1976, Control and data dependence
for program transformations. Ph.D. thesis, WOLF, M. E. 1992. Improving locality and paral-
Tech. Rep. 76-788, Computer Science Dept., lelism in nested loops. Ph.D. thesis, Computer
Univ. of Illinois at Urbana-Champaign. Science Dept., Stanford University, Stanford,
TRIOLET, R., IRIGOIN, F., AND FEAUTRIER, P. 1986. Calif.
Direct parallelization of call statements. In WOLFE, M. J. 1989a. More iteration space tiling. In
Proceedings of the SIGPLAN Symposium on Proceedings of Supercomputing ’89 (Reno, Nev.,
Compder Construction (Palo Alto, Calif., June). Nov.). ACM Press, New York, 655–664.
SIGPLAN Not. 21, 7 (July), 176-185. WOLFE, M. J. 1989b. OptLmLzmg Supercompders for
TSENG, C.-W. 1993. An optimizing Fortran D com- Supercomputers. Research Monographs in Par-
piler for MIMD distributed-memory machines. allel and Distributed Computing. MIT Press,
Ph.D. thesis, Tech. Rep. Rice COMP TR93-199, Cambridge, Mass.

420 e David F. Bacon et al
WOLI, M, E, AND LAM, M, S. 1991, A loop transfor- ZIMA, H P, BAST, H ,J , Ah_D GERNnT, M 1988
mation theory and an algorlthm to maxlmlze SUPERB. A tool for semi-automatic
parallelism. IEEE Trans. Parall, Dlstrlb. Syst. SIMD\MIMD parallehzatlon Parall C’ornput
2, 4 (Oct.), 452-471, 6, 1 (Jan,), 1-18
WOLFE, M J ANI) TSEN~, C 1992 The Power Test
for data dependence. IEEE Trans. Parall Dzs-
trzb S-yst 3, 5 (Sept ), 591–601
Recewed November 1993, final revision accepted October 1994
ACM Comput,ng Surveys, Vol 26, No 4, December 1994

Compiler Transformations For High-Performance Computing: David F. Bacon, Susan L. Graham, and Oliver J. Sharp

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Compiler Transformations For High-Performance Computing: David F. Bacon, Susan L. Graham, and Oliver J. Sharp

Uploaded by

Copyright:

Available Formats

Compiler Transformations for High-Performance Computing

DAVID F. BACON, SUSAN L. GRAHAM, AND OLIVER J. SHARP

Categories and Subject Descriptors: D.1.3 [Programming Techniques]: Concurrent

optimization; 1.2.2 [Artificial Intelligence]: Automatic Programming-program

General Terms: Languages, Performance

Additional Key Words and Phrases: Compilation, dependence analysis, locality,

ACM Computing Surveys, Vol. 26, No. 4, December 1994

CONTENTS imperative languages such as Fortran

ACM Computing Surveys, Vol. 26, No. 4, December 1994

ACM Computmg Surveys, Vol. 26, No 4, December 1994

a Overflow. If b[k] is a very large num-

practical definition. not visible to the programmer. Finally,

ACM Computmg Surveys, Vol 26, No 4, December 1994

Definition 2.1.2 A transformation is

ACM Computing Surveys, Vol. 26, No. 4, December 1994

0 Basic block (straight-line code). This is In this survey we discuss compilation

ACM Computmg Surveys, Vol. 26, No 4, December 1994

ACM Computing Surveys, Vol. 26, No. 4, December 1994

ACM Computmg Surveys, Vol 26, No 4, December 1994

Procedure Cull Graph

Instruction Ceche Optimization

Pigure 2. Organization of a hypothetical optimizing compiler.

ACM Computmg Surveys, Vol. 26, No. 4, December 1994

rewritten in terms of the induction vari- There is a control dependence between

ACM Computing Surveys, Vol. 26, No 4, December 1994

The fourth possible relationship, called doi=2, n

ACM Computing Surveys, Vol. 26, No 4, December 1994

1 a[fl(il, . . ..id). fm($l, ($l, . . . ..d)] = . . .

Figure 4. GeneraI loop nest.

Figure 4 shows a generalized perfect In other words, there is a dependence

ACM Computmg Surveys, Vol. 26, No, 4, December 1994

doi=2, n We will use the general term depen-

ACM Computing Surveys, Vol 26, No 4, December 1994

There are a large variety of tests, all of do i = 1, n-l

ACM Computing Surveys, Vol 26, No 4, December 1994

[Ershov 1966] performed interprocedural doi=l, n

6.1 Data-Flow-Based Loop Transformations

A number of classical loop optimizations T=c

Figure 7. Strength reduction example,

Reduction in strength replaces an ex-

ACM Computmg Surveys, Vol. 26, No. 4, December 1994

ACM Computmg Sumeys,Vol, 26,N0 4, December 1994

do i=2, n Conditionals that are candidates for

6.2 Loop Rerxdering

ACM Computing Surveys, Vol. 26, No. 4, December 1994

doi=l, n double precision a[*]

doi=i, n I etride \ Cache I TLB I Relative \

Figure Il. Dependence conditions forparallelizing t I 1

ACM Computing Surveys, Vol 26, No 4, December 1994

(b) interchanged loop nest

Figure 13. Loop interchange, (b) (c)

Figure 14. Original loop (a); original traversal or-

ACM Computing Surveys, Vol. 26, No. 4, December 1994

do I = 2, n-1 dex variables to compensate, skewing

Figure 15. Loop skewing.

ACM Computmg Surveys, Vol 26, No 4, December 1994

An alternative method for handling doi=l, n

transformations because it changes the doj=n, l,-1

=Oor3q<p:v~>0. a[i] = a[i] + c

( <, >)} can be reversed, because the re-

lexicographically positive; the reversed

6.2.4 Strip Mining

ACM Computing Surveys, Vol. 26, No. 4, December 1994