You are on page 1of 16

RETROSPECTIVE:

Automatic Loop Interchange


Randy Allen

Ken Kennedy

Catalytic Compilers
1900 Embarcadero Rd, #206
Palo Alto, CA. 94043
jra@catacomp.com

Department of Computer Science


Rice University
Houston, TX. 77251
ken@rice.edu

Retrospectives provide a rare and interesting opportunity to


reflect upon the past and recall (or more accurately after a
couple of decades, speculate upon) our state of mind and
understandings at earlier times. Automatic Loop Interchange
was published almost 20 years ago at a midpoint in research on
data dependence and program transformations. This paper pulled
together the work of many predecessors [9,10,11,15] into a
simple, clean theory, providing a checkpoint on earlier
predictions on the power and applicability of data dependence.
At the same time, the paper was published just as the field of
data dependence entered a golden age. It is our hope that this
paper helped catalyze the following research into data
dependence theory and applications.

Foundation
Although this paper is entitled Automatic Loop Interchange, it
is far broader in scope. As an introduction to interchange, the
paper also covers a wide spectrum of dependence-based theory
and transformations. This work is built on the efforts of many
others, and we would be remiss if we did not acknowledge at
least some of those efforts acknowledging all of them would
quickly blow our page limits. The earliest papers on
dependence-based program transformations include papers by
Lamport [10,11] and Kuck [9]. Lamport developed a form of
loop interchange for use in vectorization, as well as the
wavefront method for parallelization, an early form of what
came to be called loop skewing.

A 1978-79 sabbatical at IBM catalyzed our initial interest in


automatic vectorization. We started development with a version
of the Parafrase system built at the University of Illinois by
Dave Kuck and his group (including Michael Wolfe) [9,15], but
we began work on an entirely new system in 1981 to provide a
better platform for our research on multilevel code generation
[8] (eventually published in 1987 [3]). The new system became
known as the Parallel Fortran Converter (PFC).

As indicated earlier, we also had access to the Parafrase system


and the associated body of research. In particular, Michael
Wolfes Masters thesis focused on loop interchange [15], a
topic that he developed further in later works [16,17].
Our own work on the subject began with our multilevel code
generation strategy [8, 1, 2, 3], which we implemented in the
summer of 1981. Real implementations often provide incredible
insights into the weaknesses of theoretical approaches; this was
definitely true in the case of PFC. The code generation strategy
proved extremely effective in practice and was far more efficient
in terms of compile-time than we had anticipated.1 However, a
real implementation quickly showed us that loop interchange
was the key missing piece. While PFC performed well in terms
of the vectorization it detected, we quickly saw that loop
interchange was the key incremental transformation. The
practical strategy presented in the paper (innermosting loops
that carried no dependence, testing loops that carried
dependences for interchange only to the next deeper position)
evolved out of discussions with Randy Scarborough, Joe
Warren, and others in the PFC project.

At the time we started the PFC project, few programmers had


access to vectorizing compilers that used data dependence.
Vector units and vectorizing compilers were employed
exclusively on expensive high-end machines or specialized array
processors, which were available to only a small percentage of
the general programming public. Despite this limited access,
vectorizing compilers had already earned the informal
nicknames paralyzers and terrorizers due to their large
compile times and often less-than-optimal output.
We began our effort with modest expectations. PFC was
deliberately structured as a source-to-source translator, primarily
because we believed the algorithms that we wanted to employ
would require more compile-time than could be justified in a
production compiler. We also doubted the power of data
dependence, and expected that we would need to employ
techniques from artificial intelligence to achieve satisfactory
results.

The strategy reported in this paper was implemented in the PFC


system. Although we reported no experimental results in the
paper, a later study reviewed in our book [4] showed that PFC
was able to do extremely well on the Callahan, Dongarra, and
Levine vectorization tests [6].

This paper marked a point in PFCs development where our


early assumptions had been proved wrong. What PFC and this
paper had shown was that a fairly simple set of program
transformations based on a unified underlying theory could
provide effective restructuring without requiring unacceptable
compile times.

At that time, we had to pay for computer time by the CPUminute. The first time that we tried a large test case (roughly
1000 lines of code), Ken insisted that we limit the CPU time to
10 minutes (which was still several thousand dollars of
computer time) to avoid blowing our research budget. We didnt
expect the test case to complete in the time limit; when it took
only 40 seconds, we assumed that PFC had crashed processing
the input. It took us a day of wading through the output to verify
that it had in fact completely and correctly processed the test.

20 Years of the ACM/SIGPLAN Conference on Programming


Language Design and Implementation (1979-1999): A Selection,
2003. Copyright 2003 ACM 1-58113-623-4 $5.00

ACM SIGPLAN

75

Best of PLDI 1979-1999

Impact
The approaches to dependence and loop interchange presented
in this paper were soon incorporated into a number of
commercial compilers. We are directly aware of the
implementations in the IBM compiler for the 3090 Vector
Feature [13] and the Convex vectorizing compiler, and were
involved in the implementation of the Ardent restructuring
compilers.

Bibliography
1. J.R. Allen. Dependence analysis for subscripted variables
and its application to program transformations. Ph.D
dissertation, Department of Mathematical Sciences, Rice
University, May, 1983.
2. J. R. Allen and K. Kennedy. PFC: a program to convert
Fortran to parallel form. In Supercomputers: Design and
Applications, K. Hwang, editor, pages 186203. IEEE
Computer Society Press, August 1984.
3. J. R. Allen and K. Kennedy. Automatic translation of
Fortran programs to vector form. ACM Transactions on
Programming Languages and Systems, 9(4):491542,
October 1987.
4. R. Allen and K. Kennedy. Optimizing Compilers for
Modern Architectures. Morgan Kaufmann, 2002.
5. R. Allen. Unifying vectorization, parallelization, and
optimization: the Ardent compiler. In Proceedings of the
Third International Conference on Supercomputing, 1988.
6. D. Callahan, S. Carr, and K. Kennedy. Improving register
allocation for subscripted variables. In PLDI 90 (also
included in this volume).
7. D. Callahan, J. Dongarra, and D. Levine. Vectorizing
compilers: A test suite and results. In Proceedings of
Supercomputing 88, Orlando, FL, 1988.
8. K. Kennedy. Automatic translation of Fortran programs to
vector form. Rice Technical Report 476-029-4, Department
of Mathematical Sciences, Rice University, 1980.
9. D. Kuck, R. Kuhn, D. Padua, B. Leasure, and M. J. Wolfe.
Dependence graphs and compiler optimizations. In
Conference Record of the Eighth Annual ACM Symposium
on the Principles of Programming Languages,
Williamsburg, VA, January 1981.
10. L. Lamport. The parallel execution of DO loops.
Communications of the ACM, 17(2):8393, February 1974.
11. L. Lamport. The coordinate method for the parallel
execution of iterative DO loops. Technical Report CA7608-0221, SRI, Menlo Park, CA, August 1976, revised
October 1981.
12. D. A. Padua and M. J. Wolfe. Advanced compiler
optimizations for supercomputers. Communications of the
ACM, 29(12):11841201, December 1986.
13. R. G. Scarborough and H. G. Kolsky. A vectorizing
FORTRAN compiler. IBM Journal of Research and
Development, March 1986.
14. M. E. Wolf and M. Lam. A data locality optimizing
algorithm. In PLDI 91 (also included in this volume).
15. M. J. Wolfe. Techniques for improving the inherent
parallelism in programs. Masters thesis, Dept.of Computer
Science, University of Illinois at Urbana-Champaign, July
1978.
16. M. J. Wolfe. Advanced loop interchanging. In Proceedings
of the 1986 International Conference on Parallel
Processing, St. Charles, IL, August 1986.
17. M. J. Wolfe. High Performance Compilers for Parallel
Computing. Addison-Wesley, Redwood City, CA, 1996.

Beyond the immediate practical impact, Automatic Loop


Interchange also established interchange as a fundamental
transformation in all advanced optimizing compilers:
vectorizing, parallelizing, and even scalar. While many previous
papers had focused on dependences as execution constraints that
limit reordering, this paper (in a section devoted to other
applications of dependence) also pointed out the dual aspect:
dependences represented reused memory locations.
Accordingly, dependence provided a basis for optimizing for
memory hierarchies by moving the most frequently accessed
memory locations into the fastest elements of the hierarchy.
Later research would prove loop interchange to be as important
for moving dependences into inner loops (thereby optimizing
memory reuse) as it had proven to be for moving dependences
out of inner loops (as was necessary for vectorizing loops).
Particularly important exemplars of this research are the papers
by Callahan, Carr, and Kennedy on register optimization [6] and
by Wolf and Lam on cache blocking [14]. Both papers are
included in this volume. Practical implementations that included
this aspect of dependence include the Ardent compiler [5].

Future Applications
Looking back over the past 18 years, we doubt that we would
have predicted the impact of loop interchange on the compiler
literature. Although our own work and the work of others went
on to more powerful transformation strategies based on direction
and distance matrices [4, 14, 16, 17], this work was one of the
first to establish that powerful and effective program
transformations could be implemented in practical compiler
systems.
Of course, one reason for the growth in importance of this work
is the increased use of parallelism in computer architecture and
the increasing disparity between CPU and memory speeds.
Looking to the future, we believe these factors are only going to
increase in the design of computer systems, making these
compiler techniques even more relevant. Memory hierarchies in
particular are increasingly dominating computation times, and
automatic loop interchange is a key transformation for
exploiting that hierarchy.
While loop interchange has been thoroughly explored in the
context of restructuring compilers, there are other contexts
which have not been so thoroughly explored. For instance, given
the intimate relationship between dependence and loop
iterations, it is natural to assume that dependence and loop
interchange should have as important a role to play in the design
of pipelined architectures as it does in exploiting pipelined
architectures.
Acknowledgements
As was the case at the time the paper was published, this work
has progressed over the years only by the efforts and
collaborations of others far too numerous to list here. However,
we would be remiss if we did not acknowledge the contributions
of Randy Scarborough, Joe Warren, Horace Flatt, and all the
graduate students who worked on PFC.

ACM SIGPLAN

76

Best of PLDI 1979-1999

ACM SIGPLAN

77

Best of PLDI 1979-1999

ACM SIGPLAN

78

Best of PLDI 1979-1999

ACM SIGPLAN

79

Best of PLDI 1979-1999

ACM SIGPLAN

80

Best of PLDI 1979-1999

ACM SIGPLAN

81

Best of PLDI 1979-1999

ACM SIGPLAN

82

Best of PLDI 1979-1999

ACM SIGPLAN

83

Best of PLDI 1979-1999

ACM SIGPLAN

84

Best of PLDI 1979-1999

ACM SIGPLAN

85

Best of PLDI 1979-1999

ACM SIGPLAN

86

Best of PLDI 1979-1999

ACM SIGPLAN

87

Best of PLDI 1979-1999

ACM SIGPLAN

88

Best of PLDI 1979-1999

ACM SIGPLAN

89

Best of PLDI 1979-1999

ACM SIGPLAN

90

Best of PLDI 1979-1999