Code Transformations
Exploiting Fully Permutable Loops:
Exploiting fully permutable loops is a technique used to improve
the performance of a program by increasing parallelism. The
technique is based on the idea that when multiple loops can be
executed in any order, it is possible to execute them in parallel.
The technique involves creating a loop nest with k outermost
fully permutable loops from k independent solutions to the time-
partition constraints. This is done by making the kth solution the
kth row of the new transform. Once the affine transform is
created, an algorithm can be used to generate the code.
• The solutions found in the SOR (Successive Over-Relaxation)
example were [1 0] and [1 1]. By making the first solution the first
row and the second solution the second row, the transform 1 0 1 1
is created. By making the second solution the first row instead, the
transform 1 1 1 0 is created.
• This technique is useful because it allows the program to take
advantage of the parallelism present in the loop nest, which can
lead to a significant increase in performance.
Wavefronting:
• It is also easy to generate k 1 inner parallelizable loops from a loop with k
outermost fully permutable loops. Although pipelining is preferable, we include
this information here for completeness..
• We partition the computation of a loop with k outermost fully permutable loops
using a new index variable i’, where i’ is defined to be some combination of all the
indices in the k permutable loop nest.
• We create an outermost sequential loop that iterates through the i0 partitions in
increasing order; the computation nested within each partition is ordered as
before. The 1st k 1 loops within each partition are guaranteed to be parallelizable.
Intuitively, if given a two-dimensional iteration space, this transform groups
iterations along 135 diagonals as an execution of the outermost loop. This
strategy guarantees that iterations within each iteration of the outermost loop
have no data dependence.
Blocking:
• A k-deep, fully permutable loop nest can be blocked in k-dimensions.
Instead of assigning the iterations to processors based on the value of the
outer or inner loop indexes, we can aggregate blocks of iterations into one
unit. Blocking is useful for enhancing data locality as well as for minimizing
the overhead of pipelining.
Blocking:
A simple loop nest. Blocked version of this loop nest
• for (i=0; i<n; i++) • for (ii = 0; ii<n; i+=b)
for (jj = 0; jj<n; jj+=b)
for (j=1; j<n; j++) for (i = ii*b; i <= min(ii*b-1, n);
{ i++)
<S> for (j = ii*b; j <= min(jj*b-1,
} n); j++) {
<S>
}
• Before • After