You are on page 1of 30

“Politehnica” University of Bucharest

Computer Science and Engineering Department

Compiler Optimizations for the PowerPC


Architecture
-Master Thesis-

Scientific Coordinator: Graduate:


Prof. PhD. Eng. Irina Athanasiu Eng. Oana Florescu

July 2003
Table of Contents 2

Table of Contents

Table of Contents ................................................................................................................ 2

Abstract ............................................................................................................................... 3

1 Introduction ................................................................................................................. 4

2 Previous Work............................................................................................................. 5

3 PowerPC Architecture................................................................................................. 7

3.1 Superscalar Architectures.................................................................................... 7

3.2 PowerPC Architecture Overview ........................................................................ 8

4 PowerPC Compiler Optimizations ............................................................................ 12

4.1 Code Generation Issues..................................................................................... 12

4.2 Register Pressure Sensitive Instruction Scheduling .......................................... 13

4.3 Scheduler-Sensitive Register Allocation........................................................... 16

5 Results ....................................................................................................................... 25

6 Conclusions ............................................................................................................... 29

7 References ................................................................................................................. 30
Abstract 3

Abstract
This paper presents optimizations that have been brought to the compiler for the
PowerPC architecture in which both register allocation and instruction scheduling are
concerned. The interference graph built by the register allocator have been modified to
contain information that will prevent the allocator from introducing false dependencies.
On the other hand the scheduler is aware of the limit on the number of available registers
and it schedules instruction according to that limit. The method provided in this paper
increases the possibility of obtaining the maximum parallelism at instruction level for an
application.
Introduction 4

1 Introduction
Nowadays supercomputers utilize pipelined, superscalar, or VLIW machines for the node
processors, thus allowing a faster execution of the applications if the code generated by
the compiler exploits the parallelism at both multi-processor and instruction level. These
machines achieve increased performance by overlapping the execution of low-level
machine instructions such as memory loads and stores, and integer and floating-point
operations. In order to exploit the available instruction level parallelism in programs and
utilize the full potential of the node processors, the compiler includes an instruction
scheduling phase, which rearranges the code sequence to hide latencies and reduce
possible run-time delays. This stage of compiling is sometimes constrained by a register
allocation phase that may restrict the possibility of scheduling in parallel many
instructions.

The goal of this work is to provide optimizations for the current implementation of the
compiler for the PowerPC architecture, to allow it to increase the possibility of exploiting
the instruction parallelism available in a program. PowerPC is a family of
microprocessors with superscalar architecture. The optimizations of the compiler aim to
take into consideration both register allocation and instruction scheduling. The
scheduling takes place in two stages, one before and one after register allocation and the
heuristics implemented ensure that the register allocation does not decrease the
parallelism embedded in the program. The effects of the approach proposed here are
presented on the EEMBC benchmark.

Section 2 presents an overview of the previous work that has been done in this research
area and Section 3 it is provided a brief description of the PowerPC architecture. Then
Section 4 presents the optimizations brought to the compiler. Experimental results and
conclusions are presented in Section 5 and respectively in Section 6.
Previous Work 5

2 Previous Work
Many compilers for RISCs performed, and some still use this method, register allocation
and instruction scheduling separately, with each phase ignorant of the requirements of the
other. In reality each can impose constraints on the other, sometimes producing
inefficient code.

Code inefficiencies are caused by an excess of memory references or pipeline delays.


When register allocation is performed before instruction scheduling (Postpass Scheduling
strategy), the same physical register may be assigned to independent expression
temporaries, reducing the potential instruction-level parallelism. This, in turn, decreases
the scheduler’s possibilities to mask pipeline delays. When scheduling precedes register
allocation, the number of simultaneously live values may be increased, creating more
spills to memory.

Several strategies have been proposed to resolve this issue. Goodman and Hsu in [7]
developed an approach called Integrated Prepass Scheduling (IPS). The instruction
scheduler is executed before the local register allocation, but it is constrained by a limit
on the number of registers it has available and thus oscillates in its heuristics for
scheduling based on whether the current number of live variables has reached the register
limit.

Brandlee, Eggers and Henry in [3] developed a slightly modified version of IPS in which
they calculate the register limit in a different way, replace the local allocator with a global
allocator, and invoke the scheduler again after allocation in order to better schedule spill
code. They also developed a more integrated approach called Register Allocation with
Schedule Estimates (RASE). A prescheduling phase is run in which the scheduler is
invoked twice in order to compute cost estimates for guiding register allocation. A global
register allocator then uses the cost estimates and spill costs to obtain an allocation and to
determine a limit on the number of local registers for each block. A final scheduler is run
using the register limit from allocation and inserting spill code as it schedules. It is
Previous Work 6

complicated by simultaneously performing local register allocation with instruction


scheduling.

Another strategy proposed in [10] provides a framework in which considerations of


register allocation and instruction scheduling can be applied uniformly and
simultaneously. Pinter has described a cooperative approach based on building a parallel
interference graph and some heuristics for trading off between scheduling and register
spilling.

The strategy implemented in the PowerPC compiler implies a two-phase scheduling (the
same algorithm of instruction scheduling is applied twice) and a global register allocation
applied between them. The compiler provides a basic-block scheduler, which reorganizes
the instructions to reduce pipeline delays. The first phase of scheduling, performed before
register allocation, assumes to have an infinite number of registers and rearranges
instructions to minimize the execution time. The disadvantage of this step is that it
implies longer register lifetimes. It does not take into consideration a limit on the number
of live registers. The second step of scheduling, performed after register allocation, is
introduced to ensure that spill code is scheduled as well as possible. The implementation
of the register allocator has to deal with the data dependencies and registers’ lifetimes
introduced by the first step of scheduling. The new instruction order is used to compute
the interferences. Because of the limited number of registers and their long lifetimes, in
many cases the allocator has to spill a node of the interference graph. Another
disadvantage is the false dependencies that the allocator might introduce between
instructions due to the allocation policy used – first-fit, and also to reaching the limit of
free registers.
PowerPC Architecture 7

3 PowerPC Architecture

3.1 Superscalar Architectures

Instruction-level Parallelism (ILP) is a family of processor and compiler design


techniques that speed up execution by causing individual machine operations, such as
memory loads and stores, integer additions and floating-point multiplications, to execute
in parallel. The operations involved are normal RISC-style operations, and the system is
handed a single program written with sequential processor in mind.

In order to achieve parallel execution of multiple RISC instructions, ILP machines are
composed of multiple, independent, possibly pipelined functional units that communicate
through a local memory space such as a register file. The number and types of
instructions that can be issued for execution is limited by the number and types of
available functional units. In practice, constraints such as inter-instruction data or control
dependencies, resource conflicts, etc., might further restrict the number of choices for
parallel instruction issue.

Unlike most other types of parallel computations, this type of parallel processing is
transparent to the user (the software developer) and hence any advances in processor or
compiler techniques to increase effective parallelism will provide automatic benefits to
the user without their being even aware of it. Automatically and transparently improving
application performance is tremendously appealing, and consequently a large amount of
research has gone into ILP processing.

ILP research has led to two main styles of architectures: Very Long Instruction Word
(VLIW) and Superscalar. The code for a VLIW processor is an explicit plan, which
specifies the set of operations to be issued on each machine cycle, which functional units
to use to execute them and which registers to use as sources and sinks for input and
output operands respectively for operations performed on those functional units. A
superscalar processor is one that issues multiple independent instructions into separate
execution units, allowing parallel execution. A typical superscalar processor receives a
sequential stream of instructions representing the program. It examines a window of
PowerPC Architecture 8

instructions from the stream, analyzes the chosen instructions for inter-instruction
dependencies (and also checks for dependencies with instructions that have already been
issued and not completed) and performs scheduling and resource allocation with the aim
of extracting as much ILP as possible while processing those instructions.

For a superscalar architecture, after the front-end processing and basic compiler
optimizations are done, all the ILP processing steps are performed by the processor. The
compiler translates the source code of a program into instruction level register based
intermediate code where an infinite number of symbolic registers is assumed (one
symbolic register per value). After that, the compiler passes the program through phases
like register allocation, to map the symbolic registers into physical ones, and instruction
scheduling, to rearrange instructions in the sequential representation of the program, in
order to help the processor maximize the execution time. When executing the code, the
superscalar processor issues to execution as many instructions as possible, due to both
machine related constraints and instruction dependencies.

3.2 PowerPC Architecture Overview


The PowerPC architecture, developed jointly by Motorola, IBM and Apple Computer,
takes advantage of technological advances in areas as processor technology, compiler
design and reduced instruction set computing (RISC) microprocessor design to provide
software compatibility across a diverse family of implementations, primarily single-chip
microprocessor, intended for a wide range of systems, and multiprocessing,
microprocessor-based mainframes. To provide a single architecture for such a broad
assortment of processor environments, the PowerPC architecture is both flexible and
scalable ([9]).

Some of the features specified by the PowerPC architecture and interesting for the
allocation and scheduling algorithms are the following:
• Separate 32-entry register files for integer and floating-point instructions. The
general-purpose registers (GPRs) hold source data for integer arithmetic
instructions, and floating-point registers (FPRs) hold source and target data for
floating-point arithmetic instructions
PowerPC Architecture 9

• Instructions for loading and storing data between the memory system and either
the FPRs or GPRs
• Uniform-length instructions to allow simplified instruction pipelining and parallel
processing instruction dispatch mechanisms
• Nondestructive use of registers for arithmetic instructions in which the second,
third, and sometimes the fourth operand, typically specify source registers for
calculations whose results are typically stored in the target register specified by
the first operand
• Floating-point support that includes IEEE-754 floating-point standard
• The ability to perform both single- and double-precision floating-point operations
• Special instructions for speculatively loading data before it is needed, reducing
the effect of memory latency on instruction throughput
• Definition of a memory model that allows weakly-ordered memory accesses. This
allows bus operations to be reordered dynamically, which improves overall
performance and in particular reduces the effect of memory latency on instruction
throughput

Most of the implementations of the PowerPC architecture are pipelined, superscalar


processors with parallel execution units that allow instructions to execute out of order but
complete in order. Pipelining breaks instruction processing into discrete stages, so
multiple instructions in an instruction sequence can occupy the successive stages: as an
instruction completes one stage, it passes to the next stage, leaving the previous stage
available to a subsequent instruction. So, even though it may take multiple cycles for a
single instruction to pass through all of the pipeline stages, once a pipeline is full,
instructions complete at a rate that is far shorter than the latency. This rate is known as
throughput.

The common pipeline stages for all PowerPC implementations are:


• fetch - includes the clock cycles necessary to request an instruction and the time
the memory system takes to respond to the request. Depending on the
PowerPC Architecture 10

implementation, multiple instructions are sent from the instruction cache to the
instruction queue every cycle.
• decode - fully decodes each instruction
• issue - logic associated with the issue stage determines when an instruction can be
dispatched to the appropriate execution unit. At the end of the instruction issue
stage, the instruction and its operands are latched into the appropriate execution
unit’s reservation station.
• execute - the execution unit executes the instruction (perhaps over multiple
cycles). At the end of the execute stage, the execution unit writes the results into
the appropriate temporary buffers, called rename registers, and notifies the
completion stage that the instruction has finished execution. The instruction is
passed in the completion queue.
• complete - completion logic tracks which instructions have completed execution
and can be retired. The completion stage ends by retiring the instruction from the
completion queue.
• write-back - results from completed instructions are written from the rename
registers to the appropriate architectural register

The complete and write-back stages maintain the correct architectural machine state and
commit results to the architected registers in the proper order. The number of instructions
that can be retired per clock cycle depends on the implementation.

Each implementation of the PowerPC architecture has at least the following functional
units: load/store unit (LSU), integer unit (IU) (most of the recent implementations have
two or more IUs) and branch-processing unit (BPU). Some of the PowerPC processors
also implement a floating-point unit (FPU) and/or a vector-processing unit (VPU). The
advantage of implementing these particular execution units is that the instructions that
perform memory accesses are separated from those that operate only on registers and on
the other hand instructions are distributed to different execution units according to the
type of operands that they use (integer, floating-point, vector).
PowerPC Architecture 11

A special case is the branch-processing unit. The BPU receives branch instructions and
executes them early in the pipeline, achieving the effect of a zero-cycle branch in some
cases. The penalties associated with changes of flow control are minimized by
performance features such as static and dynamic branch prediction. Timing for branch
instruction execution is affected by whether the following occur: the branch is taken, the
target instruction stream is in the on-chip cache, the branch is predicted, the prediction is
correct.

The total execution time of an application running on a PowerPC processor depends on


the possibility of running in parallel as many instructions as possible. This possibility is
related to the number of instructions that the processor can issue in a clock cycle, the
number of execution units implemented, the length of the pipeline and the dependencies
among instructions.
PowerPC Compiler Optimizations 12

4 PowerPC Compiler Optimizations

4.1 Code Generation Issues

The goal of a compiler is to efficiently translate the code of an application into a target
machine language, keeping the semantic of the program. The PowerPC compiler
translates a C/C++ application into PowerPC assembly language. As shown in the
previous section, PowerPC processors are superscalar, thus they allow the issue and the
execution of multiple instructions in parallel.

The efficiency of the generated code of an application for a PowerPC processor is a


function of its execution time. To increase the speed of the execution of an application a
register allocator reduces memory references (memory accesses are very low comparing
with register accesses); on the other hand, an instruction scheduler reduces pipeline
delays by masking the latency of multi-cycle operations. However, reducing one may
increase the other. The register allocation may restrict the possibilities of scheduling
multiple instructions in parallel, due to the limited number of physical registers of the
machine and the false instruction dependencies that might come from the reusing of
registers. The instruction scheduling may force the register allocator to introduce spill
code (save registers into and restore them from memory when needed) if it schedules
instructions disregarding the number of registers needed alive at a given moment.

To increase efficiency of the generated code of an application means to minimize the


total execution time. On the other hand this means increasing the number of instructions
to be executed in parallel. Due to data dependencies between instructions, most of the
time the number of instructions that the processor can execute in parallel is less than the
maximum number it can support. Therefore the two steps of the compiler - allocation and
scheduling, must exchange some information between them in order to obtain a
representation of the application as a sequence of instructions that would allow the
processor to execute the given program with maximum of speed and parallelism.
PowerPC Compiler Optimizations 13

4.2 Register Pressure Sensitive Instruction Scheduling

Instruction scheduling is the process of moving instructions so that they can be scheduled
to different units of the processor such that the total execution time is minimized.
Instructions are moved in a way that preserves the program’s semantics on one hand, and
provides a better utilization of the machine on the other hand.

The constraints of the scheduler are presented by a scheduling graph, G s = (Vs , E s ) ,


constructed as follows:
• Every node v ∈ Vs corresponds to an instruction

• There exists a directed edge (u , v ) ∈ E s , from u to v, if u must be executed before


v.

For a local scheduler u must be executed before v if there is a data-dependence from u to


v (i.e. v is data dependent on u), or there is a machine constraint that enforces the
precedence of u over v. The machine-based constraints imposed on the program may
forbid the use of the same functional unit within a short time interval (resource
contention), or the simultaneous access to the same memory address, or similar
constraints. Machine constraints that are not of precedence type are not present in the
graph as edges.

For a global scheduler control dependences might appear too. They exist only between
basic blocks where the corresponding edges are derived from the program’s flow graph
(following if branches etc.).

The scheduler implemented in the PowerPC compiler is a basic-block or local scheduler.


The dependencies graph constructed for a basic-block indicates the serialization
requirements of the instructions. Each edge (u, v ) of the graph is labeled with the delay
that would result if u and v were issued in sequence. If the data dependence between u
and v is a true-dependence, this delay is the latency of u; for anti- and output-
dependencies, the delay is usually 0 unless the target machine does not have rename
registers or the resource is a memory location. The general goal of a basic-block
scheduling is to construct a topological sort of the directed acyclic graph (DAG), the
PowerPC Compiler Optimizations 14

scheduling graph, which produces the same results and minimizes the execution time of
the basic-block. This is an NP-hard problem, even if it has a very simple formulation, so
an effective heuristic must be seek.

The approach chosen for reorganizing instructions in the DAG is list scheduling. This
method traverse the scheduling graph from the leaves to roots, labeling each node with
the maximum possible delay from that node to the end of the block and also it is
computed the maximum delay along all paths to the end of the basic block - the critical
path delay. Finally each node of the graph (i.e. instruction) is assigned a deadline - the
latest time the instruction it represents can be issued without causing the schedule to
exceed the critical path delay, and an earliest time. Earliest time is 0 for the initial root
nodes; as instructions are scheduled, the earliest time of their successors are set to be the
current time plus the delay along the edge between each instruction and each of its
successors.

By a new traversing of the DAG from roots towards the leaves, nodes are selected to be
scheduled, keeping track of the current time. At each clock cycle some instructions may
be available for execution. The conditions that an instruction must fulfill in order to be
ready for scheduling are:
• the node corresponding to that instruction must be a root - the instruction has no
predecessors, which means that all its predecessors have been scheduled and
executed by the time this instruction gets to be issued
• the earliest time has arrived
• it can be issued in this clock cycle – the functional unit to which this instruction
should be assigned to is free

As mentioned in the previous sections the PowerPC compiler implements a two-phase


scheduler, one phase is executed before register coloring and the second one after register
coloring. Each of the two phases has a particular purpose, thus they differ in the way the
instructions are chosen for scheduling.
PowerPC Compiler Optimizations 15

For the first phase of scheduling, each node whose corresponding instruction can be
scheduled at a certain clock cycle is assigned a grade, computed taking into account the
following criteria:
• its deadline – the most critical factor
• the number of nodes it uncovers – a node is uncovered if it has no predecessors
and its earliest time has arrived – the second factor as importance
• the maximum delay till the end of the basic-block – the third as importance
• the contribution to the register pressure introduced by the instruction – this factor
becomes the most critical when the limit on the number of available registers is
reached
• the number of units the instruction can be executed on – there are
implementations of the PowerPC architecture that have two or more integer units
and some of the operations can be executed on any of them but others, as multiply
or division, can be executed only on one of these units.

Each of these factors contributes to the grade given to an instruction by a scaling-factor


according to their importance. The node with the maximum grade is chosen to be
scheduled and it is removed from the DAG. Using this heuristic the scheduler attempts to
restrict the number of concurrently live local symbolic registers, thus limiting the amount
of instruction-level parallelism that can be exploited. The scheduler selects instructions to
minimize pipeline delays unless the number of live local symbolic registers is greater
than or equal to the limit. As long as the scheduler is at the limit, it tries to schedule to
execution instructions that free registers; if it cannot it may exceed the limit.

After this phase of scheduling the new instruction order is used by the register allocation.
After the assignment of physical registers to symbolic ones another stage of scheduling is
performed. It ensures that the spill code that might have been inserted by the allocator is
scheduled as well as possible.

This phase of scheduling builds again the DAG with the same information as in the first
situation. The difference between the two phases is in the criteria used for choosing the
instruction to be scheduled. They are the same, except for the contribution to the register
PowerPC Compiler Optimizations 16

pressure introduced by the instruction, because this information is no longer needed, and
the number of execution units the instruction can be assigned to. The nodes no longer
receive grades. The scheduler chooses the most critical one according to deadline,
number of successors and maximum delay criteria, in this exact order of importance.

4.3 Scheduler-Sensitive Register Allocation

A very important stage of a compiler is the process of deciding which values to keep in
registers at each point in the generated code. The register allocators usually construct an
interference graph representing constraints that the allocation must preserve. Using
graph-coloring techniques, they discover a mapping from values in the procedure to
registers in the target machine. Values in registers can be accessed more quickly than
values in memory – on high-performance, microprocessor-based machines the difference
in access time can be an order of magnitude. Thus, register allocation has a strong impact
on the run-time performance of the code that a compiler generates. Because relative
memory latencies are rising while register latencies are not, the impact of allocation on
performance is increasing. With the development of processors that offer instruction level
parallelism an optimal coloring of the interference graph does not necessarily correlate
with good machine utilization. Features like superscalar instruction issue increase a
program’s absolute demand for registers – if the machine issues two instructions in a
single cycle, it must have two sets of operands ready in place at the start of the cycle.
This naturally increases the demand for registers.

A register allocator determines the contents of the limited number of hardware registers
during the course of program execution. Register allocation is usually performed at the
end of global optimization, when the final structure of the code to be emitted has been
determined and all potential register usages are exposed. The register allocator attempts
to map the registers in such a way as to minimize the number of memory references.
Register allocation may lower the instruction count and may reduce the execution time
per instruction, by changing memory operands to register operands. These improvements
reduce code size, which may lead to other, secondary improvements.
PowerPC Compiler Optimizations 17

Register allocation plays a significant role in determining the effectiveness of other


optimizations. Many optimizations create temporaries, and the improvement from the
optimization depends on the cost of access to the temporaries. For example, in common
sub-expression elimination, if the sub-expression needs to be stored in memory and
retrieved, the optimization may buy little. In some cases, it may actually slow the
program down.

All previous research has shown that global register allocation corresponds closely to the
graph-coloring problem. A coloring of a graph is an assignment of a color to each node of
the graph in such manner that any two nodes connected by an edge do not have the same
color. For register allocation, the graph, called the interference or the conflict graph is
constructed from the program. Each node in the interference graph represents a live range
of a program data value that is a candidate to reside in register. Informally, the live range
is a collection of basic blocks where a particular definition of that variable is live. Two
nodes in the graph are connected if the two data values corresponding to those nodes
interfere with each other in such a way that they must not reside in the same register. In
coloring the interference graph, the number of colors used for coloring corresponds to the
number of registers available for use in register allocation. A register allocator wants to
find an assignment of the program variables to registers that minimizes the total
execution time.

The interference graph is defined as Gr = (Vr , E r ) where:


• Every node v ∈ Vr represents a live range of a symbolic register. A live range is
the union of the regions in the program in which the symbolic register is live.
Typically it is a connected component that begins at definition (assignment)
points and terminates at last uses.
• There exists an (undirected) edge (u, v ) ∈E r if and only if live ranges u and v
interfere with each other; that is a definition of one range occurs at a point where
the other range is live.

In the PowerPC compiler an end point of the live range of a symbolic register (i.e. the
statement corresponding to its last use) is not considered part of the range; this enable the
PowerPC Compiler Optimizations 18

reuse (redefinition) of a register in the same statement that last uses it (e.g. the increment
of a register).

Here is a simple example to demonstrate the ideas implemented in the compiler in order
to improve the register coloring algorithm. Consider the execution of the code in Listing
1 on a PowerPC processor with a load/store unit (LSU), an integer unit to execute simple
arithmetic operations (IU1) and an integer unit for the execution of complex operations
like multiply and division (IU2), where si’s represent symbolic registers.

alive in symbols = ∅
......
I1: lwz s1, x -> LSU
I2: lwz s2, y -> LSU
I3: add s3, s1, s2 -> IU1
I4: stw s3, z -> LSU
I5: mullw s4, s1, s1 -> IU2
......
alive out symbols = {s1, s4}

Listing 1 A simple example

I1

I2 I3

I4 I5

Figure 1 Scheduling Graph

The program’s scheduling graph is shown in Figure 1. Let Gs be the scheduling graph.
The edges of this graph signify dependencies between instructions. Each node from Gs
corresponds to an instruction (all instructions, except for stores, branches and calls,
PowerPC Compiler Optimizations 19

introduce a symbolic register definition). At this point we have all the precedence type
constraints.

Starting from Gs another graph is generated, comprising the false dependencies in the
program. Consider that Gs is the scheduling graph for a basic block before register
allocation. G s = (Vs , E s ) contains the instructions presented with symbolic registers and
all the precedence based constraints are drawn. The false dependence undirected graph
G f = (V f , E f ) is defined as follows:

• V f = Vs

• Compute the set of edges in the transitive closure of Gs and define Et to be this set
after removing the directions of the edges. To Et are added all the non-precedence
based constraints that describe the restrictions on the processor capabilities. Ef
will contain the pairs (u, v ) such that u , v ∈ V f , u ≠ v, (u , v ) ∉ E t .

For the example in Listing 1 the transitive closure of the scheduling graph is presented in
Figure 2

I1

I2 I3

I4 I5

Figure 2 Transitive closure of the scheduling graph

After removing the directions of the edges, for this graph an augmentation will be
generated, by adding the edges between instructions that cannot be executed in parallel
due to machine constraints (e.g. there is only one LSU so two load instructions cannot
run in parallel). These are machine related constraints that are not of a precedence type.
The undirected augmented graph is presented in Figure 3.
PowerPC Compiler Optimizations 20

I1

I2 I3

I4 I5

Figure 3 Augmented scheduling graph

Neither operations I1 and I2, nor I2 and I4 or I1 and I4 can be executed simultaneously; so
these constraints will be represented by the edges (I 1 , I 2 ) , (I 2 , I 4 ) and (I 1 , I 4 ) , the last
one already present in the graph because of the data dependence resulted from the
transitive closure. In the scheduling algorithm edges like these are not present in the
graph but instead the algorithm itself takes them into consideration by trying to schedule
an operation on the integer unit while scheduling an operation on the load/store unit.

To note that the more edges are present in this graph the better the results will be; this is
because it is actually going to be used only the edges in the complement of the
constructed graph. The edges in the complement graph (which is the false dependence
graph) present the actual parallelism available to the machine for the given program (see
Figure 4).

Consider Gs’ the scheduling graph after register allocation. An edge (u, v ) in Gs’ is a false
dependence if there is a potential to schedule u and v simultaneously prior to register
allocation (there was no edges between u and v in Gs or in augmented Gs). The edges in
the false dependence graph will be used in the generation of the graph that would help
register allocation to produce an optimal coloring for a superscalar machine.
PowerPC Compiler Optimizations 21

I1

I2 I3

I4 I5

Figure 4 False dependence graph

The scheduling graph Gs of a basic block has no output or anti-data dependence edges
(because with symbolic registers no register is redefined during the execution of the basic
block). Due to this observation, the set Et defined previously, contains exactly the real
constraints on the scheduler.

Output and anti-data dependencies are generated when a memory location (or a register)
is redefined. When working with an unbounded number of symbolic registers, the
compiler does not introduce any redefinition. So these kinds of data dependencies does
not distract parallelism. This observation leads to the idea that if two definitions
corresponding to a false dependence edge are kept in two different registers, then that
dependence will not occur anymore (the edge shall not be present in Gs’, the scheduling
graph after allocation).

In the interference graph of a basic block (or more generally a program) each node
represents a definition (assignment of a value). It marks the beginning of a live range and
also potentially defines a new symbolic register. A node corresponds actually to an
instruction (except for branches and calls); thus each node in the interference graph
corresponds to a node in the scheduling graph.

For Gr = (Vr , E r ) , the interference graph the following remarks are true:
• Vr ⊆ V s

• In general there may be edges in Es such that their undirected counterparts are not
in Er and vice versa. For example, an interference edge is present in Er because of
PowerPC Compiler Optimizations 22

the sequential ordering of the code and may not be present in Es because the
instructions may not depend one on the other

S1

S2 S3

S4

Figure 5 Interference graph

Live ranges are approximated with def-use chains that are computed for a given
sequential ordering of the code, the one after the first phase of scheduling and just before
register coloring. A data flow dependence edge (u, v ) ∈ E s may not appear in Er since it
corresponds to the last use of u, which is not considered in the live range of the symbolic
register (e.g. (s 2 , s3 ) that belongs to the scheduling graph in Figure 1 but not to the
interference graph in Figure 5). On the other hand, machine related constraints might also
not be present in Er.

It is very obvious from what it was presented above that the interference graph is
depending on the order of instructions in the input code. The problem with the classical
register allocation that considers only the interference graph for a superscalar processor is
that it restricts the possibility of parallel execution for two instructions that are not
connected in the scheduling graph by introducing a false dependence between them due
to the allocation policy (which in this case is first-fit) and to reaching the limit on the
number of available physical registers. If the first instruction has a last use for a symbolic
register and the second instruction defines a symbolic register that is assigned the same
physical register as for the last use register even if this instructions are not dependent one
on the other. The challenge was to avoid this situation by adding more constraints when
choosing the color for a symbolic register, based on the edges in the false dependence
PowerPC Compiler Optimizations 23

graph. This is translated into an augmentation to the interference graph with edges (u, v )
that signify that u and v should not have the same color (should not be assigned to the
same physical register). If they receive different colors the scheduler still can schedule
them to be executed in parallel if this would be possible (e.g. number of instructions
issued per cycle).

S1

S2 S3

S4

Figure 6 Augmented interference graph

For each last use of a symbolic register the algorithm searches for the definitions that
follow it and connects with edges in the interference graph those nodes that should not
have the same color (in order to not introduce a false dependence) (see Figure 6).

In the actual implementation of this idea in the PowerPC compiler, the interference graph
is not practically modified. The register allocator takes into account for each particular
symbolic register which are the colors that should not be assigned to it. On the other hand
the allocator keeps also information regarding the color that should best fit each node: if
an instruction contains both a last use and a definition, then it is preferred for the newly
defined symbolic register to receive the physical register of the last use. This idea is
helpful because the policy implemented for choosing a color from a set is first-fit –
always chooses the first free register. In this case it can be safely reuse a register because
there is no danger of introducing a dependence.

For each node pushed on the coloring stack, the color (the physical register) is chosen by
searching first if there is any preferred color (if it is the situation of safely reusing a
register); then in the complement of the list of colors that should be better for parallelism
PowerPC Compiler Optimizations 24

reasons to be avoided (to not introduce false dependences). If there are too many
constraints, resulting in no preferred color to be assigned, then the coloring algorithm
chooses one of the available colors disregarding the possibilities of introducing false
dependences.

This method was preferred for avoiding cases of inutile spilling. It is possible that
according to both the list of colors that are forbidden (registers still alive) and the list of
colors that should be avoided (due to false dependence) a symbolic register might not
have a free color. In this case it is mandatory to choose one of the colors in the
complement of the list of forbidden one for avoiding insertion of spill code.

According to the modifications made to the interference graph in Figure 6, for the
example in Listing 1 a possible register allocation using three colors is given in Listing 2.

......
I1: lwz r1, x
I2: lwz r2, y
I3: add r2, r1, r2 -> reuse of the color of s2 which became dead
I4: stw r2, z -> s3 becomes dead, its color could be reused
I5: mullw r3, r1, r1 -> r3 is preferred as color for s4 in order to
make possible the scheduling in parallel of I4 and I5
......

Listing 2 Possible register allocation

The modifications brought to the compiler ensure a better coloring of the interference
graph, provide an improved register allocation and have the property that false
dependencies are avoided. Thus all the options for parallelism are kept for the scheduler
to handle.
Results 25

5 Results
In order to test the efficiency of the optimizations presented in the previous section and
implemented in the compiler for the PowerPC architecture, the EEMBC benchmark suite
was chosen for the tests ([5]).

Table 1 presents the speedup obtained for some applications from EEMBC, compiled
with the PowerPC compiler having only the instruction scheduler modified so as to be
register pressure sensitive. The gain obtained for the execution time is not very high.
Actually this optimization alone is good just for pieces of code for which the initial
scheduler assumed a large number of live registers in order to run in the minimum time.
According to the initial implementation of the scheduling for these kind of parts of an
application a lot of spill code was inserted. Therefore the actual execution time was
longer than estimated when scheduling disregarding the number of available registers.

As it can be seen from Table 1 the Response to Remote Request benchmark contains in it
parts of this kind of code. The modification of the instruction scheduler brings an
improvement of 1.03% for the execution time of this application and this is because it is
less spill code inserted in the generated code by the register allocation algorithm.

Benchmark Original Scheduler Heuristic Improvement


(Iterations/Sec) (Iterations/Sec) (%)
Response to Remote 6791171.48 6861063.46 1.03%
Request
Consumer RGB to 446.23 447.23 0.22%
YIQ
OSPF 11783.19 11783.19 0.00%
Benchmark
Packet Flow 1930.87 1934.98 0.21%
Benchmark
Viterbi Decoder 4909.98 4909.98 0.00%
Benchmark
Table 1 Improvement obtained with register pressure sensitive instruction scheduling
Results 26

The other applications in Table 1 do not have execution speed increases for this
improvement of the compiler, because there were not cases in which the initial scheduler
was free to schedule many instructions and exceed the limit of available registers.

When compiling the same applications with another version of the PowerPC compiler, in
which this time it was embedded the new approach for register allocation, but the
instruction scheduler was left in the original implementation, the results are the ones in
Table 2. This test with this version of compiler was made in order to evaluate the impact
that the new register allocation algorithm alone had.

Benchmark Original Register Allocation Improvement


(Iterations/Sec) (Iterations/Sec) (%)
Response to Remote 6791171.48 6837606.84 0.68%
Request
Consumer RGB to 446.23 470.59 5.46%
YIQ
OSPF 11783.19 11933.17 1.27%
Benchmark
Packet Flow 1930.87 1984.13 2.76%
Benchmark
Viterbi Decoder 4909.98 4966.89 1.16%
Benchmark
Table 2 Improvement obtained with scheduler sensitive register allocation

The results show a great improvement for most of the applications, except for the
Response to Remote Request benchmark.

The greatest improvement was obtained for Consumer RGB to YIQ benchmark. This
shows that for this application a lot of time was wasted because of the bad scheduling of
instructions limited by the false dependencies introduced while allocating physical
registers.

As far as Response to Remote Request benchmark is concerned it is obvious that its


problem is the spill code inserted because of a bad scheduling and the improvement of
the allocation algorithm alone could not help very much.
Results 27

When enabling both scheduling sensitive register allocation and register pressure
sensitive instruction scheduling in the PowerPC compiler and test them on the EEMBC
benchmark suite, the results are those in Table 3.

The best results are obtained again for Consumer RGB to YIQ benchmark. The speedup is
a result of the contribution of both the scheduler and the allocator to code generation. For
the other applications it is easy to see that only one of the algorithms had a stronger
impact.

Original Final Improvement


Benchmark
(Iterations/Sec) (Iterations/Sec) (%)
Response to Remote
Request
6791171.48 6861063.46 1.03%
Consumer RGB to
YIQ
446.23 473.48 6.11%
OSPF
Benchmark
11783.19 11933.17 1.27%
Packet Flow
Benchmark
1930.87 1984.13 2.76%
Viterbi Decoder
Benchmark
4909.98 4966.89 1.16%

Table 3 Overall improvement

Response to Remote Request Benchmark

6880000
6860000
6840000
6820000
6800000
6780000
6760000
6740000
al

l
n

na
le

tio
in

du

Fi
rig

ca
he
O

lo
Sc

Al
er

Response to Remote
st
gi

Request
Re

Figure 7 Response ro Remote Request benchmark evolution


Results 28

Two charts are provided (Figure 7 and Figure 8) in order to show in an accurate manner
the statistics presented in the tables above. The evolutions of the two problematic
benchmarks were chosen for these charts.

Consumer RGB to YIQ Benchmark

480
475
470
465
460
Consumer RGB to
455
YIQ
450
445
440
435
430
Original Scheduler Register Final
Allocation

Figure 8 Consumer RGB to YIQ benchmark evolution

It is very obvious the importance of each of the optimizations brought to the compiler.
Depending on the nature of the code of an application, at least one of the algorithms
causes some speedup.
Conclusions 29

6 Conclusions
The desire to allocate a small number of registers to be used at any point during the
execution of a program is driven both by the limited number of registers and by the fact
that temporary savings of register values is time consuming. With many functional units
competing for registers (even more so with deep pipelining as some implementations of
the PowerPC architecture have) and with optimizing techniques like unrolling, careful
allocation of registers is still a central issue for current technology.

The possibilities of allocating registers are driven both by scheduling and allocation
algorithms, but also other optimizations that have impact in this matter. The present work
evaluates the consequences of some modifications brought upon instruction scheduling
and register allocation in order to exchange information between them. The results show
that making these two stages of the compiler aware one of the other the efficiency of the
generated code is improved. Depending on the nature of the code of an application, the
improvement obtained for the generated code is more or less significant.
References 30

7 References
1. Prof. PhD. Eng. Irina Athanasiu – Compiler Design Course, 2001.

2. D.G. Brandlee. “Retargetable Instruction Scheduling for Pipelined Processors”.


PhD Thesis, University of Washington, 1991.

3. D.G. Brandlee, S.J. Eggers, R.R. Henry. “Integrated Register Allocation and
Instruction Scheduling for RISCs”. In The 4th International Conference on
Architectural Support for Programming Languages and Operating Systems, pages
122-131, April 1991.

4. F.C. Chow, J.L. Hennessy. “The Priority-Based Coloring Approach to Register


Allocation”, SIGPLAN Notices, 19(6), 1984.

5. EEMBC Benchmark Suite– www.eembc.org.

6. P.B. Gibbons, S.S. Muchnick. “Efficient Instruction Scheduling for a Pipelined


Architecture”. In Proceedings of the SIGPLAN ’86 Symposium on Compiler
Construction, volume 21, pages 11-16, July 1986.

7. J.R. Goodman, W. Hsu. “Code Scheduling and Register Allocation in Large Basic
Blocks”. In International Conference on Supercomputing, pages 442-452, ACM,
July 1988.

8. S.S. Muchnick. “Advanced Compiler Design and Implementation”. Morgan


Kaufmann Publishers, 1997.

9. PowerPC Microprocessor Family: “The Programming Environments for 32-Bit


Microprocessors”, Motorola Inc., 1997.

10. S.S. Pinter. “Register Allocation with Instruction Scheduling: a New Approach”,
SIGPLAN Conference on Programming Language Design and Implementation,
1993.

11. S. Talla. “Adaptive Explicitly Parallel Instruction Computing”. PhD thesis, New
York University, 2000.

You might also like