You are on page 1of 63

UNIT : III

SYSTEM
PROGRAMMING
II SEMESTER (MCSE 201)
PREPARED BY ARUN PRATAP SINGH

PREPARED BY ARUN PRATAP SINGH 1

1
STRUCTURE OF PARALLELIZING COMPILER :
Two approaches to compilation
Parallelize a program manually
Sequential code converted to parallel code
Develop a parallel compiler
Intermediate form
Partitioning
- Block based or loop based
Placement
Routing
Parallel Compilers help accelerate the software development process by compiling code
quickly and efficiently. Compiler for Warp, a systolic array is a cross compiler that executes on
workstations and produces optimized code for each of the processing elements of the systolic
array. The target of the compiler is a multi-computer, each of the processors in the array
contains local program and data memories. Quite frequently an application program for the
Warp array contains different programs for different processing elements. Due to its high
communication bandwidth, Warp is a good host for pipelined computations where different
phases of the computation are mapped onto different processors. This usage model
encourages the design of algorithms which result in different programs for the individual
processors and this observation initiated our investigation into parallel compilation, since the
program for each processor can be compiled independently. One of goals is to assess the
practical speedup.
Source programs :
The structure of the programming language for Warp reflects the underlying architecture of the
machine: A Warp program is a module that consists of one or more section programs which
describe the computation for a section (group) of processing elements. Section programs
contain one or more functions and execute in parallel. Because section programs execute
independently, optimizations performed within one section program are independent of those
performed within another.
Architecture of the parallel compiler :
The compiler is implemented in Common Lisp, and the sequential compiler runs as a Common
Lisp process on a single SUN workstation under the UNIX operating system. The compiler
consists of four phases:
Phase l: parsing and semantic checking

UNIT : III

PREPARED BY ARUN PRATAP SINGH 2

2
Phase 2: construction of the flow-graph, local optimization. and computation of global
dependencies
Phase 3: software pipelining and code generation
Phase 4: generation of I/O driver code, assembly and post-processing (linking, format
conversion for download modules, etc.)
Both the first and last phase are cheap compared to the optimization and code generation
phases.
Furthermore, for the current implementation, these phases require global information that
depends on all functions in a section. For example, to discover a type mismatch between a
function return value and its use at a call site, the semantic checker has to process the complete
section program. We therefore decided to only parallelize optimization and code generation.
Parsing, semantic checking and assembly are performed sequentially. The structure of the
parallel compiler then reflects the structure of the source programs. There exists a hierarchy of
processes; the levels of the hierarchy are master level, section level and function level. The
tasks of each level are as follows:
Master level : The master level consists of exactly one process, the master that controls the
entire compilation. The master process is a C process, it invokes a Common Lip process that
parses the Warp program to obtain enough information to set up the parallel compilation. Thus,
the master knows the structure of the program and therefore the total number of processes
involved in one compilation. As a side effect, if there are any syntax or semantic errors in the
program, they are discovered at this time and the compilation is aborted. Once the master has
finished, the compilation is complete. The master is invoked by the user.
Section level : Processes on the section level are called section masters, and there is one
section master for each section in the source program. Section masters are C processes.
Section masters are invoked by the master, which also supplies parse information so that the
section masters know how many functions are in the section. A section master controls the
processes that perform code generation for the functions within its section. When code has
been generated for each function of the section, the section master combines the results so
that the parallel compiler produces the same input for the assembly phase as the sequential
compiler. Furthermore, the section master process is responsible to combine the diagnostic
output that was generated during the compilation of the functions. After this is done, the section
master terminates.
Function level : The number of processes on the function level, called function masters, is
equal to the total number of functions in the program. Function masters are Common Lisp
processes. The task of a function master is to implement phases 2 and 3 of the compiler, that
is to optimize and generate code for one function.

PREPARED BY ARUN PRATAP SINGH 3

3
Importance of parallelizing :
With the rapid development of multi-core processors, parallelized programs can take such
advantage to run much faster than serial programs
Compilers created to convert serial programs to run in parallel are parallelizing compilers
Role of parallelizing compilers :
Attempt to relieve the programmer from dealing with the machine details
They allow the programmer to concentrate on solving the object problem, while the
compiler concerns itself with the complexities of the machine
Much more sophisticated analysis is required of the compiler to generate efficient machine
code for different types of machines
How to Parallelizing a Program :
Find the dependency in the program
Try to avoid or eliminate the dependency
Reduce overhead cost
Compiler parallelization analysis :-
The compiler usually conducts two passes of analysis before actual parallelization in order to
determine the following:
Is it safe to parallelize the loop? Answering this question needs accurate dependence
analysis and alias analysis
Is it worthwhile to parallelize it? This answer requires a reliable estimation (modeling) of the
program workload and the capacity of the parallel system.
The first pass of the compiler performs a data dependence analysis of the loop to determine whether
each iteration of the loop can be executed independently of the others. Data dependence can
sometimes be dealt with, but it may incur additional overhead in the form of message passing,
synchronization of shared memory, or some other method of processor communication.
The second pass attempts to justify the parallelization effort by comparing the theoretical execution
time of the code after parallelization to the code's sequential execution time. Somewhat
counterintuitively, code does not always benefit from parallel execution. The extra overhead that can
be associated with using multiple processors can eat into the potential speedup of parallelized code.

PREPARED BY ARUN PRATAP SINGH 4

4


Compiler representation :

PREPARED BY ARUN PRATAP SINGH 5

5

Parallel Compilation :
Consider loop partitioning
Create small local compilation
Consider static routing between tiles
Short neighbor-to-neighbor communication
Compiler orchestrated
Flow Compilation :

PREPARED BY ARUN PRATAP SINGH 6

6

PARALLELISM DETECTION :

DATA DEPENDENCE :-
A data dependency in computer science is a situation in which a program statement (instruction)
refers to the data of a preceding statement. In compiler theory, the technique used to discover
data dependencies among statements (or instructions) is called dependence analysis.
There are three types of dependencies: data, name, and control.

PREPARED BY ARUN PRATAP SINGH 7

7
Assuming statement S1 and S2, S2 depends on S1 if:
[I(S1) O(S2)] [O(S1) I(S2)] [O(S1) O(S2)]
where:
I(Si) is the set of memory locations read by Si and
O(Sj) is the set of memory locations written by Sj
and there is a feasible run-time execution path from S1 to S2
This Condition is called Bernstein Condition, named by A. J. Bernstein.
Three cases exist:
Flow (data) dependence: O(S1) I (S2), S1 S2 and S1 writes something read by S2
Anti-dependence: I(S1) O(S2), S1 S2 and S1 reads something before S2 overwrites it
Output dependence: O(S1) O(S2), S1 S2 and both write the same memory location.

Flow dependency :
A Flow dependency, also known as a data dependency or true dependency or read-after-write
(RAW), occurs when an instruction depends on the result of a previous instruction:
1. A = 3
2. B = A
3. C = B
Instruction 3 is truly dependent on instruction 2, as the final value of C depends on the instruction
updating B. Instruction 2 is truly dependent on instruction 1, as the final value of B depends on
the instruction updating A. Since instruction 3 is truly dependent upon instruction 2 and instruction

PREPARED BY ARUN PRATAP SINGH 8

8
2 is truly dependent on instruction 1, instruction 3 is also truly dependent on instruction
1. Instruction level parallelism is therefore not an option in this example.

Anti-dependency :
An anti-dependency, also known as write-after-read (WAR), occurs when an instruction requires
a value that is later updated. In the following example, instruction 2 anti-depends on instruction 3
the ordering of these instructions cannot be changed, nor can they be executed in parallel
(possibly changing the instruction ordering), as this would affect the final value of A.
1. B = 3
2. A = B + 1
3. B = 7
An anti-dependency is an example of a name dependency. That is, renaming of variables could
remove the dependency, as in the next example:
1. B = 3
N. B2 = B
2. A = B2 + 1
3. B = 7
A new variable, B2, has been declared as a copy of B in a new instruction, instruction N. The anti-
dependency between 2 and 3 has been removed, meaning that these instructions may now be
executed in parallel. However, the modification has introduced a new dependency: instruction 2
is now truly dependent on instruction N, which is truly dependent upon instruction 1. As flow
dependencies, these new dependencies are impossible to safely remove.

Output dependency :
An output dependency, also known as write-after-write (WAW), occurs when the ordering of
instructions will affect the final output value of a variable. In the example below, there is an output
dependency between instructions 3 and 1 changing the ordering of instructions in this example
will change the final value of B, thus these instructions cannot be executed in parallel.
1. B = 3
2. A = B + 1
3. B = 7

PREPARED BY ARUN PRATAP SINGH 9

9
As with anti-dependencies, output dependencies are name dependencies. That is, they may be
removed through renaming of variables, as in the below modification of the above example:
1. B2 = 3
2. A = B2 + 1
3. B = 7
A commonly used naming convention for data dependencies is the following: Read-after-Write or
RAW (flow dependency), Write-after-Write or WAW (output dependency), and Write-After-Read
or WAR (anti-dependency).

Control Dependency :
An instruction is control dependent on a preceding instruction if the outcome of latter determines
whether former should be executed or not. In the following example, instruction S2 is control
dependent on instruction S1. However, S3 is not control dependent upon S1 because S3 is always
executed irrespective of outcome of S1.
S1. if (a == b)
S2. a = a + b
S3. b = a + b



PREPARED BY ARUN PRATAP SINGH 10

10
Intuitively, there is control dependence between two statements S1 and S2 if
S1 could be possibly executed before S2
The outcome of S1 execution will determine whether S2 will be executed.
A typical example is that there is control dependence between if statement's condition part and
the statements in the corresponding true/false bodies.
A formal definition of control dependence can be presented as follows:
A statement S2 is said to be control dependent on another statement S1 iff
there exists a path P from S1 to S2 such that every statement Si S1 within P will be followed
by S2 in each possible path to the end of the program and
S1 will not necessarily be followed by S2, i.e. there is an execution path from S1 to the end
of the program that does not go through S2.
Expressed with the help of (post-)dominance the two conditions are equivalent to
S2 post-dominates all Si
S2 does not post-dominate S1
Example 2: control independent:
for (i=0;i<n;i++)
{
a[i] = c[i];
if (a[i] < 0) a[i] = 1;
}
control dependent:
for (i=1;i<n;i++)
{
if (a[i-1] < 0) a[i] = 1;
}

Implications :
Conventional programs are written assuming the sequential execution model. Under this model,
instructions execute one after the other, atomically (i.e., at any given point of time only one
instruction is executed) and in the order specified by the program.

PREPARED BY ARUN PRATAP SINGH 11

11
However, dependencies among statements or instructions may hinder parallelism parallel
execution of multiple instructions, either by a parallelizing compiler or by a processor
exploiting instruction-level parallelism. Recklessly executing multiple instructions without
considering related dependences may cause danger of getting wrong results, namely hazards.

DIRECTION VECTOR :
Dependence Vectors -
To classify data dependence, compilers use two important vectors: the distance vector (),
which indicates the distance between fn and hn, and the direction vector (), which indicates the
corresponding direction, basically the sign of the distance.
The distance vector is defined as = (1, ..., k) where n is n = hn - fn
The direction vector is defined as = (1, ..., k) where n is:
(<) if n > 0 => [fn < hn]
(=) if n = 0 => [fn = hn]
(>) if n < 0 => [fn > hn]



PREPARED BY ARUN PRATAP SINGH 12

12





LOOP CARRIED DEPENDENCE :
A dependence is said to be carried by a loop if the corresponding direction vector element for that
loop has a directional component (<, <, /, or * ). Loop-carried dependence is an important concept
for discovering when the iterations of the loop can be executed concurrently. If there are no loop-
carried dependences, all iterations of that loop can be executed in parallel without
synchronization.

PREPARED BY ARUN PRATAP SINGH 13

13



PREPARED BY ARUN PRATAP SINGH 14

14

















PREPARED BY ARUN PRATAP SINGH 15

15
LOOP INDEPENDENT DEPENDENCES :


Loop independence is important since loops that are independent can be parallelized.
Independent loops can be parallelized in a number of ways, from the course-grained
parallelism of Open MP*, to fine-grained Instruction Level Parallelism (ILP) of vectorization
and software pipelining.
Loops are considered independent when the computation of iteration Y of a loop can be done
independently of the computation of iteration X. In other words, if iteration 1 of a loop can be
computed and iteration 2 simultaneously could be computed without using any result from
iteration 1, then the loops are independent.
Occasionally, you can determine if a loop is independent by comparing results from the
output of the loop with results from the same loop written with a decrementing index counter.
For example, the loop shown in example 1 might be independent if the code in example 2
generates the same result.


PREPARED BY ARUN PRATAP SINGH 16

16
Example
subroutine loop_independent_A (a,b,MAX)
implicit none
integer :: j, MAX, a(MAX), b(MAX)
do j=0, MAX
a(j) = b(j)
end do
end subroutine loop_independent_A
subroutine loop_independent_B (a,b,MAX)
implicit none
integer :: j, MAX, a(MAX), b(MAX)
do j=MAX, 0, -1
a(j) = b(j)
end do
end subroutine loop_independent_B




PREPARED BY ARUN PRATAP SINGH 17

17
COMPILATION FOR DISTRIBUTED MACHINES- DATA PARTITIONING :


PREPARED BY ARUN PRATAP SINGH 18

18


PREPARED BY ARUN PRATAP SINGH 19

19


PREPARED BY ARUN PRATAP SINGH 20

20



PREPARED BY ARUN PRATAP SINGH 21

21




PREPARED BY ARUN PRATAP SINGH 22

22




PREPARED BY ARUN PRATAP SINGH 23

23



PREPARED BY ARUN PRATAP SINGH 24

24




PREPARED BY ARUN PRATAP SINGH 25

25



INSTRUCTION SCHEDULING :
In computer science, instruction scheduling is a compiler optimization used to improve
instruction-level parallelism, which improves performance on machines with instruction pipelines.
Put more simply, without changing the meaning of the code, it tries to
Avoid pipeline stalls by rearranging the order of instructions.
Avoid illegal or semantically ambiguous operations (typically involving subtle instruction
pipeline timing issues or non-interlocked resources.)
The pipeline stalls can be caused by structural hazards (processor resource limit), data hazards
(output of one instruction needed by another instruction) and control hazards (branching).
Instruction scheduling is typically done on a single basic block. In order to determine whether
rearranging the block's instructions in a certain way preserves the behavior of that block, we need
the concept of a data dependency. There are three types of dependencies, which also happen to
be the three data hazards:
1. Read after Write (RAW or "True"): Instruction 1 writes a value used later by Instruction 2.
Instruction 1 must come first, or Instruction 2 will read the old value instead of the new.

PREPARED BY ARUN PRATAP SINGH 26

26
2. Write after Read (WAR or "Anti"): Instruction 1 reads a location that is later overwritten by
Instruction 2. Instruction 1 must come first, or it will read the new value instead of the old.
3. Write after Write (WAW or "Output"): Two instructions both write the same location. They
must occur in their original order.
Technically, there is a fourth type, Read after Read (RAR or "Input"): Both instructions read the
same location. Input dependence does not constrain the execution order of two statements, but
it is useful in scalar replacement of array elements.
To make sure we respect the three types of dependencies, we construct a dependency graph,
which is a directed graph where each vertex is an instruction and there is an edge from I1 to I2 if
I1 must come before I2 due to a dependency. If loop-carried dependencies are left out, the
dependency graph is a directed acyclic graph. Then, any topological sortof this graph is a valid
instruction schedule. The edges of the graph are usually labelled with the latency of the
dependence. This is the number of clock cycles that needs to elapse before the pipeline can
proceed with the target instruction without stalling.


Types of Instruction Scheduling :
There are several types of instruction scheduling:
1. Local (Basic Block) Scheduling: instructions can't move across basic block boundaries.
2. Global scheduling: instructions can move across basic block boundaries.
3. Modulo Scheduling: another name for software pipelining, which is a form of instruction
scheduling that interleaves different iterations of a loop.

PREPARED BY ARUN PRATAP SINGH 27

27
4. Trace scheduling: the first practical approach for global scheduling, trace scheduling
tries to optimize the control flow path that is executed most often.
5. Superblock scheduling: a simplified form of trace scheduling which does not attempt to
merge control flow paths at trace "side entrances". Instead, code can be implemented
by more than one schedule, vastly simplifying the code generator.


Algorithms :
The simplest algorithm to find a topological sort is frequently used and is known as list scheduling.
Conceptually, it repeatedly selects a source of the dependency graph, appends it to the current
instruction schedule and removes it from the graph. This may cause other vertices to be sources,
which will then also be considered for scheduling. The algorithm terminates if the graph is empty.
To arrive at a good schedule, stalls should be prevented. This is determined by the choice of the
next instruction to be scheduled. A number of heuristics are in common use:
The processor resources used by the already scheduled instructions are recorded. If a
candidate uses a resource that is occupied, its priority will drop.
If a candidate is scheduled closer to its predecessors than the associated latency its priority
will drop.

PREPARED BY ARUN PRATAP SINGH 28

28
If a candidate lies on the critical path of the graph, its priority will rise. This heuristic provides
some form of look-ahead in an otherwise local decision process.
If choosing a candidate will create many new sources, its priority will rise. This heuristic tends
to generate more freedom for the scheduler.

The phase order of Instruction Scheduling :
Instruction scheduling may be done either before or after register allocation or both before and
after it. The advantage of doing it before register allocation is that this results in maximum
parallelism. The disadvantage of doing it before register allocation is that this can result in the
register allocator needing to use a number of registers exceeding those available. This will cause
spill/fill code to be introduced which will reduce the performance of the section of code in question.
If the architecture being scheduled has instruction sequences that have potentially illegal
combinations (due to a lack of instruction interlocks) the instructions must be scheduled after
register allocation. This second scheduling pass will also improve the placement of the spill/fill
code.
If scheduling is only done after register allocation then there will be false dependencies introduced
by the register allocation that will limit the amount of instruction motion possible by the scheduler.


PREPARED BY ARUN PRATAP SINGH 29

29


REGISTER ALLOCATION :
In compiler optimization, register allocation is the process of assigning a large number of target
program variables onto a small number of CPU registers. Register allocation can happen over
a basic block (local register allocation), over a whole function/procedure (global register
allocation), or across function boundaries traversed via call-graph (interprocedural register
allocation). When done per function/procedure the calling convention may require insertion of
save/restore around each call-site.
In many programming languages, the programmer has the illusion of allocating arbitrarily many
variables. However, during compilation, the compiler must decide how to allocate these variables
to a small, finite set of registers. Not all variables are in use (or "live") at the same time, so some
registers may be assigned to more than one variable. However, two variables in use at the same
time cannot be assigned to the same register without corrupting its value. Variables which cannot
be assigned to some register must be kept in RAM and loaded in/out for every read/write, a
process called spilling. Accessing RAM is significantly slower than accessing registers and slows
down the execution speed of the compiled program, so an optimizing compiler aims to assign as
many variables to registers as possible. Register pressure is the term used when there are fewer

PREPARED BY ARUN PRATAP SINGH 30

30
hardware registers available than would have been optimal; higher pressure usually means that
more spills and reloads are needed.
In addition, programs can be further optimized by assigning the same register to a source and
destination of a move instruction whenever possible. This is especially important if the compiler
is using other optimizations such as SSA analysis, which artificially generates
additional move instructions in the intermediate code.



Register allocation is the process of determining which values should be placed into which
registers and at what times during the execution of the program. Note that register allocation is
not concerned specifically with variables, only values distinct uses of the same variable can be
assigned to different registers without affecting the logic of the program.
There have been a number of techniques developed to perform register allocation at a variety of
different levels local register allocation refers to allocation within a very small piece of code,
typically a basic block; global register allocation assigns registers within an entire function; and
interprocedural register allocation works across function calls and module boundaries. The first
two techniques are commonplace, with global register allocation implemented in virtually every
production compiler, while the latter interprocedural register allocation is rarely performed by
today's mainstream compilers.

PREPARED BY ARUN PRATAP SINGH 31

31


Figure 1 Register allocation by graph coloring.
Spilling :
In most register allocators, each variable is assigned to either a CPU register or to main memory.
The advantage of using a register is speed. Computers have a limited number of registers, so not
all variables can be assigned to registers. A "spilled variable" is a variable in main memory rather
than in a CPU register. The operation of moving a variable from a register to memory is
called spilling, while the reverse operation of moving a variable from memory to a register is
called filling. For example, a 32-bit variable spilled to memory gets 32 bits of stack space allocated
and all references to the variable are then to that memory. Such a variable has a much slower
processing speed than a variable in a register. When deciding which variables to spill, multiple
factors are considered: execution time, code space, data space.
Iterated Register Coalescing :
Register allocators have several types, with Iterated Register Coalescing (IRC) being a more
common one. IRC was invented by LAL George and Andrew Appel in 1996, building on earlier
work by Gregory Chaitin. IRC works based on a few principles. First, if there are any non-move
related vertices in the graph with degree less than K the graph can be simplified by removing
those vertices, since once those vertices are added back in it is guaranteed that a color can be
found for them (simplification). Second, two vertices sharing a preference edge whose adjacency
sets combined have a degree less than K can be combined into a single vertex, by the same
reasoning (coalescing). If neither of the two steps can simplify the graph, simplification can be run
again on move-related vertices (freezing). Finally, if nothing else works, vertices can be marked
for potential spilling and removed from the graph (spill). Since all of these steps reduce the
degrees of vertices in the graph, vertices may transform from being high-degree (degree > K) to
low-degree during the algorithm, enabling them to be simplified or coalesced. Thus, the stages of
the algorithm are iterated to ensure aggressive simplification and coalescing.





PREPARED BY ARUN PRATAP SINGH 32

32
MACHINE OPTIMIZATION :
In computing, an optimizing compiler is a compiler that tries to minimize or maximize some
attributes of an executable computer program. The most common requirement is to minimize the
time taken to execute a program; a less common one is to minimize the amount
of memory occupied. The growth of portable computershas created a market for minimizing
the power consumed by a program. Compiler optimization is generally implemented using a
sequence of optimizing transformations, algorithms which take a program and transform it to
produce a semantically equivalent output program that uses fewer resources.
It has been shown that some code optimization problems are NP-complete, or even undecidable.
In practice, factors such as the programmer's willingness to wait for the compiler to complete its
task place upper limits on the optimizations that a compiler implementor might provide.
(Optimization is generally a very CPU- and memory-intensive process.) In the past, computer
memory limitations were also a major factor in limiting which optimizations could be performed.
Because of all these factors, optimization rarely produces "optimal" output in any sense, and in
fact an "optimization" may impede performance in some cases; rather, they are heuristic methods
for improving resource usage in typical programs.

Machine code optimization :

These analyze the executable task image of the program after all of an executable machine code
has been linked. Some of the techniques that can be applied in a more limited scope, such as
macro compression (which saves space by collapsing common sequences of instructions), are
more effective when the entire executable task image is available for analysis.

Machine independent vs machine dependent :

Many optimizations that operate on abstract programming concepts (loops, objects, structures)
are independent of the machine targeted by the compiler, but many of the most effective
optimizations are those that best exploit special features of the target platform. E.g.: Instructions
which do several things at once, such as decrement register and branch if not zero.
The following is an instance of a local machine dependent optimization. To set a register to 0, the
obvious way is to use the constant '0' in an instruction that sets a register value to a constant. A
less obvious way is to XOR a register with itself. It is up to the compiler to know which instruction
variant to use. On many RISC machines, both instructions would be equally appropriate, since
they would both be the same length and take the same time. On many other microprocessors such
as the Intel x86 family, it turns out that the XOR variant is shorter and probably faster, as there
will be no need to decode an immediate operand, nor use the internal "immediate operand
register". (A potential problem with this is that XOR may introduce a data dependency on the
previous value of the register, causing a pipeline stall. However, processors often have XOR of a
register with itself as a special case that doesn't cause stalls.)

PREPARED BY ARUN PRATAP SINGH 33

33
Problems with optimization :
Early in the history of compilers, compiler optimizations were not as good as hand-written ones.
As compiler technologies have improved, good compilers can often generate better code than
human programmers, and good post pass optimizers can improve highly hand-optimized code
even further. For RISC CPU architectures, and even more so for VLIW hardware, compiler
optimization is the key for obtaining efficient code, because RISC instruction sets are so compact
that it is hard for a human to manually schedule or combine small instructions to get efficient
results. Indeed, these architectures were designed to rely on compiler writers for adequate
performance.
However, optimizing compilers are by no means perfect. There is no way that a compiler can
guarantee that, for all program source code, the fastest (or smallest) possible equivalent compiled
program is output; such a compiler is fundamentally impossible because it would solve the halting
problem.
This may be proven by considering a call to a function, foo(). This function returns nothing and
does not have side effects (no I/O, does not modify global variables and "live" data structures,
etc.). The fastest possible equivalent program would be simply to eliminate the function call.
However, if the function foo() in fact does not return, then the program with the call to foo() would
be different from the program without the call; the optimizing compiler will then have to determine
this by solving the halting problem.
Additionally, there are a number of other more practical issues with optimizing compiler
technology:
Optimizing compilers focus on relatively shallow constant-factor performance improvements
and do not typically improve the algorithmic complexity of a solution. For example, a compiler
will not change an implementation of bubble sort to use merge sort instead.
Compilers usually have to support a variety of conflicting objectives, such as cost of
implementation, compilation speed and quality of generated code.
A compiler typically only deals with a part of a program at a time, often the code contained
within a single file or module; the result is that it is unable to consider contextual information
that can only be obtained by processing the other files.
The overhead of compiler optimization: Any extra work takes time; whole-program
optimization is time consuming for large programs.
The often complex interaction between optimization phases makes it difficult to find an optimal
sequence in which to execute the different optimization phases.
Work to improve optimization technology continues. One approach is the use of so-called post-
pass optimizers (some commercial versions of which date back to mainframe software of the late
1970s). These tools take the executable output by an "optimizing" compiler and optimize it even
further. Post pass optimizers usually work on the assembly language or machine code level
(contrast with compilers that optimize intermediate representations of programs). The

PREPARED BY ARUN PRATAP SINGH 34

34
performance of post pass compilers are limited by the fact that much of the information available
in the original source code is not always available to them.

DYNAMIC COMPILATION :
Dynamic compilation is a process used by some programming language implementations to
gain performance during program execution. Although the technique originated in the Self
programming language, the best-known language that uses this technique is Java. Since the
machine code emitted by a dynamic compiler is constructed and optimized at program runtime,
the use of dynamic compilation enables optimizations for efficiency not available to compiled
programs except through code duplication or meta programming.
Runtime environments using dynamic compilation typically have programs run slowly for the first
few minutes, and then after that, most of the compilation and recompilation is done and it runs
quickly. Due to this initial performance lag, dynamic compilation is undesirable in certain cases.
In most implementations of dynamic compilation, some optimizations that could be done at the
initial compile time are delayed until further compilation at run-time, causing further unnecessary
slow downs.Just-in-time compilation is a form of dynamic compilation.
A closely related technique is incremental compilation. An incremental compiler is used in POP-
2, POP-11, Forth, some versions of Lisp, e.g. Maclisp and at least one version of the ML
programming language (Poplog ML). This requires the compiler for the programming language to
be part of the runtime system. In consequence, source code can be read in at any time, from the
terminal, from a file, or possibly from a data-structure constructed by the running program, and
translated into a machine code block or function (which may replace a previous function of the
same name), which is then immediately available for use by the program. Because of the need
for speed of compilation during interactive development and testing, the compiled code is likely
not to be as heavily optimised as code produced by a standard 'batch compiler', which reads in
source code and produces object files that can subsequently be linked and run. However an
incrementally compiled program will typically run much faster than an interpreted version of the
same program. Incremental compilation thus provides a mixture of the benefits of interpreted and
compiled languages. To aid portability it is generally desirable for the incremental compiler to
operate in two stages, namely first compiling to some intermediate platform-independent
language, and then compiling from that to machine code for the host machine. In this case porting
requires only changing the 'back end' compiler. Unlike dynamic compilation, as defined above,
incremental compilation does not involve further optimizations after the program is first run.






PREPARED BY ARUN PRATAP SINGH 35

35
CODE OPTIMIZATION :
In computing, an optimizing compiler is a compiler that tries to minimize or maximize some
attributes of an executable computer program. The most common requirement is to minimize the
time taken to execute a program; a less common one is to minimize the amount
of memory occupied. The growth of portable computers has created a market for minimizing
the power consumed by a program. Compiler optimization is generally implemented using a
sequence of optimizing transformations, algorithms which take a program and transform it to
produce a semantically equivalent output program that uses fewer resources.
It has been shown that some code optimization problems are NP-complete, or even undecidable.
In practice, factors such as the programmer's willingness to wait for the compiler to complete its
task place upper limits on the optimizations that a compiler implementer might provide.
(Optimization is generally a very CPU- and memory-intensive process.) In the past, computer
memory limitations were also a major factor in limiting which optimizations could be performed.
Because of all these factors, optimization rarely produces "optimal" output in any sense, and in
fact an "optimization" may impede performance in some cases; rather, they are heuristic methods
for improving resource usage in typical programs.



PREPARED BY ARUN PRATAP SINGH 36

36




PREPARED BY ARUN PRATAP SINGH 37

37



PREPARED BY ARUN PRATAP SINGH 38

38



PREPARED BY ARUN PRATAP SINGH 39

39



PREPARED BY ARUN PRATAP SINGH 40

40



PREPARED BY ARUN PRATAP SINGH 41

41


PREPARED BY ARUN PRATAP SINGH 42

42



PREPARED BY ARUN PRATAP SINGH 43

43


DATA FLOW ANALYSIS :
Data-flow analysis is a technique for gathering information about the possible set of values
calculated at various points in a computer program. A program's control flow graph (CFG) is used
to determine those parts of a program to which a particular value assigned to a variable might
propagate. The information gathered is often used by compilers when optimizing a program. A
canonical example of a data-flow analysis is reaching definitions.
A simple way to perform data-flow analysis of programs is to set up data-flow equations for
each node of the control flow graph and solve them by repeatedly calculating the output from the
input locally at each node until the whole system stabilizes, i.e., it reaches a fixpoint. This general
approach was developed by Gary Kildall while teaching at the Naval Postgraduate School.

Collects information about the flow of data along execution paths
Loops (control goes back)
Control splits and control merges
Data-flow information
Data-flow analysis

PREPARED BY ARUN PRATAP SINGH 44

44

Four Classical Data-flow Problems :
Reaching definitions (Reach)
Live uses of variables (Live)
Available expressions (Avail)
Very busy expressions (VeryB)
Def-use chains built from Reach, and the dual Use-def chains, built from Live, play role
in many optimizations
Avail enables global common subexpression elimination
VeryB is used for conservative code motion

PREPARED BY ARUN PRATAP SINGH 45

45


PREPARED BY ARUN PRATAP SINGH 46

46


PREPARED BY ARUN PRATAP SINGH 47

47



PREPARED BY ARUN PRATAP SINGH 48

48
The following are examples of properties of computer programs that can be calculated by data-
flow analysis. Note that the properties calculated by data-flow analysis are typically only
approximations of the real properties. This is because data-flow analysis operates on the
syntactical structure of the CFG without simulating the exact control flow of the program. However,
to be still useful in practice, a data-flow analysis algorithm is typically designed to calculate an
upper respectively lower approximation of the real program properties.
Forward Analysis
The reaching definition analysis calculates for each program point the set of definitions that may
potentially reach this program point.
1.

if b==4
then
2.


3.

a = 5;
4.


5.

else
6.


7.

a = 3;
8.


9.

endif
10.


11.


12.


13.

if a < 4
then
14.


15.

...
16.


The reaching definition of variable "a" at line 7 is the set of
assignments a=5 at line 2 and a=3 at line 4.
Backward Analysis
The live variable analysis calculates for each program point the variables that may be potentially
read afterwards before their next write update. The result is typically used by dead code
elimination to remove statements that assign to a variable whose value is not used afterwards.
The out-state of a block is the set of variables that are live at the end of the block. Its in-state is
the set of variable that is live at the start of it. The out-state is the union of the in-states of the
blocks successors. The transfer function of a statement is applied by making the variables that
are written dead, then making the variables that are read live.
Initial Code:
b1: a = 3;
b = 5;
d = 4;
if a > b then
b2: c = a + b;

PREPARED BY ARUN PRATAP SINGH 49

49
d = 2;
b3: endif
c = 4;
return b * d + c;

BI-DIRECTIONAL DATA FLOW :



PREPARED BY ARUN PRATAP SINGH 50

50



PREPARED BY ARUN PRATAP SINGH 51

51





PREPARED BY ARUN PRATAP SINGH 52

52


PREPARED BY ARUN PRATAP SINGH 53

53


PREPARED BY ARUN PRATAP SINGH 54

54














PREPARED BY ARUN PRATAP SINGH 55

55
PROGRAM REPRESENTATION FOR OPTIMISATION-SSA FORM :
In compiler design, static single assignment form (often abbreviated as SSA form or
simply SSA) is a property of an intermediate representation (IR), which requires that each variable
is assigned exactly once, and every variable is defined before it is used. Existing variables in the
original IR are split into versions, new variables typically indicated by the original name with a
subscript in textbooks, so that every definition gets its own version. In SSA form, use-def
chains are explicit and each contains a single element.
SSA was developed by Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F.
Kenneth Zadeck, researchers at IBM in the 1980s.
In functional language compilers, such as those for Scheme, ML and Haskell, continuation-
passing style (CPS) is generally used while one might expect to find SSA in a compiler
for Fortran or C. SSA is formally equivalent to a well-behaved subset of CPS (excluding non-local
control flow, which does not occur when CPS is used as intermediate representation), so
optimizations and transformations formulated in terms of one immediately apply to the other.

Benefits :
The primary usefulness of SSA comes from how it simultaneously simplifies and improves the
results of a variety of compiler optimizations, by simplifying the properties of variables. For
example, consider this piece of code:
y := 1
y := 2
x := y
Humans can see that the first assignment is not necessary, and that the value of y being used in
the third line comes from the second assignment of y. A program would have to perform reaching
definition analysis to determine this. But if the program is in SSA form, both of these are
immediate:
y1 := 1
y2 := 2
x1 := y2
Compiler optimization algorithms which are either enabled or strongly enhanced by the use of
SSA include:
constant propagation
value range propagation
sparse conditional constant propagation

PREPARED BY ARUN PRATAP SINGH 56

56
dead code elimination
global value numbering
partial redundancy elimination
strength reduction
register allocation

Converting to SSA :
Converting ordinary code into SSA form is primarily a simple matter of replacing the target of each
assignment with a new variable, and replacing each use of a variable with the "version" of the
variable reaching that point. For example, consider the following control flow graph:

Notice that we could change the name on the left side of "x x - 3", and change the following
uses of x to use that new name, and the program would still do the same thing. We exploit this in
SSA by creating two new variables, x1 and x2, each of which is assigned only once. We likewise
give distinguishing subscripts to all the other variables, and we get this:

PREPARED BY ARUN PRATAP SINGH 57

57

We've figured out which definition each use is referring to, except for one thing: the uses of y in
the bottom block could be referring to either y1 or y2, depending on which way the control flow
came from. So how do we know which one to use?
The answer is that we add a special statement, called a (Phi) function, to the beginning of the
last block. This statement will generate a new definition of y, y3, by "choosing" either y1 or y2,
depending on which arrow control arrived from:


PREPARED BY ARUN PRATAP SINGH 58

58
Now, the uses of y in the last block can simply use y3, and they'll obtain the correct value either
way. You might ask at this point, do we need to add a function for x too? The answer is no; only
one version of x, namely x2 is reaching this place, so there's no problem.
A more general question along the same lines is, given an arbitrary control flow graph, how can I
tell where to insert functions, and for what variables? This is a difficult question, but one that
has an efficient solution that can be computed using a concept called dominance frontiers.
Note: the functions are not actually implemented; instead, they're just markers for the compiler
to place the value of all the variables, which are grouped together by the function, in the same
location in memory (or same register).
According to Kenny Zadeck functions were originally known as phoney functions while SSA
was being developed at IBM Research in the 1980s. The formal name of function was only
adopted when the work was first published in an academic paper.

EFFICENET CODE GENERATION FOR EXPRESSION :
In computing, code generation is the process by which a compiler's code generator converts
some intermediate representation of source code into a form (e.g., machine code) that can be
readily executed by a machine.
Sophisticated compilers typically perform multiple passes over various intermediate forms. This
multi-stage process is used because many algorithms for code optimization are easier to apply
one at a time, or because the input to one optimization relies on the completed processing
performed by another optimization. This organization also facilitates the creation of a single
compiler that can target multiple architectures, as only the last of the code generation stages
(the backend) needs to change from target to target. (For more information on compiler design,
see Compiler.)
The input to the code generator typically consists of a parse tree or an abstract syntax tree. The
tree is converted into a linear sequence of instructions, usually in an intermediate language such
as three address code. Further stages of compilation may or may not be referred to as "code
generation", depending on whether they involve a significant change in the representation of the
program. (For example, a peephole optimization pass would not likely be called "code
generation", although a code generator might incorporate a peephole optimization pass.)


PREPARED BY ARUN PRATAP SINGH 59

59
In addition to the basic conversion from an intermediate representation into a linear sequence of
machine instructions, a typical code generator tries to optimize the generated code in some way.
Tasks which are typically part of a sophisticated compiler's "code generation" phase include:
Instruction selection: which instructions to use.
Instruction scheduling: in which order to put those instructions. Scheduling is a speed
optimization that can have a critical effect on pipelined machines.
Register allocation: the allocation of variables to processor registers
[1]

Debug data generation if required so the code can be debugged.
Instruction selection is typically carried out by doing a recursive post order traversal on the
abstract syntax tree, matching particular tree configurations against templates; for example, the
tree W := ADD(X,MUL(Y,Z)) might be transformed into a linear sequence of instructions by
recursively generating the sequences for t1 := X and t2 := MUL(Y,Z), and then emitting the
instruction ADD W, t1, t2.
In a compiler that uses an intermediate language, there may be two instruction selection stages
one to convert the parse tree into intermediate code, and a second phase much later to convert
the intermediate code into instructions from the instruction set of the target machine. This second
phase does not require a tree traversal; it can be done linearly, and typically involves a simple
replacement of intermediate-language operations with their corresponding opcodes. However, if
the compiler is actually a language translator (for example, one that converts Eiffel to C), then the
second code-generation phase may involve building a tree from the linear intermediate code.



PREPARED BY ARUN PRATAP SINGH 60

60



PREPARED BY ARUN PRATAP SINGH 61

61


CODE GENERATION FOR PIPELINING MACHINE :
Pipelining is a pervasive hardware implementation strategy used to increase the execution
speeds of computers. Its success is evident, since it is used on every class of computers, from
microprocessors to supercomputers. A pipelined computer can be likened to an assembly line.
By partitioning the execution of an instruction into M successive phases, the implementation
hardware can be divided into M different stages, each stage executing a different instruction
phase. A pipelined design gives a machine the potential for an M-fold execution speedup.
Although only 1 instruction is started (and only 1 finished) each clock cycle, all stages of the
pipeline can be concurrently executing phases of different instructions. When scheduling an
instruction for execution, it is important that the resources used by the instruction do not conflict
with any of the resources of the M - 1 previously scheduled instructions, since these resources
are either locked by the hardware or in an indeterminate state. It is necessary to avoid these
conflicts, either by building special purpose hardware into the machine (and performing dynamic
checks) or by statically scheduling instructions to avoid conflicts. Some very sophisticated
pipelines are designed to perform dynamic scheduling. While execution time instruction
scheduling and conflict detection provides for a very accurate determination of resource clashes,
the hardware to do this is costly and complex. For programs running on scalar, sequential
computers, instruction scheduling can be done once, when the user's program is compiled,
thereby eliminating the need for any runtime checks. Static scheduling of instructions is difficult.
We show that even for programs which are comprised of a single arithmetic expression having
no shared operands (which could be represented by a tree like graph), finding an optimal schedule
for execution on a sequential machine whose architecture is pipelined can be considered

PREPARED BY ARUN PRATAP SINGH 62

62
computationally intractable and probably requires an exponential time algorithm to compute. For
"realistic" programs, however, we show that the problem is similar to scheduling tasks on a multi-
processor system under constraints which enable an optimal schedule to be quickly computed.
We show how to incorporate instruction scheduling (and the required register allocation) into the
code generation phase of program compilation. Our implementation demonstrates that
scheduling instruction to increase pipeline throughput can be routinely incorporated into
optimizing compilers

You might also like