Compiler Optimisations

© 2009 stratusdesign@gmail.com
stratusdesign.blogspot.com
Author details
· No connection to Sun Microsystems/Oracle.
· Worked on low level MainIrame and Embedded class Kernel, Distributed
Microkernel Ior International Hardware/Microprocessor Companies as well
as having extensive experience developing a proprietary ANSI C compiler
toolchain Ior Intel, MIPS and Alpha processors. Also experience with
parallel computers and DFT Ior signal processing using CUDA/GPU
· Now looking Ior new contracts, proiects (primarily soItware) in GB, Europe
or US in Kernel, Compiler, Signal Processing, Parallel area (as oI Jun 2011)
· Contact dmctek(gmail.com
Overview
- Introduction
- Legacy optimisation
- Vector SIMD optimisation
- DSP optimisation
- RISC/Superscalar optimisation
- SSA optimisation
- Multicore optimisation
Introduction
- This is intended as an overview of the optimisation process only as optimisations can be
done in different ways often with subtle machine specific variations. Broadly speaking
then, there are four main classes of optimisation available to the implementor and these
are
- Classic legacy optimisations - these are well understood and the majority are technically
straightforward to implement. They offer a gain of around 10-25% in performance
- Classic Vector optimisations - once the reserve of leviathan mainframe CPUs with brand
new shiny Vector Units attached but now very commonly found in DSP related
technologies. Technically these optimisations are more difficult than the former but still
not complicated. For the right class of narrow numerical applications, fully and properly
optimised they can yeild gains of 500%-2400% performance improvements
- RISC based optimisations. Despite their potential speed, scheduling fast code close to the
theoretical maximum on a RISC has and continues to be problematic. For example the
Alphas brand new GEM compilers when profiled on the machine only achieved speeds
approaching what the Alpha was capable of about 30% of the time. That meant that the
raw power of Alpha compute was wasted 70% of time in other words all those extra MHz
were just used to heat up your datacentre/office. Performance enhancements are of the
order of at least 150%
- Parallel or Hybrid optimisations. Optimisation in these cases is dominated by the
underlying memory architecture eg. UMA, NUMA, MIMD or MIMD/SIMD Hybrid. So like
RISC memory bandwidth is an issue. The other factors are interprocessor utilisation,
interprocessor communication, interprocessor security, interprocessor management and
identifying coroutines to schedule on the parallel system. Another issue is that most
commercial computer languages to date have typically not been very good at allowing
the programmer to express parallelism this means that the compiler has to infer
parallelism from what is essentially a missing attribute and this is most difficult to
accomplish with any degree of success. Currently most languages rely on rather
unsophisticated library or system routines.
Classic legacy optimisation
- Copy propogation
BeIore
x ÷ y;
z ÷ 1 ¹ x;
AIter
x ÷ y;
z ÷ 1 ¹ y;
BeIore optimisation a data dependency is created when z has to wait Ior
the value oI x to be written.
- Constant propogation
BeIore
x ÷ 42;
z ÷ 1 ¹ x;
AIter
x ÷ y;
z ÷ 1 ¹ 42;
Classic legacy optimisation
- Constant folding
BeIore
x ÷ 512 * 4;
AIter
x ÷ 2048;
Can be applied to Constant arguments, Statics and Locals.
- Dead code removal
4 Temporary code created by the compiler for eg. When
doing constant propogation
4 Dead variable removal
4 Elimination of unreachable code for eg. in C switch
statements
Classic legacy optimisation
- Algebraic
- Strength Reduction
BeIore
x ÷ 10 * ( x ¹ 5 ) / 10;
AIter
x ¹÷ 5;
BeIore
(1) x ÷ y ** 2;
(2) x ÷ y * 2;
AIter
x ÷ y * y;
x ÷ y ¹ y;
Classic legacy optimisation
- Variable renaming
- Common subexpresion
elimination
BeIore
x ÷ y * z;
x ÷ u * v;
AIter
x ÷ y * z;
x0 ÷ u * v;
BeIore
x ÷ u * (y ¹ z );
w ÷ ( y ¹ z ) / 2;
AIter
x0 ÷ y ¹ z;
x ÷ u * x0;
w ÷ x0 / 2;
Classic legacy optimisation
- Loop invariant code motion
BeIore
Ior (i÷0; i·10; i¹¹)
x|i| ¹÷ v|i| ¹ a ¹ b
AIter
x0 ÷ a ¹ b
Ior (i÷0; i·10; i¹¹)
x|i| ¹÷ v|i| ¹ x0
- Loop induction variable
simplification
BeIore
Ior (i÷0; i·10; i¹¹)
x ÷ i * 2 ¹ v;
AIter
x ÷ v;
Ior (i÷0; i·10; i¹¹)
x ¹÷ 2;
Classic legacy optimisation
- Loop unrolling
BeIore
Ior (i÷0; i·n; i¹¹)
x|i| ¹÷ x|i-1| * x|i¹1|
AIter (unroll by Iactor oI 2)
Ior (i÷0; i·n-2; i¹÷2)
¦
x|i| ¹÷ x|i-1| * x|i¹1|
x|i¹1| ¹÷ x|i| * x|i¹2|
}
- Tail recursion elimination
recurs( x, y )
¦
iI( !x ) return
recurs( x - y );
}
All computation is done by the time the recursive call is
made. By simply iumping to the top oI the Iunction
excessive stack Irame creation is avoided. May not be
possible in some languages Ior example C¹¹ usually
arranges to call destructors at Iunction exit
GVN
- Global value numbering
4 Similar to CSE but can target cases that aren´t considered by CSE (see below)
- Idea is an extension of Local Value Numbering (within Basic Block)
a ÷ b ¹ c
d ÷ b
e ÷ d ¹ c
b ÷ V1, c÷V2 so, a÷V1¹V2
d÷V1
e÷V1¹V2
Therefore a & e are equivalent
Local value numbering
Global value numbering ~ has to consider eIIects oI control Ilow across BBs
x1 ÷ a1 x2 ÷ b1
x3 ÷ phi( x1, x2 )
A1÷V1
B1÷V2
X1÷V3 W V1
X2÷V4 W V2
X3÷V5÷phi( V1, V2 ) W V6
Nb. Later rhs eval ripple through previous nodes
PRE
- Partial redundancy elimination includes analysis for
4 Loop invariant code motion - see previous
4 Full redundancy elimination - see previous for CSE
4 Partial redundancy elimination - see below - evaluation of x+y is predicated on some condition
creating a Partial Redundancy
- Some PRE variants applied to SSA values, not just the expressions, effectively combining
PRE and GVN
cond-eval
a ÷ x¹v
.
b ÷ x¹v
cond-eval
T ÷ x ¹ v
a ÷ T
T ÷ x ¹ v
.
b ÷ T
T ÷ x ¹ v
cond-eval
a ÷ T
.
b ÷ T
CFG
Elimination of Partial
Redundancy
Elimination of Full
Redundancy (ref CSE)
Classic legacy optimisation
- Leaf procedure optimisation
- Procedure inlining
A routine which does not call any other routines or require any local
storage can be invoked with a simple JSR/RET.
This technique avoids the overhead oI a call/ret by duplicating the
code wherever it is needed. It is best used Ior small Irequently called
routines.
Vector SIMD
- These optimisations increase performance by using
deep vector unit pipelines, data locality and data
isolation found when manipulating arrays to
parallelise the computation. They also reduce
conditional branching over potentially large datasets.
- Nowdays SIMD instructions appear most frequently
in DSPs for computing FIR/IIR filters or doing FFTs.
- Most modern microprocessors also have vector
support in their SIMD extensions eg. SSE and Altivec
which have traditionally offered cut down
functionality in their vector units but future trends
are towards doing fuller implementations.
- Some studies have shown that when code can be
vectorised it can improve performance in some
cases by around 500+%.
Vector SIMD
BeIore
Ior( i÷0; i·64; i¹¹ )
a|i| ÷ b|i| ¹ 50;
BeIore CISC case
movl #1, r0
moval a, r1
moval b, r2
L$1:
addl #50, (b)¹
movl (b), (a)¹
aobleq #64, r0, L$1
AIter
Classic VP (long vector)
mtvlr #64
vldl a, v0
vldl b, v1
vvaddl v1, #50
vstl v0, a
AIter
Altivec et. al limited to
4x32b parallelism
vspltisw v0, #50
lw r1, 0(a)
lw r2, 0(b)
lvx v2, 0, r2
vaddsws v1, v2, v0
stvx v1, 0, r1
; have 4 words added in parallel
lw r1 128(a)
lw r2 128(b)
lvx v2, 0, r2
vaddsws v1, v2, v0
stvx v1, 0, r1
; have 8 words added in parallel
; keep going...
Nb. Also optimised away another branch
Scalar/Superscalar RISC
- Load delay slot
4 The result of a load cannot be used in the following
instruction without having to stall the pipeline before
the add can complete. Instead of having the machine
stall in this way some useful code is found that can be
placed between the load r2 and add r2. If some useful
code cannot be found a nop can be inserted instead.
u ÷ v ¹ w;
z ÷ x ¹ y;
beIore
ld r1, v
ld r2, w
add r3, r1, r2
sw u, r1
ld r1, x
etc..
u ÷ v ¹ w;
z ÷ x ¹ y;
aIter
ld r1, v
ld r2, w
ld r1, x
add r3, r1, r2
sw u, r1
etc..
DSP optimisation
- DSPs have some unique hardware design features
which require additional compiler support
4 tbd
Scalar/Superscalar RISC
- Branch delay slot
4 The result of a branch cannot be resolved without having to stall the
pipeline. Instead of having the machine stall in this way some useful
code is found that can be placed immediately after the branch.
Several strategies can be used, either find a useful candidate
instruction before the branch, take one from the branch target and
update the branch target address by 1 instruction or take a
candidate from after the branch. If a candidate cannot be found a
nop can be inserted instead.
z ÷ x ¹ y;
iI( x ÷÷ 0 )
goto L1;
beIore
ld r1, x
ld r2, y
add r3, r1, r2
cmp r2, 0
bne L1
.
L1:
sll r3, 4
aIter
ld r1, x
ld r2, y
cmp r1, 0
bne L1
add r3, r1, r2
...
L1:
sll r3, 4
Scalar/Superscalar RISC
- Branch reduction
4 Loop unrolling is one way to reduce branching, other methods exist
Ex. BitIield setting and rotation
iI( x ÷÷ 0 )
y¹¹;
...
beIore
L1:
...
lw r2, x
cmpi r1, r2, 10
bne r1, L2
addi r3, r0, 1
L2:
.
aIter (branch eliminated)
lw r2, x
cmpdi r2, 10
cntlzw r2, r2
addic r2, r2, -32 L2:
rlwinm r3, r2, 1, 31, 31
...
Scalar/Superscalar RISC
- Conditional Move
4 Another branch reduction technique
Ex. BitIield setting and rotation
iI( x ÷÷ 0 )
y ÷ 1;
else
z ÷ 20;
beIore
ldq r1, x
cmp r1, 0
beq r1, L1
mov r3, 1
...
L1:
mov r3, 20
aIter
ldq r1, x
ldq r2, 1
ldq r3, 20
cmp r1, 0
cmovez r3, r2, r1
Superscalar Scheduling
- This is usually achieved by creating another IR or
extending an existing IR to associate machine instructions
with RISC functional units and in this way a determination
can be made as to current FU utilisation and how best to
reorder code for superscalar multi-issue.
- These IRs are highly guarded and highly proprietary
technologies.
- This is the reason for example the IBM POWER compiler
outperforms current GCC implementations
- A simple but innovative example at the time was tracking
register pressure in the WHIRL IR originally used by MIPS
and SGI
GCC
- GCC is a fairly standard compiler technology. Historically it had one tree form (the Parse
Tree) generated from the front end and a retargetable machine format (RTL) across
which the standard optimisations were done.
- Since 2005 this was expanded and tree forms now include the Parse Tree, the GENERIC
(language independent) and GIMPLE (supporting SSA form) trees (C and C++ omit a
GENERIC tree). The standard optimisations now occur after an SSA form has been
generated (scalar ops only). SSA starts out in GCC by versioning all variables and
finishes by merging them back down with PHI functions.
4 This solved the problem that the various front-end parse trees did not use a common IR which
could be used as the basis for thorough optimisation and that that the RTL IR was also unsuitable
because it was at too low a level.
- Compiler passes over the IR are handled via an extendable Pass manager which as of
4.1.1 and include preparation for optimisation and optimisation proper. They are
separated across interprocedural, intra-procedural and machine forms (consisting SSA c.
100 passes, GIMPLE c.100 passes, RTL c.60 passes [Novillo06]). The majority of these
passes centre on the intra-procedural and machine forms.
- One criticism I would make of GCC is that in some cases it flagrantly ignores
manufacturer architected conventions. This leads to a lack of interoperability with the
rest of the manufacturers system software, for example the manufacturers cross-
functional software support or the manufacturers system threading package and libraries.
Another problem for GCC is to stem the flow of RTL machine dependent based
optimisations by handling these in a smarter way.
- Corporate involvement is accelerating functional releases (2008-2009 4 releases in last
year - current 4.4.1)
GCC Gimple
- Gimple
4 Influenced by McCAT Simple IR (GNU Simple)
4 Need for a generic language independent IR
4 Need for an IR that renders complex deep parse
trees to an IR that is easier to analyse for
optimisation
4 A small grammar covers bitwise, logical,
assignment, statement etc.
4 Unlike parse tree, gimple never references more
than 3 variables, meaning 2 variable reads
4 High Gimple and Low Gimple
Removes binding scope information and
conditional clauses converted to gotos
4 Gimple nodes iterated at tree-level (tsi) and on a
doubly linked list at bb level (bsi)
GCC Gimple
- 3 Address format ex.
Generic Iorm
iI ( a ~ b ¹ c )
c ÷ b / a ¹ ( b * a )
Gimple Iorm
T1 ÷ b ¹ c;
II ( a ~ T1 )
¦
T2 ÷ b / a;
T3 ÷ b * a;
c ÷ T2 ¹ T3
}
GCC SSA
- SSA another IR form originally developed to help with dataflow analysis for interpreted
systems
4 SSA evolved from Def-Use chains (Reif & Lewis) when annotated with identity assignments eg. vx
became the basis for SSA
4 GCC does Scalar SSA using Kildall Analysis (not Wegman et.al).
4 SSA for ~ Simplification of existing optimisations, for example constant propogation was originally
complex to implement but with SSA it is greatly simplified
4 SSA for ~ Classic dataflow analysis - Reaching Definition Analysis or more intuitively Reaching
Assignment Analysis since it attempts to pair the current variable reference to the most recent
update or write to that variable
4 SSA for ~ significantly faster optimisation during compilation O(n) versus O(n2) when optimising
using traditional data-flow equations
Generic Iorm
c ÷ 5;
iI ( a ~ b ¹ c )
c ÷ b / c ¹ ( b * a )
Gimple Iorm
c ÷ 5;
T1 ÷ b ¹ c;
iI ( a ~ T1 )
¦
T2 ÷ b / c;
T3 ÷ b * a;
c ÷ T2 ¹ T3;
}
SSA Iorm
c1 ÷ 5;
T11 ÷ b1 ¹ c1;
iI ( a1 ~ T11 )
¦
T21 ÷ b1 / c1;
T31 ÷ b1 * a1;
c2 ÷ T21 ¹ T31;
}
c3 ÷ phi ( c1, c2 );
SSA making Reaching
DeIinition Analysis easy
to perIorm. Here it is
being used to simpliIy
constant propogation
Dominators & O Fn in SSA
ominators ::÷>
1. d dominates n iI every path Irom n must go through d
· Every node dominates itselI
· Nodes also evidently have the property oI an Immediate
Dominator
F
A
B C G
D E
Dom Set Block
Immed Dom
A
B
C
D
E
F
G
A
A. B
A. C. E
A. C. D
A. C
A. C. F
A. G
-----
A
A
A
C
C
C
Fig.1 Basic Blocks contain
scalar expressions
split
merge
straight-line
A
B
C
D E
F
G
c1 ÷ x; c2 ÷ a / b;
c3 ÷ 5 (c1, c2);
Fig.1 Clearly the path to G is either Irom B or F
however the path to B and F stems Irom A so every
path Irom G goes through A thereIore G is
dominated by A
Fig.1 Likewise the path to F is either Irom D or E
however the path to these stems Irom C so every
path Irom F goes through C thereIore F is dominated
by C
Using this we can build a ominator Tree (Fig.2)
and derive ominator Sets (Fig.3) and a
ominance Frontier. A DF over a given variable in
the BB is used by the compiler to introduce Phi
Iunctions this produces a maximal Phi insertion it
can be reduced by various methods eg variable
liveness
ominance Frontier of a BB variable ::÷~
DF(d) ÷ ¦n ' ppred(n), d dom p and d !sdom n}
· Set oI all CFG nodes Ior which x dom a predecessor p oI n but not the n itselI. (Intuitively earliest point
where deIinition oI a variable is not guaranteed to be unique)
· This gives maximal insertion oI phi nodes and can be optimised several ways Ior example by doing
liveness analysis.
Dom
Frontier
Fig.2 Dom Tree
Fig.3 Dom Set & Dom
Frontier
-----
G
G
F
F
G
----
Multicore optimisation
- Polyhederal representation
4 First proposed by Feautrier in 1991 and appears in research
compilers of the time.
4 Complex Loop Nest Optimisation and Array analysis is difficult to do
with corresponding AST representation ~ especially with respect to
strict observance of loop bounds across the nest which often defeats
standard LNO
4 Loop Nest is reformulated as a set of equations, Linear Inequalities
(properly affine constraints) and due to this higher level of
abstraction a deeper level of optimisation (transformation) can be
accomplished by solving the LP system
4 Each loop integer is a point in an XY space the loop bounds of which
form a Polyhedra. Ex. The first nest is a point with 2 rays, the
second modifies this as a 4 sided 2D polyhedra the third forms a 3D
polyhedra. Problem - how to efficiently implement solving for large
number of points.
4 The literature reports 20-90% improvement using polyhederal LNO
4 Such an improvement makes it practical and desireable to distribute
LN and associated array computation across a set of multicores.
AMD are doing this with a lightweight intercore IPC they call streams
4 Polyhederal LNO available in GCC 4.5 as Graphite and IBMs Cell
Compiler
Multicore optimisation
- The polyhederal model
Typical Loop Nest
Ior(i÷2; i·÷n; i¹¹)
z|i|÷0; // S1
Ior(i÷1; i·÷n; i¹¹)
Ior(i÷1; i·÷n; i¹¹)
z|i¹i| ¹÷ x|i| & y|i|; // S2
ReIormulated as aIIine constraints. Ex. outer loop
TransIormation scheduling (optimisation)
Regenerate AST Ior code generation.Ex.
Will be DS1 - DS2 DS2 - DS1 DS1 Y DS2
giving worst case oI 3np (n÷stmts; p÷nest
depth)
t÷2;
i÷2;
z[i]÷0;
for(t÷3; t<÷2`n; t++)
for(i÷max(1.t-n-1); i<÷min(t-2.n); i++)
j÷t-i-1;
z[i+j] +÷ x[i] ` y[i];
i÷t;
z[i] ÷ 0;
t÷2`n+1;
i÷n;
j÷n;
z[i+j] +÷ x[i] ` y[j];
Steps
1. DeIine domain Dn (reI bounds
oI enclosing loop)
· List Access Iunctions Ex.
S1 ÷ Z|i|÷0
· TransIorm (optimise) with
some aIIine schedule eg. S1(i) ÷
(i)
· Generate code using proiection
and separation oI polyhedra
DS1-DS2
DS2-DS1
DS1 Y DS2
#ef [Bastoul06]