This action might not be possible to undo. Are you sure you want to continue?

# Compiler Optimisations

© 2009 stratusdesign@gmail.com stratusdesign.blogspot.com

Overview

• Introduction • Legacy optimisation • Vector SIMD optimisation • DSP optimisation • RISC/Superscalar optimisation • SSA optimisation • Multicore optimisation

Introduction

• • • This is intended as an overview of the optimisation process only as optimisations can be done in different ways often with subtle machine specific variations. Broadly speaking then, there are four main classes of optimisation available to the implementor and these are Classic legacy optimisations - these are well understood and the majority are technically straightforward to implement. They offer a gain of around 10-25% in performance Classic Vector optimisations - once the reserve of leviathan mainframe CPUs with brand new shiny Vector Units attached but now very commonly found in DSP related technologies. Technically these optimisations are more difficult than the former but still not complicated. For the right class of narrow numerical applications, fully and properly optimised they can yeild gains of 500%-2400% performance improvements RISC based optimisations. Despite their potential speed, scheduling fast code close to the theoretical maximum on a RISC has and continues to be problematic. For example the Alphas brand new GEM compilers when profiled on the machine only achieved speeds approaching what the Alpha was capable of about 30% of the time. That meant that the raw power of Alpha compute was wasted 70% of time in other words all those extra MHz were just used to heat up your datacentre/office. Performance enhancements are of the order of at least 150% Parallel or Hybrid optimisations. Optimisation in these cases is dominated by the underlying memory architecture eg. UMA, NUMA, MIMD or MIMD/SIMD Hybrid. So like RISC memory bandwidth is an issue. The other factors are interprocessor utilisation, interprocessor communication, interprocessor security, interprocessor management and identifying coroutines to schedule on the parallel system. Another issue is that most commercial computer languages to date have typically not been very good at allowing the programmer to express parallelism this means that the compiler has to infer parallelism from what is essentially a missing attribute and this is most difficult to accomplish with any degree of success. Currently most languages rely on rather unsophisticated library or system routines.

•

•

**Classic legacy optimisation
**

• Copy propogation

Before x = y; z = 1 + x; After x = y; z = 1 + y;

Before optimisation a data dependency is created when z has to wait for the value of x to be written.

• Constant propogation

Before x = 42; z = 1 + x; After x = y; z = 1 + 42;

**Classic legacy optimisation
**

• Constant folding

Before x = 512 * 4; After x = 2048;

Can be applied to Constant arguments, Statics and Locals.

**• Dead code removal
**

o o o

Temporary code created by the compiler for eg. When doing constant propogation Dead variable removal Elimination of unreachable code for eg. in C switch statements

**Classic legacy optimisation
**

• Algebraic

Before x = 10 * ( x + 5 ) / 10; After x += 5;

• Strength Reduction

Before (1) x = y ** 2; (2) x = y * 2; After x = y * y; x = y + y;

**Classic legacy optimisation
**

• Variable renaming

Before x = y * z; x = u * v; After x = y * z; x0 = u * v;

**• Common subexpresion elimination
**

Before x = u * (y + z ); w = ( y + z ) / 2; After x0 = y + z; x = u * x0; w = x0 / 2;

**Classic legacy optimisation
**

• Loop invariant code motion

Before for (i=0; i<10; i++) x[i] += v[i] + a + b After x0 = a + b for (i=0; i<10; i++) x[i] += v[i] + x0

**• Loop induction variable simplification
**

Before for (i=0; i<10; i++) x = i * 2 + v; After x = v; for (i=0; i<10; i++) x += 2;

**Classic legacy optimisation
**

• Loop unrolling

Before for (i=0; i<n; i++) x[i] += x[i-1] * x[i+1] After (unroll by factor of 2) for (i=0; i<n-2; i+=2) { x[i] += x[i-1] * x[i+1] x[i+1] += x[i] * x[i+2] }

**• Tail recursion elimination
**

recurs( x, y ) { if( !x ) return recurs( x - y ); }

All computation is done by the time the recursive call is made. By simply jumping to the top of the function excessive stack frame creation is avoided. May not be possible in some languages for example C++ usually arranges to call destructors at function exit

GVN

• • Global value numbering

o

Idea is an extension of Local Value Numbering (within Basic Block)

Similar to CSE but can target cases that aren’t considered by CSE (see below)

Local value numbering a=b+c d=b e=d+c b = V1, c=V2 so, a=V1+V2 d=V1 e=V1+V2 Therefore a & e are equivalent

Global value numbering ~ has to consider effects of control flow across BBs A1=V1 B1=V2 X1=V3 • V1 X2=V4 • V2 X3=V5=phi( V1, V2 ) • V6

x1 = a1

x2 = b1

x3 = phi( x1, x2 ) Nb. Later rhs eval ripple through previous nodes

PRE

• Partial redundancy elimination includes analysis for

o o o

•

Some PRE variants applied to SSA values, not just the expressions, effectively combining PRE and GVN

Loop invariant code motion – see previous Full redundancy elimination – see previous for CSE Partial redundancy elimination – see below – evaluation of x+y is predicated on some condition creating a Partial Redundancy

CFG

cond-eval

**Elimination of Partial Redundancy
**

cond-eval

**Elimination of Full Redundancy (ref CSE)
**

T=x+y cond-eval

a = x+y

…

T=x+y a=T

T=x+y …

a=T

…

b = x+y

b=T

b=T

**Classic legacy optimisation
**

• Leaf procedure optimisation

A routine which does not call any other routines or require any local storage can be invoked with a simple JSR/RET.

• Procedure inlining

This technique avoids the overhead of a call/ret by duplicating the code wherever it is needed. It is best used for small frequently called routines.

Vector SIMD

• These optimisations increase performance by using deep vector unit pipelines, data locality and data isolation found when manipulating arrays to parallelise the computation. They also reduce conditional branching over potentially large datasets. • Nowdays SIMD instructions appear most frequently in DSPs for computing FIR/IIR filters or doing FFTs. • Most modern microprocessors also have vector support in their SIMD extensions eg. SSE and Altivec which have traditionally offered cut down functionality in their vector units but future trends are towards doing fuller implementations. • Some studies have shown that when code can be vectorised it can improve performance in some cases by around 500+%.

Vector SIMD

Before

for( i=0; i<64; i++ ) a[i] = b[i] + 50; Before CISC case movl #1, r0 moval a, r1 moval b, r2 L$1: addl #50, (b)+ movl (b), (a)+ aobleq #64, r0, L$1

After

Classic VP (long vector)

After

Altivec et. al limited to 4x32b parallelism vspltisw v0, #50 lw r1, 0(a) lw r2, 0(b) lvx v2, 0, r2 vaddsws v1, v2, v0 stvx v1, 0, r1 ; have 4 words added in parallel lw r1 128(a) lw r2 128(b) lvx v2, 0, r2 vaddsws v1, v2, v0 stvx v1, 0, r1 ; have 8 words added in parallel ; keep going...

mtvlr #64 vldl a, v0 vldl b, v1 vvaddl v1, #50 vstl v0, a

Nb. Also optimised away another branch

**Scalar/Superscalar RISC
**

• Load delay slot

o

The result of a load cannot be used in the following instruction without having to stall the pipeline before the add can complete. Instead of having the machine stall in this way some useful code is found that can be placed between the load r2 and add r2. If some useful code cannot be found a nop can be inserted instead.

u = v + w; z = x + y; before ld r1, v ld r2, w add r3, r1, r2 sw u, r1 ld r1, x etc.. u = v + w; z = x + y; after ld r1, v ld r2, w ld r1, x add r3, r1, r2 sw u, r1 etc..

DSP optimisation

• DSPs have some unique hardware design features which require additional compiler support

o

tbd

**Scalar/Superscalar RISC
**

• Branch delay slot

o

The result of a branch cannot be resolved without having to stall the pipeline. Instead of having the machine stall in this way some useful code is found that can be placed immediately after the branch. Several strategies can be used, either find a useful candidate instruction before the branch, take one from the branch target and update the branch target address by 1 instruction or take a candidate from after the branch. If a candidate cannot be found a z = xnop can be inserted instead. + y; if( x == 0 ) goto L1; before ld r1, x ld r2, y add r3, r1, r2 cmp r2, 0 bne L1 … L1: sll r3, 4 after ld r1, x ld r2, y cmp r1, 0 bne L1 add r3, r1, r2 ... L1: sll r3, 4

**Scalar/Superscalar RISC
**

• Branch reduction

o

Loop unrolling is one way to reduce branching, other methods exist

Ex. Bitfield setting and rotation if( x == 0 ) y++; ... before L1: ... lw r2, x cmpi r1, r2, 10 bne r1, L2 addi r3, r0, 1 L2: … after (branch eliminated) lw r2, x cmpdi r2, 10 cntlzw r2, r2 addic r2, r2, -32 L2: rlwinm r3, r2, 1, 31, 31 ...

**Scalar/Superscalar RISC
**

• Conditional Move

o

Another branch reduction technique

Ex. Bitfield setting and rotation if( x == 0 ) y = 1; else z = 20; before ldq r1, x cmp r1, 0 beq r1, L1 mov r3, 1 ... L1: mov r3, 20

after ldq r1, x ldq r2, 1 ldq r3, 20 cmp r1, 0 cmovez r3, r2, r1

Superscalar Scheduling

• This is usually achieved by creating another IR or extending an existing IR to associate machine instructions with RISC functional units and in this way a determination can be made as to current FU utilisation and how best to reorder code for superscalar multi-issue. • These IRs are highly guarded and highly proprietary technologies. • This is the reason for example the IBM POWER compiler outperforms current GCC implementations • A simple but innovative example at the time was tracking register pressure in the WHIRL IR originally used by MIPS and SGI

GCC

• • GCC is a fairly standard compiler technology. Historically it had one tree form (the Parse Tree) generated from the front end and a retargetable machine format (RTL) across which the standard optimisations were done. Since 2005 this was expanded and tree forms now include the Parse Tree, the GENERIC (language independent) and GIMPLE (supporting SSA form) trees (C and C++ omit a GENERIC tree). The standard optimisations now occur after an SSA form has been generated (scalar ops only). SSA starts out in GCC by versioning all variables and finishes by merging them back down with PHI functions.

o

•

Compiler passes over the IR are handled via an extendable Pass manager which as of 4.1.1 and include preparation for optimisation and optimisation proper. They are separated across interprocedural, intra-procedural and machine forms (consisting SSA c. 100 passes, GIMPLE c.100 passes, RTL c.60 passes [Novillo06]). The majority of these passes centre on the intra-procedural and machine forms.

This solved the problem that the various front-end parse trees did not use a common IR which could be used as the basis for thorough optimisation and that that the RTL IR was also unsuitable because it was at too low a level.

•

•

One criticism I would make of GCC is that in some cases it flagrantly ignores manufacturer architected conventions. This leads to a lack of interoperability with the rest of the manufacturers system software, for example the manufacturers cross-functional software support or the manufacturers system threading package and libraries. Another problem for GCC is to stem the flow of RTL machine dependent based optimisations by handling these in a smarter way. Corporate involvement is accelerating functional releases (2008-2009 4 releases in last year – current 4.4.1)

GCC Gimple

• Gimple

o o o

o o o

o

Influenced by McCAT Simple IR (GNU Simple) Need for a generic language independent IR Need for an IR that renders complex deep parse trees to an IR that is easier to analyse for optimisation A small grammar covers bitwise, logical, assignment, statement etc. Unlike parse tree, gimple never references more than 3 variables, meaning 2 variable reads High Gimple and Low Gimple Removes binding scope information and conditional clauses converted to gotos Gimple nodes iterated at tree-level (tsi) and on a doubly linked list at bb level (bsi)

GCC Gimple

• 3 Address format ex.

Generic form if ( a > b + c ) c=b/a+(b*a) Gimple form T1 = b + c; If ( a > T1 ) { T2 = b / a; T3 = b * a; c = T2 + T3 }

GCC SSA

• SSA another IR form originally developed to help with dataflow analysis for interpreted systems

o o o o

o

SSA evolved from Def-Use chains (Reif & Lewis) when annotated with identity assignments eg. vx became the basis for SSA GCC does Scalar SSA using Kildall Analysis (not Wegman et.al). SSA for ~ Simplification of existing optimisations, for example constant propogation was originally complex to implement but with SSA it is greatly simplified SSA for ~ Classic dataflow analysis - Reaching Definition Analysis or more intuitively Reaching Assignment Analysis since it attempts to pair the current variable reference to the most recent update or write to that variable SSA for ~ significantly faster optimisation during compilation O(n) versus O(n2) when optimising using traditional data-flow equations

Generic form c = 5; if ( a > b + c ) c=b/c+(b*a)

Gimple form c = 5; T1 = b + c; if ( a > T1 ) { T2 = b / c; T3 = b * a; c = T2 + T3; }

**SSA form c1 = 5; T11 = b1 + c1; if ( a1 > T11 ) { T21 = b1 / c1; T31 = b1 * a1; c2 = T21 + T31; } c3 = phi ( c1, c2 );
**

SSA making Reaching Definition Analysis easy to perform. Here it is being used to simplify constant propogation

Fig.1 Basic Blocks contain scalar expressions A

Dominators & Φ Fn in SSA

Dominators ::=> 1. d dominates n if every path from n must go through d • Every node dominates itself • Nodes also evidently have the property of an Immediate Dominator

A

Fig.2 Dom Tree

B

C split

c1 = x; Fig.1 Clearly the path to G is either from B or F however the path to B and F stems from A so every path from G goes through A therefore G is dominated by A

B

C

G

D merge F

c2 = a / b;

E

D

E

F

straight-line

c3 = phi (c1, c2);

G

Fig.1 Likewise the path to F is either from D or E however the path to these stems from C so every path from F goes through C therefore F is Block dominated by C Using this we can build a Dominator Tree (Fig.2) and derive Dominator Sets (Fig.3) and a Dominance Frontier. A DF over a given variable A in the BB is used by the compiler to introduce Phi functions this produces a maximal Phi insertion it B can be reduced by various methods eg variable liveness C

Fig.3 Dom Set & Dom Frontier Immed Dom Dom Frontier ----A A C C C A ----G G F F G ----

Dom Set A A, B A, C A, C, D A, C, E A, C, F A, G

D

Dominance Frontier of a BB variable ::=> • •

DF(d) = {n | ∃p∈pred(n), d dom p and d !sdom n}

E

Set of all CFG nodes for which x dom a predecessor p of n but not the n itself. (Intuitively earliest point F where definition of a variable is not guaranteed to be unique) G This gives maximal insertion of phi nodes and can be optimised several ways for example by doing liveness analysis.

Multicore optimisation

• Polyhederal representation

o o

o

o

o o

o

First proposed by Feautrier in 1991 and appears in research compilers of the time. Complex Loop Nest Optimisation and Array analysis is difficult to do with corresponding AST representation ~ especially with respect to strict observance of loop bounds across the nest which often defeats standard LNO Loop Nest is reformulated as a set of equations, Linear Inequalities (properly affine constraints) and due to this higher level of abstraction a deeper level of optimisation (transformation) can be accomplished by solving the LP system Each loop integer is a point in an XY space the loop bounds of which form a Polyhedra. Ex. The first nest is a point with 2 rays, the second modifies this as a 4 sided 2D polyhedra the third forms a 3D polyhedra. Problem - how to efficiently implement solving for large number of points. The literature reports 20-90% improvement using polyhederal LNO Such an improvement makes it practical and desireable to distribute LN and associated array computation across a set of multicores. AMD are doing this with a lightweight intercore IPC they call streams Polyhederal LNO available in GCC 4.5 as Graphite and IBMs Cell Compiler

Multicore optimisation

• The polyhederal model Ref [Bastoul06]

❶ Typical Loop Nest for(i=2; i<=n; i++) z[i]=0; // S1 for(i=1; i<=n; i++) for(j=1; j<=n; j++) z[i+j] += x[i] & y[j]; // S2

Steps 1. Define domain Dn (ref bounds of enclosing loop) • List Access functions Ex. S1 = Z[i]=0 • Transform (optimise) with some affine schedule eg. S1(i) = (i) • Generate code using projection and separation of polyhedra

❷ Reformulated as affine constraints. Ex. outer loop

❸ Transformation scheduling (optimisation)

❹ Regenerate AST for code generation.Ex. Will be DS1 - DS2 ∧ DS2 - DS1 ∧ DS1 ∩ DS2 giving worst case of 3np (n=stmts; p=nest depth)

t=2; i=2; DS1-DS2 z[i]=0; for(t=3; t<=2*n; t++) for(i=max(1,t-n-1); i<=min(t-2,n); i++) j=t-i-1; z[i+j] += x[i] * y[i]; i=t; DS1 ∩ DS2 z[i] = 0; t=2*n+1; i=n; j=n; z[i+j] += x[i] * y[j];

DS2-DS1