586 views

save

- A Case for Intelligent RAM IRAM
- ACA Solution
- Optimization PDF Tutorial
- 梁存铭Intel - Core_effeciency.pdf
- smartphone optimization
- L2
- Manual LiveROMS
- Make File Tutorial
- A Parallel Implicit In Compressible Flow
- Code Generators for Embedded Proc.
- Intel VTune
- Matrix Multiplication Using SIMD Technologies
- 2010-11 m.e. Electronics and Telecommunication [Digital Systems]
- 1234444
- lecture5b
- DUI0204J Rvct Assembler Guide
- Cluster Computing and Load Balancing
- g k 3611601162
- Complete Reference to Informatica_ Optimizing the Bottleneck’s
- 20150826_Guri Sohi_Adaptive Efficient Parallel Execution of Parallel Programs.pdf
- 60052
- A General Algorithm for Computing DT
- Mastering Logistics Complexity Whitepaper Sep16
- LN7
- Tutorial on Parallel & Concurrent Programming in Haskell, A - Simon Peyton Jones & Satnam Singh - 2009
- YunmingZhang Resume
- PC-6
- ES UT I 2017
- Multicore Programming Guide
- Curriculum of Computer Science
- Compiler Updated
- SunRockSST Updated
- Compiler Updated
- sst
- SunRockSST Updated
- Compiler Opt

**© 2009 stratusdesign@gmail.com
**

stratusdesign.blogspot.com

Author details

· No connection to Sun Microsystems/Oracle.

· Worked on low level MainIrame and Embedded class Kernel, Distributed

Microkernel Ior International Hardware/Microprocessor Companies as well

as having extensive experience developing a proprietary ANSI C compiler

toolchain Ior Intel, MIPS and Alpha processors. Also experience with

parallel computers and DFT Ior signal processing using CUDA/GPU

· Now looking Ior new contracts, proiects (primarily soItware) in GB, Europe

or US in Kernel, Compiler, Signal Processing, Parallel area (as oI Jun 2011)

· Contact dmctek(gmail.com

Overview

- Introduction

- Legacy optimisation

- Vector SIMD optimisation

- DSP optimisation

- RISC/Superscalar optimisation

- SSA optimisation

- Multicore optimisation

Introduction

- This is intended as an overview of the optimisation process only as optimisations can be

done in different ways often with subtle machine specific variations. Broadly speaking

then, there are four main classes of optimisation available to the implementor and these

are

- Classic legacy optimisations - these are well understood and the majority are technically

straightforward to implement. They offer a gain of around 10-25% in performance

- Classic Vector optimisations - once the reserve of leviathan mainframe CPUs with brand

new shiny Vector Units attached but now very commonly found in DSP related

technologies. Technically these optimisations are more difficult than the former but still

not complicated. For the right class of narrow numerical applications, fully and properly

optimised they can yeild gains of 500%-2400% performance improvements

- RISC based optimisations. Despite their potential speed, scheduling fast code close to the

theoretical maximum on a RISC has and continues to be problematic. For example the

Alphas brand new GEM compilers when profiled on the machine only achieved speeds

approaching what the Alpha was capable of about 30% of the time. That meant that the

raw power of Alpha compute was wasted 70% of time in other words all those extra MHz

were just used to heat up your datacentre/office. Performance enhancements are of the

order of at least 150%

- Parallel or Hybrid optimisations. Optimisation in these cases is dominated by the

underlying memory architecture eg. UMA, NUMA, MIMD or MIMD/SIMD Hybrid. So like

RISC memory bandwidth is an issue. The other factors are interprocessor utilisation,

interprocessor communication, interprocessor security, interprocessor management and

identifying coroutines to schedule on the parallel system. Another issue is that most

commercial computer languages to date have typically not been very good at allowing

the programmer to express parallelism this means that the compiler has to infer

parallelism from what is essentially a missing attribute and this is most difficult to

accomplish with any degree of success. Currently most languages rely on rather

unsophisticated library or system routines.

Classic legacy optimisation

- Copy propogation

BeIore

x ÷ y;

z ÷ 1 ¹ x;

AIter

x ÷ y;

z ÷ 1 ¹ y;

BeIore optimisation a data dependency is created when z has to wait Ior

the value oI x to be written.

- Constant propogation

BeIore

x ÷ 42;

z ÷ 1 ¹ x;

AIter

x ÷ y;

z ÷ 1 ¹ 42;

Classic legacy optimisation

- Constant folding

BeIore

x ÷ 512 * 4;

AIter

x ÷ 2048;

Can be applied to Constant arguments, Statics and Locals.

- Dead code removal

4 Temporary code created by the compiler for eg. When

doing constant propogation

4 Dead variable removal

4 Elimination of unreachable code for eg. in C switch

statements

Classic legacy optimisation

- Algebraic

- Strength Reduction

BeIore

x ÷ 10 * ( x ¹ 5 ) / 10;

AIter

x ¹÷ 5;

BeIore

(1) x ÷ y ** 2;

(2) x ÷ y * 2;

AIter

x ÷ y * y;

x ÷ y ¹ y;

Classic legacy optimisation

- Variable renaming

- Common subexpresion

elimination

BeIore

x ÷ y * z;

x ÷ u * v;

AIter

x ÷ y * z;

x0 ÷ u * v;

BeIore

x ÷ u * (y ¹ z );

w ÷ ( y ¹ z ) / 2;

AIter

x0 ÷ y ¹ z;

x ÷ u * x0;

w ÷ x0 / 2;

Classic legacy optimisation

- Loop invariant code motion

BeIore

Ior (i÷0; i·10; i¹¹)

x|i| ¹÷ v|i| ¹ a ¹ b

AIter

x0 ÷ a ¹ b

Ior (i÷0; i·10; i¹¹)

x|i| ¹÷ v|i| ¹ x0

- Loop induction variable

simplification

BeIore

Ior (i÷0; i·10; i¹¹)

x ÷ i * 2 ¹ v;

AIter

x ÷ v;

Ior (i÷0; i·10; i¹¹)

x ¹÷ 2;

Classic legacy optimisation

- Loop unrolling

BeIore

Ior (i÷0; i·n; i¹¹)

x|i| ¹÷ x|i-1| * x|i¹1|

AIter (unroll by Iactor oI 2)

Ior (i÷0; i·n-2; i¹÷2)

¦

x|i| ¹÷ x|i-1| * x|i¹1|

x|i¹1| ¹÷ x|i| * x|i¹2|

}

- Tail recursion elimination

recurs( x, y )

¦

iI( !x ) return

recurs( x - y );

}

All computation is done by the time the recursive call is

made. By simply iumping to the top oI the Iunction

excessive stack Irame creation is avoided. May not be

possible in some languages Ior example C¹¹ usually

arranges to call destructors at Iunction exit

GVN

- Global value numbering

4 Similar to CSE but can target cases that aren´t considered by CSE (see below)

- Idea is an extension of Local Value Numbering (within Basic Block)

a ÷ b ¹ c

d ÷ b

e ÷ d ¹ c

b ÷ V1, c÷V2 so, a÷V1¹V2

d÷V1

e÷V1¹V2

Therefore a & e are equivalent

Local value numbering

Global value numbering ~ has to consider eIIects oI control Ilow across BBs

x1 ÷ a1 x2 ÷ b1

x3 ÷ phi( x1, x2 )

A1÷V1

B1÷V2

X1÷V3 W V1

X2÷V4 W V2

X3÷V5÷phi( V1, V2 ) W V6

Nb. Later rhs eval ripple through previous nodes

PRE

- Partial redundancy elimination includes analysis for

4 Loop invariant code motion - see previous

4 Full redundancy elimination - see previous for CSE

4 Partial redundancy elimination - see below - evaluation of x+y is predicated on some condition

creating a Partial Redundancy

- Some PRE variants applied to SSA values, not just the expressions, effectively combining

PRE and GVN

cond-eval

a ÷ x¹v

.

b ÷ x¹v

cond-eval

T ÷ x ¹ v

a ÷ T

T ÷ x ¹ v

.

b ÷ T

T ÷ x ¹ v

cond-eval

a ÷ T

.

b ÷ T

CFG

Elimination of Partial

Redundancy

Elimination of Full

Redundancy (ref CSE)

Classic legacy optimisation

- Leaf procedure optimisation

- Procedure inlining

A routine which does not call any other routines or require any local

storage can be invoked with a simple JSR/RET.

This technique avoids the overhead oI a call/ret by duplicating the

code wherever it is needed. It is best used Ior small Irequently called

routines.

Vector SIMD

- These optimisations increase performance by using

deep vector unit pipelines, data locality and data

isolation found when manipulating arrays to

parallelise the computation. They also reduce

conditional branching over potentially large datasets.

- Nowdays SIMD instructions appear most frequently

in DSPs for computing FIR/IIR filters or doing FFTs.

- Most modern microprocessors also have vector

support in their SIMD extensions eg. SSE and Altivec

which have traditionally offered cut down

functionality in their vector units but future trends

are towards doing fuller implementations.

- Some studies have shown that when code can be

vectorised it can improve performance in some

cases by around 500+%.

Vector SIMD

BeIore

Ior( i÷0; i·64; i¹¹ )

a|i| ÷ b|i| ¹ 50;

BeIore CISC case

movl #1, r0

moval a, r1

moval b, r2

L$1:

addl #50, (b)¹

movl (b), (a)¹

aobleq #64, r0, L$1

AIter

Classic VP (long vector)

mtvlr #64

vldl a, v0

vldl b, v1

vvaddl v1, #50

vstl v0, a

AIter

Altivec et. al limited to

4x32b parallelism

vspltisw v0, #50

lw r1, 0(a)

lw r2, 0(b)

lvx v2, 0, r2

vaddsws v1, v2, v0

stvx v1, 0, r1

; have 4 words added in parallel

lw r1 128(a)

lw r2 128(b)

lvx v2, 0, r2

vaddsws v1, v2, v0

stvx v1, 0, r1

; have 8 words added in parallel

; keep going...

Nb. Also optimised away another branch

Scalar/Superscalar RISC

- Load delay slot

4 The result of a load cannot be used in the following

instruction without having to stall the pipeline before

the add can complete. Instead of having the machine

stall in this way some useful code is found that can be

placed between the load r2 and add r2. If some useful

code cannot be found a nop can be inserted instead.

u ÷ v ¹ w;

z ÷ x ¹ y;

beIore

ld r1, v

ld r2, w

add r3, r1, r2

sw u, r1

ld r1, x

etc..

u ÷ v ¹ w;

z ÷ x ¹ y;

aIter

ld r1, v

ld r2, w

ld r1, x

add r3, r1, r2

sw u, r1

etc..

DSP optimisation

- DSPs have some unique hardware design features

which require additional compiler support

4 tbd

Scalar/Superscalar RISC

- Branch delay slot

4 The result of a branch cannot be resolved without having to stall the

pipeline. Instead of having the machine stall in this way some useful

code is found that can be placed immediately after the branch.

Several strategies can be used, either find a useful candidate

instruction before the branch, take one from the branch target and

update the branch target address by 1 instruction or take a

candidate from after the branch. If a candidate cannot be found a

nop can be inserted instead.

z ÷ x ¹ y;

iI( x ÷÷ 0 )

goto L1;

beIore

ld r1, x

ld r2, y

add r3, r1, r2

cmp r2, 0

bne L1

.

L1:

sll r3, 4

aIter

ld r1, x

ld r2, y

cmp r1, 0

bne L1

add r3, r1, r2

...

L1:

sll r3, 4

Scalar/Superscalar RISC

- Branch reduction

4 Loop unrolling is one way to reduce branching, other methods exist

Ex. BitIield setting and rotation

iI( x ÷÷ 0 )

y¹¹;

...

beIore

L1:

...

lw r2, x

cmpi r1, r2, 10

bne r1, L2

addi r3, r0, 1

L2:

.

aIter (branch eliminated)

lw r2, x

cmpdi r2, 10

cntlzw r2, r2

addic r2, r2, -32 L2:

rlwinm r3, r2, 1, 31, 31

...

Scalar/Superscalar RISC

- Conditional Move

4 Another branch reduction technique

Ex. BitIield setting and rotation

iI( x ÷÷ 0 )

y ÷ 1;

else

z ÷ 20;

beIore

ldq r1, x

cmp r1, 0

beq r1, L1

mov r3, 1

...

L1:

mov r3, 20

aIter

ldq r1, x

ldq r2, 1

ldq r3, 20

cmp r1, 0

cmovez r3, r2, r1

Superscalar Scheduling

- This is usually achieved by creating another IR or

extending an existing IR to associate machine instructions

with RISC functional units and in this way a determination

can be made as to current FU utilisation and how best to

reorder code for superscalar multi-issue.

- These IRs are highly guarded and highly proprietary

technologies.

- This is the reason for example the IBM POWER compiler

outperforms current GCC implementations

- A simple but innovative example at the time was tracking

register pressure in the WHIRL IR originally used by MIPS

and SGI

GCC

- GCC is a fairly standard compiler technology. Historically it had one tree form (the Parse

Tree) generated from the front end and a retargetable machine format (RTL) across

which the standard optimisations were done.

- Since 2005 this was expanded and tree forms now include the Parse Tree, the GENERIC

(language independent) and GIMPLE (supporting SSA form) trees (C and C++ omit a

GENERIC tree). The standard optimisations now occur after an SSA form has been

generated (scalar ops only). SSA starts out in GCC by versioning all variables and

finishes by merging them back down with PHI functions.

4 This solved the problem that the various front-end parse trees did not use a common IR which

could be used as the basis for thorough optimisation and that that the RTL IR was also unsuitable

because it was at too low a level.

- Compiler passes over the IR are handled via an extendable Pass manager which as of

4.1.1 and include preparation for optimisation and optimisation proper. They are

separated across interprocedural, intra-procedural and machine forms (consisting SSA c.

100 passes, GIMPLE c.100 passes, RTL c.60 passes [Novillo06]). The majority of these

passes centre on the intra-procedural and machine forms.

- One criticism I would make of GCC is that in some cases it flagrantly ignores

manufacturer architected conventions. This leads to a lack of interoperability with the

rest of the manufacturers system software, for example the manufacturers cross-

functional software support or the manufacturers system threading package and libraries.

Another problem for GCC is to stem the flow of RTL machine dependent based

optimisations by handling these in a smarter way.

- Corporate involvement is accelerating functional releases (2008-2009 4 releases in last

year - current 4.4.1)

GCC Gimple

- Gimple

4 Influenced by McCAT Simple IR (GNU Simple)

4 Need for a generic language independent IR

4 Need for an IR that renders complex deep parse

trees to an IR that is easier to analyse for

optimisation

4 A small grammar covers bitwise, logical,

assignment, statement etc.

4 Unlike parse tree, gimple never references more

than 3 variables, meaning 2 variable reads

4 High Gimple and Low Gimple

Removes binding scope information and

conditional clauses converted to gotos

4 Gimple nodes iterated at tree-level (tsi) and on a

doubly linked list at bb level (bsi)

GCC Gimple

- 3 Address format ex.

Generic Iorm

iI ( a ~ b ¹ c )

c ÷ b / a ¹ ( b * a )

Gimple Iorm

T1 ÷ b ¹ c;

II ( a ~ T1 )

¦

T2 ÷ b / a;

T3 ÷ b * a;

c ÷ T2 ¹ T3

}

GCC SSA

- SSA another IR form originally developed to help with dataflow analysis for interpreted

systems

4 SSA evolved from Def-Use chains (Reif & Lewis) when annotated with identity assignments eg. vx

became the basis for SSA

4 GCC does Scalar SSA using Kildall Analysis (not Wegman et.al).

4 SSA for ~ Simplification of existing optimisations, for example constant propogation was originally

complex to implement but with SSA it is greatly simplified

4 SSA for ~ Classic dataflow analysis - Reaching Definition Analysis or more intuitively Reaching

Assignment Analysis since it attempts to pair the current variable reference to the most recent

update or write to that variable

4 SSA for ~ significantly faster optimisation during compilation O(n) versus O(n2) when optimising

using traditional data-flow equations

Generic Iorm

c ÷ 5;

iI ( a ~ b ¹ c )

c ÷ b / c ¹ ( b * a )

Gimple Iorm

c ÷ 5;

T1 ÷ b ¹ c;

iI ( a ~ T1 )

¦

T2 ÷ b / c;

T3 ÷ b * a;

c ÷ T2 ¹ T3;

}

SSA Iorm

c1 ÷ 5;

T11 ÷ b1 ¹ c1;

iI ( a1 ~ T11 )

¦

T21 ÷ b1 / c1;

T31 ÷ b1 * a1;

c2 ÷ T21 ¹ T31;

}

c3 ÷ phi ( c1, c2 );

SSA making Reaching

DeIinition Analysis easy

to perIorm. Here it is

being used to simpliIy

constant propogation

Dominators & O Fn in SSA

ominators ::÷>

1. d dominates n iI every path Irom n must go through d

· Every node dominates itselI

· Nodes also evidently have the property oI an Immediate

Dominator

F

A

B C G

D E

Dom Set Block

Immed Dom

A

B

C

D

E

F

G

A

A. B

A. C. E

A. C. D

A. C

A. C. F

A. G

-----

A

A

A

C

C

C

Fig.1 Basic Blocks contain

scalar expressions

split

merge

straight-line

A

B

C

D E

F

G

c1 ÷ x; c2 ÷ a / b;

c3 ÷ 5 (c1, c2);

Fig.1 Clearly the path to G is either Irom B or F

however the path to B and F stems Irom A so every

path Irom G goes through A thereIore G is

dominated by A

Fig.1 Likewise the path to F is either Irom D or E

however the path to these stems Irom C so every

path Irom F goes through C thereIore F is dominated

by C

Using this we can build a ominator Tree (Fig.2)

and derive ominator Sets (Fig.3) and a

ominance Frontier. A DF over a given variable in

the BB is used by the compiler to introduce Phi

Iunctions this produces a maximal Phi insertion it

can be reduced by various methods eg variable

liveness

ominance Frontier of a BB variable ::÷~

DF(d) ÷ ¦n ' ppred(n), d dom p and d !sdom n}

· Set oI all CFG nodes Ior which x dom a predecessor p oI n but not the n itselI. (Intuitively earliest point

where deIinition oI a variable is not guaranteed to be unique)

· This gives maximal insertion oI phi nodes and can be optimised several ways Ior example by doing

liveness analysis.

Dom

Frontier

Fig.2 Dom Tree

Fig.3 Dom Set & Dom

Frontier

-----

G

G

F

F

G

----

Multicore optimisation

- Polyhederal representation

4 First proposed by Feautrier in 1991 and appears in research

compilers of the time.

4 Complex Loop Nest Optimisation and Array analysis is difficult to do

with corresponding AST representation ~ especially with respect to

strict observance of loop bounds across the nest which often defeats

standard LNO

4 Loop Nest is reformulated as a set of equations, Linear Inequalities

(properly affine constraints) and due to this higher level of

abstraction a deeper level of optimisation (transformation) can be

accomplished by solving the LP system

4 Each loop integer is a point in an XY space the loop bounds of which

form a Polyhedra. Ex. The first nest is a point with 2 rays, the

second modifies this as a 4 sided 2D polyhedra the third forms a 3D

polyhedra. Problem - how to efficiently implement solving for large

number of points.

4 The literature reports 20-90% improvement using polyhederal LNO

4 Such an improvement makes it practical and desireable to distribute

LN and associated array computation across a set of multicores.

AMD are doing this with a lightweight intercore IPC they call streams

4 Polyhederal LNO available in GCC 4.5 as Graphite and IBMs Cell

Compiler

Multicore optimisation

- The polyhederal model

Typical Loop Nest

Ior(i÷2; i·÷n; i¹¹)

z|i|÷0; // S1

Ior(i÷1; i·÷n; i¹¹)

Ior(i÷1; i·÷n; i¹¹)

z|i¹i| ¹÷ x|i| & y|i|; // S2

ReIormulated as aIIine constraints. Ex. outer loop

TransIormation scheduling (optimisation)

Regenerate AST Ior code generation.Ex.

Will be DS1 - DS2 DS2 - DS1 DS1 Y DS2

giving worst case oI 3np (n÷stmts; p÷nest

depth)

t÷2;

i÷2;

z[i]÷0;

for(t÷3; t<÷2`n; t++)

for(i÷max(1.t-n-1); i<÷min(t-2.n); i++)

j÷t-i-1;

z[i+j] +÷ x[i] ` y[i];

i÷t;

z[i] ÷ 0;

t÷2`n+1;

i÷n;

j÷n;

z[i+j] +÷ x[i] ` y[j];

Steps

1. DeIine domain Dn (reI bounds

oI enclosing loop)

· List Access Iunctions Ex.

S1 ÷ Z|i|÷0

· TransIorm (optimise) with

some aIIine schedule eg. S1(i) ÷

(i)

· Generate code using proiection

and separation oI polyhedra

DS1-DS2

DS2-DS1

DS1 Y DS2

#ef [Bastoul06]

- A Case for Intelligent RAM IRAMUploaded byanssandeep
- ACA SolutionUploaded byShivam Jain
- Optimization PDF TutorialUploaded byaamya
- 梁存铭Intel - Core_effeciency.pdfUploaded byanilkoli
- smartphone optimizationUploaded byDeus
- L2Uploaded bychhabra_amit78
- Manual LiveROMSUploaded byGubbler R. Otarola Bello
- Make File TutorialUploaded byLê Quốc Hòa
- A Parallel Implicit In Compressible FlowUploaded byRituraj Gautam
- Code Generators for Embedded Proc.Uploaded bySanthosh Kumar
- Intel VTuneUploaded byDvijesh
- Matrix Multiplication Using SIMD TechnologiesUploaded byGurpreet Singh
- 2010-11 m.e. Electronics and Telecommunication [Digital Systems]Uploaded bysonali_raisonigroup
- 1234444Uploaded byRaja Cool
- lecture5bUploaded byXin Zhao
- DUI0204J Rvct Assembler GuideUploaded byadsfjkg
- Cluster Computing and Load BalancingUploaded bysudheer0553
- g k 3611601162Uploaded byAnonymous 7VPPkWS8O
- Complete Reference to Informatica_ Optimizing the Bottleneck’sUploaded byvara421
- 20150826_Guri Sohi_Adaptive Efficient Parallel Execution of Parallel Programs.pdfUploaded bygisselydsz
- 60052Uploaded byuranub
- A General Algorithm for Computing DTUploaded byAsaduz Zaman
- Mastering Logistics Complexity Whitepaper Sep16Uploaded bygambito2000
- LN7Uploaded byjhaumosha
- Tutorial on Parallel & Concurrent Programming in Haskell, A - Simon Peyton Jones & Satnam Singh - 2009Uploaded byabcd2690
- YunmingZhang ResumeUploaded byYunming Zhang
- PC-6Uploaded bytt_aljobory3911
- ES UT I 2017Uploaded byAvinash Harale
- Multicore Programming GuideUploaded byYan Tse
- Curriculum of Computer ScienceUploaded byUbaid Umar