You are on page 1of 389

1

Lots of authors

Static Single Assignment Book
Tuesday 27th November, 2012 10:25

2
To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter. To be filled with the actual beginning of the book, for instance the start of the introduction chapter.

Springer

Contents

Part I Vanilla SSA 1 Introduction — (J. Singer ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Definition of SSA 1.2 Informal Semantics of SSA 1.3 Comparison with Classical Data Flow Analysis 1.4 SSA in Context 1.5 About the Rest of this Book 1.5.1 Benefits of SSA 1.5.2 Fallacies about SSA 5

2

Properties and flavours — (P. Brisk ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.1 Preliminaries 2.2 Def-Use and Use-Def Chains 2.3 Minimality 2.4 Strict SSA Form. Dominance Property 2.5 Pruned SSA Form 2.6 Conventional and Transformed SSA Form 2.7 A Stronger Definition of Interference 2.8 Further readings Standard Construction and Destruction Algorithms — (J. Singer, F. Rastello) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1 Construction 3.2 Destruction 3.3 SSA Property Transformations 3.4 Additional reading Advanced Construction Algorithms for SSA — (D. Das, U. Ramakrishna, V. Sreedhar ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.1 Introduction 4.2 Basic Algorithm 4.3 Placing φ -functions using DJ graphs

3

4

v

vi

Contents

4.4

4.5

4.6 4.7 5

4.3.1 Key Observation 4.3.2 Main Algorithm Placing φ -functions using Merge Sets 4.4.1 Merge Set and Merge Relation 4.4.2 Iterative Merge Set Computation Computing Iterated Dominance Frontier Using Loop Nesting Forests 4.5.1 Loop nesting forest 4.5.2 Main Algorithm Summary Further Reading

SSA Reconstruction — (S. Hack ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.1 Introduction 5.1.1 Live-Range Splitting 5.1.2 Jump Threading 5.2 General Considerations 5.3 Reconstruction based on the Dominance Frontier 5.4 Search-based Reconstruction 5.5 Conclusions Functional representations of SSA — (L. Beringer ) . . . . . . . . . . . . . . . . . . . . . . 69 6.1 Low-level functional program representations 6.1.1 Variable assignment versus binding 6.1.2 Control flow: continuations 6.1.3 Control flow: direct style 6.1.4 Let-normal form 6.2 Functional construction and destruction of SSA 6.2.1 Initial construction using liveness analysis 6.2.2 λ-dropping 6.2.3 Nesting, dominance, loop-closure 6.2.4 Destruction of SSA 6.3 Refined block sinking and loop nesting forests 6.4 Pointers to the literature 6.5 Concluding remarks

6

Part II Analysis 7 Introduction — (A. Ertl ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 7.1 benefits of SSA 7.2 properties: suffixed representation can use dense vector, vs DF needs (x,PP) needs hashtable 7.3 phi node construction precomputes DF merge 7.4 dominance properties imply that simple forward analysis works 7.5 Flow insensitive somewhat improved by SSA 7.6 building defuse chains: for each SSA variable can get pointer to next var

Contents

vii

7.7 8

improves SSA DF-propagation

Propagating Information using SSA— (F. Brandner, D. Novillo) . . . . . . . . 101 8.1 Overview 8.2 Preliminaries 8.2.1 Solving Data Flow Problems 8.3 Data Flow Propagation under SSA Form 8.3.1 Program Representation 8.3.2 Sparse Data Flow Propagation 8.3.3 Discussion 8.3.4 Limitations 8.4 Example – Copy Propagation 8.5 Further Reading Liveness — (B. Boissinot, F. Rastello) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 9.1 TODO 9.2 SSA tells nothing about liveness 9.3 Computing liveness sets 9.4 Liveness check Alias analysis — (??? ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 10.1 TODO Loop tree and induction variables — (S. Pop, A. Cohen) . . . . . . . . . . . . . . . . . 119 11.1 Part of the CFG and Loop Tree can be exposed from the SSA 11.1.1 An SSA representation without the CFG 11.1.2 Natural loop structures on the SSA 11.1.3 Improving the SSA pretty printer for loops 11.2 Analysis of Induction Variables 11.2.1 Stride detection 11.2.2 Translation to chains of recurrences 11.2.3 Instantiation of symbols and region parameters 11.2.4 Number of iterations and computation of the end of loop value 11.3 Further reading Redundancy Elimination — (F. Chow ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 12.1 Introduction 12.2 Why PRE and SSA are related 12.3 How SSAPRE Works 12.3.1 The Safety Criterion 12.3.2 The Computational Optimality Criterion 12.3.3 The Lifetime Optimality Criterion 12.4 Speculative PRE 12.5 Register Promotion via PRE 12.5.1 Register Promotion as Placement Optimization 12.5.2 Load Placement Optimization

9

10

11

12

.3. .3 Introducing “zero versions” to limit the explosion of the number of variables 15. . . .2 Static Single Information 14. Chow ) .2. Rastello) . F. . . .2 SSA and aliasing: µ and χ -functions 15.8 Further readings Array SSA Form — (Vivek Sarkar.3 Sparse Constant Propagation of Array Elements 16.2. . . . F.viii Contents 12. . . . 161 14.3 Variable Renaming 14.6. . .2. Sarkar. .3. .5 Implementation Details 14.2 Array SSA Form 16. . . Mantione. Stephen Fink ) . .4 Dead and Undefined Code Elimination 14.3 The Static Single Information Property 14.7 Using HSSA 15. .5. . .2.4 Special instructions used to split live ranges 14. .3. 195 16. . . .4 Extension to Objects: Redundant Load Elimination 15 16 . . .3. .3. . .5 GVN and indirect memory operations: HSSA 15.2 Partitioned Variable Lattice (PVL) Data-Flow Problems 14. . . .4 Further Reading Hashed SSA form: HSSA — (M.1 Sparse Analysis 14.6 Value-based Redundancy Elimination 12. .5 Propagating Information Forwardly and Backwardly 14. . Rastello) .6.2 Redundancy Elimination under Value Numbering 12.3 Construction and Destruction of the Intermediate Program Representation 14.1 Array Lattice for Sparse Constant Propagation 16.4 SSA and indirect memory operations: virtual variables 15. Kathleen Knobe.1 Introduction 15. Pereira. . .1 Introduction 14.2 Splitting live ranges 14. .1 Value Numbering 12. . .2. . . . . . . . .3. . .6 Building HSSA 15.1 Splitting strategy 14. . . . . . . . .6 Examples of sparse data-flow analyses 14. .3 Store Placement Optimization 12. .1 Introduction 16.2.7 Additional Reading Part III Extensions 13 14 Introduction — (V. 181 15. .2 Beyond Constant Indices 16.3. . F. . . 155 Static Single Information Form — (F. . . . . .

5 Psi-SSA destruction 17.3 The SSA Graph 18. 211 17.2. . .2 SSA form variant 19. 247 19. . .7 Further readings 18 Part IV Machine code generation and optimization 19 SSA form and code generation — (B. .5. .2 SSA form in a code generator 19.2 Definition and Construction 17.6 Dead node elimination with the VSDG 18. .1 Issues with the SSA form on machine code 19. . . . . . . . 225 18.Contents ix 16. .5 Gating functions and GSA 18. .6.1 Control tree flavor 19. . . . .2 Nodes 18. .2.1. . . .6. . .2 Psi-web 17. . . . . .2 Operand naming constraints 19. . . .3 Non-kill target operands 19. . . . . . . .6 Additional reading Graphs and gating functions — (J. . . . .1 Overview 17.5.5 Further Reading 17 Psi-SSA Form — (F.1 Introduction 18.4 SSA form construction . .1 Finding induction variables with the SSA Graph 18.6.2. . . .1 Detecting parallelism with the PDG 18.1. . .6. .4. . . .3 Variable lineness and interference 19.3 γ-Nodes 18.1 Psi-normalize 17. . . .1 Definition of the VSDG 18.6. . Stanier ) .1 Backwards symbolic analysis with GSA 18. . .5. . .4 Program Dependence Graph 18. . . de Ferrière) . .4 Psi-SSA algorithms 17. . .2.2 Definitely-Same and Definitely-Different Analyses for Heap Array Indices 16. .6 Value State Dependence Graph 18. .5 State Nodes 18.1. . . . . .2 Data Flow Graphs 18.6. .1 Representation of instruction semantics 19. .4.3 SSA algorithms 17.1 Analysis Framework 16.4. . .4. .3 Scalar Replacement Algorithm 16. . . Dupont de Dinechin) . .4 θ -Nodes 18.3.

. . .2. .1 Architectural requirements 21.2.x Contents 19.3 Code Quality 22.2 Tail Duplication 21. . . Ebner.3.1.4 Speed and Memory Footprint 22. . . .2 SSA form to the rescue of register allocation 23. . . . . . S. . . .4 Summary and Further Reading If-Conversion — (C. . . Hack ) . . .6 Pre-pass instruction scheduling 19.1 Partitioned Boolean Quadratic Problem 20.1 Spilling under the SSA Form 21 22 23 . .1 Introduction 23. .3. .2. . .3.2 Instruction Code Selection for Tree Patterns on SSA-Graphs 20. . . .1 Non-SSA Register Allocators 23. . . . .3. .1. .3.2 Correctness 22. Bruel ) .3.2.5 Further Readings Register Allocation— (F.2 Basic Transformations 21. .3. . . A. . . .2. .2. . . . . . . 309 23. . B. . . Bouchez. 275 21.3 Global Analysis and Transformations 21. . . .3 Code generation phases and the SSA form 19. . . . . Scholz ) . .2 If-conversion 19.3. .1 Introduction 21. . Rastello) . .1. . . .2 Handling of predicated execution model 21. . . . . .5 Code cleanups and instruction re-selection 19.4 Memory addressing and memory packing 19. . . .1 SSA Incremental if-conversion algorithm 21. . .2 Instruction Code Selection with PBQP 20. . . .5 Additional reading SSA destruction for machine code — (F. .1 Introduction 22. 291 22.3. . 261 20.3 Extensions and Generalizations 20. .3 Profitability 21.3. . . . . . . . . . .3. . . Krall. . . .4 Conclusion 21.3 Maxlive and colorability of graphs 23. .1 Instruction selection 19. . .3 Inner loop optimizations 19.2 Spilling 23.1 SSA operations on Basic Blocks 21. . . . .1 Instruction Code Selection for DAG Patterns 20.7 Register allocation 20 Instruction Code Selection — (D. . . . .1 Introduction 20. .1.5 SSA form destruction 19.

. . .2. . .5. . . . .2 Why use SSA for Hardware Compilation? 24. .3 Control-Flow-Graph Mapping using SSA 24. . .3.1 Brief History and Overview 24.4. 329 24. . . . .1 Basic Block Mapping 24.5 Implications of using SSA-form in Floor-planing 24. . implications and conclusion 25 References .2 SSA graphs are colorable with a greedy scheme 23.3. .2 Execution of arbitrary code 25.4. . . .3 Mapping a Control Flow Graph to Hardware 24.2. .4. . . . .3 Object handlers 25. . Lero@TCD.2 Out-of-SSA and edge splitting 23.2 Basic Control-Flow-Graph Mapping 24. . . . .1 Greedy coloring scheme 23. . . . . Diniz. 345 25. .3 Parallel copy motion 23. . . . .3. .4.1 Run-time symbol tables 25. .3. .5 Further Reading & References 24 Hardware Compilation using SSA — (Pedro C. . . .2 SSA form in “traditional” languages 25.4. .3 Building def-use sets 25.3.4 Coalescing under SSA form 23. .4.3. 359 . . .Contents xi 23. .3.2 Finding Spilling Candidates 23.6 Discussion. . . . .3 Coloring and coalescing 23.4 φ -Function and Multiplexer Optimizations 24. . Trinity College Dublin) . .1 Introduction 25.4 Our whole-program analysis 25. .2 Analysis results 25. . .1 Handling registers constraints 23. . . .4 HSSA 25.4 Existing SSA-based Hardware Compilation Efforts Building SSA form in a compiler for PHP — (Paul Biggar and David Gregg. .5. .3.5.4.3 PHP and aliasing 25. . .5 Other challenging PHP features 25.1 The analysis algorithm 25. Philip Brisk ) . . .3 Spilling under SSA 23. .3.4 Practical and advanced discussions 23. .3 A light tree-scan coloring algorithm 23. .

.

Foreword Author: Zadeck xiii .

.

for students. dragon book. muchnick. etc)? xv . etc TODO: Pre-requisites for reading this book (Tiger book.Preface TODO: Roadmap . compiler engineers.

.

φ Vanilla SSA Progress: 88% Part I .

.

3 TODO: (tutorial style not too heavy) Progress: 88% .

.

names are useful handles for concrete entities. As soon as the conversation mentions Springfield (rather than Smyrna). this enables a greater degree of precision. The key message of this book is that having unique names for distinct entities reduces uncertainty and imprecision. 5 . This book is about the static single assignment form (SSA). the memory location labelled as x ) should be modified to store the value (y + 1).’ Without any more contextual clues. The term assignment means variable definitions. then there would be no possibility of confusing 20th century American cartoon characters with ancient Greek literary figures. you are fairly sure that the Simpsons television series (rather than Greek poetry) is the subject. or assignment statement. On the other hand. in the code x = y + 1. For instance. As illustrated above.CHAPTER 1 Introduction J. or indeed.. A compiler engineer would interpret the above assignment statement to mean that the lvalue of x (i. for x .e. as in real life. consider overhearing a conversation about ‘Homer. the variable x is being assigned the value of expression (y + 1). which is a naming convention for storage locations (variables) in low-level representations of computer programs. you cannot disambiguate between Homer Simpson and Homer the classical Greek poet. The term single refers to the uniqueness property of variable names that SSA imposes. any other people called Homer that you may know. The term static indicates that SSA relates to properties and analysis of program text (code). if everyone had a unique name. This is a definition. Singer Progress: 95% Final reading in progress In computer programming. For example.

. . . . we know the value of x in the conditionally executed block following an if statement beginning: if (x == 0) however the underlying value of x does not change at this if statement. since the meaning of an expression depends only on the meaning of its subexpressions and not on the order of evaluation or side-effects of other expressions. For a referentially opaque program. . . more specialized. variable values depend on their context. . ” However there are various. including the simplest definition above. . . . . . least constrained. . . . . . . . . . . Each distinct SSA variety has specific characteristics. . .g. . . since there is only a single definition for each variable in the program text. . . . it becomes referentially transparent.. . Referentially transparent programs are more amenable to formal methods and mathematical reasoning. x = 2. Singer ) . varieties of SSA. Such constraints may relate to graph-theoretic properties of variable definitions and uses. which impose further constraints on programs. . . Programs written in pure functional languages are referentially transparent. . is referential transparency : i. . or the encapsulation of specific controlflow or data-flow information.1 Denition of SSA The simplest. a variable’s value is independent of its position in the program. . . i. . Part III of this book presents more complex extensions. Basic varieties of SSA are discussed in Chapter 2. e. . z = x + 1. . When a compiler transforms this program fragment to SSA code. x = 1. Now it is apparent that y and z are equal if and only if x 1 and x 2 are equal. 1. .. . . However the value of variable x depends on whether the current code position is before or after the second definition of x . .6 referential transparency 1 Introduction — (J. . A naive (and incorrect) analysis may assume that the values of y and z are equal. The translation process involves renaming to eliminate multiple assignment statements for the same variable. definition of SSA can be given using the following informal prose: program is defined to be in SSA form if each variable is a target of exactly one assignment “A statement in the program text. . . . . since they have identical definitions of (x + 1). We may refine our knowledge about a particular variable based on branching conditions. One important property that holds for all varieties of SSA. .e. . . consider the following code fragment.e. . y = x + 1.

. A y ←1 B y ←x +2 print(y ). . It is a special statement. . we saw how straightline sequences of code can be transformed to SSA by simple renaming of variable definitions. the multiple definitions of y are renamed as y 1 and y 2 .2 Informal Semantics of SSA 7 x 1 = 1. (x = 42)? if (x == 42) then else end y = 1. as multiplexer control flow graph . . . . . . . variable name \phifun \phifun. Conversely variable names can be used multiple times on the right-hand side of any assignment statements. . . . Although this was an in-house joke. . . although it can prove useful for compiler debugging purposes. . on the left-hand side of the assignment statement. . Consider the following code example and its corresponding control flow graph (CFG) representation: x ← input() x = input(). . . . In general this is an unimportant implementation feature. print(y ) There is a distinct definition of y in each branch of the if statement. x 2 = 2.1. it did serve as the basis for the eventual name. The target of the definition is the variable being defined. . y = x + 2. . . . . So multiple definitions of y reach the print statement at the control-flow merge point. In SSA. However the print statement could use either variable. . at control-flow merge points. . . 1 Kenneth Zadeck reports that φ -functions were originally known as phoney -functions. . . . each definition target must be a unique variable name. 1. . . during the development of SSA at IBM Research. . renaming is generally performed by adding integer subscripts to original variable names. as source variables for definitions. The φ -function is the most important SSA concept to grasp. . When a compiler transforms this program to SSA. . . . . .” 1 The purpose of a φ -function is to merge values from different incoming paths. . . Throughout this book. . . known as a pseudo-assignment function.2 Informal Semantics of SSA In the previous section. . z = x 2 + 1. . . y = x 1 + 1. . . . . Some call it a “notational fiction. . .

Notice that the CFG representation here adopts a more expressive syntax for φ -functions than the standard one. φ -functions are not directly executable in software. Strictly speaking. φ -functions are generally placed at control-flow merge points. on the left-hand-side of the φ -function. This is tolerable. . then these are executed in parallel. Singer ) dependent on the outcome of the if conditional test. basic block labels will be omitted from φ -function operands when the omission does not cause ambiguity. print(y3 ). y 2 ). However. as multiplexer \phifun. B : y 2 ) print(y 3 ) y 3 = φ (y 1 .e.g. Throughout this book.e. at the heads of basic blocks that have multiple predecessors in control-flow graphs. This parameter value is assigned to the fresh variable name. if (x == 42) x ← input() (x = 42)? then else end y 1 = 1. simultaneously not sequentially. Hence. since φ -functions are generally only used during static analysis of the program. which takes the value of either y 1 or y 2 . It is important to note that. Figure 23. since the dynamic control-flow path leading to the φ -function is not explicitly encoded as an input to φ -function. Thus the SSA version of the program is: x = input(). and set to y 2 if it flows from basic block B . there are various executable extensions of φ -functions.8 \phifun. The behavior of the φ -function is to select dynamically the value of the parameter associated with the actually executed control-flow path into b . parallel semantic $\phiif$-function $gamma$-function 1 Introduction — (J. B n : a n ). y 3 ← φ (A : y 1 .. i.e. which . . a 0 = φ ( B 1 : a 1 . A y1 ← 1 B y2 ← x + 2 y 2 = x + 2. A φ -function at block b has n parameters if there are n incoming control-flow paths to b . . They are removed before any program interpretation or execution takes place. in the presence of branching control flow. as described in Chapters 3 and 22. in the above example. e. as it associates predecessor basic block labels B i with corresponding SSA variable names a i . such as φif or γ functions (see Chapter 18). perhaps after optimizations such as copy propagation (see Chapter 8). y 3 is set to y 1 if control flows from basic block A .11). they are sequentialized using conventional copy operations. In terms of their position. When φ -functions are eliminated in the SSA destruction phase. i. This distinction becomes important if the target of a φ -function is the same as the source of another φ -function. Such pseudo-functions are required to maintain the SSA property of unique variable definitions. A φ -function introduces of a new variable y 3 . . if there are multiple φ -functions at the head of a basic block. i.. This subtlety is particularly important in the context of register allocated code (see Chapter 23.

It is important to outline that SSA should not be confused with (dynamic) single assignment (DSA or simply SA) form used in automatic parallelization. Static single assignment does not prevent multiple assignments to a variable during program execution. Such extensions are useful for program interpretation (see Chapter 18). if-conversion (see Chapter 21 or). A φ -function has been inserted at the appropriate control-flow merge point where multiple reaching definitions of the same variable converged in the original program. y 3 ) (x 2 < 10)? } y = y +x. x 3 ) y 2 ← φ (y 1 . We present one further example in this section. x1 ← 0 y1 ← 0 while(x < 10){ x 2 ← φ (x 1 . in the SSA code fragment above. and from the loop body for subsequent iterations.2 Informal Semantics of SSA 9 take an extra predicate parameter to replace the implicit control dependence that dictates the argument the φ -function should select by a condition dependence. variables y 3 and x 3 in the loop body are redefined dynamically with fresh values at each loop iteration. For instance. Full details of the SSA construction algorithm are given in Chapter 3. For now. to illustrate how a loop controlflow structure appears in SSA. Integer subscripts have been used to rename variables x and y from the original program. or hardware synthesis (see 24). 2.1. . x = x + 1. y3 ← y2 + x 2 x3 ← x2 + 1 print(y ) print(y 2 ) The SSA code has two φ -functions in the compound loop header statement that merge incoming definitions from before the loop for the first iteration. Here is the non-SSA version of the program and its corresponding control flow graph SSA version: x = 0. y = 0. it is sufficient to see that: 1.

This is the approach used in classical data flow analysis. . 0 £] [x . 0] [x . the aim is to determine statically whether that variable can contain a zero integer value (i. Often. . . y ] = [0. $\top$ data flow analysis. rather than maintaining a vector of data flow facts indexed over all variables. . . 0] [x . . at each program point. . . . 0] x =1 [x . . y 3 ] = [ . data flow information can be propagated more efficiently using a functional. . 1. .. . Here 0 represents the fact that the variable is null. x = 0 [x . x 2 ) (a) dense (b) SSA based Fig. . x 1 . . . . 0] y 2 = φ (y 3 . For each variable in a program. . .e. 0] y =y +1 [x .3 non. y 1 ) y3 = y2 + 1 [y 2 . y 1 . and the fact that it is maybe-null. or data flow facts. . ¡ 0 the facts that it is non-null. x 1 = 0 [x 1 .g.1(a). . y ] = [0. . . . During actual program execution. . using an operational representation of the program such as the control flow graph (CFG). we would compute information about variables x and y for each of the entry and exit points of the six basic-blocks in the CFG. . . Thus it is possible to associate data flow facts directly with variable names. . . .10 1 Introduction — (J. constant propagation) this is exactly the set of program points where data flow facts may change. . 0] [x . y ] = [0. using suitable data flow . . Data flow analysis collects information about programs at compile time in order to make optimizing code transformations. . Figure 1. variables are renamed at definition points. ] y 4 = φ (y 3 . With classical dense data flow analysis on the CFG in Figure 1. Static analysis captures this behaviour by propagating abstract information.1 illustrates this point through an example of non-zero value analysis.1 Example control flow graph for non-zero value analysis. . y ] = [0. Singer ) data flow analysis data flow analysis. y 1 ] = [0. . . y ] = [ . . x 3 ] = [ . . . y ] = [0. information flows between variables. . . . one of the major advantages . of SSA form concerns data flow analysis. null) at runtime. or sparse. representation of the program such as SSA. 0] [x . y = 0. 1. 0 £] [x . . y ] = [0. y 1 ) x 3 = φ (x 1 . dense As we will discover further in Chapters 14 and 8.Comparison with Classical Data Flow Analysis zero value top. . y ] = [0. . . . 0] y 1 = 0. £ 0] [x . . . y ] = [0. When a program is translated into SSA form. only showing relevant definition statements for variable x and y . ] x2 = 1 [x 2 ] = [0 £] [y 4 . y ] = [0. . For certain data flow problems (e. .

1(b). As optimizations in SSA are fast and powerful. . . 1. Chapter 18 gives further details on dependence graph style IRs. . . . . or is not relevant. Open64. . . . The results of the SSA data flow analysis are more succinct. . . Data flow information propagates directly from definition statements to uses. . . . which simplifies data flow analysis techniques. . Chromium V8 JavaScript engine. via the def-use links implicit in the SSA naming scheme. For other data flow problems. the classical data flow framework propagates information throughout the program. . These problems can be accommodated in a sparse analysis framework by inserting additional pseudo-definition functions at appropriate points to induce additional variable renaming. . this example illustrates some key advantages of the SSA-based analysis. 1. Part ?? of this textbook gives a comprehensive treatment of some SSA-based data flow analysis. . Using sparse SSA-based data flow analysis on Figure 1. known as def-use chains. . . SSA rapidly acquired popularity due to its intuitive nature and straightforward construction algorithm. . . . . . The SSA property gives a standardized shape for variable def-use chains. properties may change at points that are not variable definitions. . . . including points where the information does not change. . Example IRs include the program dependence graph [113] and program dependence web [202]. . In contrast. various intermediate representations (IRs) were proposed to encapsulate data dependence in a way that enabled fast and accurate data flow analysis. . . 83]. . 2. and LLVM. Sun’s HotSpot JVM.1. which was developed at IBM Research. 10. . Mono. . . . . . we compute information about each variable based on a simple analysis of its definition statement. . However. IBM’s Java Jikes RVM. . JIT equations. . . Current Usage The majority of current commercial and open-source compilers. enabling efficient propagation of data-flow information. and announced publicly in several research papers in the late 1980s [229. See Chapter 14 for one such example. SSA is increasingly used in just-in-time (JIT) compilers that operate on a high-level . one for each SSA version of variables x and y . .4 SSA in Context 11 data flow analysis. . .4 SSA in Context Historical Context Throughout the 1980s. . . Static single assignment form was one such IR. there are fewer data flow facts associated with the sparse (SSA) analysis than with the dense (classical) analysis. The motivation behind the design of such IRs was the exposure of direct links between variable definitions and uses. This gives us six data flow facts. use SSA as a key intermediate representation for program analysis. . . as optimizing compiler techology became more mature. including GCC. sparse def-use chains just-intime compiler. . . In the example.

. the back-end. .g. .e. Singer ) target-independent program representation such as Java byte-code. This is becoming an increasingly important issue. . . . without any data dependence problems. The ultimate goals of this book are: 1. . .and many-core processors. . . . . Jikes. To demonstrate clearly the benefits of SSA-based analysis. . . CLI bytecode (. . Today. e. Mono. LLVM. Initially developed to facilitate the development of high-level program transformations. . To dispell the fallacies that prevent people from using SSA. . 1. . The main motivation for allowing the programmer to enforce SSA in an explicit manner in high-level programs is that immutability simplifies concurrent programming. . Read-only data can be shared freely between multiple threads. static or just-in-time. . . SSA for High-Level Languages So far. . Other languages allow the SSA property to be applied on a per-variable basis. The SISAL language is defined in such a way that programs automatically have referential transparency. Several industrial and academic compilers. . . . Java HotSpot. . . or LLVM bitcode. libFirm. . . using special annotations like final in Java. . . Recent research on register allocation (see Chapter 23) even allows the retention of SSA form until the very end of the code generation process. compiler register allocation functional language 1 Introduction — (J. .NET MSIL). including GCC. . . Chapter 6 explains the dualities between SSA and functional programming. with the trend of multi. . from the pragmatic perspective of compiler engineers and code analysts. LAO. . Thus functional programming supports the SSA property implicitly. High-level functional languages claim referential transparency as one of the cornerstones of their programming paradigm. . . The rest of this book presents various aspects of SSA. . . .. This section gives pointers to later parts of the book that deal with specific topics. we have introduced the notion of SSA. . Many industrial compilers that use SSA form perform SSA elimination before register allocation. since multiple assignments are not permitted to variables. . . . or const and readonly in C#. HotSpot. SSA form is even adopted for the final code generation phase (see Part ??). . SSA form has gained much interest due to its favorable properties that often enable the simplification of algorithms and reducuction of computational complexity.12 back-end. i. use SSA in their back-ends. . and LLVM. . 2.. . . . It is interesting to note that some high-level languages enforce the SSA property.5 About the Rest of this Book In this chapter. . . we have presented SSA as a useful feature for compiler-based analysis of low-level programs.

rather than with variables at each program point. Myth Reference SSA greatly increases the number of variables Chapter 2 reviews the main varieties of SSA. [133].5 About the Rest of this Book 13 1. This means that compiler engineers can be more productive. several compiler optimizations show to be more effective when operating on programs in SSA form. and control-flow merge points. Part II of this book focus on data flow analysis using SSA. the dead code elimination pass in GCC 4. takes only 40% as many lines of code as the equivalent pass in GCC 3. so that each variable has a unique definition. There are three major advantages that SSA enables: Compile time benefit Certain compiler optimizations can be more efficient when operating on SSA programs. given the application of suitable techniques. SSA property is difficult to maintain Chapters 3 and 5 discuss simple techniques for the repair of SSA invariants that have been broken by optimization rewrites. some of which introduce far fewer variables than the original SSA formulation. factored-out.2 Fallacies about SSA Some people believe that SSA is too cumbersome to be an effective program representation. Compiler development benefit Program analyses and transformations can be easier to express in SSA. This serves to simplify the structure of variable def-use relationships (see Section 2.Chapters 3 and 22 present efficient and effections tive SSA destruction algorithms.4). since it relies on the general-purpose. These include the class of control-flow insensitive analyses. in writing new compiler passes. which underpin data flow analysis. and debugging existing passes.x.g. Program runtime benefit Conceptually.3. and references in this first part of the book containing material to dispell these myths.5. e. data-flow propagation engine. For example. We have illustrated this simply with the non-zero value analysis in Section 1. which relies on an underlying SSA-based intermediate representation. The table below presents some common myths about SSA. 1. which does not use SSA.2) and live ranges (see Section 2. The SSA version of the pass is simpler.5. Fresh variable names are introduced at assignment statements. This book aims to convince the reader that such a concern is unnecessary.1 Benefits of SSA SSA imposes a strict discipline on variable naming in programs. since referential transparency means that data flow information can be associated directly with variables.1. any analysis and optimization that can be done under SSA form can also be done identically out of SSA form. Because of the compiler development mentioned above. .x. SSA destruction generates many copy opera.

.

of SSA form that satisfy these criteria can be defined. . or flavors. . . . . each variable is defined once. . A use-def chain which under SSA consists of a single name. . . . and every use of a variable refers to exactly one definition. . . . 15 . . in turn. . . . . . . . This chapter explores these SSA flavors and provides insight onto the contexts that favor some over others. . . . . As we will illustrate further in the book (see Chapter 8) def-use chains are useful for forward data-flow analysis as they provide direct connections that shorten the propagation distance between nodes that generate and use data-flow information. . Brisk Progress: 95% Final reading in progress . . . . . 2. . For example.2 Def-Use and Use-Def Chains Under SSA form. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . different flavors vary in terms of the number of φ -functions. . . .1 Preliminaries Recall from the previous chapter that a procedure is in SSA form if every variable is defined once. and destruct compared to others. . . . . uniquely specifies the definition that reaches the use. . Many variations. . some are more difficult to construct. . . . . . . which affects the size of the intermediate representation. . . . . . . . . . . . . . .CHAPTER 2 Properties and flavours P. . each offering its own considerations. . maintain. . . . . . Def-use chains are data structures that provide for the single definition of a variable the set of all its uses. . . . . . . . 2. .

Brisk) Because of its single definition per variable property. which favors algorithms such as dead-code elimination. computing them is also easy. some lightweight forward propagation algorithms such as copy-folding are possible: using a single pass that processes operations along a topological order traversal of a forward CFG 1 . as it is easy to associate to each variable its single defining operation. . Second. First.1 Def-use chains (dashed) for non-SSA form and its corresponding SSA form program 1 A forward control flow graph is an acyclic reduction of the CFG obtained by removing backedges. This is illustrated by Figure 2. use-def chains can be explicitly represented and maintained almost for free. a φ function encounters an unprocessed argument. most definitions are processed prior to their uses.1 where the def-use chains in the non-SSA program requires as many merge as x is used while the corresponding SSA form allows early and more efficient combination. However. x ←1 x ←2 x1 ← 1 x2 ← 2 x 3 ← φ (x 1 . use-def chains are implicitly considered as a given. at some loop headers. When processing an operation. 2. Such a lightweight propagation engine shows to be already quite efficient. def-use chains being just the reverse of use-def chains. SSA form simplifies def-use chains as it combines the information as early as possible. As this constitutes the skeleton of the so called SSA graph (see Chapter 18). SSA form simplifies defuse and use-def chains in several ways. For forward propagation. even without def-use chains. The explicit representation of use-def chains simplifies backward propagation. x 2 ) y ←x +1 (a) non-SSA z ←x +2 y ← x3 + 1 (b) SSA form z ← x3 + 2 Fig. the use-def chain allows to carry the already computed value of an argument. when considering a program under SSA form. and maintaining them can be done without much efforts either. Conservative merge is performed when.16 forward control flow graph 2 Properties and flavours — (P .

some copypropagation engines can easily turn a non-minimal SSA code into a minimal one.e. . A definition D of variable v reaches a point p in the CFG if there exists a path from D to p that does not pass through another definition of v . . as explained in Section 3. . . . . . . Given a set S of basic blocks. . placing φ -functions only at the join sets can be done easily using a simple topological traversal of the CFG as described in Section 4. . Actually it turns out that (S ∪ (S )) = (S ). However. Under the assumption that the single reaching-definition property is fulfilled. . the two paths converge at n 3 and no other CFG node. i. some new φ -functions should be inserted at (D v ∪ (D v )). . . Generalizing this statement. . such that n 3 is the only basic block that occurs on both of the paths. There are good reasons for that as we will explain further.3 for reducible flow graphs. n 3 is a join node of S if it is the join node of at least two basic blocks in S . followed by renaming.3 Minimality SSA construction is a two-phase process: placement of φ -functions. Chapter 3 describes the classical SSA construction algorithm in detail. Classical techniques place φ -functions of a variable v at (D v . . . a join set corresponds to the placement of φ -functions.5.3 Minimality 17 . . . then we ought to instantiate φ -functions for v at every basic block in (n 1 . . We are not aware of any optimizations that require a strict enforcement of minimality property.2. respectively. . The set of join nodes of set S is denoted by the set (S ). then φ -functions should be instantiated in every basic block in (D v ). . . . . is a join node of n 1 and n 2 if there exist at least two non-empty paths. . This property can be characterized using the notion of join sets that we introduce next. . Minimality is an additional property of the code with φ -functions inserted. Let n 1 and n 2 be distinct basic blocks in a CFG. . . In other words. As inserted φ -functions are themselves definition points. if D v is the set of basic blocks containing definitions of v . . . . . n 2 ). so the join set of the set of definition points of a variable in the original program characterizes exactly the minimum set of program points where φ -functions should be inserted. . if n 1 and n 2 are basic blocks that both contain the definition of a variable v . paths containing at least one CFG edge. . Intuitively. the minimality property states the minimality of the number of inserted φ -functions. . We say that a code has the single reaching-definition property iff no program point can be reached by two definitions of the same variable. . . while this section focuses primarily on describing the minimality property. . r ). . . . In other words. Finally. . 2. . . A basic block n 3 . . . but prior to renaming. . which may or may not be distinct from n 1 or n 2 . . The goal of the first phase is to generate a code that fulfills the further defined single reaching-definition property. . with r the entry node of the CFG. . .. from n 1 to n 3 and from n 2 to n 3 . .

Basic block n 1 strictly dominates n 2 if n 1 dominates n 2 and n 1 = n 2 . every basic block in a CFG dominates itself. . it is non-strict. . Although this may be permissible in some cases. . . If the original procedure is non-strict. 2. . . Under SSA. . . . . conversion to and from strict SSA form cannot address errors due to uninitialized variables. . . the set of program points where it is live. . strictness refers solely to the SSA representation. . it is usually indicative of a programmer error or poor software design. . others. minimal SSA form is obtained by placing the φ -functions of variable v at (D v . A dominator tree is a tree where each node’s children are those nodes it immediately dominates. For each variable. there would be a path from the CFG entry node to U that does not include D . .2(a) is nonstrict as there exists a path from the entry to the use of a that does not go through the definition. . . basic block n 1 dominates basic block n 2 if every path in the CFG from the entry point to n 2 includes n 1 . impose strictness as part of the language definition. then the program would not be in SSA form. . . . . If a program point U is a use of variable v . then a will be used without ever being assigned a value. . otherwise. if the input program is non-strict. . . . . Adding a (undefined) pseudo-definition of each variable to the procedure’s entry point ensures strictness. . it is a tree with the entry node as root. If this path is taken through the CFG during the execution. . such as Java. . such as C/C++. otherwise. As shall be seen in Chapter 3. Brisk) . . and a φ -function would need to be inserted somewhere in (r . To finish with. . . . . The code in Figure 2. i. The immediate dominator or idom of a node n is the unique node that strictly dominates n but does not strictly dominate any other node that strictly dominates n. Dominance Property A procedure is defined to be strict if every variable is defined before it is used along every path from the entry to exit point. then the reaching definition D of v will dominate U . conversion to minimal SSA will create a strict SSA-based representation. SSA with dominance property is useful for many reasons that directly originate from the structural properties of the variable live-ranges. .18 live-range 2 Properties and flavours — (P . the use of an implicit pseudo-definition in the CFG entry node to enforce strictness does not change the semantics of the program by any means. . . . D ) as in our example of Figure 2. . because there is only a single (static) definition per variable. By convention. impose no such restrictions. . . If such a path existed. In a CFG. . r ) using the formalism of dominance frontier.4 Strict SSA Form. . Some languages. strictness is equivalent to the dominance property : each use of a variable is dominated by its definition. Here. The so called minimal SSA form is a variant of SSA form that satisfies both the minimality and dominance properties. . All nodes but the entry node have immediate dominators. is a sub-tree . . . its live-range. We use the symbols n 1 d om n 2 and n 1 s d om n 2 to denote dominance and strict dominance respectively.e.2(b) where ⊥ represents the undefined pseudo-definition. . . The single reaching-definition property discussed previously mandates that each program point be reachable by exactly one definition (or pseudo-definition) of each variable. . . Because the immediate dominator is unique.

2.. ← b (b) SSA with dominance property . the underlying chordal property highly simplifies the assignment problem otherwise considered NP-complete.. For strict programs. Note that in the general case. . If we suppose that two variables interfere if their live-ranges intersect (see Section 2. 2. as the register assignment problem can be expressed as a coloring problem of the interference graph.... Among other consequences of this property. . and if there exists a definition free path to a use (upward-exposed use analysis). a 1 ← φ (a 0 . Dominance Property 19 liveness interference graph a ← . of the dominator tree. Graph coloring plays an important role in register allocation.. without requiring the explicit construction of an interference graph. b0 ← . the graph that links two variables with an interference edge if they cannot be assigned the same register (color). Another elegant consequence is that the intersection graph of live-ranges belongs to a special class of graphs called chordal graphs.2 A non-strict code and its corresponding strict SSA form. . ⊥) b 1 ← φ (⊥. b 0 ) . b ← .. In particular... Chordal graphs are significant because several problems that are NP-complete on general graphs have efficient linear-time solutions on chordal graphs. . The trees scan algo- . a traversal of the dominator tree called a tree scan can color all of the variables in the program. ← a (a) Non-strict code . any program point from which you can reach a use without going through a definition is necessarily reachable from a definition. ← b1 Fig. we can cite the ability to design a fast and efficient method to query whether a variable is live at point q or an iteration free algorithm to computes liveness sets (see Chapter 9).7 for further discussions about this hypothesis). a variable is considered to be live at a program point if there exists a definition of that variable that can reach this point (reaching definition analysis). this property allows also efficient algorithms to test whether two variables interfere (see Chapter 22). ← a1 .. including graph coloring..4 Strict SSA Form. a0 ← .. The presence of ⊥ indicates a use of an undefined value.

. As we have already mentioned. . . . another approach is to construct minimal SSA and then remove the dead φ -functions using dead code elimination. the argument a 1 of the copy represented by a 2 = φ (a 1 . . Brisk) rithm can be used for register allocation. . . . Making a non-strict SSA code. . . . . r ). φ -functions for variable v are placed at the entry points of basic blocks belonging to the set (S . The primary advantage of eliminating those dead φ -functions over minimal SSA form is that it has far fewer φ -functions in most cases. . . this property can be broken by copy-propagation: in our example of Figure 2. . . . including register allocation. . . ⊥) can be propagated and every occurrence of a 2 can be safely replaced by a 1 .2(b). . . 2.3(a) shows an example of minimal non pruned SSA. . . . The new constraint is that every use point for a given variable must be reached by exactly one definition. As we will see in Chapter 3. Still the “strictification” usually concerns only a few variables and a restricted region of the CFG: the incremental update described in Chapter 5 will do the work with less efforts. . . . . are only concerned with the region of a program where a given variable is live. most φ -function placement algorithms are based on the notion of dominance frontier (see chapters 3 and 4) consequently do provide the dominance property. . . . .5 Pruned SSA Form One drawback of Minimal SSA form is that it may place φ -functions for a variable at a point in the control flow graph where the variable was not actually live prior to SSA. strict is somehow as “difficult” as SSA construction (actually we need a pruned version as described below). Under minimal SSA. the now identity φ -function can then be removed obtaining the initial SSA but non strict code. . . . . One possible way to do this is to perform liveness analysis prior to SSA construction. Figure 2. . we suppress the instantiation of a φ -function at the beginning of a basic block if v is not live at the entry point of that block. . . Details can be found in Chapter 3. . . . Under pruned SSA. . . . . . and then use the liveness information to suppress the placement of φ -functions as described above. . as opposed to all program points. It is possible to construct such a form while still maintaining the minimality and dominance properties otherwise. . Many program analyses and optimizations. Pruned SSA form satisfies these properties. which is discussed in greater detail in Chapter 23. The corresponding pruned SSA form would remove the dead φ -function that defines Y3 . . . . .20 2 Properties and flavours — (P .

3 Non pruned SSA form allows value numbering to determine that Y3 and Z 3 have the same value. . . Z 2 ) . . ← Y1 Y2 ← X 1 . . . . Z1 ← 1 Z2 ← X 1 (a) Non-pruned SSA form Z 3 ← φ (Z 1 . . . .. . . ← Y1 Y2 ← X 1 . . The conversion to minimal SSA form replaces each web of a variable v in the pre-SSA program with some variable names v i . . . . . . . . . . . . . . We say that x and y are φ -related to one another if they are referenced by the same φ -function. . . . . . .4(a) leads to two separate webs for variable a . register assignment is done at the granularity of webs. . ← Y3 Fig. Y2 ) . .e. . . . . . . 2. . . and none of the v i are live at any point where the web is not. As an example. . .6 Conventional and Transformed SSA Form In many non-SSA and graph coloring based register allocation schemes. . . . these variable names partition the live-range of the web: at every point in the procedure where the web is live. we can partition the variables in a program that has been converted to SSA form into φ -equivalence classes that we will refer as φ -webs . . . . exactly one variable v i is also live. . 2. . Y2 ) . . . .2. . if(P1 ) Y3 ← φ (Y1 . the code of Figure 2. . . a web is the maximum unions of def-use chains that have either a use or a def in common. . .6 Conventional and Transformed SSA Form 21 variable name live-range \phiweb if(P1 ) if(P1 ) Y1 ← 1 . . . ← Y2 Y1 ← 1 . . . . . . . . In pruned SSA. Based on this observation. In this context. ← Z3 (b) After value numbering . i. if x and y are either parameters or defined by the φ - . ← Y2 Y3 ← φ (Y1 .

. . Many program optimizations such as copy-propagation may transform a procedure from conventional to a non-conventional (T-SSA for Transformed-SSA) form. The transitive closure of this relation defines an equivalence relation that partitions the variables defined locally in the procedure into equivalence classes. Bringing back the conventional property of a T-SSA code is as “difficult” as destructing SSA (see Chapter 3). for machine level transformations such as register allocation or scheduling. . . . T-SSA provides an inaccurate view of the resource usage. . . 2. For any freshly constructed SSA code. a 3 . the φ -webs. . . this definition has applied to the discussion of interference graphs and the definition of conventional SSA form. . . . . . . otherwise. the φ -equivalence class of a resource represents a set of resources “connected” via φ -functions. Figure 2. . Another motivation for sticking to C-SSA is that the names used in the original program might help capturing some properties otherwise difficult to discover. . . . SSA destruction starting from non-conventional SSA form can be performed through a conversion to conventional SSA form as an intermediate step. two variables have been said to interfere if their live ranges intersect. . . Still. . . Intuitively. in which some variables belonging to the same φ -web interfere with one another. . As those copy instructions will have to be inserted at some points to get rid of φ -functions. Brisk) function. the φ -webs exactly correspond to the register web of the original non-SSA code. a write to one variable will overwrite the value of the other. we should outline that. Apart from those specific examples most current compilers choose not to maintain the conventional property. . . . Conventional SSA form (C-SSA) is defined as SSA form for which each φ web is interference free. . . . . .4(c) shows the corresponding transformed SSA form of our previous example: here variable a 1 interferes with variables a 2 . and a 4 . .22 2 Properties and flavours — (P . In particular. checking if a given φ -web is (and if necessary turning it back to) interference free can be done in linear time (instead of the naive quadratic time algorithm) in the size of the φ -web. and all φ -functions involving this equivalence class are removed. . . all definitions and uses are renamed to use the new variable. . as later described in Chapter 22. as described above. . . . . . . . . . The translation out of conventional SSA form is straightforward: each φ -web can be replaced with a single variable. . Lexical partial redundancy elimination (PRE) as described in Chapter 12 illustrates this point. . . Intuitively. two variables with overlapping lifetimes will require two distinct storage locations. This conversion is achieved by inserting copy operations that dissociate interfering variables from the connecting φ -functions. . . . . .7 A Stronger Denition of Interference Throughout this chapter.

a 3 ) .. ← tmp a3 ← a1 +1 (c) Transformed SSA Fig..2. a 4 }. a 3 . Firstly.. . under very specific conditions.. 2.. ← a4 . . is correct as soon as the two if-conditions are the same. in which. this is a fairly restrictive definition of interference.4 (a) Non-SSA register webs of variable a and (b) corresponding SSA φ -webs {a 1 } and {a 2 .. The ultimate notion of interference. ← tmp (b) Conventional SSA a 4 ← φ (a 2 .. . Several “static” extensions to our simple definition are still possible. ← a1 Although it suffices for correctness.7 A Stronger Definition of Interference 23 a ← . which although non-strict. ← a4 . the paths along which . (b) conventional and (c) corresponding transformed SSA form after copy propagating a 1 a 4 ← φ (a 1 ... . . Even if a and b are unique variables with overlapping live ranges. variables whose live ranges overlap one another may not interfere. based on static considerations. a 3 ) . consider the double-diamond graph of Figure 2. ← a . We present two examples. that is obviously undecidable because of a reduction to the Halting problem. tmp ← a a1 ← ...2(a) again.. should decide for two distinct variables whether there exists an execution for which they simultaneously hold two different values. tmp ← a 1 a ←a a ←a +1 a2 ← a1 a3 ← a1 +1 (a) original code a1 ← ...

The post-dominance frontier. . . . This relaxed but correct notion of interference would not make a and b of Figure 2. . . The fact that + (S ) = (S ) have been actually discovered later. . In this case. the interference graph of a procedure is no longer chordal. the two notions are strictly equivalent: two live-ranges intersect iff one contains the definition of the other. the difference between the refined notion of interference and non-value based one is significant. . such as one obtained after a naive SSA destruction pass (See Chapter 3). As far as the interference is concerned. . . . .8 Further readings The advantages of def-use and use-def chains provided for almost free under SSA are well illustrated in chapters 8 and 14. . . . . . . and Wolfe gives a simple proof of it in [278]. . . also known as the control dependence graph finds many . a and b do not actually interfere with one another. a single storage location can be allocated for both variables. . . since all statements involved are controlled by the same condition. Still. Of course. . Since only one of the two paths will ever execute. In particular. consider two variables u and v . This refined notion of interference has significant implications if applied to SSA form. Thus. A simple way to refine the interference test is to check if one of the variable is live at the definition point of the other. . whose live ranges overlap. . 2. .24 live-range 2 Properties and flavours — (P .1) can make a fairly good job. .6. which is its symmetric notion. . This example illustrates the fact that live-range splitting required here to make the code fulfill the dominance property may lead to less accurate analysis results. . . in [84]. If we can prove that u and v will always hold the same value at every place where both are live. . . . . . . . .2(b) would still interfere. . . . (n ) = (n . For this purpose they extensively develop the notion of dominance frontier of a node n . . . . a technique such as global value numbering that is straightforward to implement under SSA (See Chapter 12. In that case (See Chapter 22). Since they always have the same value. . . as any edge between two variables whose lifetimes overlap could be eliminated by this property. The notion of minimal SSA and a corresponding efficient algorithm to compute it were introduced by Cytron et al. r ). especially in the presence of a code with many variable-to-variable copies. this new criteria is in general undecidable. More details about the theory on (iterated) dominance frontier can be found in chapters 3 and 4. or the definition of b and the use of b . it suffices to allocate a single storage location that can be used for a or b . because there is only one unique value between them. for a SSA code with dominance property. . albeit at different conditional statements in the program. . . . Secondly. then they do not actually interfere with one another. Brisk) a and b are respectively used and defined are mutually exclusive with one another. . . .2(a) interfere while variables a 1 and b 1 of Figure 2. the program will either pass through the definition of a and the use of a .

The notions of conventional and transformed SSA were introduced by Sreedhar et al. The ultimate notion of interference was first discussed by Chaitin in his seminal paper [63] that presents the graph coloring approach for register allocation. or a strict SSA is provided in Chapter 3. addresses the issue of taking advantage of the dominance property with this refined notion of interference. His interference test is similar to the refined test presented in this chapter. In the context of SSA destruction.2. Most SSA papers implicitly consider the SSA form to fulfill the dominance property. in [55] and revisited in Chapter 22.8 Further readings 25 applications. in their seminal paper [242] for destructing SSA form. The notion of pruned SSA have been introduced in [67]. The first technique that really exploits the structural properties of the strictness is the fast SSA destruction algorithm developed by Budimlic et al. Chapter 22. a pruned.3 to illustrate the difference between pruned and non pruned SSA have been borrowed from [84]. The example of Figure 2. The description of the existing techniques to turn a general SSA into either a minimal. Further discussions on control dependence graph can be found in Chapter 18. . a conventional.

.

denoted by directed edges between the nodes. e. tmp}. and register allocation (see Chapter 23). Singer F.g. C . E }. We discuss this 27 . y . These are described further in Chapters 4 and 22. The set of variables is {x .construction. alternative algorithms have been devised. of SSA destruction. Figure 3. The algorithms presented in this chapter are based on material from the seminal research papers on SSA. However note that there are specialized code generation techniques that can work directly on SSA-based intermediate representations such as instruction selection (see Chapter 20). and prior to code generation. x on the path r → A → C .1 shows an example CFG program. if-conversion (see Chapter 21). this transformation occurs as one of the earliest phases in the middle-end of an optimizing compiler. SSA destruction is sometimes called out-of-SSA translation. A . Therefore such algorithms are widely implemented in current compilers. Rastello Progress: 95% Final reading in progress This chapter describes the standard algorithms for construction and destruction of SSA form. On certain control flow paths. However the program only shows statements that define relevant variables. B . some variables may be used without being defined. These original algorithms are straightforward to implement and have acceptable efficiency. This step generally takes place in an optimizing compiler after all SSA optimizations have been performed. The set of nodes is {r . Note that the program shows the complete control flow structure. albeit more complex. of SSA SSA destruction CHAPTER 3 Standard Construction and Destruction Algorithms J. All of the program variables are undefined on entry. SSA construction refers to the process of translating a non-SSA program into one that satisfies the SSA constraints. D . Note that more efficient. together with the unique return statement at the exit point of the CFG. when the program has been converted to three-address intermediate code. In general.

. 1 A program point p is said to be reachable by a definition of v . variable renaming assigns a unique variable name to each live-range. . before SSA construction occurs . . . . of \phifun issue later in the chapter. We intend to use this program as a running example live-range splitting throughout the chapter. . The resulting live-ranges exhibit the property of having a single definition. . . 3. . . . . Singer. . . . there are different flavors of SSA with distinct properties. reachable. Rastello) insertion. . . 1. . . . 3. . . . . to demonstrate various aspects of SSA construction. . . . . This second phase rewrites variable names in program statements such that the program text contains only one definition of each variable. . if there exists a path in the CFG from that definition to p that does not contain any other definition of v . of variables variable name minimal SSA form A B y ←0 x ←0 C tmp ← x x ←y y ← tmp D x ← f (x .28 3 Standard Construction and Destruction Algorithms — (J. As already outlined in Chapter 2. . . by a definition live-range single definition property r entry renaming. . . . and every use refers to its corresponding unique reaching definition. . . .1 Construction The original construction algorithm for SSA form consists of two distinct phases. . . F. . . . . . In this chapter. which occurs at the beginning of each live-range. .1 Example control flow graph. . . . . y ) E ret x Fig. . φ -function insertion performs live-range splitting to ensures that any use of a given variable v is reached1 by exactly one definition of v . we focus on the minimal SSA form. . 2. .

({r . 1. Note that DF is defined over individual nodes. but for simplicity of presentation. D .1. D .e. i. Consider again our running example from Figure 3. φ -function Insertion This concept of dominance frontiers naturally leads to a straightforward approach that places φ -functions on a per-variable basis. The iterated dominance frontier DF+ (S ). i. ({ B . DF(n ). The original construction algorithm for SSA form uses the iterated dominance frontier DF+ (D v ). since DF+ (S ) = (S ∪ {r }). is the border of the CFG region that is dominated by n . is the limit D Fi →∞ (S ) of the sequence: DF1 (S ) = DF(S ) DFi +1 (S ) = DF (S ∪ DFi (S )) Construction of minimal SSA requires for each variable v the insertion of φ functions at (D v ).1. More formally. nodes in the CFG that can be reached by two (or more) distinct elements of S using disjoint paths. paths. E }. E }. The iterated dominance frontier of this set is {A . D . non-overlapping. The dominance frontier of a node n .1 Construction 29 join set dominance frontier. \iDF dominance property Join Sets and Dominance Frontiers In order to explain how φ -function insertions occur. i. the join set (S ) is the set of join nodes of S . strictly iterated dominance frontier. DF dominates. C .3. This leads to the construction of SSA form that has the dominance property. This is an over-approximation of join set. i.e. E }) = {A . A . The set of nodes containing definitions of variable x is { B . Hence we need to insert φ -functions for x at the beginning of . Again. we overload it to operate over sets of nodes too. B . we place φ -functions at the iterated dominance frontier DF+ (D v ) where D v is the set of nodes containing definitions of v . where each renamed variable’s definition dominates its entire live-range. For a given variable v . 2. For a given set of nodes S in a CFG. Let us consider some join set examples from the program in Figure 3. D }. D . C }) = {D }. it will be helpful to review the related concepts of join sets and dominance frontiers. Join sets were introduced in Section 2.3. • node a strictly dominates node b if a dominates b and a = b • the set of nodes DF(a ) contains all nodes n such that a dominates a predecessor of n but a does not strictly dominate n . C .e. where D v is the set of nodes that contain definitions of v . DF(S ) = s ∈S DF(s ). since it is possible to get from B to D and from C to D along different. since the nodes A . the original algorithm assumes an implicit definition of every variable at the entry node r . and E are the only nodes with multiple predecessors in the program.e.

F and W at the start of each while loop iteration. i. Rastello) nodes A .1.) at entry of Y . At the end. At the beginning. F ← F ∪ {Y }. As far as the actual algorithm for φ -functions insertion is concerned. and E . This is the cause of node insertions into the worklist W during iterations of the inner loop in Algorithm 1.1. Algorithm 1: Standard algorithm for inserting φ -functions for a variable v 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 begin W ← {} : set of basic blocks. this is just-in-time calculation of the iterated dominance frontier. the CFG looks like Figure 3.2 shows the example CFG program with φ -functions for x inserted. then the CFG looks like Figure 3. Singer. when all the φ -functions for x have been placed.. Because a φ -function is itself a definition. as the algorithm proceeds. It shows the stages of execution for a single iteration of the outermost for loop.30 3 Standard Construction and Destruction Algorithms — (J. Dominance frontiers of distinct nodes may intersect.e. since a single φ -function per variable handles all incoming definitions of that variable to that node. . The worklist of nodes W is used to record definition points that the algorithm has not yet processed. The table shows the values of X . there is no need to insert another. e.. Figure 3.2. Effectively. The flags array F is used to avoid repeated insertion of φ -functions at a single node. for v : variable names in original program do for d ∈ D v do let B be the basic block containing d .1 gives a walk-through example of Algorithm 1. DF( B ) and DF(C ) both contain D . W ← W ∪ { B }. we will assume that the dominance frontier of each CFG node is pre-computed and that the iterated dominance frontier is computed just-in-time. F. F ← {} : set of basic blocks. and flags (to avoid multiple insertions). D . which is inserting φ functions for variable x . The corresponding pseudo-code for φ -function insertion is given in Algorithm 1. The algorithm works by inserting φ -functions iteratively using a worklist of definition points. it has not yet inserted φ -functions at their dominance frontiers. if Y ∈ / D v then W ← W ∪ {Y }. Each row represents a single iteration of the while loop that iterates over the worklist W . while W = {} do remove a basic block X from W . it may require further φ -functions to be inserted. Once a φ -function for a particular variable has been inserted at a node.g. for Y : basic block ∈ DF(X ) do if Y ∈ F then add v ← φ (. in the example CFG in Figure 3. Table 3.

E } {D . y ) E x ← φ (x . b ) between every CFG nodes a . This is augmented with J-edges (join edges) that correspond to all edges of the CFG whose source does not strictly dominate its destination. A } E {} A {A } F W 31 DJ-graph dominance edge. A DF-edge (dominance frontier edge) is an edge whose destination is in the dominance frontier of its source. A } {} Walk-through of placement of φ -functions for variable x in example CFG r entry A x ← φ (x . there is a DF-edge (a . E .1 B {D } C {D .3. x ) ret x Fig. A } {D . 3. for each J -edge (a . D } {D . this can be understood using the DJ-graph notation. DFedge {} {B . all ancestors of a (including a ) that do not strictly dominate b have b . E } {D . The skeleton of the DJ-graph is the dominator tree of the CFG that makes the D-edges. A } {A } {D . E } D {E . As illustrated by Figure 3. b such that a dominates a predecessor of b . E . b ). J-edge dominance frontier edge. Dedge join edge. E . x ) x ← f (x . By definition.2 Example control flow graph. C . x ) B y ←0 x ←0 C tmp ← x x ←y y ← tmp D x ← φ (x .1 Construction while loop # X DF(X ) 0 1 2 3 4 Table 3. A } { E . In other-words. D } {D } {C . including inserted φ -functions for variable x Providing the dominator tree is given.3. the computation of the dominance frontier is quite straightforward. but does not strictly dominate b .

It is conventional to treat each variable use in a φ -function as if it actually occurs on the corresponding incoming edge or at the end of the corresponding predecessor node. (F. then we can define the DF+ -graph as the transitive closure of the DF-graph. If we follow this convention. DF-graph and DF+ graph. In our example. 3. so {(F. We can compute the iterated dominance frontier for each variable independently. as outlined in this chapter. G ). (C .. ( B . For example. 5 a ← iDom(a ). then def-use chains are aligned with the CFG dominator tree. D E A A A G B G B G B E F G C E C E C F D F D F D (a) CFG (b) DJ-Graph (c) DF-Graph (d) DF+ -Graph Fig. (E . Since the iterated dominance frontier is simply the transitive closure of the dominance frontier. G )} are DF-edges. Hence. G ) is a DF+ -edge. in Figure 3. F. the single definition that reaches each use dominates that use. G ).3. the program may still contain several definitions per variable. however now there is a single definition statement in the CFG that reaches each use. E ). This leads to the pseudo-code given in Algorithm 2. This leads to more sophisticated algorithms detailed in Chapter 4. (E . Singer. G ) is a J-edge. 1 2 Once φ -functions have been inserted using this algorithm.3 An example CFG and it corresponding DJ-graph (D -edges are highlighted). as {(C . Rastello) in their dominance frontier. In other words. b ) ∈ cfgEdges do 3 while a does not strictly dominate b do 4 DF(a ) ← DF(a ) ∪ b . G )} are DF-edges. A B C x = . G } Algorithm 2: Algorithm for computing the dominance frontier of each CFG node begin for (a . . or ‘cache’ it to avoid repeated computation of the iterated dominance frontier of the same node.32 def-use chains 3 Standard Construction and Destruction Algorithms — (J. a definition of x in C will lead to inserting φ -functions in E and G .. DF+ (C ) = {E .

This corresponds to the closest definition that dominates p . Algorithm 3: Renaming algorithm for second phase of SSA construction 1 begin 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 foreach v : Variable do v .reachingDef . which may be a φ -function do updateReachingDef(v .4a. In Algorithm 3. replace this use of v by v . i ). v . shown in Figure 3. variable’s name. This per-variable slot stores the in-scope. variable’s Variable Renaming To obtain the desired property of a static single assignment per variable. it is necessary to remember its unique reaching version’s definition at some point p in the graph. v . of variables live-range splitting version.4b gives a walkthrough example of Algorithm 3. The variable renaming phase associates to each individual live-range a new variable name. we compute and cache the reaching definition for v in the per-variable slot v . create fresh variable v . /* rename variable definitions and uses to have one definition per variable name */ The variable renaming algorithm translates our running example from Figure 3. The table in Figure 3. i ).reachingDef that is updated as the algorithm traverses the dominator tree of the SSA graph. only considering variable x . also called a version. φ -function insertions have the effect of splitting the liverange(s) of each original variable into pieces. it is necessary to perform variable renaming. During the traversal. replace this definition of v by v in i . ‘new’ variable name (version) for the equivalent variable at the same point in the un-renamed program.reachingDef ← ⊥. f ). foreach v : variable defined by i.1 Construction 33 renaming.1 into the SSA form of Figure 3.reachingDef in f . for each variable v . replace this use of v by v . which is the second phase in the SSA construction process.reachingDef ← v .3.reachingDef ← v . foreach f: φ -function in a successor of BB do foreach v : variable used as a source operand on right-hand side of f do updateReachingDef(v . it is straightforward to rename variables using a depthfirst traversal of the dominator tree. Because of the dominance property outlined above. The table . foreach BB : basic Block in depth-first search preorder traversal of the dominance tree do foreach i : instruction in linear code sequence of BB do foreach v : variable used as a source operand on right-hand side of non-φ -function i do updateReachingDef(v . The labels l i mark instructions in the program that mention x .4a.reachingDef in i . Algorithm 3 presents the pseudo-code for this process.

reachingDef r A B B C C C C D D D D D E E use at l 1 def at l 1 def at l 2 use at l 5 use at l 3 def at l 4 use at l 5 use at l 7 def at l 5 use at l 6 def at l 6 use at l 1 use at l 7 def at l 7 use at l 8 ⊥ ⊥ then x 1 x 1 then x 2 x2 x 2 updated into x 1 x 1 then x 3 x3 x3 x 3 updated into x 1 then x 4 x4 x 4 then x 5 x5 x5 x 5 then x 6 x6 D l 5 x 4 ← φ (x 2 . While the slot-based algorithm requires more memory. ⊥) y 1 ← φ (y 4 . y 3 ) l 6 x 5 ← f (x 4 .reachingDef is x old before a definition statement.34 3 Standard Construction and Destruction Algorithms — (J. y 4 ) l8 l 7 x6 ← E φ (x 5 . a stack value is pushed when a variable definition is processed (while we explicitly update the reachingDef field at this point). r entry A l 1 x 1 ← φ (x 5 . and (2) when x . Rastello) records (1) when x . F. Singer. y 3 ) ret x 6 (a) CFG (b) Walktrough of renaming for variable x in example CFG Fig. x 3 ) y 5 ← φ (y 4 . then x new afterwards.4 SSA form of the example of Figure 3. . it can take advantage of an existing variable’s working field. and be more efficient in practice. rather than the slot-based approach outlined above. 3. ⊥) l2 B y2 ← 0 x2 ← 0 C l 3 tmp ← x 1 l 4 x 3 ← y1 y 3 ← tmp BB x mention x .1 The original presentation of the renaming algorithm uses a per-variable stack that stores all variable versions that dominate the current program point.reachingDef is updated from x old into x new due to a call of updateReachingDef. x 3 ) y 4 ← φ (y 2 . Multiple stack values may be popped when moving to a different node in the dominator tree (we always check whether we need to update the reachingDef field before we read from it). The top stack value is peeked when a variable use is encountered (we read from the reachingDef field at this point). In the stack-based algorithm.

see Section 2. see Section 2.g. This is due to the use of iterated dominance frontiers during the φ -placement phase. then at least one program point can be reached both by r (the entry of the CFG) and one of the definition points. We refer back to several SSA properties that were introduced in Chapter 2.6.3. see Section 2. Some of the inserted φ -functions may be dead(e. see Section 2.reachingDef in-place with this definition */ minimal SSA form pruned SSA form conventional SSA form. • It is minimal. as in Figure 3. • It is conventional. In other words.reachingDef. CSSA strict SSA form dominance property. The transformation that renames all φ related variables into a unique representative name and then removes all φ functions is a correct SSA-destruction algorithm. but before variable renaming. . one of the uses of the φ -function inserted in block A for x does not have any actual reaching definition that dominates it. y 5 in Figure 3.1 Construction 35 Procedure updateReachingDef(v. then update v . /* search through chain of definitions for v until we find the closest definition that dominates i.3. • Finally. while not (r == ⊥ or definition(r ) dominates i ) do r ← r.5. there is no explicit use of the variable subsequent to the φ -function. rather than join sets. v .e. the CFG contains the minimal number of inserted φ -functions to achieve the property that exactly one definition of each variable v reaches every point in the graph. • It is not pruned. or create on the fly undefined pseudo-operations just before the particular use.4a). create a fake undefined variable at the entry of the CFG.i) Utility function for SSA renaming Data: v : variable from program Data: i : instruction from program 1 begin 2 3 4 5 r ← v . This corresponds to the ⊥ value used to initialize each reachingDef slot in Algorithm 3. Each variable use is dominated by its unique definition. it has the dominance property. After the φ -function insertion phase.1. i.4. Actual implementation code can use a NULL value. SSA form with undefined variable Summary Now let us review the flavour of SSA form that this simple construction algorithm produces. Whenever the iterated dominance frontier of the set of definition points of a variable differs from its join set.reachingDef ← r .reachingDef.

. a pseudo instruction called a parallel copy . . phiweb(a i )) While freshly constructed SSA code is conventional. and certainly before code generation. Algorithm 4: The φ -webs discovery algorithm. . The discovery of φ webs can be performed efficiently using the classical union-find algorithm with a disjoint-set data structure. . . . Singer.e. Once we have completed SSA based optimization passes. . In particular if each φ -web’s constituent variables have nonoverlapping live-ranges then the SSA form is conventional. . which enables simple. We recall from Chapter 2 that conventional SSA is defined as a flavor under which each φ -web is interference free. . . To this end. . . this may not be the case after optimizations such as copy propagation have been performed. . . i. The φ -webs discovery algorithm is presented in Algorithm 4. . the same holds for the corresponding copies inserted at the end of predecessor basic blocks. . . A critical edge is an edge from a node with several successors to a node with several predecessors. . . say (b 1 . . . for each instruction of the form a dest = φ (a 1 . a n ) do for each source operand a i in instruction do union(phiweb(a dest ). involves replacing edge (b 1 .2 coalescing. . it is necessary to eliminate φ -functions since these are not executable machine instructions. b 2 ). and thus can be removed to coalesce the related live-ranges. . Then each φ -function should have syntactically identical names for all its operands. . . Going back to conventional SSA form implies the insertion of copies. . As φ -functions have a parallel semantic. . . . which keeps track of a set of elements partitioned into a number of disjoint (non-overlapping) subsets. and then replace φ -functions by copies at the end of predecessor basic blocks. . . . . . Destruction SSA form is a sparse representation of program information. its destruction is straightforward. of variables \phiweb conventional SSA parallel copy critical edge edge splitting parallel copy 3 Standard Construction and Destruction Algorithms — (J. One simply has to rename all φ -related variables (source and destination operands of a single φ -function are related) into a unique representative variable. F. . . . untransformed SSA code is conventional.36 SSA destruction conventional SSA form 3. have to be executed simultaneously not sequentially. Since freshly constructed. . . . . . This elimination phase is known as SSA destruction. . b 2 ) by (i) an edge from b 1 to a freshly created basic block and by (ii) another edge from this fresh basic block to b 2 . . . . . . . . . . . We refer to a set of φ -related variables as a φ -web . . based on the union-find pattern 1 2 3 4 5 6 begin for each variable v do phiweb(v ) ← {v }. . efficient code analysis and optimization. Rastello) . The simplest (although not the most efficient) way to destroy non-conventional SSA form is to split all critical edges. The process of splitting an edge. is created to .

B ) do let PCi be an empty parallel copy instruction. However Algorithm 5 can be slightly modified to directly destruct SSA by: removing line 15. Once φ -functions have been replaced by parallel copies. Test label line 14 and line 16. As opposed to a (so-called conservative) coalescing during register allocation. but also aggressive coalescing in the context of resource constrained compilation. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 begin foreach B : basic block of the CFG do let (E 1 . the compiler might not permit the splitting of a given edge. This phase can be performed immediately after SSA destruction or later on. B n : a n ) do foreach a i (argument of the φ -function corresponding to B i ) do let a i be a freshly created variable . adding “remove the φ -function” at line 19. Algorithm 5: Critical Edge Splitting Algorithm for making non-conventional SSA form conventional. T-SSA conventional SSA. In theory. . the process of copy insertion itself might take a substantial amount of time and might not be suitable for dynamic compilation. Coalescing can also. . region boundaries. with less effort. Further. we need to sequentialize the parallel copies. replacing a i by a 0 in the following lines. The goal of Chapter 22 is to cope both with non-splittable edges and difficulties related to SSA destruction at machine code level. . insert PC i in B i .and time-consuming coalescing heuristics mentioned in Chapter 23 can handle the removal of these copies effectively. if B i has several outgoing edges then create fresh empty basic block B i .2 Destruction 37 transformed SSA. else append PC i at the end of B i . . replace a i by a i in the φ -function. . reducing the frequency of these copies is the role of the coalescing during the register allocation phase. foreach φ -function at the entry of B of the form a 0 = φ ( B 1 : a 1 . this aggressive coalescing would not cope with the interference graph colorability. As explained further. CSSA edge splitting conservative coalescing register allocation aggressive coalescing represent a set of copies that have to be executed in parallel. the replacement of parallel copies by sequences of simple copies is handled later on. add copy a i ← a i to PCi .3. We stress that the above destruction technique has several drawbacks: first because of specific architectural constraints. . As already mentioned. Algorithm 5 presents the corresponding pseudo-code that makes nonconventional SSA conventional. SSA destruction of such form is straightforward.e. A few memory. perhaps . E n ) be the list of incoming edges of B . the resulting code contains many temporary-to-temporary copy operations. be performed prior to the register allocation phase. replace edge E i by edges B i → B i and B i → B . second. . i. foreach E i = ( B i . or exception handling code. . replace them by a sequence of simple copies.

. Break one of them */ . making SSA copy sequentialization since it introduces arbitrary interference between varistrict. one can visualize the parallel copy as a graph where nodes represent resources and edges represent transfer of values: the number of steps is exactly the number of cycles plus the number of non-self edges of this graph. . let a be a freshly created variable. . An optimized implementation of this algorithm will be presented in Chapter 22. . Algorithm 6: Replacement of parallel copies with sequences of sequential copy operations. let seq = () denote the sequence of copies. This section describes algorithms that transform arbitrary SSA code into the desired flavor. while ¬ ∀(b ← a ) ∈ pcopy. . a = b . . . . . . else let b ← a ∈ pcopy s. . . . . . . . fulfill the dominance property. . . . Making SSA conventional corresponds exactly to the first phase of SSA destruction (described in Section 3.3 SSA Property Transformations As discussed in Chapter 2.e. Of course. SSA comes in different flavors. ∃(c ← b ) ∈ pcopy then /* b is not live-in of pcopy append b ← a to seq. making SSA ables. . . . . this can be done as shown in Algorithm 6.t. . append a ← a to seq. . As an example. */ /* pcopy is only made-up of cycles. . b 1 ← b 2 which would make b 2 interfere with a 1 while the other way round b 1 ← b 2 . . 1 2 3 4 5 6 7 8 9 10 11 12 begin let pcopy denote the parallel copy to be sequentialized. To see that this algorithm converges. .38 3 Standard Construction and Destruction Algorithms — (J. As already discussed. . . It might be useful to postpone the conventional. The correctness comes from the invariance of the behavior of seq. .t. . . this straightforward algorithm has several drawbacks addressed in Chapter 22. . a 1 ← a 2 b 1 ← b 2 (where inst1 inst2 represents two instructions inst1 and inst2 to be executed simultaneously) can be sequentialized into a 1 ← a 2 . is as ‘hard’ as constructing SSA. remove copy b ← a from pcopy. . . Singer. . . replace in pcopy b ← a into b ← a . . . i. . pcopy. 3. . Rastello) parallel copy. a = b do if ∃(b ← a ) ∈ pcopy s. . sequentialization evenof after register allocation (see Chapter 23). Making SSA strict. . . . a pre-pass through the graph can detect the offending vari- . . If we still decide to replace parallel copies into a sequence of simple copies immediately after SSA destruction. a 1 ← a 2 would make a 2 interfere with b 1 instead. . F.2) that splits critical edges and introduces parallel copies (sequentialized later in bulk or on-demand) around φ -functions. . .

building dead \phifuns dead code elimination def-use chains use-def chains semi-pruned SSA local variable ables that have definitions that do not dominate their uses. rather than minimal SSA form. */ To construct pruned SSA form via dead code elimination. dead-φ function elimination simply relies on marking actual uses (non-φ -function ones) as useful and propagating usefulness backward through φ -functions. Alternatively. and then apply dead code elimination. */ */ /* . If available.push(x ). pruned SSA would not instantiate any φ - . else I is not φ -function foreach x : source operand of I do if x is defined by φ -function then mark x as useful. by restricting attention to the set of non-conforming variables. Algorithm 7: φ -function pruning algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 begin stack ← (). defined by φ -functions. foreach I : instruction of the CFG in dominance order do if I is φ -function defining variable a then mark a as useless.3 SSA Property Transformations 39 pruned SSA.e. liveness information can be used to filter out the insertion of φ functions wherever the variable is not live: the resulting SSA form is pruned.initial marking phase - /* . The renaming phase can also be applied with the same filtering process. foreach x : source operand of I do if x is marked as useless then mark x as useful. As use-def chains are implicitly provided by SSA form. Here stack is used to store useful and unprocessed variables.final pruning phase - foreach I : φ -function do if destination operand of I marked as useless then delete I . Then there are several possible single-variable φ -function insertion algorithms (see Chapter 4) that can be used to patch up the SSA. Consequently. Renaming can then be done on a per-variable basis or better (if pruned SSA is preferred) the reconstruction algorithm presented in Chapter 5 can be used for both φ -functions placement and renaming. Algorithm 7 presents the relevant pseudo-code for this operation. stack. let I be the φ -function that defines a .pop(). i. /* .push(x ). The construction algorithm described above does not build pruned SSA form. As the number of variables requiring repair might be a small proportion of all variables.usefulness propagation phase while stack not empty do a ← stack. it is generally much faster to first build semi-pruned SSA form. Semi-pruned SSA form is based on the observation that many variables are local. pruning SSA form is equivalent to a dead-code elimination pass after SSA construction. stack. a costly traversal of the whole program can be avoided by building the def-use chains (for non-conforming variables) during the detection pre-pass.3. have a single live-range that is within a single basic block.

To be efficient (otherwise as many iterations as the maximum loop depth might be necessary for copy propagation) variable def-use chains should be precomputed. . . . making SSA noncopy-propagation strict. a x k ) where all the source operands are syntactically equal to either a j or ⊥. .40 3 Standard Construction and Destruction Algorithms — (J. comes at the price of a sophisticated algorithm involving the computation of iterated dominance frontiers. Identity φ -functions can be simplified this way from inner to outer loops. Figure 3. a φ -function is inserted at the start of each CFG node (basic block) for each variable that is live at the start of that node. many inserted φ -functions are redundant.5 illustrates these graph transformations. . so the only entry to a loop is through its header. . making SSA nonreducible CFG minimal. we can simplify an SSA program as follows: 1. . When a pessimistic approach is used. . the resulting SSA will not be pruned. . this simplification produces minimal SSA. 2 A CFG is reducible if there are no jumps into the middle of loops from the outside. F. The underlying reason is that the def-use chain of a given φ -web is. i. As such. with pointers to further reading. as outlined in Section 3. To be precise. can removes many of these redundant φ -functions. or other CFG modifications. strategy is to insert a φ -function for each variable at the start of each CFG node. The application of copy propagation and rule-based φ -function rewriting. Code transformations such as code motion. A more interesting rule is the one that propagates a j through a φ -function of the form a i = φ (a x 1 . they can be trivially elided from the program. in this context. These φ -functions turn out to be ‘identity’ operations. If we want to avoid breaking the dominance property we simply have to avoid the application of copy propagation that involves ⊥. A less sophisticated.4 gives a fuller discussion of reducibility. copy propagation can break the dominance property by propagating variable a j through φ -functions of the form a i = φ (a x 1 . or crude. When the CFG is reducible 2 . This corresponds to T1 reduction. make a minimal SSA program become non-minimal. In the pessimistic approach. a x k ) where all the source operands are syntactically equal to either a i or a j . . Alternatively. Such variables can be identified by a linear traverminimal. as detailed below. Remove any φ -function a i = φ (a x 1 . a x k ) where all x n ∈ {i }. making SSA sal over each basic block of the CFG. φ -functions may be inserted in a pessimistic fashion.e. pessimistic functions for these variables. All of these variables can be filtered out: minimal SSA form restricted to the remaining variables gives rise to the so called semi-pruned SSA form. As already mentioned in Chapter 2. isomorphic with a subgraph of the reducible CFG: all nodes except those that can be reached by two different paths (from actual non-φ -function definitions) can be simplified by iterative application of T1 (removal of self edge) and T2 (merge of a node with its unique predecessor) graph reductions. . where all the source operands become syntactically identical and equivalent to the destination operand. can also introduce redundant φ -functions. Singer. Pessimistic φ -function Insertion Construction of minimal SSA. . Section 3. . Rastello) insertion of \phifun.1.

Whenever it adds new φ -functions to W . it removes a φ -function from the program. the worklist can be initialized with successors of non-φ -functions. if I is a φ -function and I is not marked then mark I . and the . However for simplicity. This algorithm is guaranteed to terminate in a fixed number of steps. At every iteration of the while loop.enqueue(I ). However since we believe the main motivation for this approach to be its simplicity. the worklist can be avoided by traversing the CFG in a single pass from inner to outer loops. W . . foreach I : instruction different from I that uses a i do replace a i by a j in I . let I be of the form a i = φ (a x 1 . Remove any φ -function a i = φ (a x 1 .3. . Using the graph made up of def-use chains (see Chapter 18). applied to def-use relations between SSA variables This approach can be implemented using a worklist. which stores the candidate nodes for simplification. a x k ). a j } then Remove I from program. . This corresponds to T2 possibly followed by T1 reduction. 3. . The queue could be replaced by a worklist. while W not empty do I ← W . j }. . it removes a φ -function from the work queue W . . aj ai T 1(a i ) ai ai T 2(a i ) aj Fig. . Algorithm 8: Removal of redundant φ -functions using rewriting rules and work queue 1 2 3 4 5 6 7 8 9 10 11 12 13 begin W ← () . and in a topological order within each loop (header excluded). we may initialize it with all φ -functions.dequeue(). Replace all occurrences of a i by a j . The number of φ -functions in the program is bounded so the number of insertions to W is bounded.5 T1 and T2 rewrite rules for graph reduction.enqueue(I ). foreach I: φ -function in program in reverse post-order do W .3 SSA Property Transformations 41 2. . if loop nesting forest information is available. unmark I . the pseudo-code shown in Algorithm 8 uses a work queue. Of course. mark I . if all source operands of the φ -function are ∈ {a i . a x k ) where all x n ∈ {i .

. . . F. dominance and copy propagation. . . . . .42 3 Standard Construction and Destruction Algorithms — (J. Sreedhar and Gao [239] pioneer linear-time complexity φ -function insertion algorithms based on DJ-graphs. . show how to improve the efficiency of the stack-based renaming algorithm. . 3. . syntaxdirected approach to SSA construction from well structured high-level source code. . and describe how copy propagation must be constrained to preserve correct code during SSA destruction. . . Aho et al [5] describe the concept of reducibility and trace its history in early compilers literature. and discusses algorithmic complexity on common and worst-case inputs. Brandis and Mössenböck [44] describe a simple. . These initial presentations trace the ancestry of SSA form back to early work on data flow representations by Shapiro and Saint [234]. including construction. They introduce the notion of semi-pruned form. The algorithm would be less efficient. . . . Chapter 22 presents more details on alternative approaches to SSA destruction. Throughout this textbook.4 Additional reading The early literature on SSA form [83. Rastello) insertions/removals done at random. . we consider the more general case of SSA construction from arbitrary CFGs. . Boissinot et al [39] review the history of SSA destruction approaches. . . . 85] introduces both phases of the construction algorithm we have outlined in this chapter. with the aim of reducing execution time. A reducible CFG is one that will collapse to a single node when it is transformed using repeated application of T1/T2 rewrite rules. Blech et al [33] formalize the semantics of SSA. . . Briggs et al [47] discuss pragmatic refinements to the original algorithms for SSA construction and destruction. and highlight misunderstandings that led to incorrect destruction algorithms. . These approaches have been refined by other researchers. . . . There are numerous published descriptions of alternative algorithms for SSA construction. . but the end-result would be the same. There are instructive dualisms between concepts in SSA form and functional programs. in order to verify the correctness of SSA destruction algorithms. . . . . . . The pessimistic approach that first inserts φ -functions at all control flow merge points and then removes unnecessary ones using simple T1/T2 rewrite rules was proposed by Aycock and Horspool [20]. . . . . Chapter 6 explores these issues in more detail. Chapter 4 explores these alternative construction algorithms in depth. . . . . . . . in particular for the φ -function insertion phase. . . . . Singer. . .

and present Das and Ramakrishna’s algorithm for placing φ -functions based on merge relation and using DJ graph. . x 1 . Then we will discuss the notions of merge set and merge relation. . . . . In SSA form every variable is assigned only once. . 4. . . . Recall that SSA construction falls into two phases of φ -placement and variable renaming. Notice that φ -functions have been introduced at certain join (also called merge) nodes in the control flow graph. Here we present different approaches for the first phase for minimal SSA form. . where x i s (i = 0 . . . In the rest of this chapter we will present three different approaches for placing φ -functions at appropriate join nodes. . . . . . . . . Sreedhar Progress: 80% Minor polishing in progress .1 Introduction The placement of φ -functions is an important step in the construction of Static Single Assignment (SSA) form. . . . n − 1) are a set of variables with static single assignment. . At control flow merge points φ -functions are added to ensure that every use of a variable corresponds to exactly one definition. . . . . . . Consider the example control flow graph where we have shown definitions and uses for a variable x . . . . . . 43 .1(a).CHAPTER 4 Advanced Construction Algorithms for SSA D. . . . . . . . . We first present Sreedhar and Gao’s algorithm for placing φ -functions using the DJ (Dominator-Join) graph representation of a CFG. . . . Ramakrishna V. . Finally we describe another approach for placing φ -functions based on the loop nesting forest proposed by Ramalingam. . . . . Das U. x 2 . . x n −1 ). . The SSA form for the variable x is illustrated in Figure 4. . . . A φ -function is of the form x n = φ (x 0 . .

. . Even though the time complexity of Sreedhar and Gao’s algorithm is linear in the size of the DJ graph. Given that the dominance frontier for a set of nodes is just the union of the DF of each node. .’s algorithm for the placement of φ -functions consists of computing the iterated dominance frontier (IDF) for a set of all definition points (or nodes where variables are defined). . without strictly dominating z . V. . . . . . . . . . . . . . . which is not a dominator-tree edge J becomes the J edge 4→ 5. Cytron et al. Das. we can compute I D F (N α ) as a limit of the following recurrence equation (where S is initially N α ) I D F1 (S ) = D F (S ) I D Fi +1 (S ) = D F (S ∪ I D Fi (S )) A φ -function is then placed at each join node in the I D F (N α ) set. A DJ graph is a CFG augmented with immediate dominator tree edges. . . . . . . . . . . Join edges are edges of the CFG that are not dominator edges. . . . . . . . . . 4. in the worst case the time complexity of φ -placement algorithm for a single variable is quadratic in the number of nodes in the original control flow graph. Thus. . The DJ graph for the example CFG is also shown in Figure 4. . . Sreedhar ) . The main reason for this is that the size of the DF set is linear in the number of nodes in the CFG and in practice sometimes smaller than the size of the DJ graph. . . .3 Placing φ -functions using DJ graphs Sreedhar and Gao proposed the first linear time algorithm for computing the IDF set without the need for explicitly computing the full DF set. . Ramakrishna. . . . . . . . . . . . . from the point of view of x . . . . . . . . . . .2 Basic Algorithm The original algorithm for placing φ -functions is based on computing the dominance frontier (DF) set for the given control flow graph. . . Sreedhar and Gao’s original algorithm was implemented using the DJ graph representation of a CFG. . . . . . . algorithm. . . . . 8} in Figure 4. . . . 4. .1(b). U.’s algorithm works very well in practice. Rather than explicitly computing the DF set. the DF nodes are the earliest appearance of paths joining the flow without passing through x . . . . . in practice it sometimes performs worse than the Cytron et al. . Let N α be the set of nodes where variable x are defined. . . . A straightforward algorithm for computing DF for each node takes O (N 2 ) (where N is the number of nodes in the CFG) since the size of the full DF set in the worst case can be O (N 2 ). The dominance frontier D F (x ) of a node x is the set of all nodes z such that x dominates a predecessor of z . . All the other edges are D edges.1(b). For example. . . . . . . . Although Cytron et al. . D F (8) = {6. . . .44 4 Advanced Construction Algorithms for SSA — (D. . Sreedhar and Gao’s algorithm uses a DJ graph to compute the I D F (N α ) on-the-fly. . A join (J) edge is referred as J x→ y . The CFG edge 4→5.

.4.7} IDF(Nα)={2..) 6 x = x +1 (a) CFG 10 =x 9 =x 11 level = 2 3 11 8 level = 3 4 5 6 8 level = 4 7 9 7 J (Join) edge Dominator edge level = 5 CFG edge 10 (b) DJ graph Fig.l e v e l } 1 x= x = Ф(.3. This algorithm does not precompute DF.4. Consider the DJ graph shown in Figure 4. For our example the J edges that satisfy this condition J J are 9→ 6 and 10→ 8. 4.6} 2 level = 1 2 x= 3 4 x= x = Ф(.. where l e v e l of a node is the depth of the node from the root of the dominator tree.1 Key Observation Now let us try to understand how to compute the DF for a single node using DJ graphs. However.. D F (x ) need not be recomputed when computing D F (y ). If D F (x ) has already been computed before the computation of D F (y ).) 5 x = Ф(.. we can compute the DF of a node x using: J D F (x ) = {y | ∃ z ∈ SubTree(x ) ∧ z → y ∧ y . given a set of initial nodes N α for which we want to compute the relevant set of φ -functions.l e v e l ≤ x .l e v e l ≤ 8.. . and hence the placement of φ -functions. To generalize the example. the reverse may not be true. This is because nodes reachable from Su b t r e e (x ) are already in D F (x ).l e v e l .1(b). 8}. and therefore the order of the computation of DF is crucial. Let y be an ancestor node of a node x on the dominator tree. The l e v e l of each node is also shown in Figure 4.) level = 0 1 Nα={1.1 A Motivating Example (a) CFG (b) DJ graph Now we can extend the above idea to compute the IDF for a set of nodes. Therefore D F (8) = {6.5.1(b).3.3 Placing φ -functions using DJ graphs 45 4. To compute D F (8) we simply walk down the dominator (D) tree edges from node 8 and identify all join (J) J edges x → y such that y . a key observation can be made.

Ramakrishna. The complete algorithm is shown in Figure 4. The Or d e r e d Bu c k e t is populated with the nodes 1. During this traversal it also peeks at destination nodes of J edges.1(b). 8}). with root . The procedure Visit(x ) essentially walks top-down in the DJ graph avoiding already visited nodes. The resulting DF set is D F (3) = {2}.l e v e l ≤ x .l e v e l ].l e v e l = 0. It is clear from the recursive definition of IDF that we have to compute D F (3) and D F (8) as a first step. Now supposing we start with node 3 and compute D F (3). Finally at Line 25 we continue to process the nodes in the sub-tree by visiting over the D edges. For this.l e v e l be the depth of the node from the root node. Whenever it notices that the level number of the destination node of a J edge is less than or equal to the level number of the c u r r e n t Root .l e v e l }∪ J {w | t ∈ Su b t r e e (x ) \ Su b t r e e (y ) ∧ t → w ∧ w .2 Main Algorithm In this section we present Sreedhar and Gao’s algorithm for computing IDF.l e v e l ≤ x . Notice that at Line 17 the destination node is also inserted in the OrderedBucket if it was never inserted before. 4. the destination node is added to the I D F set(Line 15) if it is not present in IDF already. We can avoid such duplicate visits by ordering the computation of DF set so that we first compute D F (8) and then during the computation of D F (3) we avoid visiting the sub-tree of node 8.3. 3. The nodes are placed in the buckets corresponding to the levels . V. where x is an ancestor of y in the DJ graph.2. To ensure that the nodes are processed according to the key observation we use a simple array of sets Or d e r e d Bu c k e t . When the algorithm terminates the set I D F will contain the iterated dominance frontier for the initial N α .3. Thus. x ). we do not need compute it from scratch as we can re-use the information computed as part of D F (y ) as shown.l e v e l } 4. D F (x ) = {w | w ∈ D F (y ) ∧ w . to compute D F (x ). and two functions defined over this array of sets: (1) InsertNode(n ) that inserts the node n in the set Or d e r e d Bu c k e t [n . Then the nodes are processed in a bottom-up fashion over the DJ graph from deepest node level to least node level (Line 6) by calling Visit(x ). U. Let x . In IDFMain() at first we insert all nodes belonging to N α in the Or d e r e d Bu c k e t .46 4 Advanced Construction Algorithms for SSA — (D. 3. 7}. and the resulting set is D F (8) = {6. some of the phases of the algorithm are depicted for clarity. y ). and use the result D F (8) that was previously computed. before computing the DF of a shallower node (here. 8}. Sreedhar ) To illustrate the key observation consider the example DJ graph in Figure 4. we need to compute the DF of deeper (based on level) nodes (here. Notice here that we had already visited node 8 and its subtree as part of visiting node 3. In Figure 4. and let us compute I D F ({3. Now supposing we next compute the DF set for node 8. 4 and 7 corresponding to N α = {1. Das. and (2) GetDeepestNode() that returns a node from the Or d e r e d Bu c k e t with the deepest level number.

l e v e l ≤ c u r r e nt Root . . 26: // end if 27: end if 28: end if 29: end for } Fig. InsertNode(2) is invoked and node 2 is placed in the second bucket. An interesting case arises when node 3 is visited. Hence. Output: The set I D F (N α ). node 3 is in the second bucket and so on.v i s i t e d = f a l s e .3 Placing φ -functions using DJ graphs 47 Input: A DJ graph representation of a program and N α .v i s i t e d == f a l s e ) then 23: y . at which they appear. and the I D F set is empty. 8: z . the first node that is visited is node 7. 16: if (y ∈ N α ) then 17: InsertNode(y ). Node 3 finally causes nodes 9 and 10 also to be visited. The successor of node 7 is node J 2 and since there exists a J edge 7→ 2. 5. The successor of node 4 which is node 5 has an incoming J J edge 4→ 5 which results in the new I D F = {2. 4: x . Since the nodes are processed bottom-up. The final I D F set converges to {2. 2: foreach (node x ∈ N α ) do 3: InsertNode(x ). 5}. 5: end for 6: while ((z = G e t De e p e s t Nod e ()) = NU LL ) // Get node from the deepest level 7: c u r r e nt Root = z . However. Procedure IDFMain(Set N α ) { 1: I D F = {}. In addition. 18: endif 19: end if 20: end if 21: else // visit D edges 22: if (y .l e v e l ) then 14: if (y ∈ I D F ) then 15: I DF = I DF ∪ y . The next node visited is node 4. the I D F set is updated to hold node 2 according to Line 15 of the Visit procedure.4. its successor node which is node 8 and which also corresponds to the J edge J 10→ 8. node 1 which appears at level 0 is in the zero-th bucket. 4.v i s i t e d = t r u e . // Find D F (z ) 10: end while } Procedure Visit(x ) { 11: foreach (node y ∈ Su c c (x )) do 12: if (x →y is a J edge) then 13: if (y . 24: // if(y . 9: Visit(z ). Subsequent visits of other nodes do not add anything to the I D F set.2 Sreedhar and Gao’s algorithm for computing IDF set.v i s i t e d = t r u e . when node 10 is visited.bou nd a r y == f a l s e ) // See the section on Further Reading for details 25: Visit(y ). does not result in an update of the I D F set as the level of node 8 is deeper than that of node 3. 6} when node 5 is visited.

In this section we will first introduce the notion of merge set and merge relation and then show how the merge relation is related to the DF relation. We insert a φ -function at v for a variable that is assigned only at root and u . . Ramakrishna. . Sreedhar ) Level 0 OrderedBucket 1 2 3 4 5 4 7 IDF = {} 3 4 7 Visit(7) IDF = {2} InsertNode(2) 1 3 2 5 7 Visit(4) IDF = {2. . Now let us define the merge relation as a relation v = M (u ) that holds between two nodes u and v whenever v ∈ J ({root . For instance. . . . . . . . 4. . . . . V. . . . .1 Merge Set and Merge Relation First. .4. One can show that J (S ) = ∪u ∈S M (u ) 1 In English ‘join’ and ‘merge’ are synonyms. due to lack of better terms. 7}.5} InsertNode(5) 3 4 2 1 1 1 1 3 4 6 7 5 2 6 4 4 6 3 5 7 2 6 4 Visit(5) IDF = {2. .5. We will then show how to compute the M (merge) graph using the DJ graph and then use the M graph for placing φ -functions. . .6} InsertNode(6) Visit(6) IDF = {2. u }). . . let us define the notion of a join set J (S ) for a given set of nodes S in a control flow graph. . . .1 Consider two nodes u and v and distinct non-empty paths (denoted + + by +) from u → w and v → w . .48 4 Advanced Construction Algorithms for SSA — (D. . v }. Das. .5. . . . . . . 3. . . . . . . . consider nodes 1 and 9 in Figure 4. . . 4. . If the two paths meet only at w then w is in the join set of the nodes {u . these two synonyms are used to mean distinct but related concepts. .3 Phases of Sreedhar and Gao’s algorithm for N α = {1. . where w is some node in the CFG. 4. . U. .4 Placing φ -functions using Merge Sets In the previous section we described φ -function placement algorithm in terms of iterated dominance frontiers.1(a). . 4.6} Fig. . but unfortunately in the literature. 9}). The paths 1→2→3→8 and 9→10→8 meet at 8 for the first time and so {8} ∈ J ({1. . There is another way to approach the problem of φ -function placement using the concepts of merge relation and merge set.

This relationship between dominance and merge can be conveniently encoded using a DJ graph and can be used for placing φ -functions.6 } {2} { 2. DF graph 1 M(1) = M(2) = M(3) = M(4) = M(5) = M(6) = M(7) = M(8) = M(9) = M(10) = M(11) = {} {2} {2} { 2.8 } {} 2 3 11 4 5 6 8 7 DF edge Dominator edge 9 10 Fig.5. for node 8.4.4 DF graph and M sets The algorithms for placing φ -functions based on the construction of the DF graph or M relation in the worst case can be quadratic. The resulting DJ graph is a “fully cached” DJ graph. Also. we will construct the DF (Dominance Frontier) graph using J the DJ graph. v ∈ M (u ) if and only if there is a path + u→ v that does not contain i d om (v ).6.4 Placing φ -functions using Merge Sets 49 where S ⊆ V . Given the M graph of a CFG we can place φ -functions for a set S ⊆ V at M (S ).l e v e l . The merge relation is due 2 Recall that w . 6} as the nodes 2.l e v e l ≥ v .l e v e l is the depth of w from root of the dominator tree. 5 and 6 are the only nodes reachable from node 4 following the DF edges. Similarly. .8 } { 2.5.5. For each J edge u →v in the DJ graph insert a new w → v where w = i d om (u ) and w does not strictly dominate v . 8} due to these nodes being reachable from node 8 via the DF edges.6. In Figure 4. Another way to look at this is to insert a new J edge w →v if w = i d om (u ) and w . 5.8 } { 2.4. M (8) = {2.5.6. 4. First. Using the DF graph we compute the M graph as transitive closure using only the DF edges of the DJ graph.2 We repeatedly insert new J edges in a bottom-up fashion over the DJ graph until no more J edges can be inserted.5. We will call the fully cached DJ graph as the DF graph.6 } { 2. we can compute the M relation from the DF graph. M (4) = {2. Next. 5. for any node u ∈ V .5. 6.6 } { 2.

Inconsistency Condition: J For a J edge.l e v e l . Ramakrishna. Sreedhar ) to Pingali and Bilardi. The M (x ) sets computed in a previous pass are carried over if a subsequent pass is required. if there is an incoming J edge s → n as in Line 8.2. During this process.1 TopDownMergeSetComputation The first and direct variant of the approach laid out above is termed TDMSC-I. The DJ graph is visited level-by-level. the same result can be achieved by formulating the m e r g e relation as a dataflow problem and solving it iteratively. All M (x ) sets are set to the null set before the initial pass of TDMSC-I. then a separate . all the nodes are supposed to have reached fixed-point solutions.50 4 Advanced Construction Algorithms for SSA — (D.4.5. If some node is found to be inconsistent. U. we devise an “inconsistency condition” stated as follows. if u does not satisfy M (u ) ⊇ M (v ). 4. for each node n J encountered. Instead. To check whether multiple-passes are required over the DJ graph before a fixedpoint is reached for the dataflow equations. explicit DF graph construction or the transitive closure formation are not necessary. In the algorithm proposed by Das and Ramakrishna.4. multiple passes are required until a fixed-point solution is reached. 4. In the previous section we saw that the m e r g e relation can be computed using a transitive closure of the DF graph. then the node u is said to be inconsistent. This variant works by scanning the DJ graph in a top-down fashion as shown in Line 7 of Figure 4. which in turn can be computed from the DJ graph. J Consider a J edge u → v . The algorithm described in the next section is directly based on the method of building up the M (x ) sets of the nodes as each J edge is encountered in an iterative fashion by traversing the DJ graph top-down. Das. The set of dataflow equations for each node n in the DJ graph can be solved iteratively using a top-down pass over the DJ graph.2 Iterative Merge Set Computation In this section we describe a method to iteratively compute the m e r g e relation using a dataflow formulation. V. Also consider all nodes w such that w dominates u and w . u → v . this approach has been found to be a fast and effective method to construct M (x ) for each node x and the corresponding φ -function placement using the M (x ) sets.4. For every w we can set up the dataflow equation as M (w ) = M (w ) ∪ M (v ) ∪ {v }.l e v e l ≥ v . If no node is found to be inconsistent after a single top-down pass. For several applications. The DF graph and the M sets for the running example are given in Figure 4.

4 Placing φ -functions using Merge Sets Input: A DJ graph representation of a program.5 Top Down Merge Set Computation algorithm for computing Merge sets. Re q u i r e Anot he r Pa s s is set to true only if a fixed point is not reached and the inconsistency check succeeds for some node. 3: while (! done) do 4: done = TDMSC-I(DJ graph). 13: l nod e = NU LL .l e v e l .4. we will briefly walk through TDMSC-I using the DJ graph of Figure J 4. Here. 17: t m p = p a r e nt (t m p ). 24: end if 25: end if 26: end for 27: end if 28: end for 29: end while 30: return Re q u i r e Anot he r Pa s s . As a result.l e v e l ≥ s . 4. Procedure TDMSCMain { 1: ∀ node x ∈ DJ graph set M (x ) = {} 2: done = false. be an incoming J edge 10: if (e not visited) then 11: Mark e as visited 12: t mp = s. Line 19 is used for the inconsistency check. //dominator tree parent 18: end while 19: for (∀ incoming edge to l nod e ) do J 20: Let e = s → l nod e . Line 19 of the algorithm visits incoming edges to l nod e only as l nod e is at the same level as n . 7: while (n = next node in B(readth) F(irst) S(earch) order of DJ graph) do 8: for (∀ incoming edge to n ) do J 9: Let e = s → n . 51 bottom-up pass starts at Line 14. starting at node 7 and ending at node 2 and the merge sets of these nodes are updated so that M (7) = M (6) = . updating the M (w ) values using the aforementioned dataflow equation. 16: l nod e = t m p . This bottom-up pass traverses all nodes w such that w dominates s and w . 5: end while } Procedure TDMSC-I(DJ graph) { 6: Re q u i r e Anot he r Pa s s = f a l s e . Moving top-down over the graph. the first J edge encountered is 7→ 2. } Fig.1(b). which is the current level of inspection and the incoming edges to l nod e ’s posterity are at a level greater than that of node n and are unvisited yet. a bottom-up climbing of the nodes happens. be an incoming J edge 21: if (e visited) then 22: if (M (s ) ⊇ M (l nod e )) then //Check inconsistency 23: Re q u i r e Anot he r Pa s s = t r u e . Output: The merge sets for the nodes. 14: while (l e v e l (t m p ) ≥ l e v e l (n )) do 15: M (t m p ) = M (t m p ) ∪ M (n ) ∪ {n }. There are some subtleties in the algorithm that should be noted.

In a second iterative pass. .5 Computing Iterated Dominance Frontier Using Loop Nesting Forests This section illustrates the use of loop nesting forests for construction of the iterated dominance frontier (IDF) of a set of vertices in a CFG. The inconsistency does not appear any more and the algorithm proceeds to handle J J J the edges 4→ 5. . . . . . TDMSC-I is invoked repeatedly by a different function which calls it in a loop till RequireAnotherPass is returned as f a l s e as shown in the procedure TDMSCMain. 6}. as shown in Figure 4. Once the M e r g e relation is computed for the entire CFG. . . 5. 6}. Input: A CFG with Merge sets computed and N α . Sreedhar ) J J J M (3) = M (2) = {2}. . The algorithm is based on Ramalingam’s work. Assume it is 5→6. This results in M (5) = M (5) ∪ M (6) ∪ {6} = {2. . . . U. M (5) = M (5) ∪ M (6) ∪ {6} = {2. . Hence. . . 6} J and M (6) = {2. . 6}. . The next J edge to be visited can be any of 4→ 5. V. which may contain reducible as well as irreducible loops. . the J edges are visited in the same order. . 5. Output: Blocks augmented by φ -functions for the given N α . 4. . as 5→ 6 is another J edge that is already visited and is an incoming edge of node 6. implying that the M (5) needs to be computed again. . This will be done in a succeeding pass as suggested by the RequireAnotherPass value of true. J J the inconsistency check comes into picture for the edge 6→ 5. . . . .6 φ -placement using M e r g e sets . . 6}. Checking for M (5) ⊇ M (6) fails. . . . M (6) = M (6) ∪ M (5) ∪ {5} = {2. . 5. . . On a subsequent visit of 6→ 5. . 9→ 6 and 10→ 8 which have also been visited in the earlier pass. . placing φ is a straightforward application of the M (x ) values for a given N α .6. Now. . . . . J when 5→ 6 is visited. . 5→ 6 or 6→ 5 J at l e v e l = 3. . . J Now. 4. Das. . At this point. 5. . let 6→ 5 be visited. . M (6) is also set to {2. . . . 6} as this time M (5) = {2.52 4 Advanced Construction Algorithms for SSA — (D. Ramakrishna. . . Procedure φ -placement for N α { 1: for (∀n ∈ N α ) do 2: for (∀n ∈ M (n )) do 3: Add a φ for n in n if not placed already 4: end for 5: end for } Fig. . . .

5.1 Loop nesting forest A loop nesting forest is a data structure that represents the loops in a CFG and the containment relation between them. the acyclic version of the graph G .5 Computing Iterated Dominance Frontier Using Loop Nesting Forests 53 4. If at least two definitions reach a node u . is formed by dropping the backedges 11 → 9 and 12 → 2. This suggests the algorithm in Figure 4. compute the subset of I D F (N α ) that can reach a node using forward dataflow analysis . then u belongs to I D F (X ) where X consists of these definition nodes. • Using a topological order. entry 2 1 2 3 2 3 v= 4 6 8 Loop Header Node =v 10 11 v= 12 9 Loop Internal Node CFG Node v= 5 9 10 11 v= 7 4 5 8 12 9 6 7 Fig. The corresponding loop nesting forest is shown in Figure 4. termed G a c .2 Main Algorithm Let us say that a definition node d “reaches” another node u if there is a path in the graph from d to u which does not contain any definition node other than d (for a variable v ) . Also. 4. • Add a node to IDF if it is reachable from multiple nodes. In the example shown in Figure 4.5. for a given N α .7 An example (a) CFG and (b) Loop Nesting Forest 4. According to the algorithm. For Figure 4.8. we can compute I D F (N α ) as follows: • Initialize IDF to the empty set .4.7(b) and consists of two loops whose header nodes are 2 and 9. The loop with header node 2 contains the loop with header node 9. e n t r y (G ) = {e n t r y } where .7(a) the loops with backedges 11 → 9 and 12 → 2 are both reducible loops.7.

3: for (∀u of V (G ) − e nt r y (G ) in topological sort order) do 4: Re a c hi n g De f s = {}. X : subset of V (G )) { 1: I D F = {}. ReachingDefs = UniqueReachingDefs(3) according to Line 9. 2: ∀ nod e of V (G ). I D F = {6. For the definitions of v in nodes 4. the IDF set remains unchanged. Uni q u e Re a c hi n g De f s (nod e ) = {entry}. We will see later how the IDF set for the entire graph is computed by considering the contribution of the backedges. For this vertex. Ramakrishna. 5}. If a vertex u is contained in a loop then I D F (u ) will contain the loop header. Note that the backedges do not exist in the acyclic graph and hence node 2 is not part of the IDF set. according to Line 7. Sreedhar ) e n t r y is a specially designated node that is the root of the CFG. ReachingDefs = {e n t r y }. When the rest of the vertices are visited. But as 3 ∈ / I DF ∪ X ∪ {e n t r y }. 5: for (∀ predecessor v of u ) do 6: if (v ∈ (I D F ∪ X ∪ e nt r y (G ))) then 7: Re a c hi n g De f s = Re a c hi n g De f s ∪ {v }. Input: An acyclic CFG G a c . Since both vertices 4 and 5 belong to X . For vertex 6. according to Line 15. As ReachingDefs equals 1. UniqueReachingDefs(4) = {e n t r y }. As ReachingDefs = 1. Following similar arguments. The contribution of backedges to the iterated dominance frontier can be identified by using the loop nesting forest. it can be either the definition from node 7 or one of 4 or 5. The same logic applies for vertices 5 and 7.8 Ramalingam’s algorithm for computing the IDF of an acyclic graph. Hence. 7 and 12 in Figure 4. Procedure ComputeIteratedDominanceFrontier(G a c :Acyclic graph. 16: end if 17: end for 18: return I D F . predecessors = {3}. when node 8 is visited. UniqueReachingDefs(5) = UniqueReachingDefs(7) = {e n t r y }. X : subset of V (G ). Node 6 can be reached by any one of two definitions in nodes 4 or 5. 7. 8: else 9: Re a c hi n g De f s = Re a c hi n g De f s ∪ Un i q u e Re a c hi n g De f s (v ). For any vertex u . 14: else 15: I D F = I D F ∪ { u }. 8}. let H LC (u ) denote the set of loop headers of the loops containing u . V. 4. and so. We will walk through some of the steps of this algorithm for computing I D F ({4. 5. 10: end if 11: end for 12: if ( Re a c hi n g De f s == 1) then 13: Uni q u e Re a c hi n g De f s (u ) = Re a c hi n g De f s .54 4 Advanced Construction Algorithms for SSA — (D. 5}.7. For node 8. 12}). the subsequent nodes reached by multiple definitions turn out to be at 6 and 8. } Fig. ReachingDefs = {4. I D F = {6}. We will start at vertex 4. U. Das. 5. Output: The set I D F (X ). predecessors = {4. How can the algorithm to find the IDF for acyclic graphs be extended to handle reducible graphs? A reducible graph can be decomposed into an acyclic graph and a set of backedges. Given .

8}. extra edges are added from θ to the header nodes v and w . 5. 8} = {2. 7. Finally.9. 7. 5.7 we see that in order to find the I D F for the nodes where the variable v is defined. 12} ∪ H LC ({4. 12}) = H LC ({4. Referring to Figure 4. Reverting back to Figure 4. the algorithm depicted in Figure 4. This results in additional edges from u and s which are the predecessors of the header nodes v and w respectively to θ . we need to evaluate I D F ({4.4. H LC ({4. it turns out that I D F (X ) = H LC (X ) ∪ I D Fa c (X ∪ H LC (X )) where I D F (X ) and I D Fa c denote the IDF over the original graph G and the acyclic graph G a c respectively. we see that the irreducible loop comprising of nodes v and w in (a) can be transformed to the acyclic graph in (c) by removing the edges between nodes v and w that create the irreducible loop. 7. 7. We can now create a new dummy node θ and add edges from the predecessor of all the header nodes of the cycles of the irreducible graph to θ . The transformed graph is shown in (d). 5. we would need to find I D Fa c ({2. I D F ({4. 12}) which turns out to be the set {6. 5. (a) Entry (b) v. 4.5 Computing Iterated Dominance Frontier Using Loop Nesting Forests 55 a set of vertices X . Since this graph is acyclic. 12}) ∪ {6. 7. 5. The insight behind the implementation is to transform the irreducible loop in such a way that an acyclic graph is created from the loop without changing the dominance properties of the nodes. 4. 7. Firstly. 7. Following this transformation. computing the IDF for the nodes in the original irreducible graph translates to computing IDF for the transformed graph. 8}. we will briefly touch upon how graphs containing irreducible loops can be handled. Hence. 12})). we would need to evaluate I D Fa c ({4. 12}) = {2} as all these nodes are contained in a single loop with header 2. as well as add edges from θ to all the header nodes. 5.8 can be applied by not- .w (c) Entry (d) Entry u s v w u s u θ v s v w v w w Exit Exit Exit Fig. 6. 5.9 (a) An irreducible graph (b) The Loop Nesting Forest (c) The acyclic subgraph (c) Transformed graph Now that we have shown how IDF can be computed for graphs with reducible loops using loop nesting forests. In addition. 12}).

The nodes which act as “boundary nodes” are detected in a separate pass. .7 Further Reading Cytron’s approach for φ -placement involves a fully eager approach of constructing the entire DF graph. . . . . . . . . . Das. These include the first linear-time algorithm by Sreedhar using the concept of DJ graphs. Given the ADT of a control flow graph. . the merge relation by Pingali and Bilardi. . . . . . . . . termed “interior nodes”. . . . . . such cases may be rare for practical applications. . . A factor β is used to determine the partitioning of the nodes of a CFG into boundary or interior nodes by dividing the CFG into zones. . . The ADT representation can be thought as a DJ graph. . . . . . . . . V. . . . β is a number that represents space/query-time tradeoff. . The crucial observation that allows this transformation to create an equivalent acyclic graph is the fact that the dominator tree of the transformed graph remains identical to the original graph containing an irreducible cycle. . One of the drawbacks of the transformation is the possible explosion in the number of dummy nodes and edges for graphs with many irreducible cycles as a unique dummy node needs to be created for each irreducible loop. . . . . . . . They proposed a new representation called ADT (Augmented Dominator Tree). . . . . . . . . where the DF sets are pre-computed for certain nodes called “boundary nodes” using an eager approach. . . Pingali and Bilardi [?] suggested a middle-ground by combining both approaches. Ramakrishna. . . Though all these algorithms claim to be better than the original algorithm by Cyton et al. . the DF needs to be computed on-the-fly as in the Sreedhar-Gao algorithm. . . . . . it is straight forward to modify Sreedhar and Gao’s algorithm for computing φ -functions in linear time. . . . . . However. . . . . . . . β << 1 denotes a fully eager approach where storage requirement for DF is maximum but query-time is faster while β >> 1 denotes a fully lazy approach where storage requirement is zero but query is slower. On the other hand Sreedhar and Gao’s approach is a fully lazy approach as it constructs DF on-the-fly only when a query is encountered. . . For the rest of the nodes.56 4 Advanced Construction Algorithms for SSA — (D. . . . . . . . .. . . . . . they are difficult to compare due to the unavailability of these algorithms in a common compiler framework. . . . . The only modification that is needed is to ensure that we need not visit all the nodes of a sub-tree rooted at a node y when y is a boundary node whose DF set is already known. U. a DJ graph and merge set based algorithm by Das and Ramakrishna and finally an algorithm by Ramalingam based on the concept of loop nesting forests. . Sreedhar ) ing the loop nesting forest structure as shown in (b).6 Summary This chapter provides an overview of some advanced construction algorithms for SSA. . 4. . 4. . . . . .

This improvement is fueled by the observation that for an inconsistent node u .4. This eliminates extra passes as an inconsistent node is made consistent immediately on being detected.l e v e l ≥ u . For iterative merge set computation. where a subtree rooted at y is visited or not visited based on whether it is a boundary node or not.2. TDMSC-II is an improvement to algorithm TDMSC-I. can be locally corrected for some special cases. the merge sets of all nodes w such that w dominates u and w .l e v e l . TODO: citations • • • • • • SSA form [?] Sreedhar & gao [?] Das & Ramakrishna no explicit DF graph construction [?] Ramalingam loop nesting forest [?] computing IDF set without full DF set [?. This heuristic works very well for certain class of problems – especially for CFGs with DF graphs having cycles consisting of a few edges. For details of this algorithm refer to [?].7 Further Reading 57 This change is reflected in Line 24 of Figure 4. ?] [?] .

.

. . . .1 Live-Range Splitting In Figure 5. . . . maintaining SSA is often one of the more complicated and error-prone 59 . . . . . Let us first go through two examples before we present algorithms to properly repair SSA. such as loop unrolling or jump threading might duplicate code. Such program modifications are done by many optimizations. . and modify the control flow of the program. Hack Progress: 90% Text and figures formating in progress . . .1c.1 Introduction Some optimizations break the single-assignment property of the SSA form by inserting additional definitions for a single SSA value. This is indicated by assigning to the memory location X. . . . 5. . . . . . . . . . . our spilling pass decided to spill a part of the live range of the variable x 0 in the right block.1. Furthermore. This figure shows that maintaining SSA also involves placing φ -functions. . thus adding additional variable definitions. it inserted a store and a load instruction. . . . . hence SSA is violated and has to be reconstructed as shown in Figure 5. . . . 5. . . Other optimizations. . . A common example is liverange splitting by inserting copy operations or inserting spill and reload code during register allocation. .CHAPTER 5 SSA Reconstruction S. .1b. . . . . Not surprisingly. . . . The load however is a second definition of x 0 . . . Therefore. .

. Assume that the block containing that conditional branch has multiple predecessors and x can be proven constant for one of the predecessors. · · · ← x0 · · · ← x0 · · · ← x0 · · · ← x0 · · · ← x0 · · · ← x0 (a) Original program Fig. . . . . Consider following situation: A block contains a conditional branch that depends on some variable x.· ← x0 . . The first is an adoption of the classical dominance-frontier based algorithm. . . . . . ←← y1y1 . X ← spill x0 X ← spill x0 X ← spill x0 x ← reload X x ← reload X x ← reload X 1 1 1 . . this is shown by the assignment x1 ← 0. Hack) x0 ← · · · x0 ← · · · x0 ← · · · x0 ← · · · X ← spill x0 X ← spill x0 X ← spill x0 . . . The second performs a search . x1x ) 2 ← φ(x0 . ←← y1y1 . the code of the branch block is duplicated and attached to the predecessor. . Jump threading now partially evaluates the conditional branch by directly making the corresponding successor of the branch block a successor of the predecessor of the branch. . . . . 0 . . This also duplicates 1 1 1 Hence. x0 ← reload x X x0 ← reload X 0 ← reload X x2 ← φ(x0 . · · · ·← x0x0 ·← · ·. ← y1 . . y1 ) y3 . y ) . . · ← x0 . . · ·. ←← y1y1 y3 . . . . . . .2 Jump Threading Jump threading is a transformation performed by many popular compilers such as GCC and LLVM. . . . . the control flow has changed which poses an additional challenge for an efficient SSA reconstruction algorithm. . . . we will discuss two algorithms. . in contrast to the examples above. (y . . SSA is destroyed for potential definitions of variables in the branch block. . . . . . . It partially evaluates branch targets to save branches and provide more context to other optimizations. · · · ← x0 · ·. SSA broken (c) SSA reconstructed parts in such optimizations. the conditional branch tests if x > 0. . . x1 ) · · · ← x2 · · · ← x2 · · · ← x2 . . However. . those variables and has to be reconstructed. . . owing to the insertion of additional φ -functions and the correct redirection of the variable’s uses. 5. x1 ) x2 ← φ(x0 . . ←y 1 ← y3 . ←← y1y1 60 . y1 ) . . . . . .1 Adding a second definition (b) Spilling x 0 . . . . . . . . 5.1. . . . ← y3 ← y3 ← φ. . . . . In the example shown in Figure 5. . ← y1 ← y1 . . . . . .2 General Considerations In the following. . . . . . ← y1 y3 ← φ(y2 . . . 2 1 . . · ← x0 . . ← y1 2 2 Simple CFG CFG Simple CFG 2 Simple x0 ← · · · x0 ← · · · x0 ← · · · x0 ← · · · x0 ← · · · 5 SSA Reconstruction — (S. . . (y2 . . . · · · ← x0 ·← · ····← x0x0 ······ ← ←x x0 · ·. . .if x3 > 0 if x3 > 0 if x3 > 0 if x3 > 0 if x3 > 0 if x3 > 0 ← φ. . . . . .2.· ← x0 . . . . 5. In the example below. . . To this end. ←y 1 . . . . . . .

← · · · . . . . if x3 >if 0 x3 > 0 0 x3 > 0 if x3 > if 0 x3 >if x3 ← φ x (3 x1 ← . . ··· x0 . ..2 Jump Threading (b) Jump threaded. . . x2 x1 ←0 . A for v . However. In contrast to the first. Before reconstructing SSA x . . ← y1 ← y1 ← y3 ← y3 y3 . . . if x3 >if 0 x3 > 0 if x3 > 0 .. . . x1 ) x1 (x2 . ← y1 . . · 2 ··· x3 y1 .· ·x 2 ← ··· ·2 ← · x x1 ← 0x1 ← 0x1 0 . .. we assume that each instruction in the program only writes to . x2 ) 2) 31← 2) ( . we can a single variable. . . . 2 ← ·x . . . . x2φ ) . like in the examples above. xφ (x1 . . An optimization or transformation now violates SSA by inserting additional definitions for an existing SSA variable. ← y ← y ← y1 1 1 y1 . 0 A xand In the following. The original variable and the 3 Phi Optimization additional definitions can be seen as a single non-SSA variable that has multiple definitions uses. . y1 . . . . · x · ·· ... . . . . x1 ) For the flow graph (CFG) and is in SSA form and the dominance property holds.defs the B pseudocode denotes the set of all SSA variables that represent v . . x2 ) ··· if x3 > 0 x3 ← φ x (3 x1 ← . SSA broken (c) SSA reconstructed x0 x0 ··· x0 ··· ··· from the variables’ uses to the definition and places φ -functions on demand at X spill x0 appropriate places. . . . . . x x2 (x0 . y1 . .defs that contains C D D x0 . . y1 ). . . . First. . 5. . . . . ··· x2 sake of simplicity. . every definition of v is numbered. .. . . . yφ ) (3 y2 . . . . . y1 )( 1y . . . . . . xφ (x1 . y2 . . . x1 ) C x1 gorithm 9). . for every basic block b we compute a list b. . x1 ) B use of a variable is a pair consisting of a program point (a variable) and an integer denoting the index of the operand at the using instruction. Both algorithms presented in this chapter share the same driver routine (Al(x2 . . . . . y3 ← φ y (3 y2 ← . .. x2 ) x3 xx x (3 x1 ← . ··· x0 ··· x0 ··· x0 . What remains is associating every use of v with a suitable definition.SSA form. . . x2 ←. . . This establishes the single-assignment property. ·· . . . . . . . y1 2 Simple CFG Fig. . . ← ··· . . . y2 ← · y ·· ·· 2 ← ·y ··· 2 . xφ (x1x . . . . .3 x2 ) x2 2) . . x E . y1 ← · y ·· 1 ← · · · y1 ··· . . . ternal data structures when the CFG is modified. ← y1 ← y1← y1 ← y1 y1 (a) Original program . . it does not need to update its inspill x0 x1 reload X .. . the second algorithm might not con. . . . . ··· x0 · · the x0single-assignment property of the. . x0 ) 1 1 1 . . We consider the following x0 scenario: reload X The program is represented as a controlx2 (x 0 . . Due ·to then identify the program point of the instruction and the variable. . . . .1 Jump Threading 5. E . . . . . . . . . . (x1 . . . . ·· 1 y1 · ·y ·1 ← · y . struct minimal SSA form X in general. . . y1 ← · y ·· ·· 1 ← ··· 1 ← ··· y1 · ·y ·1 ← · y . . . A 0 . In x2 (xv.2 General Considerations 61 x1 0 x2 ··· 0 x1 ← 0x1 . . .. x1 ← 0x1 ← x0 ·· 2 ← ·x 2 . . .. v will be such a non-SSA variable.

order: d = l break # no local definition was found. we start looking for the reaching definition at the end of that block. The differences are described in the next two sections. find_def_from_end additionally considers definitions in the block. Here we have to differentiate whether the user is a φ -function or not. while find_def_from_begin does not. Algorithm 9: SSA Reconstruction Driver proc ssa_reconstruct(variable v): for d in v. we have to find one that reaches this block from outside. This list is sorted according to the schedule of the instructions in the block from back to front.defs.defs: if l. we scan the block’s list backwards until we find the first definition in front of the use. Hence. If there is no such definition.block if u. Otherwise.is_phi(): b = u.defs.pred(index) d = find_def_from_end(b) else: for l in b.defs: b = d.empty(): return b. the variable’s use occurs at the end of the predecessor block that corresponds to the position of the variable in the φ ’s argument list.defs according to schedule # latest definition is first in list for each use (u. Hack) all instructions in the block which define one of the variables in v.block. respectively. all uses of the variable v are scanned. Algorithm 9 shows that find_def_from_end can be expressed in terms of find_def_from_begin. the latest definition is the first in the list. If it is. In that case. index) of v: d = None b = u.block insert d in b.62 5 SSA Reconstruction — (S. Then. search in the predecessors if not d: d = find_def_from_begin(b) rewrite use at u to d proc find_def_from_end(block b): if not b. Our two approaches only differ in the implementation of the function find_def_from_begin. We use two functions find_def_from_begin and find_def_from_end that find the reaching definition at the beginning and end of a block.defs[0] return find_def_from_begin(b) .order < u.

Algorithm 10 shows this procedure in pseudo-code. .append(d) i = 0 for p in b. . The operands of the newly created φ -function will query their reaching definitions by recursive calls to find_def_from_end. . . IDF) set i-th operand of d to o i = i + 1 else: d = find_def_from_end(b.3 Reconstruction based on the Dominance Frontier 63 . . . The operands of that φ -function are then (recursively) searched in the predecessors of the block. . . . . . . as described in chapter (TODO: reference to the chapter). . no infinite recursion can occur. . . . . . . .defs. . we search for each use u the corresponding reaching definition. Then. . . . . This search starts at the block of u . . the search continues in the block’s immediate dominator. This φ -function has to be inserted into the list of definitions of v in that block. . Algorithm 10: SSA Reconstruction based on Dominance Frontiers proc find_def_from_begin(block b): if b in IDF: d = new_phi(b) b. . . 5. . the reaching definition is the same for all predecessors and hence for the immediate dominator of this block. Therefore. . . . .preds: o = find_def_from_end(p. every use of a variable must be dominated by its definition1 . If that block is in the IDF of v a φ -function needs to be placed at its entrance. . If the block is not in the IDF. . . We compute the iterated dominance frontier (IDF) of v . This set is a sound approximation of the set where φ -functions must be placed (it might contain blocks where a φ -function would be dead).defs before searching for the arguments. . .5. .idom) return d 1 The definition of an operand of a φ -function has to dominate the according predecessor block. . . . This is because in SSA. Because we inserted the φ -function into the b. . . . . . . . . .3 Reconstruction based on the Dominance Frontier This algorithm follows the same principles as the classical SSA construction algorithm by Cytron at al.

. In programs with loops. . some unnecessary φ -functions can not be removed by phi_necessary. . B . If multiple variables have to be reconstructed. as shown in Figure 5. . Its major advantage over the algorithm presented in the last section is that it does neither require dominance information nor dominance frontiers. . We record the SSA variable that corresponds to the reaching definition of v at the end of the block in the field def of the block. inserting φ -functions on the fly. . it also works well on control flow graphs.4 Search-based Reconstruction The second algorithm we present here is similar to the construction algorithm that Click describes in his thesis [76]. We perform a backward depth-first search in the CFG to collect the reaching definitions of the variable in question at each block. To avoid infinite recursion. x } where x is the φ -function inserted at b and w some other variable. this check is done by the function phi_necessary. . . C . . e. there are blocks for which not all reaching definitions can be computed before we can decide whether a φ -function has to be placed. the algorithm can be applied to each variable separately. . This can happen in loops where φ -functions can become unnecessary because other φ -functions are optimized away.g. . D . Its disadvantage is that potentially more blocks have to be visited during the reconstruction. . we only consider the reconstruction for a single variable called v in the following. . . . Recursively computing the reaching definitions for a block b can end up at b itself. . . If the set of reaching definitions is a subset of {w . We look for a definition of x from block E . While the φ -function in block C can be eliminated by the local criterion applied by phi_necessary. we create a φ -function phi without operands in the block before descending to the predecessors. . A . . . . it can be the case that the local optimization which Algorithm 11 performs by calling phi_necessary does not remove all unnecessary φ -functions. . . . while caching the found definitions at the basic blocks. . because the depth-first search carried out by find_def_from_begin will not visit . . When we return to b we decide whether a φ -function has to be placed in b by looking at the reaching definition for every predecessor. . C . the φ -function placed in the header eventually reaches itself (and can later be eliminated). Hence. . . . . If Algorithm 11 considers the blocks in a unfavorable order. Although his algorithm is designed to construct SSA from the abstract syntax tree. all definitions that reach b can be computed before we decide whether to place a φ -function in b or not. . . This is. . all predecessors of a block can be visited before the block itself is processed (post-order traversal). Consider the example in Figure 5. . Hack) . .64 5 SSA Reconstruction — (S. In Algorithm 11. . . .3. . . . . the φ -function in block B remains. . As in the last section. then the φ -function is not necessary and we can propagate w further downwards. If the CFG is a DAG. .3b. If the CFG has loops. . we need to place a φ -function. . The principle idea is to start a search from every use to find the corresponding definition. if a variable has no definition in a loop. . 5. Thus it is well suited to be used in transformations that change the control flow graph. . . Hence. D . . If more than one definition reaches the block.

. For reducible control flow. . . . . .def pending = make_phi() reaching_defs = [] for p in b. .preds: reaching_defs += find_def_from_end(p) definite = phi_necessary(pending. . . . .def = definite return definite # checks. . . such as loop unrolling or live-range splitting destroy the single-assignment property of the SSA form. This is the case if it is a subset of { other. . . pending_phi } proc phi_necessary(variable pending_phi. . . . . if the parameter list in reaching_defs makes a phi function # necessary. . . . reaching_defs) b. . . . .4 illustrates that this does not hold for the irreducible case: . . .def != None: return b. . Both algorithms rely on computed def-use chains because they traverse all uses from a . . . . . a routine is called that restores SSA. list of variables reaching_defs): other = None for d in reaching_defs: if d == pending_phi: continue if other == None: other = d elif other != d: return pending_phi # this assertion is violated if reaching_defs only contains pending_phi # which never can happen assert other != None return other block B a second time. the local criterion can be iteratively applied to all placed φ -functions until a fixpoint is reached. The classical (∗)-graph in Figure 5.5 Conclusions 65 Algorithm 11: Search-based SSA Reconstruction proc find_def_from_begin(block b): if b. . . this then produces the minimal number of placed φ functions. . . In this chapter we presented two generic algorithms to reconstruct SSA. . .5. . . 5. . To remove the remaining φ -functions. . . . reaching_defs) if pending == definite: set_arguments(definite. . . .5 Conclusions Some optimizations. . The algorithms are independent of the transformation that violated SSA and can be used as a black box: For every variable for which SSA was violated. . .

However. x1 ) x D B x2 (x0 . Furthermore. (x0 . reload X reload X ··· x0 ··· x0 B B ·· x· 2. x0 spill x0 reload X .. x1 ) ..3 Removal of unnecessary φ -functions 1 x0 . x1 ) x2 x0 x0 ··· ··· x0 x0 ··· 5 SSA Reconstruction — (S. it is less suited for optimizations that also change the flow of control since that would require recomputing the iterated dominance frontiers. Hence..... it is well 2 .. x1 ) E C x1 (x2 .4 The irreducible (∗)-Graph 1 SSA variable to find suitable definitions. the algorithm can find the reaching definitions quickly by scanning the dominance tree upwards. .. x E . Hack) 3 A X Phi Optimization ··· X ··· x0 ··· x0 x0 .. Thus. x1 ) x1 (x2 .... by using the iterated dominance frontiers. spill x0 . one could also envision applying incremental algorithms to construct the dominance tree [218. . D . 5. A x0x 1 x2 x0 . x1 ) x2 (x0 . x D E . . both algorithms differ in their prerequisites and their runtime behavior: 1.. . 2. . .. reload X Simple CFG ··· x0 ··· 66 (x0 . 5. x (a) Original program (b) Unnecessary φ -functions placed Fig.. . On the other hand.1 y1 y1 y1 ··· x0 ··· x0 y3 ··· x0 X .. x0 ) Fig. . spill x0 x0 x . x x2 (x0 . this has not yet been done and no practical experiences have been reported so far. x1 x2 ··· . The second algorithm does not depend on additional analysis information such as iterated dominance frontiers or the dominance tree. x1 ) Phi Optimization . C C x1 (x2 .. 240] and the dominance frontier [241] to account for changes in the control flow. [82]. The first one [66] is based on the iterated dominance frontiers like the classical SSA construction algorithm by Cytron et al. . . A x0 ..

Both approaches construct pruned SSA (TODO: cite other chapter). Such a nonlocal check also eliminates unnecessary φ -functions in the irreducible case.. The first approach produces minimal SSA by the very same arguments Cytron et al. .e.5 Conclusions 67 suited for transformations that change the CFG because no information needs to be recomputed. [82] give. no φ function is dead. i. These two local rules can be extended to a non-local one which then has to find strongly connected φ -components that all refer to the same exterior variable. The second one does only produce minimal SSA (in the sense of Cytron) in the case of reducible control flow. This follows from the iterative application of the function phi_necessary which implements the two simplification rules presented by Aycock and Horspool [21]. They show that the iterative application of both rules yields minimal SSA on reducible graphs.5. On the other hand. it might find the reaching definitions slower than the first one because they are searched by a depth-first search in the CFG.

.

2. Beringer Progress: 70% Structural reordering in progress This chapter discusses alternative representations of SSA using the terminology and structuring mechanisms of functional programming languages. explicitly in syntax. 1. improving the compiler’s robustness. less code code is required. 69 . The development of functional representations of SSA is motivated by the following considerations. reformulating SSA as a functional program makes explicit some of the syntactic conditions and semantic invariants that are implicit in the definition and use of SSA. The reading of SSA as a discipline of functional programming arises from a correspondence between dominance and syntactic scope that subsequently extends to numerous aspects of control and data flow structure. by enforcing a particular naming discipline. and code readability. maintainability. “the variables assigned to by these φ -functions must be distinct”. In a similar way. namely the def-use relationships. functional representations directly enforce invariants such as “all φ -functions in a block must be of the same arity”. Constraints such as these would typically have to be validated or (re-)established after each optimization phase of an SSA-based compiler. relating the core ideas of SSA to concepts from other areas of compiler and programming language research provides conceptual insight into the SSA discipline and thus contributes to a better understanding of the practical appeal of SSA to compiler writers. but are typically enforced by construction if a functional representation is chosen. or “each use of a variable should be dominated by its (unique) definition”. “φ -functions are only allowed to occur at the beginning of a basic block”. the introduction of SSA itself was motivated by a similar goal: to represent aspects of program structure. Indeed.def-use chains CHAPTER 6 Functional representations of SSA L. Consequently.

a formal basis for comparing variants of SSA – such as the variants discussed elsewhere in this book –. functional intermediate code can often be directly emitted as a program in a high-level mainstream functional language. Rather than discussing all these considerations in detail. Armed with this background. 5. they can be expected to generalize more easily to interprocedural analyses than other static analysis formalisms. the purpose of the present chapter is to informally highlight particular aspects of the correspondence and then point the reader to some more advanced material. Like the remainder of the book. our discussion is restricted to code occurring in a single procedure. giving the compiler writer access to existing interpreters and compilation frameworks. pointers to the original literature. facilitating the rapid implementation of interpreters. without requiring SSA destruction. As type systems for functional languages typically support higher-order functions. We briefly discuss historical aspects. formal frameworks of program analysis that exist for functional languages become applicable. Correctness criteria for program analyses and associated transformations can be stated in a uniform manner. we then discuss the relationship between SSA and functional forms in more detail by describing how various steps that occur during the construction and destruction of SSA mirror corresponding transformations on functional code.70 6 Functional representations of SSA — (L. 4. . and can be proved to be satisfied using well-established reasoning principles. Chapter outline We start by discussing the fundamental observation that (imperative) points of definition correspond to sites of variable binding in functional languages. the intuitive meaning of “unimplementable” φ -instructions is complemented by a concrete execution model. Beringer ) 3. Our exposition is example-driven but leads to the identification of concrete correspondence pairs between the imperative/SSA world and the functional world. This enables the compiler developer to experimentally validate SSA-based analyses and transformations at their genuine language level. for translating between these variants. Thus. and for constructing and destructing SSA is obtained. rapid prototyping is supported and high-level evaluation of design decisions is enabled. Type systems provide a particularly attractive formalism due to their declarativeness and compositional nature. before concluding with some remarks on alternative representations of SSA that do not employ the terminology of functional languages. namely continuation-passing and direct style. We then outline two low-level functional formats that may be used to represent SSA. and topics of current and future work. Indeed.

x n ) = e (6.6.1) where the syntactic category of expressions e conflates the notions of expressions and commands of imperative languages. . yielding the overall result of 72. . . For example. . For example. . . Bindings are treated in a stack-like fashion. .1 Low-level functional program representations Functional languages represent code using declarations of the form function f (x 0 . . . . . . .2) end spans the both inner let-bindings. . the binding of v to 3 comes back into force. . . The code affected by this binding. . the scope associated with the top-most binding of v in code let v = 3 in let y = (let v = 2 ∗ v in 4 ∗ v in y ∗ v end v .1. . . e 2 . . A declaration of the form (6.1) binds the variables x i and also the function name f within e . . . . . . . . is called the static scope of x and is easily syntactically identifiable. . . . in the above code. 6. 6. . resulting in a tree-shaped nesting structure of boxes in our code excerpts. . . we occasionally indicate scopes by code-enclosing boxes. .y v end) v (6. Once this evaluation has terminated (resulting in the binding of y to 24). . . and list the variables that are in scope using subscripts. . . The effect of this expression is to evaluate e 1 and bind the resulting value to variable x for the duration of the evaluation of e 2 . In the following. . . . . . . e may contain further nested or (mutually) recursive function declarations. . . the inner binding of v to value 2 ∗ 3 = 6 shadows the outer binding of v to value 3 precisely for the duration of the evaluation of the expression 4 ∗ v . . . .1 Low-level functional program representations 71 . . . Typically. . the scopes of which are themselves not nested inside one other as the (inner) binding to v occurs in the e 1 position of the letbinding for y . a let-binding for variable x hides any previous value bound to x for the duration of evaluating e 2 but does not permanently overwrite it. .1 Variable assignment versus binding A language construct provided by almost all functional languages is the letbinding let x = e 1 in e 2 end. . . . In contrast to an assignment in an imperative language.

functional languages achieve the association of definitions to uses without imposing the global uniqueness of variables. and program transformations α-rename bound variables whenever necessary. the semantics-preserving renaming of bound variables is called α-renaming. and thus a property typically enjoyed by functional languages.3) end Note that this conversion formally makes the outer v visible for the expression 4 ∗ z . As a consequence of this decoupling. Thus. Hence.72 6 Functional representations of SSA — (L.indeed any occurrences of x or y in e 1 refer to conceptually different but identically named variables. and can be calculated from the meaning of its subexpressions. we may rename the inner v in code (6. as witnessed by the duplicate binding occurrences for v in the above code. is compositional equational reasoning : the meaning of a piece of code e is only dependent its free variables. program analyses for functional languages are compatible with α-renaming in that they behave equivalently for fragments that differ only in their choice of bound variables. functional languages enjoy a strong notion of referential transparency : the choice of x as the variable holding the result of e 1 depends only on the free variables of e 2 . Indeed. In order to avoid altering the meaning of the program. the meaning of a phrase let x = e 1 in e 2 end only depends on the free variables of e 1 and on the free variables of e 2 other than x . but the static scoping discipline ensures these will never be confused with the variables involved in the renaming. this means that a renaming let x = e 1 in e 2 end to let y = e 1 in e 2 [y /x ] end can only be carried out if y is not a free variable of e 2 . A consequence of referential transparency.y v . For example. Typically. the point of definition for a use of x is given by the nearest enclosing binding of x . In contrast to SSA. in case that e 2 already contains some preexisting bindings to y . Occurrences of variables in an expression that are not enclosed by a binding are called free. Also note that the renaming only affects e 2 . Formally.2) to z without altering the meaning of the code: let v = 3 in let y = (let z = 2 ∗ v in 4 ∗ z in y ∗ v end v . as indicated by the index v . languages with referential transparency allow . A wellformed procedure declaration contains all free variables of its body amongst its formal parameters. the substitution of x by y in e 2 (denoted by e 2 [y /x ] above) first renames these preexisting bindings in a suitable manner. the notion of scope makes explicit a crucial invariant of SSA that is often left implicit: each use of a variable should be dominated by its (unique) definition. Beringer ) The concepts of binding and static scope ensure that functional programs enjoy the characteristic feature of SSA. namely the fact that each use of a variable is uniquely associated with a point of definition. Moreover. In general. the choice of the newly introduced variable has to be such that confusion with other variables is avoided. z decorating its surrounding box.z end) v (6. For example.

6. i.e where x typically occurs free in e .2). 2 ∗ x .1 Low-level functional program representations 73 one to replace a subexpression by some semantically equivalent phrase without altering the meaning of the surrounding code. as is the case for the variable k in let v = 3 in let y = (let v = 2 ∗ v in 4 ∗ v end) . i. 6. a program representation routinely used in compilers for functional languages. It is common practice to write these continuation-defining expressions in λ-notation. For example. are typically applied to argument expressions). k represents any function that may be applied to the result of expression (6.1.e. the suitability of SSA for formulating and implementing such transformations can be explained by the proximity of SSA to functional languages.e. a continuation specifies how the execution should proceed once the evaluation of the current code fragment has terminated. a client of the above code fragment wishing to multiply the fragment’s result by 2 may insert code (6.4) In effect. as λ acts as a binder) place-holder for the argument to which the continuation is applied. in k (y ∗ v ) end end (6. Syntactically. 2 ∗ x in let v = 3 in let y = (let z = 2 ∗ v in 4 ∗ z end) in k (y ∗ v ) end end end .. continuations are expressions that may occur in functional position (i.2 Control flow: continuations The correspondence between let-bindings and points of variable definition in assignments extends to other aspects of program structure. in particular to code in continuation-passing-style (CPS). Surrounding code may specify the concrete continuation by binding k to a suitable expression. as in the following code: let k = λ x . formal parameter x represents the (α-renameable. in the form λ x . Since semantic preservation is a core requirement of program transformations. The effect of the expression is to act as the (unnamed) function that sends x to e (x ).e.4) in the e 2 position of a let-binding for k that (in its e 1 position) contains λ x . Satisfying a roughly similar purpose as return addresses or function pointers in imperative languages.

Contrary to the functional representation. k ) = let x = 4 in let k = λ z .7) constructs from k a continuation k that is invoked (with different arguments) in each branch of the conditional.4) in a function definition with formal argument k and construct the continuation in the calling code. In a similar way. k (z ∗ x ) in if y > 0 then let z = y ∗ 2 in k (z ) end else let z = 3 in k (z ) end end end (6. just like in an ordinary function application. The SSA form of this CFG is shown in Figure 6. Alternatively. as in function g (k ) = let k = λ x .e 72) is substituted for x in the expression 2 ∗ x .it roughly corresponds to the frame slot that holds the return address in an imperative procedure call. the top-level continuation parameter k is not explicitly visible in the CFG . where he would be free to choose a different name for the continuation-representing variable: function f (k ) = let v = 3 in let y = (let z = 2 ∗ v in 4 ∗ z end) in k (y ∗ v ) end (6. the (dynamic) value y ∗ v (i.6) where f is invoked with a newly constructed continuation k that applies the multiplication by 2 to its formal argument x (which at runtime will hold the result of f ) before passing the resulting value on as an argument to the outer continuation k . as indicated by the CFG corresponding to h in Figure 6.5) end in let k = λ x . k (2 ∗ x ) in f (k ) end. the caller of f is itself parametric in its continuation. Beringer ) When the continuation k is applied to the argument y ∗ v .1 (left). 2 ∗ x in f (k ) end end. Typically. (6. we obtain .7).74 6 Functional representations of SSA — (L. the sharing of k amounts to the definition of a control flow merge point. If we apply similar renamings of z to z 1 and z 2 in the two branches of (6.1 on the right. the client may wrap fragment (6. the function function h (y . In effect.

6.8) only involves the referentially transparent process of α-renaming indicates that program (6. Programs in CPS equip all functions declarations which continuation arguments. By interspersing ordinary code with continuation-forming expressions as shown above. the so-called direct-style representation. . and invoking continuations. 6. As expected from the above understanding of scope and dominance. Before examining further aspects of the relationship between CPS and SSA. k ) = let x = 4 in let k = λ z .7) already contains the essential structural properties that SSA distills from an imperative program. The fact that transforming (6.1 Control flow graph for code (6. they model the flow of control exclusively by communicating.8) We observe that the role of the formal parameter z of continuation k is exactly that of a φ -function: to unify the arguments stemming from various calls sites by binding them to a common name for the duration of the ensuing code fragment – in this case just the return expression.1 Low-level functional program representations 75 Fig. the scopes of the bindings for z 1 and z 2 coincide with the dominance regions of the identically named imperative variables: both terminate at the point of function invocation / jump to the control flow merge point. k (z ∗ x ) in if y > 0 then let z 1 = y ∗ 2 in k (z 1 ) end else let z 2 = 3 in k (z 2 ) end end end (6. we discuss a close relative of CPS. constructing.7) (left). and SSA representation (right) function h (y .7) into (6.

Thus.y .h . the effect is similar to the imperative compilation discipline of always setting the return address of a procedure call to the instruction pointer immediately following the call instruction.1. code (6.z .7) may be represented as function h (y ) = let x = 4 in in if y > 0 then let z = y ∗ 2 in h (z ) else let z = 3 in h (z ) end x .9) x .h . no continuation terms are constructed dynamically and then passed as function arguments. h simply returns control back to the caller. who is then responsible for any further execution.h .z . rather than handing its result directly over to some caller-specified receiver (as communicated by the continuation argument k ).h (6.h end end y .h x . Also note that neither the declaration of h nor that of h contain additional continuation parameters. and we hence exclusively employ λ-free function definitions in our representation. A stricter format is obtained if the granularity of local functions is required to be that of basic blocks: . however.76 6 Functional representations of SSA — (L.3 Control flow: direct style An alternative to the explicit passing of continuation terms via additional function arguments is to represent code as a set of locally named tail-recursive functions.h end x .h where the local function h plays a similar role as the continuation k and is jointly called from both branches.h function h (z ) = z ∗ x x . In direct style. the body of h returns its result directly rather than by passing it on as an argument to some continuation. In contrast to the CPS representation. For example. Beringer ) 6.z .h .y .y . a representation that is often referred to as direct style.y .y . Roughly speaking.

In our discussion below.1 Low-level functional program representations 77 function h (y ) = let x = 4 in function h (z ) = z ∗ x in if y > 0 then function h 1 () = let z = y ∗ 2 in h (z ) end in h 1 () end else function h 2 () = let z = 3 in h (z ) end in h 2 () end end end (6. and has so far not yet led to a clear consensus in the literature. The choice between CPS and direct style is orthogonal to the granularity level of functions: both CPS and direct style are compatible with the strict notion of basic blocks but also with more relaxed formats such as extended basic blocks. the role of the formal parameter z of the control flow merge point function h is identical to that of a φ -function. reflecting more directly the CFG from Figure 6. The exploration of the resulting design space is ongoing. In accordance with the fact that the basic blocks representing the arms of the conditional do not contain φ -functions.11) Again.e. In the extreme case. the process of moving from the CFG to the SSA form is again captured by suitably α-renaming the bindings of z in h 1 and h 2 : function h (y ) = let x = 4 in function h (z ) = z ∗ x in if y > 0 then function h 1 () = let z 1 = y ∗ 2 in h (z 1 ) end in h 1 () end else function h 2 () = let z 2 = 3 in h (z 2 ) end in h 2 () end end end (6. function invocations correspond precisely to jumps.. the local functions h 1 and h 2 have empty parameter lists – the free occurrence of y in the body of h 1 is bound at the top level by the formal argument of h . we employ the arguably easier-to-read direct style. i. . one local function or continuation is introduced per instruction. Independent of the granularity level of local functions. all control flow points are explicitly named.6. but the gist of the discussion applies equally well to CPS.10) Now.1.

1.3) pulls the let-binding for z to the outside of the binding for y .78 6 Functional representations of SSA — (L. Using this convention. and function arguments must be names or constants. neither jumps (conditional or unconditional) nor let-bindings are primitive.y end y subject to the side condition that y not be free in e . interspersed occasionally by the definition of continuations or local functions.z (6. In particular. x . (6. While still enjoying referential transparency.y .12) Programs in let-normal form thus do not contain let-bindings in the e 1 -position of outer let-expressions.13) Summarizing our discussion up to this point. The stack discipline in which let-bindings are managed is simplified as scopes are nested inside each other. Let-normalized form is obtained by repeatedly rewriting code of the form let x = let y = e in e in e end x y end into let y = e in let x = e in e end. Exploiting this correspondence. we apply a simplified notation in the rest of this chapter where the scope-identifying boxes (including their indices) are omitted and chains of let-normal bindings are abbreviated by single comma-separated (and ordered) let-blocks. For example.4 Let-normal form For both direct style and CPS the correspondence to SSA is most pronounced for code in let-normal-form: each intermediate result must be explicitly named by a variable. letnormal-form isolates basic instructions in a separate category of primitive terms a and then requires let-bindings to be of the form let x = a in e end. yielding let v = 3 in let z = 2 ∗ v in end end v let y = 4 ∗ z in y ∗ v v . let-normalizing code (6.1 collects some correspondences between functional and imperative/SSA concepts. Syntactically. let-normal code is in closer correspondence to imperative code than nonnormalized code as the chain of nested let-bindings directly reflects the sequence of statements in a basic block. Beringer ) 6. Table 6. . y =4∗z in y ∗ v end let v = 3.z end v . code (6.12) becomes z = 2 ∗ v.

6. . Functional concept Imperative/SSA concept subterm relationship control flow successor relationship arity of function f i number of φ -functions at beginning of b i distinctness of formal parameters of f i distinctness of LHS-variables in the φ -block of b i number of call sites of function f i arity of φ -functions in block b i parameter lifting/dropping addition/removal of φ -function block floating/sinking reordering according to dominator tree structure potential nesting structure dominator tree Table 6.2. . . . . . . . .2 Correspondence pairs between functional form and SSA: program structure Fig. . . .1 Correspondence pairs between functional form and SSA (part I) . .2 Functional construction of SSA: running example 6. . . . The body of each f i arises by intro- . . . . . . . . . . . using the program in Figure 6. . . . . . . . . . . .2 as a running example. We discuss some of these aspects by considering the translation into SSA. .2 Functional construction and destruction of SSA Functional concept Imperative/SSA concept 79 variable binding in let assignment (point of definition) α-renaming variable renaming unique association of binding occurrences to uses unique association of defs to uses formal parameter of continuation/local function φ -function (point of definition) lexical scope of bound variable dominance region Table 6.2. . . 6. . . . . . 6. . . .1 Initial construction using liveness analysis A simple way to represent this program in let-normalized direct style is to introduce one function f i for each basic block b i .2 Functional construction and destruction of SSA The relationship between SSA and functional languages is extended by the correspondences shown in Table 6. . . .

v 2 ) else f 2 (v 2 . As the function declarations in code (6. z . • variable names are not unique. z .2): each formal parameter of a function f i is the target of one φ -function for the corresponding block b i . y 2 ) = let x 1 = 5 + y 2 . . v 3 ) = let w 1 = y 4 + v 3 in w 1 end in f 1 () end (6. The arguments of these φ -functions are the arguments in the corresponding positions in the calls to f i . v ) else f 2 (v . y ) end and f 2 (v . z 1 = 8. z 2 . Beringer ) ducing one let-binding for each assignment and converting jumps into function calls. y 1 ) end and f 2 (v 2 . the above construction of parameter lists amounts to equipping each b i with φ -functions for all its live-in variables. If desired. As the number of arguments in each call to f i coincides with the number of f i ’s formal parameters. The resulting code (6. z 1 .80 6 Functional representations of SSA — (L. y ) end and f 3 (y . we choose an arbitrary enumeration of its live-in variables. z . In order to determine the formal parameters of these functions we perform a liveness analysis.15) Under this perspective. x = x − 1 in if x = 0 then f 3 (y . x 2 = x 1 − 1 in if x 2 = 0 then f 3 (y 3 . y = 4 in f 2 (v . the φ -functions in b i are all of the same arity. all variable renamings are independent from each other. z 2 . • each subterm e 2 of a let-binding let x = e 1 in e 2 end corresponds to the control flow successor of the assignment to x . y ) = let x = 5 + y . v ) = let w = y + v in w end in f 1 () end The resulting program has the following properties: (6. z = 8. y 3 = x 1 ∗ z 2 . we choose an arbitrary enumeration of these call sites. y 1 = 4 in f 2 (v 1 . y 3 ) end and f 3 (y 4 . We collect all function definitions in a block of mutually tail-recursive functions at the top level: function f 1 () = let v = 1. but the unique association of definitions to uses is satisfied.15) corresponds precisely to the SSA-program shown in Figure 6. and also as the list of actual arguments in calls to f i . For each basic block b i .3 (see also Table 6. with subsequent 1 Apart from the function identifiers f i which can always be chosen distinct from the variables.14) are closed. we may α-rename to make names globally unique. function f 1 () = let v 1 = 1. We then use this enumeration as the list of formal parameters in the declaration of the function f i . namely the number of call sites to f i . In order to coordinate the relative positioning of the arguments of the φ -functions. y = x ∗ z .14) • all function declarations are closed: the free variables of their bodies are contained in their formal parameter lists1 .

The transformation consists of two phases. we can move the declaration for f into that of g – note the similarity to the notion of dominance. While resulting in a legal SSA program. In particular. 6. . . whenever our set of function declarations contains definitions f (x 1 . . the former option is more instructive.2 λ-dropping λ-dropping may be performed before or after variable names are made distinct. . . . and is the inverse of the more widely known transformation λ-lifting. block sinking and parameter dropping. but for our purpose. 6. the above method corresponds to the construction of pruned (but not minimal) SSA — see Chapter 2. algorithms for computing the dominator tree from a CFG discussed elsewhere in this book can be applied to identify block sinking opportunities.2 Functional construction and destruction of SSA 81 Fig. Thus.2. block sinking indeed amounts to making the entire dominance tree structure explicit in the program representation. the construction clearly introduces more φ -functions than necessary. where the CFG is given by the call graph of functions.2.3 Pruned (non-minimal) SSA form renaming of the variables.2. . For example. x n ) = e f and g (y 1 . Each superfluous φ -function corresponds to the situation where all call sites to some function f i pass identical arguments. . . The technique for eliminating such arguments is called λ-dropping. 6.6.1 Block sinking Block sinking analyzes the static call structure to identify which function definitions may be moved inside each other. If applied aggressively. y m ) = e g where f = g such that all calls to f occur in e f or e g .

y ) end end in f 1 () end (6. . . function f 1 () = let v = 1.16) An alternative strategy is to insert f near the end of its host function g . one would insert f directly prior to its call if g contains only a single call site for f . y ) = function f 3 (y . x n ) = e f in e g end. . Several options exist as to where f should be placed in its host function. y ) end end in let v = 1. in the vicinity of the calls to f . Beringer ) In our example. y = x ∗ z . . f 3 is only invoked from within f 2 . x = x − 1 in if x = 0 then function f 3 (y . y = 4 in f 2 (v . . . . v ) else f 2 (v .82 6 Functional representations of SSA — (L. z = 8.14) yields the following code: function f 1 () = function f 2 (v . In case that g contains multiple call sites for f . y ) end end in f 1 () end (6. these are (due to . z = 8. v ) = let w = y + v in w end in let x = 5 + y . . and f 2 is only called in the bodies of f 2 and f 1 (see the dominator tree in Figure 6. v ) end else f 2 (v . . z . y ) = let x = 5 + y . x = x − 1 in if x = 0 then f 3 (y . by rewriting to function g (y 1 . Applying this transformation to example (6. v ) = let w = y + v in w end in f 3 (y . We may thus move the definition of f 3 into that of f 2 . . Again. z . the alternative strategy would yield code (6. z . y = 4 in function f 2 (v .17).3 (right)). y ) end in f 2 (v . The first option is to place f at the beginning of g . y m (and also into the scope of g itself ) does not alter the bindings to which variable uses inside e f refer. and the latter one into f 1 .17) In general. y m ) = function f (x 1 . y = x ∗ z . z . z . z . . as the declaration of f is closed: moving f into the scope of g ’s formal parameters y 1 . This transformation does not alter the meaning of the code. This brings the declaration of f additionally into the scope of all let-bindings in e g . . referential transparency and preservation of semantics are respected as the declaration on f is closed. In case of our example code.

In case . In (6.e. and v will always be bound to the value passed via the parameter v of f 2 . In both cases. we may simultaneously drop a parameter x occurring in all declarations of a block of mutually recursive functions f 1 .3).17) can be removed without altering the meaning of the code. the tightest scope for x enclosing the declaration of f coincides with the tightest scope for x surrounding each call site to f outside f ’s declaration. Removing a parameter x from the declaration of some function f has the effect that any use of x inside the body of f will not be bound by the declaration of f any longer. Similarly. if the scope for x at the point of declaration of the block coincides with the tightest scope in force at any call site to some f i outside the block. a call to f in the body of f ) is the one associated with the formal parameter x in f ’s declaration.17). these conditions sanction the removal of both parameters from the nonrecursive function f 3 .2.16) and (6. The scope applicable for v at the site of declaration of f 3 and also at its call site is the one rooted at the formal parameter v of f 2 .17) both nest f 3 inside f 2 inside f 1 . In our example. the tightest scope for x is the one associated with the formal parameter x of f j .2. . in accordance with the dominator tree of the imperative program (see Figure 6. . and we would insert f directly prior to this conditional. For example. .2 Functional construction and destruction of SSA 83 their tail-recursive positioning) in different arms of a conditional. since the binding applicable at the call site determines the value that is passed as the actual argument (before x is deleted from the parameter list). as we can statically predict the values they will be instantiated with: parameter y will be instantiated with the result of x ∗ z (which is bound to y in the first line in the body of f 2 ). we thus have to ensure that this f -containing binding for x also contains any call to f . Both outlined placement strategies result in code whose nesting structure reflects the dominance relationship of the imperative code. and also from the argument lists of the corresponding function calls. the two parameters of f 3 in (6.2 Parameter dropping The second phase of λ-dropping. f n . we may drop a parameter x from the declaration of a possibly recursive function f if 1. the bindings for y and v at the declaration site of f 3 are identical to those applicable at the call to f 3 . code (6. the tightest scope for x enclosing any recursive call to f (i. In general. parameter dropping. . In order to ensure that removing the parameter does not alter the meaning of the program. In particular. and 2. 6. removes superfluous parameters based on the syntactic scope structure. and if in each call to some f i inside some f j . dropping a parameter means to remove it from the list of formal parameter lists of the function declarations concerned.6. but by the (innermost) binding for x that contains the declaration of f .

y ) end in f 2 (v . z = 8. y = 4 in function f 2 (y ) = let x = 5 + y . function f 1 () = let v = 1.17)) is in general preferable to inserting them at the beginning of their hosts (as in code (6.84 6 Functional representations of SSA — (L. y = 4 in function f 2 (v . y = x ∗ z .18). Thus.19) back in SSA yields the desired minimal code with a single φ -function. neither v nor z have binding occurrences in the body of f 2 . In contrast. The scopes applicable at the external call site to f 2 coincide with those applicable at its site of declaration and are given by the scopes rooted in the let-bindings for v and z . y ) end end in f 1 () end (6.16)): the placement of function declarations in the vicinity of their calls . z . z . y = x ∗ z . parameters v and z may be removed from f 2 : function f 1 () = let v = 1. preventing us from removing y . x = x − 1 in if x = 0 then function f 3 () = let w = y + v in w end in f 3 () end else f 2 (v . z = 8. x = x − 1 in if x = 0 then function f 3 () = let w = y + v in w end in f 3 () end else f 2 (y ) end in f 2 (y ) end end in f 1 () end (6. Beringer ) of y . for variable y at the beginning of block b 2 – see Figure 6. We thus obtain code (6. The reason that this φ -function can’t be eliminated (the fact that y is redefined in the loop) is precisely the reason why y survives parameterdropping.19) Interpreting the uniquely-renamed variant of (6. y ) = let x = 5 + y .18) Considering the recursive function f 2 next we observe that the recursive call is in the scope of the let-binding for y in f 2 ’s body. Given this understanding of parameter dropping we can also see why inserting functions near the end of their hosts during block sinking (as in code (6. the common scope is the one rooted at the let-binding for y in the body of f 2 . z .4.

3 Nesting.4 SSA code after λ-dropping potentially enables the dropping of more parameters. Thus. The choice as to where functions are placed corresponds to variants of SSA. the arity of these trivial Φ-nodes grows with the number 2 Refinements of this representation will be sketched in Section 6. As the loop is unrolled. analyzing whether function definitions may be nested inside one another is tantamount to analyzing the imperative dominance structure: function f i may be moved inside f j exactly if all non-recursive calls to f i come from within f j exactly if all paths from the initial program point to block b i traverse b j exactly if b j dominates b i . For example. 6. in loop-closed SSA form (see chapters 18 and 11). specialpurpose unary Φ-nodes are inserted for these variables at the loop exit points. as witnessed by our use of the let-binding construct for binding code-representing expressions to the variables k in our syntax for CPS. hence we can always nest f inside g . An immediate consequence of this strategy is that blocks with a single predecessor indeed do not contain φ -functions.2 Functional construction and destruction of SSA 85 Fig. functional languages do not distinguish between code and data when aspects of binding and use of variables are concerned. loop-closure As we observed above. all parameters of f can be dropped and no φ -function is generated. Indeed. dominance. Inserting f in e g directly prior to its call site implies that condition (1) is necessarily satisfied for all parameters x of f . This observation is merely the extension to function identifiers of our earlier remark that lexical scope coincides with the dominance region.6. SSA names that are defined in a loop must not be used outside the loop. 6.2. Such a block b f is necessarily dominated by its predecessor b g . Thus. namely those that are letbound in the host function’s body.3 . where all children of a node are represented as a single block of mutually recursive function declarations2 . To this end. and that points of definition/binding occurrences should dominate the uses. the dominator tree immediately suggests a function nesting scheme.

to obtain code (6.21) coincides with the dominance structure of the original imperative code and its loop-closed SSA form. function f 1 () = let v = 1. x = x − 1 in if x = 0 then f 3 (y .21) contains the desired loop-closing Φ-node for y at the beginning of b 3 . z . y = x ∗ z . and the program continuation is always supplied with the value the variable obtained in the loop’s final iteration.21) The SSA form corresponding to (6.20) We may now drop v (but not y ) from the parameter list of f 3 . y ) end end in f 1 () end (6. z . x = x − 1 in if x = 0 then f 3 (y ) else f 2 (y ) end end in f 2 (y ) end end in f 1 () end (6. we would still like to drop as many parameters from f 2 as possible. v ) else f 2 (v . the only loopdefined variable used in f 3 is y . preceding the let-binding for y . We unroll the loop by duplicating the body of f 2 . as shown in Figure 6. Hence we apply the following placement policy during block sinking: functions that are targets of loop-exiting function calls and have live-in variables that are defined in the loop are placed at the beginning of the loop headers. y ) = function f 3 (y . v ) = let w = y + v in w end in let x = 5 + y . z . Of course. y = 4 in function f 2 (v .86 6 Functional representations of SSA — (L. z = 8. without duplicating the declaration of f 3 : . and v and z from f 2 . In our example. Beringer ) of unrollings.20) and (6.14) yields (6. y = 4 in function f 2 (y ) = function f 3 (y ) = let w = y + v in w end in let x = 5 + y . Applying this policy to our original program (6.5 (top).21). The nesting structure of both (6. Other functions are placed at the end of their hosts. y = x ∗ z .and we already observed in code (6. z = 8. y ) end end in f 2 (v .16) how we can prevent the dropping of y from the parameter list of f 3 : we insert f 3 at the beginning of f 2 . function f 1 () = let v = 1.20).

22) Both calls to f 3 are in the scope of the declaration of f 3 and contain the appropriate loop-closing arguments. Moreover. y = 4 in function f 2 (y ) = function f 3 (y ) = let w = y + v in w end in let x = 5 + y .22). x = x − 1 in if x = 0 then f 3 (y ) else f 2 (y ) end in f 2 (y ) end end in f 2 (y ) end end in f 1 () end (6. y = x ∗ z . x = x − 1 in if x = 0 then f 3 (y ) else function f 2 (y ) = let x = 5 + y . respectively.5 (bottom) – the first instruction in b 3 has turned into a nontrivial Φ-node. y = x ∗ z .6. the parameters of this Φ-node correspond to the two control flow arcs leading into b 3 .2 Functional construction and destruction of SSA 87 Fig. corresponding to codes (6.22). one for each call site to f 3 in code (6. 6.21) and (6. z = 8. In the SSA reading of this code – shown in Figure 6. the call and .5 Loop-closed and loop-unrolled forms of running example program. As expected. function f 1 () = let v = 1.

. a local algorithm that considers each call site individually and avoids the “lost-copies” and “swap”-problems identified by Briggs et al. programs that do obey this discipline can be immediately converted to imperative non-SSA form. . . This discipline is not enjoyed by functional programs in general. Thus. . y ) where f is declared as function f (x . . . . Instead of introducing let-bindings for all parameter positions of a call.22) is indeed in agreement with the control flow and dominance structure of the loop-unrolled SSA representation. . z ). . . 6. . . .2.88 6 Functional representations of SSA — (L. . y . . . . . . . . . 6. . a call f (v . . [48] can easily be given [29]. . . nested inside the declaration of f : . . in correspondence to the move instructions introduced in imperative formulations of SSA destruction (see Chapters 3 and 22). . Appropriate transformations can be formulated as manipulations of the functional representation. . although the target format is not immune against α-renaming and thus only syntactically a functional language. . a typical dominance tree with root b and subtrees rooted at b 1 . . . . Beringer ) nesting structure of (6. . This can be achieved by introducing additional let-bindings of the form let x = y in e end: for example. z ) = e may be converted to let x = v . . . . a = z . . . . . z = y . . For example. in line with the results of [187]. . . and is often destroyed by optimizing program transformations. b n is most naturally represented as a block of function declarations for the f i . y . z . . . . the algorithm scales with the number and size of cycles that span identically named arguments and parameters (like the cycle between y and z above). .4 Destruction of SSA The above example code excerpts where variables are not made distinct exhibit a further pattern: the argument list of any call coincides with the list of formal parameters of the invoked function. y = a in f (x . .3 Rened block sinking and loop nesting forests As discussed when outlining λ-dropping. the task of SSA destruction amounts to converting a functional program with arbitrary argument lists into one where argument lists and formal parameter lists coincide for each function. . However. . and employs a single additional variable (called a in the above code) to break all these cycles one by one. . . Thus. . . block sinking is governed by the dominance relation between basic blocks. .

6(b). .6. (∗ body of b ∗) . consider the CFG in Fig- Fig.24). b i ∗) in . .) = e n (∗ body of b n . end By exploiting additional control flow structure between the b i . . In the case of a reducible CFG. . . (∗ calls to f5 and g ∗) . . . A possible (but not unique) ordering of b ’s children is [b 5 . . in function f 5 (.) = e 5 in function f 1 (.) = e g (∗ contains call to f 3 ∗) in . (∗ calls to f 1 . . with calls to b . resulting in the nesting shown in code (6.3 Refined block sinking and loop nesting forests 89 function f (.e. . . . the descendants of b i in the dominance tree) to b j .) = e 3 in function f 2 (. . .6 Example placement for reducible flow graphs.) = e 1 (∗ body of b 1 .) = e 4 (∗ contains calls to f 3 and f 4 ∗) in . Ignoring these self-loops. we perform a post-order-DFS (or more generally a reverse topological ordering) amongst the b i and stagger the function declarations according to the resulting order. .) = let . in function f 1 (. namely placements that correspond to notions of loop nesting forests that have been identified in the SSA literature. . . end (6. . b 4 ]. . (∗ b s calls to b . . . . . . . b 3 . b 1 . b 2 . . .23) and f n (. however. 6. . . . b i ∗) . .) = e 1 (∗ contains calls to f and f 5 ∗) in function f 3 (. end in function f 4 (. .24) . . b i ∗) : (6. . . the resulting graph contains only trivial loops. . function f (. (∗ body of f ∗) .6(a) and its enriched dominance tree shown in Figure6. in function g (. . with calls to b . .) = let . . f 2 . . (a) control flow graph (b) overlay of CFG (dashed edges) arcs between dominatees onto dominance graph (solid edges) ure 6. . . . . f 4 . f 5 ∗) . (∗ body of f 2 ∗) .) = let . it is. As an example. possible to obtain refined placements. These refinements arise if we enrich the above dominance tree by adding arrows b i → b j whenever the CFG contains a directed edge from one of b i ’s dominance successors (i.

but additionally makes f 1 inacessible from within e 5 . . i. its only control flow predecessors are b 2 / g and b 4 .e. (∗ body of f ∗) . f 4 ) but f 3 remains non-recursive. the CFG G 0 shown in Figure 6. even when self-loops are ignored. . Beringer ) The code respects the dominance relationship in much the same way as the naive placement. and makes f 3 inaccessible from within f 1 or f 5 . different partitionings are possible. {u . x }. For example. . . the scheme introduced by Steensgaard is particularly appealing from a functional perspective. H ) are precisely the entry nodes of its body B .90 6 Functional representations of SSA — (L. In Steengaard’s notion of loops. e 5 in let f 1 = λ p 1 . . (∗ body of f 2 ∗) . f 4 ) ends here ∗) in . e 3 in (λ p 2 . As the reordering does not move function declarations inside each other (in particular: no function declaration is brought into or moved out of the scope of the formal parameters of any other function) the reordering does not affect the potential to subsequently perform parameter dropping. . . the functional representation not only depends on the chosen DFS order but additionally on the partitioning of the enriched graph into loops. f 5 ∗) . In general. w . corresponding to different notions of loop nesting forests (LNF’s). We would hence like to make the declaration of f 3 local to the tuple ( f 2 . those nodes in B that have a predecessor outside of B . the role of f 3 is played by any merge point b i that is not directly called from the dominator node b . . the headers H of a loop L = ( B . e 1 (∗contains calls to f and f 5 ∗) in letrec ( f 2 . This can be achieved by combining let/letrec binding with pattern matching. . the enriched dominance tree is no longer acyclic. if we insert the shared declaration of f 3 between the declaration of the names f 2 and f 4 and the λ-bindings of their formal parameters p i : letrec f = λ p . f 4 . (∗ calls to f 5 and g ∗) . let . e 4 (∗ contains calls to f 3 and f 4 ∗)) (6. e g (∗ contains call to f 3 ∗) in . .7 contains the outer loop L 0 = ({u . This enables us not only to syntactically distinguish between loops and nonrecursive control flow structures using the distinction between let and letrec present in many functional languages. . in let g = λ p g . but also to further restrict the visibility of function names. f 4 ) = let f 3 = λ p 3 . . In the case of irreducible CFG’s. i. Indeed. v }). in let f 5 = λ p 5 . end (∗ declaration of tuple ( f 2 . let . f 4 ). . whose constituents B 0 are determined as the maximal SCC of . end. .e. (∗ calls to f 1 . . while b 3 is immediately dominated by b in the above example. . f 2 . end λ p 4 . Of the various LNF strategies proposed in the SSA-literature. A further improvement is obtained if functions are declared using λ-abstraction.25) The recursiveness of f 4 is inherited by the function pair ( f 2 . invisible to f . In this case. As each loop forms a strongly connected component (SCC). v .

. The LNF resulting from Steensgaard’s contruction is shown in Figure 6. 6. .6. In our example. h n . h n } yields a function declaration block for functions h 1 . x }). terminating the process. H ) with headers H = {h 1 . whose (only) SCC determines a further. 6. is nested inside the definition of L 0 . w ) and (x . Loops are drawn as ellipsis that are decorated with the appropriate header nodes.e. . Fig. with private declarations for the non-headers from B \ H . . . x }.8(b). and likewise the body of L 1 once the cycles (u . . Fig. . Removing L 0 ’s back-edges (i. Figure 6. The body of loop L 0 is easily identified as the maximal SCC. loop L 1 = ({w . L 1 . (b) resulting loop nesting forest. a loop L = ( B .8(a) shows the CFG-enriched dominance tree of G 0 . . In the functional representation. the loop comprised of the latter nodes. in accordance with the LNF. . Removing L 1 ’s back-edges from G 1 results in the acyclic G 2 . Instead. {w . edges from B 0 to H 0 ) from G 0 yields G 1 .7 Illustration of Steensgaard’s construction of loop nesting forests (from Ramalingam [?]): Iterative construction of graphs G 1 and G 2 . loop L 0 provides entry points for the headers u and v but not for its non-headers w and x .8 Illustration of Steensgaard’s construction of loop nesting forests: (a) CFG-enriched dominance tree.3 Refined block sinking and loop nesting forests 91 G 0 . and are nested in accordance with the containment relation B 1 ⊂ B 0 between the bodies. inner. v ) are broken by the removal of L 0 ’s backdeges w → u and x → v.

. . 165]. and exit ∗). 6. .) = let . . . . (∗ calls from entry to u and v ∗) . . . . . As the relationship between the trees is necessarily acyclic. As the bodies of these SCC’s are necessarily nonoverlapping. v ∗) letrec (u . (∗ body of v . . . Appel [15. . in (∗ Define outer loop L 0 . . x . Early studies of CPS and direct style include [223. (∗ body of w . . . . the functional reading of the LNF extends the earlier correspondence between the nesting of individual functions and the dominance relationship to groups of functions and basic blocks: loop L 0 dominates L 1 in the sense that any path from entry to a node in L 1 passes through L 0 . 224. . with calls to u . . . . the construction yields a forest comprised of trees shaped like the LNF in Figure 6. . (∗ body of x . . . .25) and making exit private to L 1 . . λp v . . in ( λp w . . any path from entry to a header of L 1 passes trough a header of L 0 . . . with call to x ∗) ) end (∗ End of outer loop ∗) in . see Reynolds [225] and Wadsworth [269]. . we obtain a representation (6. λp x . . . . .4 Pointers to the literature Shortly after the introduction of SSA.92 6 Functional representations of SSA — (L. [253] . . . . . . v ) = (∗ Define inner loop L 1 .26) By placing L 1 inside L 0 according to the scheme from code (6. as a CFG may contain several maximal SCC’s. . . . (∗ body of entry ∗) . . . (6. . . 215]. . . .8(b)). For retrospective accounts of the historical development. (∗ body of u . . Two prominent examples of CPS-based compilers are those by Sussman et al. x ) = let exit = λp exit . . . x ∗) letrec (w . Effectively. . . . . 14] popularized the correspondence using a direct-style representation. each step of Steensgaard’s construction may identify several loops.26) which captures all the essential information of Steensgaard’s construction. building on his earlier experience with continuation-based compilation [13]. . . . . . . and exit ∗) ) end (∗ End of inner loop ∗) in (λp u . . with headers u . . . . In general. . . . . with calls to w . Continuations and low-level functional languages have been an object of intensive study since their inception about four decades ago [265. . . more specifically. with headers w . with call to w ∗). . . . . Beringer ) function entry(. the declarations of the function declaration tuples corresponding to the trees can be placed according to the loop-extended notion of dominance. v . O’Donnell [201] and Kelsey [153] noted the correspondence between let-bindings and points of variable declaration and its extension to other aspects of program structure using continuation-passing style.

4 Pointers to the literature 93 and Appel [13]. Closely related to continuations and direct-style representation are monadic intermediate languages as used by Benton et al. Occasionally. This allows one to treat side-effects (memory access. Gao and Lee’s construction would essentially yield the nesting mentioned at the beginning of Section 6. or of loop analyses in the functional world that correspond to LNF’s. and thus simplifies reasoning about program analyses and the associated transformations in the presence of impure language features. [28] and Peyton-Jones et al. in particular with respect to their integration with program analyses and optimizing transformations. the task of correctly introducing φ -compensating assignments. Beringer et al. dataflow analyses and type systems. Bform [254].. 215]). 222. a uniform representation is obtained that admits optimizing transformations to be implemented efficiently. which also served as the source of the example in Figure 6. algorithms for formal conversion between these formats. [226] present an in-depth study of SSA destruction. Two alternative constructions discussed by Ramalingam are those by Sreedhar. and their efficient compilation to machine code. and SIL [258]. avoiding the administrative overhead to which some of the alternative approaches are susceptible. To us. and the conditions on the latter notion are strengthened so that only variables may appear as branch conditions. ) in a uniform way. [89]. the fact that entry points of loops are not necessarily classified as headers appears to make an elegant representation in functional form at least challenging. where the approach to explicitly name control flow points such as merge points is taken to its logical conclusion. similar to the isolation of primitive terms in let-normal form (see also [224. [30] . Lambda-lifting and dropping are well-known transformations in the functional programming community. Gao and Lee [?]. . following Moggi [192]. and Havlak [?].7. remain an active area of research. the term direct style refers to the combination of tail-recursive functions and let-normal form. 154]. These partition expressions into categories of values and computations. i. Our discussion of Steensgaard’s construction was based on a classification of LNF’s by Ramalingam [?]. it appears that a functional reading of Sreedhar. Extending the syntactic correspondences between SSA and functional languages. Variations of this discipline include administrative normal form (A-normal form.3. exceptions.. A particularly appealing variant is that of Kennedy [154]. By mandating that all control flow points be explicitly named. Chakravarty et al. Rideau et al. including a verified implementation in the proof assistant Coq for the “windmills”-problem. ANF [118]).6. and are studied in-depth by Johnsson [148] and Danvy et al. [64] prove the correctness of a functional representation of Wegmann and Zadeck’s SSAbased sparse conditional constant propagation algorithm [274].e. The relative merit of the various functional representations. We are not aware of previous work that transfers the analysis of loop nesting forests to the functional setting. IO. similarities may be identified between their characteristic program analysis frameworks. Regarding Havlak’s construction. [210]. Recent contributions include [88. .

94

6 Functional representations of SSA — (L. Beringer )

consider data flow equations for liveness and read-once variables, and formally translate their solutions to properties of corresponding typing derivations. Laud et al. [166] present a formal correspondence between dataflow analyses and type systems but consider a simple imperative language rather than SSA. The textbook [199] presents a unifying perspective on various program analysis techniques, including data flow analysis, abstract interpretation, and type systems. As outlined in loc. cit., the extension of type systems to effect systems is particularly suitable for integrating data with control flow analyses. Indeed, as functional languages treat code and data in an integrated fashion, type-based analyses provide a natural setting for developing extensions of (SSA-based) program analyses to inter-procedural analyses. For additional pointers on type-based program analysis, see [207].

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.5 Concluding remarks
In addition to low-level functional languages, alternative representations for SSA have been proposed, but their discussion is beyond the scope of this chapter. Glesner [127] employs an encoding in terms of abstract state machines [130] that disregards the sequential control flow inside basic blocks but retains control flow between basic blocks to prove the correctness of a code generation transformation. Later work by the same author uses a more direct representation of SSA in the theorem prover Isabelle/HOL for studying further SSA-based analyses. Matsuno and Ohori [186] present a formalism that captures core aspects of SSA in a notion of (non-standard) types while leaving the program text in unstructured non-SSA form. Instead of classifying values or computations, types represent the definition points of variables. Contexts associate program variables at each program point with types in the standard fashion, but the nonstandard notion of types means that this association models the reaching-definitions relationship rather than characterizing the values held in the variables at runtime. Noting a correspondence between the types associated with a variable and the sets of def-use paths, the authors admit types to be formulated over type variables whose introduction and use corresponds to the introduction of φ -nodes in SSA. Finally, Pop et al.’s model [216] dispenses with control flow entirely and instead views programs as sets of equations that model the assignment of values to variables in a style reminiscent of partial recursive functions. This model is discussed in more detail in Chapter 11.

φ

Part II

Analysis
Progress: 81%

97

Progress: 88% • what structural invariants does ssa provide for analysis (majumdar? ask palsberg to ask him) • available expressions become available values • split the analyses between algorithms that need kill calculation and those that dont? • bidirectional (but dont discuss ssi)
TODO: some throwaway remarks that never found a place

CHAPTER

7

Introduction

A. Ertl

Progress: 0%

Material gathering in progress

99

100

7 Introduction — (A. Ertl)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.1 benets of SSA
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.2 properties: suxed representation can use dense vector, vs DF needs (x,PP) needs hashtable
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.3 phi node construction precomputes DF merge
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.4 dominance properties imply that simple forward analysis works
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.5 Flow insensitive somewhat improved by SSA
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.6 building defuse chains: for each SSA variable can get pointer to next var
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.7 improves SSA DF-propagation

. However. . . . . . . . 8. . . or some other metric of interest. . . Nodes in the graph represent operations and edges represent the potential flow of program execution. Novillo Progress: 90% Text and figures formating in progress . . We will also show that how the SSA property allows us to reduce the compilation time and memory consumption of the data flow analyses that this program representation supports. .. . . . and to guarantee its correctness. . . Information on certain program properties is propagated among the nodes along the control flow edges until the computed information stabilizes. . Traditionally. typically some form of program analysis is required to determine if a given program transformation is applicable. . . i. . no new information can be inferred from the program. . . . . data flow analysis is performed on a control flow graph representation (CFG) of the input program. . . . . . . . Data flow analysis is a simple yet powerful approach to program analysis that is utilized by many compiler frameworks and program analysis tools today. in order to perform these optimizations. to estimate its profitability. . . . . code size. . . . . We will introduce the basic concepts of traditional data flow analysis in this chapter and will show how the static single assignment form (SSA) facilitates the design and implementation of equivalent analyses. Brandner D.1 Overview A central task of compilers is to optimize a given input program such that the resulting code is more efficient in terms of execution time.CHAPTER 8 Propagating Information using SSA F. . . . .e. . . . . . 101 .

we will also investigate the limitations of the SSA-based approach. . We then provide an example of a data flow analysis that can be performed efficiently by the aforementioned engine. . Liveness information. . . . . the processing stops when the information associated with the graph nodes stabilizes. Brandner.2. . . . . In this chapter. . . and timing analysis tools. The propagation is typically performed iteratively until the computed results stabilize. The analysis results are gathered from the input program by propagating information among its operations considering all potential execution paths. . . We conclude and provide links for further reading in Section 8. for example. . . The basic algorithm is not limited to constant propagation and can also be applied to solve a large class of other data flow problems efficiently. . . . . a data flow problem can be specified using a monotone framework that consists of: • a complete lattice representing the property space. . . This section gives a brief introduction to the basics of data flow analysis. the edges represent data dependencies instead of control flow. is critical during register allocation. and .4. . or the set of program points that are reachable at run-time. the particular constant value a variable may take. . Typical examples of interesting properties are: the set of live variables at a given program point. This representation allows a selective propagation of program properties among data dependent graph nodes only. However. . . Formally.102 8 Propagating Information using SSA— (F. but also finds application in a broad spectrum of analysis and verification tasks in program analysis tools such as program checkers. The remainder of this chapter is organized as follows. This will provide the theoretical foundation and background for the discussion of the SSA-based propagation engine in Section 8. . . . . . 8. . . .5. profiling tools. while the two latter properties help to simplify computations and to identify dead code. . . As noted before. First. • a flow graph resembling the control flow of the input program. Due to space considerations. D. . data flow analysis derives information from certain interesting program properties that may help to optimize the program. . . . . . . . As before.2 Preliminaries Data flow analysis is at the heart of many compiler transformations and optimizations. . . . . not all data flow analyses can be modeled. . however. . . the basic concepts of (traditional) data flow analysis are presented in Section 8. we cannot cover this topic in full depth. . Instead of using the CFG they represent the input program as an SSA graph as defined in Chapter 18: operations are again represented as nodes in this graph. namely copy propagation in Section 8. Novillo) The propagation engine presented in the following sections is an extension of the well known approach by Wegman and Zadeck for sparse conditional constant propagation (also known as SSA-CCP).

Property Space: A key concept for data flow analysis is the representation of the property space via partially ordered sets (L ..e. However.2 Preliminaries 103 • a set of transfer functions modeling the effect of individual operations on the property space. A particularly interesting class of partially ordered sets are complete lattices. the information from those edges has to be combined using the meet or join operator.. i. a complete lattice. where L represents some interesting program property and represents a reflexive. a flow graph.e. while the latter is termed the meet operator. as well as least upper and greatest lower bounds. where all subsets have a least upper bound as well as a greatest lower bound. if a node has multiple incoming edges. Using the relation.2. An analogous definition can be given for descending chains. . Such analyses are termed backward analyses. l n } of a complete lattice. ). A very popular and intuitive way to solve these equations is to compute the maximal (minimal) fixed point (MFP) . An ascending chain is a totally ordered subset {l 1 . and a set of transfer functions. reverse the direction of the edges in the control flow graph. . Data flow information is then propagated from one node to another adjacent node along the respective graph edge using in and out sets associated with every node. while those using the regular flow graph are forward analyses. can be defined for subsets of L . 8. . data can be simply copied from one set to the other. the operations of the program need to be accounted for during analysis. Program Representation: The functions of the input program are represented as control flow graphs. These bounds are unique and are denoted by and respectively. upper and lower bounds. i. This framework describes a set of data flow equations whose solution will ultimately converge to the solution of the data flow analysis. transitive. often denoted by ⊥ and respectively. . and edges denote the potential flow of execution at run-time. Complete lattices have two distinguished elements. Every operation is thus mapped to a transfer function. Usually these operations change the way data is propagated from one control flow node to the other. the least element and the greatest element. where ∀i > m : l i = l m . it is helpful to reverse the flow graph to propagate information. Transfer Functions: Aside from the control flow. In the context of program analysis the former is often referred to as the join operator. which transforms the information available from the in set of the operation’s flow graph node and stores the result in the corresponding out set. and antisymmetric relation.8. yields an instance of a monotone framework. Sometimes. If there exists only one edge connecting two nodes. where the nodes represent operations. or instructions. A chain is said to stabilize if there exists an index m .1 Solving Data Flow Problems Putting all those elements together.

The basic idea is to directly propagate information computed at the unique definition of a variable to all its uses. . . the lattice for copy propagation consists of the cross product of many smaller lattices. using the meet or join operator. .3 Data Flow Propagation under SSA Form SSA form allows us to solve a large class of data flow problems more efficiently than the iterative work list algorithm presented previously. we have to propagate data flow to all relevant program points. . this gives an O (|V |2 · h ) general algorithm to solve data flow analyses. . . . . Recalling the definition of chains from before (see Section 8. Since the number of edges is bounded by the number of graph nodes |V |. . . E ) it is easy to see that the MFP solution can be computed in O (|E | · h ) time. . . . more precisely. . . Novillo) using an iterative work list algorithm. . Visiting an edge consists of first combining the information from the out set of the source node with the in set of the target node. . . . . The total height of the lattice thus directly depends on the number of variables in the program. Nodes are required to hold informations even when these are not directly related to the node. .2). The algorithm terminates when the data flow information stabilizes and the work list becomes empty. . . which ensures that the lattice has finite height. where |E | represents the number of edges. . 8. . . . . . . . . . hence. the height of a lattice is defined by the length of its longest chain. |E | ≤ |V |2 . . . We can ensure termination for lattices fulfilling the ascending chain condition. . In this way. then applying the transfer function of the target node. . We are thus interested in bounding the number of times a flow edge is processed. . . The obtained information is then propagated to all successors of the target node by appending the corresponding edges to the work list. . . It may even happen that an infinite feedback loop prevents the algorithm from terminating. . . . . . D. . thus reducing memory consumption and compilation time. intermediate program points that neither define nor use the variable of interest do not have to be taken into consideration. . Note that the height of the lattice often depends on properties of the input program. . . . . Brandner. A single flow edge can be appended several times to the work list in the course of the analysis. which might ultimately yield bounds worse than cubic in the number of graph nodes. each representing the potential values of a variable occurring in the program. In terms of memory consumption. For instance. Given a lattice with finite height h and a flow graph G = (V .104 8 Propagating Information using SSA— (F. each node must store complete in and out sets. . The work list contains edges of the flow graph that have to be revisited.

Consider for example the code excerpt shown in Figure 8. edges represent true dependencies.1 Example program and its SSA graph. propagates the desired information directly from definition to use sites. x2 . 5: x3 = φ x1 . ( b ) SSA-graph. Fig. . At the same time. Data flow analyses under SSA form rely on a specialized program representation based on SSA graphs. the SSA graph captures the relevant join points of the CFG of the program. which resemble traditional def-use chains and simplify the propagation of data flow information. In the following. ( c ) Control flow graph. without any intermediate step. 6: z1 = x3 + y1 2: if (. The edges of the graph connect the unique definition of a variable with all its uses.8. we assume that possibly many φ -operations are associated with the corresponding CFG nodes at those join points. at the beginning of the code. down to its unique use at the end. The traditional CFG representation causes the propagation to pass through several intermediate program points. The SSA form properties ensure that a φ -operation is placed at the join point and that any use of the variable that the φ -function defines has been properly updated to refer to the correct name.1 Program Representation Programs in SSA form exhibit φ -operations placed at join points of the original CFG.) 3: then x1 = 4. 8. 2: if (.e. x2 . The SSA graph representation. we also find that the control flow join following . The nodes of an SSA graph correspond to the operations of the program. 5: x3 = φ x1 . and x3 . Besides the data dependencies. A join point is relevant for the analysis whenever the value of two or more definitions may reach a use by passing through that join. x2 6: z1 = x3 + y1 ( a ) SSA pseudo code. i. along with its corresponding SSA graph and CFG. .3 Data Flow Propagation under SSA Form 105 1: y1 = 6 3: x1 = 4 4: x2 = 5 1: y1 = 6. These program points are concerned only with computations of the variables x1 . . Assume we are interested in propagating information from the assignment of variable y1 .3. including the φ -operations that are represented by dedicated nodes in the graph. 6: z1 = x3 + y1 . 8. .. and are thus irrelevant for y1 .1. x2 1: y1 = 6 4: else x2 = 5. on the other hand.) 3: x1 = 4 4: x2 = 5 5: x3 = φ x1 .

a control flow graph capturing the execution flow of the program. D. due to the shortcomings of the SSA graph. For other operations. However. .3.2 Sparse Data Flow Propagation Similar to monotone frameworks for traditional data flow analysis. Continue with step 2 until both work lists become empty. proceed as follows: • Mark the edge as executable. the φ -operation will select the value of x2 in all cases. 2. visit all its operations (Algorithm 13). If the element is a control flow edge.1. Remove the top element of one of the two work lists. Consequently. • The SSAWorkList is empty. a set of transfer functions for the operations of the program. process the target operation as follows: a. Novillo) Algorithm 12: Sparse Data Flow Propagation 1. • Visit every φ -operation associated with the edge’s target node (Algorithm 13). • The CFGWorkList is seeded with the outgoing edges of the CFG’s start node. b. When the target operation is a φ -operation visit that φ -operation. which is known to be of constant value 5. Even though the SSA graph captures data dependencies and the relevant join points in the CFG. Initialization • Every edge in the CFG is marked not executable. • If the target node was reached the first time via the CFGWorkList. 8. • If the target node has a single. 5.106 8 Propagating Information using SSA— (F. and an SSA graph representing data dependencies. 3. frameworks for sparse data flow propagation under SSA form can be defined using: • • • • a complete lattice representing the property space. and assume that the condition associated with the if operation is known to be false for all possible program executions. However. this information cannot be derived. append that edge to the CFGWorkList. analysis results can often be improved significantly by considering the additional information that is available from the control dependencies in the CFG. It is thus important to use both the control flow graph and the SSA graph during data flow analysis in order to obtain the best possible results. 4. non-executable outgoing edge. Visit the operation if any of the edges is executable. If the element is an edge from the SSA graph. As an example consider once more the code of Figure 8. examine the executable flag of the incoming edges of the respective CFG node. it lacks information on other control dependencies. the if is properly represented by the φ -operation defining the variable x3 in the SSA graph. Brandner.

which contains edges from the SSA graph.8. Once the algorithm has determined that a CFG edge is executable. However. It proceeds by removing the top element of either of those lists and processing the respective edge. the control flow graph is used to track which operations are not reachable under any program execution and thus can be safely ignored during the computation of the fixed point solution.2—are obsolete. it will be processed by Step 3 of the main algorithm. Determine all outgoing edges of the branch’s CFG node whose condition is potentially satisfied. which containing edges of the control flow graph.3 Data Flow Propagation under SSA Form 107 Algorithm 13: Visiting an Operation 1. Throughout the main algorithm. Append the CFG edges that were non-executable to the CFGWorkList. i. Other operations: Update the operation’s data flow information by applying its transfer function. We again seek a maximal (minimal) fixed point solution (MFP) using an iterative work list algorithm. where the data flow analysis cannot rule out that a program execution traversing the edge exists. b. the operation of . Whenever the data flow information of an operation changes. since φ -operations already provide the required buffering. For regular uses the propagation is then straightforward due to the fact that every use receives its value from a unique definition. and the SSAWorkList. all φ -operations of its target node need to be reevaluated due to the fact that Algorithm 13 discarded the respective operands of the φ -operations so far—because the control flow edge was not yet marked executable. Special care has to be taken only for φ -operations. operations of the program are visited to update the work lists and propagate information using Algorithm 13. As data flow information is propagated along SSA edges that have a single source. c. Propagate data flow information depending on the operation’s kind: a. We will highlight the most relevant steps of the algorithms in more detail in the following paragraphs. The algorithm is shown in Algorithm 12 and processes two work lists.e. data flow information is not propagated along the edges of the control flow graph but along the edges of the SSA graph. In addition. First. in contrast to the algorithm described previously. the CFGWorkList. 2. append all outgoing SSA graph edges of the operation to the SSAWorkList.. φ -operations: Combine the data flow information from the node’s operands where the corresponding control flow edge is executable. The CFGWorkList is used to track edges of the CFG that were encountered to be executable. The data flow information of the incoming operands has to be combined using the meet or join operator of the lattice. it is sufficient to store the data flow information with the SSA graph node. Conditional branches: Examine the branch’s condition(s) using the data flow information of its operands. which select a value among their operands depending on the incoming control flow edges. Similarly. The in and out sets used by the traditional approach— see Section 8.

i. 6: z1 = x3 + y1 (z1 . or whenever the data flow information of one of their predecessors in the SSA graph changed.e. Consequently. the target node has to be evaluated when the target node is encountered to be executable for the first time. and y1 . ⊥) 5: x3 = φ x1 . However.108 8 Propagating Information using SSA— (F. Brandner. The transfer functions immediately yield that the variables are all constant. also the assignment to variable z1 is reevaluated. but the analysis . ⊥) Fig. Regular operations as well as φ -operations are visited by Algorithm 13 when the corresponding control flow graph node has become executable. 5. Note that this is only required the first time the node is encountered to be executable. all control flow edges in the program will eventually be marked executable. Finally. 6) 3: x1 = 4 4: x2 = 5 (x2 . At φ -operations. due to the processing of operations in Step 4b.1 and the constant propagation problem. assume that the condition of the if cannot be statically evaluated. Finally. we thus have to assume that all its successors in the CFG are reachable. x2 (x3 . 4) 4: x2 = 5 (x2 . 5) 1: y1 = 6 (y1 . This new information will trigger the reevaluation of the φ -operation of variable x3 . control flow edges are appended to the CFGWorkList to ensure that all reachable operations are considered during the analysis. which thereafter triggers the reevaluation automatically when necessary through the SSA graph. Conditional branches are handled by examining its conditions based on the data flow information computed so far. (z1 . D. i. As both of its incoming control flow edges are marked executable. This will trigger the evaluation of the constant assignments to the variables x1 . x2 (x3 . 5) 5: x3 = φ x1 .e. only those operands where the associated control flow edge is marked executable are considered. holding the values 4. and 6 respectively. the combined information yields 4 5 = ⊥. consider the program shown in Figure 8.2 Sparse conditional data flow propagation using SSA graphs. Novillo) 3: x1 = 4 (x1 . 11) ( b ) With unreachable code. 8. 6) 6: z1 = x3 + y1 ( a ) All code reachable. the value is known not to be a particular constant value. 5) 1: y1 = 6 (y1 . x2 . As an example.. the information from multiple control flow paths is combined using the usual meet or join operator. all regular operations are processed by applying the relevant transfer function and possibly propagating the updated information to all uses by appending the respective SSA graph edges to the SSAWorkList. Depending on whether those conditions are satisfiable or not.. the currently processed control flow edge is the first of its incoming edges that is marked executable. First.

An expression is available at a given program point when the expression is computed and not modified thereafter on all paths leading to that program point. Measurements indicate that this growth is linear. which is known to be constant. the semantics and placement of φ -operations. Neither the control flow edge leading to the assignment of variable x1 nor its outgoing edge leading to the φ -operation of variable x3 are marked executable. in practice the SSA-based propagation engine outperforms the traditional approach.3 Discussion During the course of the propagation algorithm.3. Second. Consequently. which arise from two sources: First. This is due to the direct propagation from the definition of a variable to its uses. 8. However. Consider. The size of the SSA graph increases with respect to the original non-SSA program.2a. however.3 Data Flow Propagation under SSA Form 109 shows that its value is not a constant as depicted by Figure 8. This enables the analysis to show that the assignment to variable z1 is. without the costly intermediate steps that have to be performed on the CFG. it is sufficient to associate every node in the SSA graph with the data flow information of the corresponding variable only. 8. whenever the operation corresponding to its definition is found to be executable. yielding a bound that is comparable to the bound of traditional data flow analysis. the exclusive propagation of information between data-dependent operations. constant as well. On the other hand. where E SSA and E CFG represent the edges of the SSA graph and the control flow graph respectively. the reevaluation of the φ -operation considers the data flow information of its second operand x2 only. This leads to an upper bound in execution time of O (|E SSA |· h + |E CFG |). as shown in Figure 8.4 Limitations Unfortunately. while the latter prohibits the modeling of general backward problems. the presented approach also has its limitations. for example. Afterward. The former issue prohibits the modeling of data flow problems that propagate information to program points that are not directly related by either a definition or a use of a variable. leading to considerable savings in practice. In particular.3. an edge can be revisited several times depending on the height h of the lattice representing the property space of the analysis. edges of the control flow graph are processed at most once.2b.8. every edge of the SSA graph is processed at least once. The overhead is also reduced in terms of memory consumption: instead of storing the in and out sets capturing the complete property space on every program point. in fact. If. the problem of available expressions that often occurs in the context of redundancy elimination. this might include program . the condition of the if is known to be false for all possible program executions a more precise result can be computed.

For one. Chapter 14 gives an overview of such representations. program representations akin to the SSA format that can handle backward analyses. . φ -operations are placed at join points with respect to the forward control flow and thus do not capture join points in the reversed control flow graph. i. . . thereby effectively eliminating the original copy operation. Furthermore. . The SSA graph does not cover those points. a rewrite phase eliminates spurious copies and renames variables. Given a sequence of copies as shown in Figure 8.e. . . . The problem . . particularly those in loops. . . Copy propagation under SSA form is.3 Analysis of copy-related variables. 8. . . very simple. . . Given the assignment x = y . . . . . . . The analysis for copy propagation can be described as the problem of propagating the copy-of value of variables. . There are. it is still a useful and effective solution for interesting problems. we say that y1 is a copy of x1 and z1 is a copy of y1 .. Secondly. . . . . in principle. points that are independent from the expression and its operands. . 1: y1 = . .4 Example  Copy Propagation Even though data flow analysis based on SSA graphs has its limitations. . as we will show in the following example. . . a data flow analysis is performed to find copy-related variables throughout the program. data flow analysis using SSA graphs is limited to forward problems. . as variables typically have more than one use. however. . . . Novillo) 1: x1 = . such an approach will not be able to propagate copies past φ -operations. . . . . it is not possible to simply reverse the edges in the graph as it is done with flow graphs. all we need to do is to traverse the immediate uses of x and replace them with y . . 2: y1 = x1 3: z1 = y1 ( a ) Transitive copy relations. . Due to the structure of the SSA graph. . . as it propagates information directly from definitions to uses without any intermediate steps. . SSA graphs are consequently not suited to model backward problems in general. . x2 ( b ) Copy relations on φ-operations.3a. In addition. this would invalidate the nice property of having a single source for incoming edges of a given variable. A more powerful approach is to split copy propagation into two phases: Firstly. . .110 8 Propagating Information using SSA— (F. . . Brandner. . 8. . . . neither define nor use any of its operands. However. . . . Fig. D. 2: x1 = y1 3: x2 = y1 4: x3 = φ x1 . .

) goto 2 e6 ( a ) Control flow graph. the ordering used here is meant for illustration only. . its copy-of value is the variable itself. Note that the actual visiting order depends on the shape of the CFG and immediate uses. This is accomplished by choosing the join operator such that a copy relation is propagated whenever the copy-of values of all the operands of the φ -operation match. . . 1: x1 = . In order to handle transitive copy relations. . e3 4: x3 = . . 2: x2 = φ x1 . . x2 (e3) 6: if (. x2 6: if (. which in turn is a copy of itself. 8. . If a variable is not found to be a copy of anything else. . Assuming that the value assigned to variable x1 is not a copy. . . all transfer functions operate on copy-of values instead of the direct source of the copy. with both work lists empty and the data flow information associated with all variables: 1. The lattice of this data flow problem is thus similar to the lattice used previously for constant propagation. the SSA edges leading to operations 2 and 3 are appended to the SSAWorkList. we would like to obtain the result that x3 is a copy of y1 for the example of Figure 8. Fig. The next example shows a more complex situation where copy relations are obfuscated by loops—see Figure 8. in other words. Processing starts at the operation labeled 1. When visiting the φ -operation for x3 . and the control flow graph edge e 1 is appended to the CFGWorkList. The least element of the lattice represents the fact that a variable is a copy of itself. The lattice elements correspond to variables of the program instead of integer numbers. this yields that both y1 and z1 are copies of x1 .4 Example – Copy Propagation 111 with this representation is that there is no apparent link from z1 to x1 .8.4.4 φ -operations in loops often obfuscate copy relations. the analysis finds that x1 and x2 are both copies of y1 and consequently propagates that x3 is also a copy of y1 . e4 5: x4 = φ x3 . Similarly. the data flow information for this variable is lowered to ⊥.) ( b ) SSA Graph. . x4 3: if (x2 == x1 ) 2: x2 = φ x1 (e1). e5 e1 1: x1 = .3b. x4 (e5) 3: if (x2 == x1 ) goto 5 e2 4: x3 = . 5: x4 = φ x3 (e4). . . For the above example.

112 8 Propagating Information using SSA— (F. . Now. . the φ -operation labeled 5 is visited. . . . 151. along with the ability to discover non-executable code. . 3. and Hankin [197] gives an excellent introduction to the theoretical foundations and practical applications. variables x1 and x4 have the same copy-of value x1 . . only the φ -operation is revisited. Brandner. . 197]. . . . This information is utilized in order to determine which outgoing control flow graph edges are executable for the conditional branch. . . . However. . Novillo) 2. . . . . . . . . . Assuming that the control flow edge e 6 leads to the exit node of the control flow graph. 4. For instance. Processing the control flow edge e 2 from the work list causes the edge to be marked executable and the operations labeled 2 and 3 to be visited. . Since the target operations are already known to be executable. Control flow edge e 3 is processed next and marked executable for the first time. . However. . . Moreover. . . Thus. allows to handle even obfuscated copy relations. D. Due to the fact that edge e 4 is not known the be executable. the algorithm stops after processing the edge without modifications to the data flow information computed so far. Furthermore. . For reducible flow graphs the order in which operations are processed by the work list algorithm can be optimized [136. . . The condition of the branch labeled 6 cannot be analyzed and thus causes its outgoing control flow edges e 5 and e 6 to be added to the work list. control flow edge e 5 is processed and marked executable. . . relying on reducibility is problematic because the flow graphs are often not reducible even for proper structured languages. . The straightforward implementation of copy propagation would have needed multiple passes to discover that x4 is a copy of x1 . . .5 Further Reading Traditional data flow analysis is well established and well described in numerous papers. allowing to derive tighter complexity bounds. 8. . this kind of propagation will only reevaluate the subset of operations affected by newly computed data flow information instead of the complete control flow graph once the set of executable operations has been discovered. this yields a copy relation between x4 and x1 (via x2 ). Since edge e 5 is not yet known to be executable. The book by Nielsen. Furthermore. . the processing of the φ -operation yields a copy relation between x2 and x1 . . Nielsen. which is identical to the previous result computed in Step 2. Examining the condition shows that only edge e 3 is executable and thus needs to be added to the work list. experiments have shown that the tighter bounds not necessarily lead to improved compilation times [79]. . . reversed control flow graphs for backward problems can be—and in fact almost always are—irreducible even for programs with reducible control flow graphs. . . . . . . . . . neither of the two work lists is modified. . The iterative nature of the propagation. 5.

. Static Single Information form (SSI) [237] allows both backward and forward problems to be modeled by introducing σ operations. Ruf [230] introduces the value dependence graph. He derives a sparse representation of the input program. and apply simplifications to them. is based on singleentry/single-exit regions. The sparse propagation engine [200. traditional data flow equations can also be solved using a more powerful approach called the meet over all paths (MOP) solution. eliminate redundancies. which computes the in data flow information for a basic block by examining all possible paths from the start node of the control flow graph. The previous approaches derive a sparse graph suited for data flow analysis using graph transformations applied to the CFG. is based on the underlying properties of the SSA form. Duesterwald et al. Ramalingam [219] further extends these ideas and introduces the compact evaluation graph. Chapter 14 provides additional information on the use of SSI form for static program analysis. Even though more powerful. which are placed at program points where data flow information for backward problems needs to be merged [236]. Consequently. as well as the approach presented by Johnson and Pingali. to eliminate array bounds checks [35]. e -SSA. which is suited for data flow analysis. 272]. Bodík uses an extended SSA form. computing the MOP solution is often harder or even undecidable [197]. The sparse evaluation graph by Choi et al. using a set of transformations and simplifications. Other intermediate representations offer similar properties. as presented in the chapter. the MFP solution is preferred in practice. A similar approach.5 Further Reading 113 Apart from computing a fixed point (MFP) solution. Their approach is closely related to the placement of φ -operations and similarly relies on the dominance frontier during construction.8. presented by Johnson and Pingali [145]. but is also less complex to compute. The resulting graph is usually less sparse. which captures both control and data dependencies. The approach is superior to the sparse representations by Choi et al. which is constructed from the initial CFG using two basic transformations. [68] is based on the same basic idea as the approach presented in this chapter: intermediate steps are eliminated by by-passing irrelevant CFG nodes and merging the data flow information only when necessary. [100] instead examine the data flow equations.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Starts with a counter example of the intuition that live-range is “between” the definition and the uses. .3 Computing liveness sets Provides Appel’s (In the modern compiler book) technique to compute liveness sets under SSA. . . 9. . . . . . . . . . . . . .1 TODO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Boissinot F. . 9. . . . . . . . . . . . . . . 115 . . . . 9. . . Discussion about φ -function semantic (this is transverse to the technique). . . . . . . . . .2 SSA tells nothing about liveness Do we really need this? I am not convinced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rastello Progress: 30% Material gathering in progress . . . . . . . .CHAPTER 9 Liveness B. . . . . .

. . . . . . . . .4 Liveness check . 9. . . . . . . . Rastello) . . . . . . .116 9 Liveness — (B. . . . . . . . . . . . . . . . . . . . Boissinot. . . . . . . . . . F. .

. . . . . . . .CHAPTER 10 ??? Alias analysis Progress: 0% Material gathering in progress . . . . . 10. . . . . . . . . . . . . . . . . . . . . . . . . .1 TODO 117 . . . . . . . . . . . . . . .

.

. such as strongly connected components of basic blocks (or reducible loops).. The induction variable analysis presented in this chapter is based on the detection of self references in the SSA representation and on its characterization. . . . . . Cohen Progress: 90% Text and figures formating in progress This chapter presents an extension of the SSA under which the extraction of the reducible loop tree can be done only on the SSA graph itself. . . . a large part of the CFG information is translated into the SSA representation. . . .1 Part of the CFG and Loop Tree can be exposed from the SSA During the construction of the SSA representation based on a CFG representation. . . . . . . As the construction of the SSA has precise rules to place the phi nodes in special points of the CFG (i. . . Pop A. . . . . This first section shows that the classical SSA representation is not enough to represent the semantics of the original program. . . . We will see the minimal amount 119 . . . . This extension also captures reducible loops in the CFG.CHAPTER 11 Loop tree and induction variables S. 11. based only on the patterns of the SSA definitions and uses. Furthermore. by identifying patterns of uses and definitions. . . . . . at the merge of control-flow branches). . . . . it is possible to identify higher level constructs inherent to the CFG representation. . . . This chapter first illustrates this property then shows its usefulness through the problem of induction variable recognition. it is possible to expose a part of the CFG structure from the SSA representation. . .e. . . . . . . . .

1. listing the definitions in an arbitrary order. Cohen) of information that has to be added to the classical SSA representation in order to represent the loop information: similar to the η-function used in the Gated SSA presented in Chapter 18. and the basic blocks with multiple predecessors contain φ -nodes. } after removing the CFG structure. we remove the CFG. Let’s look at what happens when. succs = {bb_2}) { a = #some computation independent of b } bb_2 (preds = {bb_1}. we could obtain this: return c.1 An SSA representation without the CFG In the classic definition of the SSA. b = #some computation independent of a c = a + b. succs = {bb_3}) { b = #some computation independent of a } bb_3 (preds = {bb_2}. succs = {bb_5}) { return c. 11. obtained from this pretty printer.120 11 Loop tree and induction variables — (S. Pop. the CFG provides the skeleton of the program: basic blocks contain assignment statements defining SSA variable names. a = #some computation independent of b 1 to simplify the discussion. the loop closed SSA form adds an extra variable at the end of a loop for each variable defined in a loop and used after the loop. succs = {bb_4}) { c = a + b. contains enough information to enable us to compute the same thing as the original program 1 ? Let’s see what happens with an example in its CFG based SSA representation: bb_1 (preds = {bb_0}. In order to remove the CFG. Does the representation. imagine a pretty printer function that dumps only the arithmetic instructions of each basic-blocks and skips the control instructions of an imperative program by traversing the CFG structure in any order. we consider the original program to be free of side effect instructions . } bb_4 (preds = {bb_3}. A. starting from a classic SSA representation.

11.1 Part of the CFG and Loop Tree can be exposed from the SSA

121

and this SSA code is enough, in the absence of side effects, to recover an order of computation that leads to the same result as in the original program. For example, the evaluation of this sequence of statements would produce the same result:

b = #some computation independent of a a = #some computation independent of b c = a + b; return c;

11.1.2 Natural loop structures on the SSA
We will now see how to represent the natural loops in the SSA form by systematically adding extra φ -nodes at the end of loops, together with extra information about the loop exit predicate. Supposing that the original program contains a loop:

bb_1 (preds = {bb_0}, succs { x = 3; } bb_2 (preds = {bb_1, bb_3}, { i = phi (x, j) if (i < N) goto bb_3 else } bb_3 (preds = {bb_2}, succs { j = i + 1; } bb_4 (preds = {bb_2}, succs { k = phi (i) } bb_5 (preds = {bb_4}, succs { return k; } x = 3; return k; i = phi (x, j) k = phi (i)

= {bb_2})

succs = {bb_3, bb_4}) goto bb_4; = {bb_3})

= {bb_5})

= {bb_6})

Pretty printing, with a random order traversal, we could obtain this SSA code:

122

11 Loop tree and induction variables — (S. Pop, A. Cohen)

j = i + 1;
We can remark that some information is lost in this pretty printing: the exit condition of the loop have been lost. We will have to record this information in the extension of the SSA representation. However, the loop structure still appears through the cyclic definition of the induction variable i . To expose it, we can rewrite this SSA code using simple substitutions, as:

i = phi (3, i + 1) k = phi (i) return k;
Thus, we have the definition of the SSA name “i” defined in function of itself. This pattern is characteristic of the existence of a loop. We can remark that there are two kinds of φ -nodes used in this example: • loop-φ nodes “i = φ (x , j )” have an argument that contains a self reference j and an invariant argument x : here the defining expression “ j = i + 1” contains a reference to the same loop-φ definition i , while x (here 3) is not part of the circuit of dependencies that involves i and j . Note that it is possible to define a canonical SSA form by limiting the number of arguments of loop-φ nodes to two. • close-φ nodes “k = φ (i )” capture the last value of a name defined in a loop. Names defined in a loop can only be used within that loop or in the arguments of a close-φ node (that is “closing” the set of uses of the names defined in that loop). In a canonical SSA form it is possible to limit the number of arguments of close-φ nodes to one.

11.1.3 Improving the SSA pretty printer for loops
As we have seen in the above example, the exit condition of the loop disappeared during the basic pretty printing of the SSA. To capture the semantics of the computation of the loop, we have to specify in the close-φ node, when we exit the loop so as to be able to derive which value will be available in the end of the loop. With our extension that adds the loop exit condition to the syntax of the close-φ , the SSA pretty printing of the above example would be:

x = 3; i = loop-phi (x, j) j = i + 1; k = close-phi (i >= N, i) return k;
So k is defined as the “first value” of i satisfying the loop exit condition, “i ≥ N ”. In the case of finite loops, this is well defined as being the first element satisfying the loop exit condition of the sequence defined by the corresponding loop-φ node.

11.2 Analysis of Induction Variables

123

In the following, we will look at an algorithm that translates the SSA representation into a representation of polynomial functions, describing the sequence of values that SSA names take during the execution of a loop. The algorithm is restricted to the loops that are reducible. All such loops are labeled. Note that irreducible control flow is not forbidden: only the loops carrying self-definitions must be reducible. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11.2 Analysis of Induction Variables
The purpose of the induction variables analysis is to provide a characterization of the sequences of values taken by a variable during the execution of a loop. This characterization can be an exact function of the canonical induction variable of the loop (i.e., a loop counter that starts at zero with a step of one for each iteration of the loop) or an approximation of the values taken during the execution of the loop represented by values in an abstract domain. In this section, we will see a possible characterization of induction variables in terms of sequences. The domain of sequences will be represented by chains of recurrences: as an example, a canonical induction variable with an initial value 0 and a stride 1 that would occur in the loop with label x will be represented by the chain of recurrence {0, +, 1}x .

11.2.1 Stride detection
The first phase of the induction variables analysis is the detection of the strongly connected components of the SSA. This can be performed by traversing the usedef SSA chains and detecting that some definitions are visited twice. For a self referring use-def chain, it is possible to derive the step of the corresponding induction variable as the overall effect of one iteration of the loop on the value of the loop-φ node. When the step of an induction variable depends on another cyclic definition, one has to further analyze the inner cycle. The analysis of the induction variable ends when all the inner cyclic definitions used for the computation of the step are analyzed. Note that it is possible to construct SSA graphs with strongly connected components that are impossible to characterize with the chains of recurrences. This is precisely the case of the following example that shows two inter-dependent circuits, the first that involves a and b with step c + 2, and the second that involves c and d with step a + 2. This leads to an endless loop, which must be detected.

a = loop-phi-x (0, b) c = loop-phi-x (1, d) b = c + 2

124

11 Loop tree and induction variables — (S. Pop, A. Cohen)

d = a + 3

a = 3; b = 1; loop x c = φ(a, f); d = φ(b, g); e = d + 7; f = e + c; g = d + 5; end loop x
(1) Initial condition edge.

a = 3; b = 1; loop x c = φ(a, f); d = φ(b, g); e = d + 7; f = e + c; g = d + 5; end loop x
(2) Searching for “c”.

a = 3; b = 1; loop x c = φ(a, f); d = φ(b, g); e = d + 7; f = e + c; g = d + 5; end loop x

a = 3; b = 1; loop x c = φ(a, f); d = φ(b, g); e = d + 7; f = e + c; g = d + 5; end loop x

(3) Backtrack and found “c”. (4) Cyclic def-use chain.

Fig. 11.1

Detection of the cyclic definition using a depth first search traversal of the use-def chains.

Let’s now look at an example, presented in Figure 11.1, to see how the stride detection works. The arguments of a φ -node are analyzed to determine whether they contain self references or if they are pointing towards the initial value of the induction variable. In this example, (1) represents the use-def edge that points towards the invariant definition. When the argument to be analyzed points towards a longer use-def chain, the full chain is traversed, as shown in (2), until a phi node is reached. In this example, the phi node that is reached in (2) is different to the phi node from which the analysis started, and so in (3) a search starts over the uses that have not yet been analyzed. When the original phi node is found, as in (3), the cyclic def-use chain provides the step of the induction variable: in this example, the step is “+ e”. Knowing the symbolic expression for the step of the induction variable may not be enough, as we will see next, one has to instantiate all the symbols (“e” in the current example) defined in the varying loop to precisely characterize the induction variable.

11.2.2 Translation to chains of recurrences
Once the def-use circuit and its corresponding overall loop update expression have been identified, it is possible to translate the sequence of values of the induction variable to a chain of recurrence. The syntax of a polynomial chain of recurrence is: {base, +, step}x , where base and step may be arbitrary expressions or constants, and x is the loop to which the sequence is associated. As a chain of recurrence represents the sequence of values taken by a variable during the execution of a loop, the associated expression of a chain of recurrence is given by {base, +, step}x ( x ) = base + step × x , that is a function of x , the number of times the body of loop x has been executed.

11.2 Analysis of Induction Variables

125

When b a s e or s t e p translates to sequences varying in outer loops, the resulting sequence is represented by a multivariate chain of recurrences. For example {{0, +, 1}x , +, 2}y defines a multivariate chain of recurrence with a step of 1 in loop x and a step of 2 in loop y , where loop y is enclosed in loop x . When s t e p translates into a sequence varying in the same loop, the chain of recurrence represents a polynomial of a higher degree. For example, {3, +, {8, +, 5}x }x represents a polynomial evolution of degree 2 in loop x . In this case, the chain of recurrence is also written omitting the extra braces: {3, +, 8, +, 5}x . The semantics n n! , of a chain of recurrences is defined using the binomial coefficient p = p !(n −p )! by the equation:
n

{c 0 , +, c 1 , +, c 2 , +, . . . , +, c n }x ( x ) =

cp
p =0

x

p

.

with the iteration domain vector (the iteration loop counters for all the loops in which the chain of recurrence variates), and x the iteration counter of loop x . This semantics is very useful in the analysis of induction variables, as it makes it possible to split the analysis into two phases, with a symbolic representation as a partial intermediate result: 1. first, the analysis leads to an expression, where the step part “s” is left in a symbolic form, i.e., {c 0 , +, s }x ; 2. then, by instantiating the step, i.e., s = {c 1 , +, c 2 }x , the chain of recurrence is that of a higher degree polynomial, i.e., {c 0 , +, {c 1 , +, c 2 }x }x = {c 0 , +, c 1 , +, c 2 }x .

11.2.3 Instantiation of symbols and region parameters
The last phase of the induction variable analysis consists in the instantiation (or further analysis) of symbolic expressions left from the previous phase. This includes the analysis of induction variables in outer loops, computing the last value of the counter of a preceding loop, and the propagation of closed form expressions for loop invariants defined earlier. In some cases, it becomes necessary to leave in a symbolic form every definition outside a given region, and these symbols are then called parameters of the region. Let us look again at the example of Figure 11.1 to see how the sequence of values of the induction variable c is characterized with the chains of recurrences notation. The first step, after the cyclic definition is detected, is the translation of this information into a chain of recurrence: in this example, the initial value (or base of the induction variable) is a and the step is e , and so c is represented by a chain of recurrence {a , +, e }1 that is varying in loop number 1. The symbols are then instantiated: a is trivially replaced by its definition leading to {3, +, e }1 . The analysis of e leads to this chain of recurrence: {8, +, 5}1 that is then used in the chain of recurrence of c , {3, +, {8, +, 5}1 }1 and that is equivalent to {3, +, 8, +, 5}1 ,

126

11 Loop tree and induction variables — (S. Pop, A. Cohen)

a polynomial of degree two: F( ) = 3 = 5 2 +8 + +5

0
2

1

2

11 + 3. 2

11.2.4 Number of iterations and computation of the end of loop value
One of the important static analyses for loops is to evaluate their trip count, i.e., the number of times the loop body is executed before the exit condition becomes true. In common cases, the loop exit condition is a comparison of an induction variable against some constant, parameter, or another induction variable. The number of iterations is then computed as the minimum solution of a polynomial inequality with integer solutions, also called a Diophantine inequality. When one or more coefficients of the Diophantine inequality are parameters, the solution is left under a parametric form. The number of iterations can also be an expression varying in an outer loop, in which case, it can be characterized using a chain of recurrence. Consider a scalar variable varying in an outer loop with strides dependent on the value computed in an inner loop. The expression representing the number of iterations in the inner loop can then be used to express the evolution function of the scalar variable varying in the outer loop. For example, the following code

x = 0; for (i = 0; i < N; i++) // loop_1 for (j = 0; j < M; j++) // loop_2 x = x + 1;
would be written in loop closed SSA form as: x0 i x1 x4 j x3 x2 = = = = = = = 0 loop-φ 1 (0, i + 1) loop-φ 1 (x 0 , x 2 ) close-φ 1 (i < N , x 1 ) loop-φ 2 (0, j + 1) loop-φ 2 (x 1 , x 3 + 1) close-φ 2 ( j < M , x 3 )

x 3 represents the value of variable x at the end of the original imperative program. The analysis of scalar evolutions for variable x 4 would trigger the analysis of scalar evolutions for all the other variables defined in the loop closed SSA form as follows: • first, the analysis of variable x 4 would trigger the analysis of i , N and x 1

11.3 Further reading

127

• and finally the analysis of x 4 ends with x 4 = close-φ 1 (i < N , x 1 ) = {0, +, M }1 (N ) = M ×N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

– the analysis of i leads to i = {0, +, 1}1 i.e. the canonical loop counter l 1 of l oop 1 . – N is a parameter and is left under its symbolic form – the analysis of x 1 triggers the analysis of x 0 and x 2 · the analysis of x 0 leads to x 0 = 0 · analyzing x 2 triggers the analysis of j , M , and x 3 · j = {0, +, 1}2 i.e. the canonical loop counter l 2 of l oop 2 . · M is a parameter · x 3 = loop-φ 2 (x 1 , x 3 + 1) = {x 1 , +, 1}2 · x 2 = close-φ 2 ( j < M , x 3 ) is then computed as the last value of x 3 after l oop 2 , i.e., it is the chain of recurrence of x 3 applied to the first iteration of l oop 2 that does not satisfy j < M or equivalently l 2 < M . The corresponding Diophantine inequality l 2 ≥ M have minimum solution l 2 = M . So, to finish the computation of the scalar evolution of x 2 we apply M to the scalar evolution of x 3 , leading to x 2 = {x 1 , +, 1}2 (M ) = x 1 + M ; – the scalar evolution analysis of x 1 then leads to x 1 = loop-φ 1 (x 0 , x 2 ) = loop-φ 1 (x 0 , x 1 + M ) = {x 0 , +, M }1 = {0, +, M }1

11.3 Further reading
Induction variable detection has been studied extensively in the past because of its central role in loop optimizations. Wolfe [?] designed the first SSA-based induction variable recognition technique. It abstracts the SSA graph and classifies inductions according to a wide spectrum of patterns. When operating on a low-level intermediate representation with arbitrary gotos, detecting the natural loops is the first step in the analysis of induction variables. In general, and when operating on low-level code in particular, it is preferable to use analyses that are more robust to complex control flow that do not resort to an early classification into predefined patterns. Chains of recurrences [22, 158, 287] have been proposed to characterize the sequence of values taken by a variable during the execution of a loop [?], and it has proven to be more robust to the presence of complex, unstructured control flow, to the characterization of induction variables over modulo-arithmetic such as unsigned wraparound types in C, and to the implementation in a production compiler [?]. The formalism and presentation of this chapter is derived from the thesis work of Sebastian Pop. The manuscript [?] contains pseudo-code and links to the implementation of scalar evolutions in GCC since version 4.0. The same approach has also influenced the design of LLVM’s scalar evolution, but the implementa-

Loop transformations also benefit from the replacement of the end of loop value as this removes scalar dependencies between consecutive loops. Induction varaible analysis is used in dependence tests for scheduling and parallelization [?]. Cohen) tion is different. 2 Note however that the computation of closed form expressions is not required for dependence testing itself [?]. depending on the context and aggressiveness of the optimization. This information is essential to advanced loop analyses such as value-range propagation propagation [?].128 11 Loop tree and induction variables — (S. induction variable canonicalization (reducing induction variables to a single one in a given loop) [?]. Other applications include enhancements to strength reduction and loop-invariant code motion [?]. Parler des types. Modern parallelizing compilers tend to implement both kinds. arith modulo. citer comme difficulte. Substituting an induction variable with a closed form expression is also useful to the removal of the cyclic dependences associated with the computation of the induction variable itself [?]. The number of iterations of loops can also be computed based on the characterization of induction variables. It also enables more opportunities for scalar optimization when the induction variable is used after its defining loop. But in many cases. and enhance dependence tests for scheduling and parallelization [?. . simplications and approximations can lead to polynomial decision procedures [?]. Pop. A. Another interesting use of the end of loop value is the estimation of the worst case execution time (WCET) where one tries to obtain an upper bound approximation of the time necessary for a program to terminate. the extraction of short-vector to SIMD instructions [?]. traitement dans GCC et LLVM.2 The Omega test [?] and parametric integer linear programming [?] have typically used to reason about system parametric affine Diophantine inequalities. ?]. and more recently.

. The latter refers to the value generated by the computation in the static sense 1 . which translates to how it is represented in the program representation. A static value can map to different dynamic values during program execution. . . . . . .1 Introduction Redundancy elimination is an important category of optimizations performed by modern optimizing compilers. Full redundancy can be regarded as a special case of partial redundancy where the redundant computation occurs regardless of the path taken. . . . . . The elimination of full redundancy is also called common subexpression elimination. . . . As a result. . . There are two types of redundancies: full redundancy and partial redundancy. . . . Chow Progress: 80% Minor polishing in progress Author: F. . . There are two different views for a computation related to redundancy: how it is computed and the computed value. . . . . . . A computation is partially redundant if the computation has occurred only along certain paths. In the course of program execution. . 129 . . certain computations may be repeated multiple times that yield the same results. . . . algorithms for finding and eliminating redundances 1 All values referred to in this Chapter are static values viewed with respect to the program code. .CHAPTER 12 Redundancy Elimination F. . . . 12. . . A computation is fully redundant if the computation has occurred earlier regardless of the flow of control. . . Such redundant computations can be eliminated by saving the results of the earlier computations and reusing instead of recomputing them later. . . . . . The former relates to the operator and the operands it operates on. . . Chow . . . .

assuming 2 The opposite of maximal expression tree form is the triplet form in which each arithmetic operation always defines a temporary. . . Partial redundancy elimination (PRE) is powerful because it subsumes global common subexpressions and loop-invariant code motion. We also assume the Conventional SSA Form of program representation. The style of the program representation can impact the effectiveness of the algorithm applied. . Expressions computes to values without generating any side effect. In dealing with lexically identified expressions. We distinguish between statements and expressions. . our algorithm will focus on the optimization of a lexically identical expression. we deal mostly with syntax-driven redundancy elimination. In this case. . a + b is redundant when the right path is taken. . .2 Why PRE and SSA are related Figure 12.1(a). two computations are the same if they are the same operation applied to the same operands that are program variables or constants. Chow) can be classified into being syntax-driven or being value-driven. In this style. We can visualize the impact on redundancies of a single computation as shown in Figure 12. . In Figure 12. In Figure 12. . . 12. a + b is redundant whenever the branch-back edge of the loop is taken. . . any further occurrence of a + b is fully redundant. . . . Statements have side effects by potentially altering memory contents or control flow. . In syntax-driven analyses. . . a +b and a + c compute the same result if b and c can be determined to hold the same value. . redundancy can arise only if the variables’ values have not changed between the occurrences of the computation. . that appears in the program. . We further assume the HSSA form that completely models the aliasing in the program [reference to HSSA chapter]. like a + b . . Both are examples of strictly partial redundancies.1(b). . . . In value-based analyses. . In this chapter. without having to specify any assignments to temporaries for storing the intermediate values of the computation2 . we advocate a maximal expression tree form of program representation. In contrast. . . . . . . . . . a full redundancy can be deleted without requiring any insertion. . The last section will extend our discussion to value-based redundancy elimination. and are not candidates for redundancy elimination. . For example. . redundancy arises whenever two computations yield the same value. In our discussion on syntax-driven redundancy elimination. . .2. .130 12 Redundancy Elimination — (F. in which each φ -congruence class is interference-free and the live ranges of the SSA versions of each variable do not overlap. a large expression tree like a + b ∗ c − d are represented as is. In the region of the control flow graph dominated by the occurrence of a + b . . . . the compiler will repeat the redundancy elimination algorithm on all the other lexically identified expressions in the program. . . .1 shows the two most basic forms of partial redundancy. . . . During compilation. . . in which insertions are required to eliminate the redundancies. .

the same sparse approach to modeling the use-def relationships among the occurrences of a program variable can be used to model the redundancy relationships among the different occurrences of a + b . it must be related to SSA’s φ ’s.2 Why PRE and SSA are related 131 a+b a+b a+b PRE a+b a+b (a) a+b PRE a+b a+b (b) Fig. we use lower case φ ’s in the SSA form of variables and upper case Φ’s in the SSA form for expressions. To expose potential partial redundancies. In referring to the FRG. In the Φ-Insertion step. To make the expression SSA form more intuitive. a and b are not modified. which has the effect of factoring the redundancy edges at merge points in the control flow graph. we 3 Adhering to SSAPRE’s convention. 12. SSAPRE builds a redundancy edge that connects a i + b i to a j + b j . The FRG can be viewed as the SSA graph for h which is not precise.12. we introduce the operator Φ at the dominance frontiers of the occurrences.1 Two basic examples of partial redundancy elimination.3 The resulting factored redundancy graph (FRG) can be regarded as the SSA form for expressions. In constructing SSA form. once we are past the dominance frontiers. Following the program flow. we introduce the hypothetical temporary h . dominance frontiers are where φ ’s are inserted. any further occurrence of a + b is partially redundant. . The SSA form for h is constructed in two steps similar to ordinary SSA form: the Φ-Insertion step followed by the Renaming step. because we have not yet determined where h should be defined or used. a use node will refer to a node in the FRG that is not a definition. Since partial redundancies start at dominance frontiers. The SSAPRE algorithm that we present performs PRE efficiently by taking advantage of the use-def information inherent in its input conventional SSA form. In fact. which can be thought of as the temporary that will be used to store the value of the expression for reuse in order to suppress redundant computations. If an occurrence a j + b j is redundant with respect to a i + b i .

We also insert Φ’s caused by expression alteration. In addition to a renaming stack for each variable. 12. a1+b1 [h] a1+b1 [h] 1 1 a2 = [h] = Φ([h]. We conduct a pre-order traversal of the dominator tree similar to the renaming step in SSA construction for variables. Chow) a+b dominance frontiers fully redundant region partially redundant region Fig.132 12 Redundancy Elimination — (F.3(a). In Figure 12. 12.3 Examples of Φ insertion The Renaming step assigns SSA versions to h such that occurrences renamed to identical h -versions will compute to the same values. as in Figure 12.[h]) 2 2 a3 = φ(a1. to ensure that we do not miss any possible placement positions for the purpose of PRE.a2) [h] = Φ([h].[h]) (a) (b) 3 Fig. Such Φ’s are triggered by the occurrence of φ ’s for any of the operands in the expression.3(b). we maintain a renaming stack for the expression. Maintaining . Entries on the expression stack are popped as our dominator tree traversal backtracks past the blocks where the expression originally received the version. but with the following modifications. insert Φ’s at the dominance frontiers of all the expression occurrences. which in turns reflects the assignment to a in block 2. the Φ at block 3 is caused by the φ for a in the same block.2 Dominance frontiers are boundaries between fully and partially redundant regions.

. . we check the current version of every variable in the expression (the version on the top of each variable’s renaming stack) against the version of the corresponding variable in the occurrence on the top of the expression’s renaming stack. i. . . we assign it a new version. . . which we call real occurrences.. . . . . . it contains just the right amount of information for determining the optimal code placement.e. . .3 How SSAPRE Works 133 the variable and expression stacks together allows us to decide efficiently whether two occurrences of an expression should be given the same h -version. If a new version is assigned. . The objective of SSAPRE is to find a placement that satisfies the following four criteria in this order: 1. for case (c). .4 Examples of expression renaming The FRG captures all the redundancies of a + b in the program. . . (b) the Φ’s inserted in the Φ-Insertion step. . Because strictly partial redundancies can only occur at the Φ nodes. and (c) Φ operands. . If all the variable versions match. . . . we push the version on the expression stack. . . 12. . . For a non-Φ. . Correctness — X is fully available at all the original computation points. There are three kinds of occurrences of the expression in the program: (a) the occurrences in the original program. original computation points refer to the points in the original program where X ’s computation took place. . . which are regarded as occurring at the ends of the predecessor blocks of their corresponding edges. . we use the term placement to denote the set of points in the optimized program where X ’s computation occurs. . as in the example of Figure 12. . we assign it the same version as the top of the expression’s renaming stack.4(b). . . . . .4(a).3 How SSAPRE Works Referring to the expression being optimized as X . a1+b1 [h1] [h1] = Φ([h1]. If any of the variable versions does not match. cases (a) and (c). . . . . . . In contrast. for case (a). In fact. as in the example of Figure 12. During the visitation in Renaming. insertions for PRE only need to be considered at the Φ’s. we assign the special class ⊥ to the Φ operand to denote that the value of the expression is unavailable at that point. The original program will be transformed to the optimized program by performing a set of insertions and deletions. 12. . . a Φ’s is always given a new version. .12. ) b2 = a1+b2 [h2] a1+b1 [h1] T a1+b1 [h1] (a) (b) Fig. . . . . . . .

while the correctness criterion is delegated to the second step. we only focus on the first step for finding the best insertion points. SSAPRE follows the same two-step process used in all PRE algorithms. which we illustrate with the Downsafety computation. Since the second full redundancy elimination step is trivial and well understood. When we say a Φ is a candidate for insertion. Computational optimality —No other safe and correct placement can result in fewer computations of X on any path from entry to exit in the program. .1 The Safety Criterion As we have pointed out at the end of Section 12. this can happen 4 A critical edge is one whose tail block has multiple successors and whose head block has multiple predecessors. In reality. An insertion at a Φ operand means inserting X at the incoming edge corresponding to that Φ operand. computational optimality and lifetime optimality criteria. The safety criterion implies that we should only insert at Φ’s where X is downsafe (fully anticipated). We assume that all critical edges in the control flow graph have been removed by inserting empty basic blocks at such edges4 . Lifetime optimality — Subject to the computational optimality. Data flow analysis can be performed with linear complexity on SSA graphs. Except for loops with no exit. The first step will tackle the safety. we perform data flow analysis on the FRG to determine the downsafe attribute for Φ’s.3. The first step determines the best set of insertion points that render as many SPR occurrences fully redundant as possible.2. which is driven by the SPR occurrences. In the SSAPRE approach. insertions only need to be considered at the Φ’s. it means we will consider inserting at its operands to render X available at the entry to the basic block containing that Φ. the actual insertion is done at the end of the predecessor block. 4.134 12 Redundancy Elimination — (F. A Φ is not downsafe if there is a control flow path from that Φ along which the expression is not computed before program exit or before being altered by redefinition of one of its variables. 3. 12. Thus. the life range of the temporary introduced to store X is minimized. The second step deletes fully redundant computations taking into account the effects of the inserted computations. the challenge lies in the first step for coming up with the best set of insertion points. For the rest of this section. insertions are only performed at Φ operands. Chow) 2. Each occurrence of X at its original computation point can be qualified with exactly one of the following attributes: • fully redundant • strictly partially redundant (SPR) • non-redundant As a code placement problem. Safety — There is no insertion of X on any path that did not originally contain X.

5 Algorithm for DownSafety 12. Since a real occurrence of the expression blocks the case (b) propagation.3 How SSAPRE Works 135 only due to one of the following cases: (a) there is a path to exit or an alteration of the expression along which the Φ result version is not used.5 gives the DownSafety propagation algorithm.12. Procedure Reset_downsafe(X ) { 1: if d e f (X ) is not a Φ 2: return 3: f ← d e f (X ) 4: if (not d ow ns a f e ( f )) 5: return 6: d ow ns a f e ( f ) ← false 7: for each operand ω of f do 8: if (not ha s _r e a l _u s e (ω)) 9: Reset_downsafe(ω) } Procedure DownSafety { 10: for each f ∈ {Φ’s in the program} do 11: if ∃ path P to program exit or alteration of expression along which f is not used 12: d ow ns a f e ( f ) ← false 13: for each f ∈ {Φ’s in the program} do 14: if (not d ow ns a f e ( f )) 15: for each operand ω of f do 16: if (not ha s _r e a l _u s e (ω)) 17: Reset_downsafe(ω) } Fig.2 The Computational Optimality Criterion At this point. we want to identify all the Φ’s that are possible candidates for insertion by disqualifying Φ’s that cannot be insertion candidates in any computationally optimal placement. Case (a) represents the initialization for our backward propagation of ¬d ow n s a f e . The propagation of ¬d ow n s a f e is blocked whenever the has_real_use flag is true. The initialization of the has_real_use flags is performed in the earlier Renaming phase. Figure 12. We define the can_be_avail attribute for the current step. An unsafe Φ can still be an insertion candidate if the expression is fully available there.3. we have eliminated the unsafe Φ’s based on the safety criterion. or (b) the Φ result version appears as the operand of another Φ that is not downsafe. 12. The Downsafety propagation is based on case (b). all other Φ’s are initially marked d ow n s a f e . we define a has_real_use flag attached to each Φ operand and set this flag to true when the Φ operand is defined by another Φ and the path from its defining Φ to its appearance as a Φ operand crosses a real occurrence. whose purpose . Next. though the inserted computation will itself be fully redundant.

3. The can_be_avail attribute can be viewed as: c a n _b e _a v a i l (Φ) = d ow n s a f e (Φ) ∪ a v a i l (Φ) We could compute the avail attribute separately using the full availability analysis. but they would not affect computational optimality because the subsequent full redundancy elimination step will remove any fully redundant inserted or non-inserted computation. The Later propagation is based on case (b). or 2. we choose to compute can_be_avail directly by initializing a Φ to be ¬c a n _b e _a v a i l if the Φ is not downsafe and one of its operands is ⊥. except the following cases: (a) the Φ has an operand defined by a real computation. The two Φ’s with their computed data flow attributes are as shown. has_real_use is false and it is defined by a ¬w i l l _b e _a v a i l Φ. leaving the earliest computations as the optimal code placement. we perform a second forward propagation called Later that is derived from the well-understood partial availability analysis. Case (a) represents the initialization for our forward propagation of not later . A Φ is ¬c a n _b e _a v a i l if and only if inserting there violates computational optimality. it is ⊥. But this would have performed some useless computation because we do not need to know its values within the region where the Φ’s are downsafe. . where the program exhibits partial redundancy that cannot be removed by safe code motion.6. Chow) is to identify the region where. At these Φ’s. 12. After can_be_avail has been computed. The final criterion for performing insertion is to insert at the Φ’s where can_be_avail and ¬l a t e r hold.3 The Lifetime Optimality Criterion To fulfill lifetime optimality. or (b) the Φ has an operand that is can_be_avail Φ marked not later. we propagate ¬c a n _b e _a v a i l forward when a ¬d ow n s a f e Φ has an operand that is defined by a ¬c a n _b e _a v a i l Φ and that operand is not marked has_real_use. there exists a computationally optimal placement under which X is not available immediately after the Φ. insertion is performed at each operand that satisfies either of the following condition: 1. We optimistically regard all the can_be_avail Φ’s to be later. after appropriate insertions. We call such Φ’s will_be_avail. There would be full redundancies created among the insertions themselves. We illustrate our discussion in this section with the example of Figure 12. In other words. A Φ is marked later if it is not necessary to insert there because a later insertion is possible. all other can_be_avail Φ’s are marked later. The purpose is to disqualify can_be_avail Φ’s where the computation is partially available based on the original occurrences of X . In the propagation phase. Thus. which involves propagation in the forward direction with respect to the control flow graph. it is possible to perform insertions at all the can_be_avail Φ’s.136 12 Redundancy Elimination — (F. the computation can become fully available.

. no insertion is performed. . . . . . Computations like indirect loads and divides are called dangerous computations because they may fault. . . Figure 12. Dangerous computations in general should not be . . . depending on the branch condition inside the loop. while moving it to the loop header causes it to execute one time. .7 shows a loop-invariant computation a + b that occurs in a branch inside the loop. As long as the paths that are burdened with more computations are executed less frequently than the paths where the redundant computations are avoided. . . By considering later. . marking Φ’s located at the start of loop bodies downsafe will effect speculative loop invariant code motion. . 12. speculative PRE can be conservatively performed by restricting it to loop-invariant computations. . 12. . . speculative code motion should only be performed when there are clues about the relative execution frequencies of the paths involved. . ) can_be_avail=0 3 T exit a2 = 4 5 a3 = φ(a1.6 Example to show the need of the Later attribute If insertions were based on can_be_avail. . . which is optimal under safe PRE for this example. . .[h5]) can_be_avail=1 later=1 6 a3+b1 [h5] 7 exit Fig. Speculative code motion suppresses redundancy in some path at the expense of another path where the computation is added but result is unused. . . . . . . . . the resulting code motion will involve speculation. . a net gain in program performance can be achieved.4 Speculative PRE a1+b1[h1] 137 1 2 downsafe=0 [h3] = Φ([h1]. . Without profile data. .3. . . . This loop-invariant code motion is speculative because. . .4 Speculative PRE If we ignore the safety requirement of PRE discussed in Section 12. This speculative loop-invariant code motion is profitable unless the path inside the loop containing the expression is never taken.a3) T downsafe=1 [h5] = Φ([h3]. . . . . . When performing SSAPRE. . . Thus. . it may be executed zero time.a2. . which is usually not the case. . .12. . which would have resulted in unnecessary code motion that increases register pressure. . a + b would have been inserted at the exits of blocks 4 and 5 due to the Φ in block 6. .

We can represent safety dependences as value dependences in the form of abstract tau variables.7 by a /b and the speculative code motion is performed. the compiler inserts the definition of t a u 1 with the abstract right-hand-side value t a u e d g e . if we replace the expression a +b in Figure 12. the dangerous instruction is fault-safe and can still be speculated.8. A dangerous computation can be defined to have more than one tau operands. During SSAPRE. Because they are abstract. so their definitions can be found by following the use-def edges.138 12 Redundancy Elimination — (F. When such a test occurs in the program. By including the tau operands into consideration. The tau variables are also put into SSA form. At the start of the region guarded by the non-zero test for b . At the points in the program where its safety dependence is satisfied. Each run-time test that succeeds defines a tau variable on its fall-through path. otherwise. As an example. speculative PRE automatically honors the fault-safety of dangerous computations when it performs speculative code motion. Chow) a+b speculative PRE a+b Fig. it is unsafe to speculate. it may cause a run-time divide-byzero fault after the speculation because b can be 0 at the loop header while it is never 0 in the branch that contains a /b inside the loop body. 12. When all its tau operands have definitions. they are also omitted in the generated code after the SSAPRE phase. We define an additional tau operand for the divide operation in a /b in SSAPRE to provide the information whether a non-zero test for b is available. Any appearance of a /b in . it means the computation is fault-safe. we attach these tau variables as additional operands to the dangerous computations related to the test. depending on its semantics. The definitions of the tau variables have abstract right-hand-side values that are not allowed to be involved in any optimization. we say the dangerous computation is safety-dependent on the control flow point that establishes its safety.7 Speculative loop-invariant code motion speculated. In Figure 12. Dangerous computations are sometimes protected by tests (or guards) placed in the code by the programmers or automatically generated by language compilers like those for Java. the program contains a non-zero test for b .

. . . . . . . . Having a defined tau operand allows a /b to be freely speculated in the region guarded by the non-zero test. . . . called pseudo-registers. . . Because pseudo- . . Instead of solving these problems all at once. . . . while the definition of t a u 1 prevents any hoisting of a /b past the non-zero test. . . Register allocation — This phase will fit the unlimited number of pseudoregisters to the limited number of real or physical registers. . Register promotion — We assume there is an unlimited number of registers. . It is the compiler’s job to promote these memory contents to registers as much as possible to speed up program execution. b1 = 0 ? no yes b1 = 0 ? no yes tau1 = tauedge tau1 = tauedge a1/b1 tau1 speculative PRE a1/b1 tau1 Fig. . . . . Load and store instructions have to be generated to transfer contents between memory locations and registers. .12. But when these memory contents are implicitly accessed through aliasing. . . . The compiler also has to deal with the limited number of physical registers and find an allocation that makes the best use of them. . . . . . . . . . . 2. Register promotion will allocate variables to pseudo-registers whenever possible and optimize the placement of the loads and stores that transfer their values between memory and registers. we can tackle them as two smaller problems separately: 1. . 12. . .8 Speculative and fault-safe loop-invariant code motion . . . . . . symbolic registers or virtual registers. it is necessary that their values are up-todate with respect to their assigned registers. 12. . .5 Register Promotion via PRE Variables and most data in programs normally start out residing in memory. .5 Register Promotion via PRE 139 the region guarded by the non-zero test for b will have t a u 1 as its tau operand. .

140 12 Redundancy Elimination — (F.5. 12. A program constant is a register promotion candidate whenever some operation is involved to put its value in a register and its register residence is required for its value to be used during execution. Direct loads correspond to scalar variables in the program. Chow) registers have no alias. we only address the register promotion problem because it can be cast as a redundancy elimination problem. From the point of view of redundancy. For these trivial candidates. since it can be accessed by function calls. The PRE of stores. Variables in the program can also be determined via compiler analysis or by language rules to be alias-free. Combining the effects of the PRE of loads and stores results in optimal placements of loads and stores while minimizing the live ranges of the pseudo-registers. Since the goal of register promotion is to obtain the most efficient placements for the loads and stores.2 Load Placement Optimization PRE applies to any computation including loads from memory locations and creation of constants. When it runs out of registers due to high register pressure. loads can either be indirect through a pointer or direct. They include the temporaries generated during PRE to hold the values of redundant computations.5. For stores. In program representations. assigning them to registers involves only renaming them. or if it is a global variable. indirectly accessed memory locations and program constants. followed by PRE of stores. Our register promotion is mainly concerned with scalar variables that have aliases. loads are like expressions because the later occurrences are the ones to be deleted. Indirect loads are automatically covered by the PRE for expressions. can thus be treated as the dual of the PRE of loads. and . the reverse is true: the earlier stores are the ones to be deleted. as is evident in the examples of Figure 12. A scalar variable can have aliases whenever its address is taken. register promotion can be modeled as two separate problems: PRE of loads.1 Register Promotion as Placement Optimization Variables with no aliases are trivial register promotion candidates. also called partial dead code elimination. In this Chapter. Performing PRE of stores thus has the effects of moving stores forward while inserting them as early as possible. In the case of program constants. The PRE of stores does not apply to program constants. it will generate code to spill contents from registers back to memory. our use of the term load will extend to refer to the computation operation performed to put the constant value in a register. and no load or store needs to be generated. 12.9(a) and (b). we can just rename them to unique pseudo-registers. by virtue of the computational and lifetime optimalities of our PRE algorithm.

*p = *p *p = *p (redundant) *p (partially redundant) Fig. a store occurrence is always given . When we apply SSAPRE to direct loads. In the Rename step. When working on the PRE of memory loads.9 Duality between load and store redundancies (b) stores to x since our input program representation is in HSSA form. 12. we thus include the store occurrences into consideration. so the Φ-insertion step and Rename step can be streamlined.10 Redundant loads after stores When we perform the PRE of loads. the FRG corresponds somewhat to the variable’s SSA graph. it is important to also take into account the stores. A store of the form x ← <expr> can be regarded as being made up of the sequence: r ← <expr> x ←r Because the pseudo-register r contains the current value of x .10 gives examples of loads made redundant by stores.5 Register Promotion via PRE 141 x x x = (redundant) x (redundant) x= x= (a) loads of x Fig. 12. which we call l-value occurrences. In our tree representation. Figure 12. and thus can be regarded as redundant.12. The Φ-insertion step will insert Φ’s at the iterated dominance frontiers of store occurrences. the aliasing that affects the scalar variables are completely modeled by the χ and µ operations. both direct loads and constants are leaves in the expression trees. since the hypothetical temporary h can be regarded as the candidate variable itself. any subsequent occurrences of the load of x can reuse the value from r .

11 r is a register Register promotion via load PRE followed by store PRE 12. the earlier occurrences are redundant. The hoisting of the load of a to the loop header does not involve speculation. To apply the dual of the SSAPRE algorithm. In addition.142 12 Redundancy Elimination — (F. The example in Figure 12. but the PRE of loads creates more opportunities for the PRE of stores by deleting loads that would otherwise have blocked the movement of stores.5.[h2]) a [h3] r r a=r a is a memory location Fig. Any presence of (aliased) loads have the effect of blocking the movement of stores or rendering the earlier stores non-redundant. 12. followed by the PRE of stores. SPRE is the dual of LPRE. Without performing LPRE first. it is necessary to compute a program representation that is the dual of the SSA form. and we call this the static .11 illustrates what we discuss in this section. the load of a inside the loop would have blocked the sinking of a =. The occurrence of a = causes r to be updated by splitting the store into the two statements r = followed by a = r . a = is regarded as a store occurrence. Chow) a new h -version. This ordering is based on the fact that the PRE of loads is not affected by the results of the PRE of stores. speculation is needed to sink a = to outside the loop because the store occurs in a branch inside the loop. Code motion in SPRE will have the effect of moving stores forward with respect to the control flow graph. speculation is required for the PRE of loads and stores in order for register promotion to do a decent job in loops.3 Store Placement Optimization As mentioned earlier. r=a [h1] = Φ([h3]. ) r=a T LPRE [h2] a = r= a=r SPRE r= [h3] = Φ([h1]. In the PRE of stores (SPRE). because a store is a definition. Any subsequent load renamed to the same h -version is redundant with respect to the store. In the presence of store redundancies. We apply the PRE of loads first. During the PRE of loads (LPRE).

. 12. . use-def edges are factored at divergence points in the control flow graph. . 12. The PRE algorithm . Each use of a variable establishes a new version (we say the load uses the version). . . The store can be sunk to outside the loop only if it is also inserted in the branch inside the loop that contains the load. . . . . . . . We call this factoring operator Σ . . construct the SSU form for the variable whose store is being optimized. . Though store elimination itself does not require the introduction of temporaries. . . . . . . . . . (c) shows the result of applying SSUPRE to the code. . each of which is post-dominated by their single uses [reference to SSI chapter]. .6 Value-based Redundancy Elimination 143 single use form. We call our store PRE algorithm SSUPRE. . . . . a= Σ(a2. The left-hand-side of the Σ is the multiple definitions. . .12. . . . . which is made up of the corresponding steps in SSAPRE. T )=a2 a= (a) original code Fig. The use post-dominates all the stores of its version. . In SSU. It is desirable not to sink the stores too far down. . . . . CanBeAnt to compute the can_be_ant attribute and Earlier to compute the earlier attribute. .12 (b) a in SSU form (c) after SPRE Example of program in SSU form and the result of applying SSUPRE Figure 12. . and every store reaches exactly one load. . lifetime optimality still needs to be considered for the temporaries introduced in the LPRE phase which hold the values to the point where the stores are placed. The data flow analyses consist of UpSafety to compute the upsafe (fully available) attribute. The Σ is regarded as a use of a new version of the variable. The optimized code no longer exhibits any store redundancy. .6 Value-based Redundancy Elimination The redundant computations stored into a temporary introduced by PRE may be of different static values because the same lexically identified expression may yield different results at different code paths in the program. . .a3)=a1 a a= a a1= a3 Σ(a1. .12 gives an example program with (b) being the SSU representation for the program in (a). . . Σ -Insertion and SSU-Rename. . . .

called global value numbering (GVN). Each internal node is hashed based on its operator and the value numbers of its operands. the same value numbers can be arrived at regardless of the order of processing the program code. the program code in Figure 12. One subtlety is regarding the φ ’s. In the example of 12. When we value number a φ . we discuss redundancy elimination based on value analysis. SSA form enables value numbering to be easily extended to the global scope. Note that variable c is involved with both value numbers v 2 and v 3 because it has been re-defined.13 b1 ← c1 c2 ← a1 + c1 d1 ← a 1 + b 1 (a) Statements processed Fig. the assigned variable will be assigned the value number of the right hand side expression. 12. because each SSA version of a variable corresponds to at most one static value for the variable. c2. 12. d (b) Value numbers Fig.14. d1 (b) Value numbers we have described so far is not capable of recognizing redundant computations among lexically different expressions that yield the same value.144 value numbering 12 Redundancy Elimination — (F.13(b). c v3: v1+v2. The local algorithm for value numbering will conduct a scan down the instructions in a basic block. This traversal strategy can minimize the instances when a φ operand has unknown value number when the φ itself needs to be value-numbered. we would like the value numbers for its operands to have been determined. The assignment will also cause any value number that refers to that variable to be killed. The value number of an expression tree can be regarded as the index of its hashed entry in the hash table.6.13(a) will result in the value numbers v 1 . v 2 and v 3 shown in Figure 12. One strategy is to perform the global value numbering by visiting the nodes in the control flow graph in a reverse postorder traversal of the dominitor tree. assigning value numbers to the expressions. c. In this Section. For example.1 Value Numbering The term value number originated from the hash-based method developed by Cocke and Schwartz for recognizing when two expressions evaluate to the same value within a basic block. Chow) b←c c←a+c d←a+b (a) Code in a basic block Value numbering in a local scope v1: a v2: b. An expression tree is hashed recursively bottom up starting with the leaf nodes.14 Global value numbering on SSA form v1: a1 v2 : b 1 . c 1 v3: v1+v2. except in the case . 12. At an assignment.

the two expressions may compute to different values. This is the subdivision criterion. and thus should not be in the same congruence class. As a result. while <cond>. This makes the above algorithm unable to recognize that i 2 and j 2 can be given the same value number. At line 8. When this arises. processed already). The algorithm partitions all the expressions in the program into congruence classes. Given two expressions within the same congruence class. There is a different approach to performing value numbering that is not hash-based and is optimistic. we have no choice but to assign a new value number to the φ . (s ∩ t ou c he d ) being not empty implies s uses a member of c as its p th operand. it assumes all expressions that have the same operator to be in the same congruence class. and so is not affected by the presence of back edges. or i 3 and j 3 can be given the same value number. we create different value numbers for i 2 and j 2 . do { i 2 ← φ (i 3 . For example.12. the partitioning was already stable with respect to the old s . the set of congruence classes in this final partition will represent all the values in the program that we care about. This is the criterion for subdividing class s because those members of s that do not have a member of c at its p th operand potentially compute to a different value. Expressions in the same congruence class are considered equivalent because they evaluate to the same static value. It does not depends on any tranversal over the program’s flow of control. j 2 ← φ ( j 3 . in this loop: i 1 ← 0. j 1 ). it implies some member of s does not have a member of c as its p th operand.. After the partition into n and the new s . and each congruence class is assigned a unique value number. The details of this partitioning algorithm is shown in Figure 12. the congruence classes are subdivided into smaller ones while the total number of congruence classes increases.6 Value-based Redundancy Elimination 145 of back edges arising from loops. if their operands at the same operand position belong to different congruence classes. j 1 ← 0. the value numbers for i 3 and j 3 are not yet determined. i 1 ). j 3 ← j 2 + 4.15. if s is not in the worklist (i. At this point. If (s ∩ t ou c he d ) is a proper subset of s .e. The algorithm is optimistic because when it starts. when we try to hash a value number for either of the two φ ’s. As the algorithm iterates. The above hash-based value numbering algorithm can be regarded as pessimistic. and we can add either n or the new s to the worklist . i 3 ← i 2 + 4. The algorithm terminates when no more subdivision can occur. because it will not assign the same value number to two different expressions unless it can prove they compute the same value.

but in this case. 12. Two computations that compute to the same value exhibit redundancy only if there is an control flow path that leads from one to the other. it is necessary to determine which form to use at each insertion point. Chow) Fig. If there are φ ’s introduced for the t . Since it is not applied bottom-up with respect to the expression tree. While this partition-based algorithm is not obstructed by the presence of back edges. Because the same value can come from different forms of expressions at different points in the program.6. Value-number-based PRE also has to deal with the additional issue of how to generate an insertion. If the insertion point is outside the live range . but have not addressed eliminating the redundancies among them. it is not able to apply algebraic simplifications while value numbering. and we know from experience that such φ ’s are rare. its value will stay the same throughout the lifetime of t . it is not able to apply commutativity to detect more equivalences. A temporary t will be introduced to store the redundant computations for each value number. Choosing the smaller one results in less overhead. 12. we have discussed finding computations that compute to the same values.15 1: Place expressions with the same operator in the same congruence class 2: N ← maximum number of operands among all operators in the program 3: wor k l i s t ← set of all congruence classes 4: while wor k l i s t = {} 5: Select and delete an arbitrary congruence class c from wor k l i s t 6: for (p = 0. To get the best of both the hash-based and the partition-based algorithms. it is possible to apply both algorithms independently and then combine their results together to shrink the final set of value numbers. Because it has to consider one operand position at a time. A subset of such φ ’s is expected to come from PRE’s insertions. they will be merging identical values. it does have its own deficiencies.146 12 Redundancy Elimination — (F. An obvious approach is to consider PRE for each value number separately. and that implies that insertions introduced by value-number-based PRE are also rare.2 Redundancy Elimination under Value Numbering So far. p ++) 7: t ou c he d ← set of expressions in program that has a member of c in position p 8: for each class s such that (s ∩ t ou c he d ) is not empty and (s ∩ t ou c he d ) ⊂ s 9: Create a new class n ← s ∩ t ou c he d 10: s ←s −n 11: if s ∈ wor k l i s t 12: Add n to wor k l i s t 13: else 14: Add smaller of n and s to wor k l i s t The partitioning algorithm to re-stabilize with respect to that split of s . p < N .

. In this hybrid approach. . . . . . . . it is sufficient to perform only full redundancy elimination among computations that have the same value number But it is possible to broaden the scope and consider PRE among lexically identical expressions and value numbers at the same time. which incurs more overhead than uni-directional data flow analysis. Since each φ in the input can be viewed as merging different value numbers from the predecessor blocks to form a new value number. After lazy code motion was introduced. . . their algorithm does not yield optimal results in certain situations. the ΦInsertion step will be driven by the presence of φ ’s for the program variables. Morel and Renvoise showed that global common subexpressions and loop-invariant computations are special cases of partial redundancy. An better placement strategy. . It will just assign the value number of the right hand sides to the left hand side variables. . . . . . Due to this complexity. . . . . In addition. 12. it is best to relax our restriction on the style of program representation described in Section 12. In their seminal work [193]. By not requiring Conventional SSA Form. This hybrid approach can be implemented based on an adaptation of the SSAPRE framework. then the insertion point has to be disqualified. . we allow its value to be used whenever convenient. in which the result of every operation is immediately stored in a temporary. . The PRE algorithm developed by Morel and Renvoise involves bi-directional data flow analysis. by removing the bi-directional nature of the original PRE data flow analysis and by proving the optimality of their algorithm. .7 Additional Reading The concept of partial redundancy was first introduced by Morel and Renvoise. 5 We would remove the restriction that the result and operands of each φ have to be SSA versions derived from the same original program variable . . . . . . . . . we can more effectively represent the flow of values among the program variables5 . . and the expectation that strictly partial redundancy is rare among computations that yield the same value. It improved on Morel and Renvoise’s results by avoiding unnecessary code movements. Each FRG can be regarded as a representation of the flow and merging of computed values based on which PRE can be performed by applying the remaining steps of the SSAPRE algorithm. . . . .1. The program representation can even be in the form of triplets. . . was later developed by Knoop et al [161][163].12. but differ in the formulation approach and implementation details [99][98][206][280]. . . called lazy code motion (LCM). . . . . . . . . .7 Additional Reading 147 of any variable version that can compute that value. By regarding the live range of each SSA version to extend from its definition to program exit. . there have been alternative formulations of PRE algorithms that achieve the same optimal results. . and they formulated PRE as a code placement problem. . The FRGs can be formed from some traversal of the program code.

showed that register promotion can be achieved by load placement optimization followed by store placement optimization [174]. preorder traversal over the basic blocks of the program. Other optimizations can potentially be implemented using the SSAPRE framework. They showed their sparse approach based on SSA results in smaller flow networks. Since the bit vector representation uses basic blocks as its granularity. enabling the optimal code placements to be computed more efficiently. extended it to global value number based on SSA [228]. PRE has traditionally provided the context for integrating additional optimizations into its framework. applies the minimum cut approach to flow networks formed out of the FRG in the SSAPRE framework to achieve the same computational and lifetime optimal code motion [285]. Their algorithm determines optimization candidates and construct FRGs via a depth-first. It avoids having to encode data flow information in bit vector form. Their algorithm uses bit-vector-based data flow analysis and applies minimum cut to flow networks formed out of the control flow graph to find the optimal code placement. introduced the concept of faultsafety and use it in the SSAPRE framework for the speculation of dangerous computations [195]. Their SSAPRE algorithm is an adaptation of LCM that take advantage of the use-def information inherent in SSA. Chow) The above approaches to PRE are all based on encoding program properties in bit vector forms and the iterative solution of data flow equations. presented refinements to both the hash-based and partition-based algorithms [46]. Zhou et al. register shrink-wrapping [72] and live range shrinking. Within each FRG. When execution profile data are available. Briggs et al. They include operator strength reduction [162] and linear function test replacement [156].148 12 Redundancy Elimination — (F. a separate algorithm is needed to detect and suppress local common subexpressions. Hashed-based value numbering originated from Cocke and Schwartz [77]. and Rosen et al. including applying the hash-based method in a postorder traversal of the dominator tree. Moreover. as long as there are potential redundancies among them. taking advantage of SSA’s sparse representation so that fewer number of steps are needed to propagate data flow information. [9]. like code hoisting. The partitionbased algorithm was developed by Alpern et al. and eliminates the need for a separate algorithm to suppress local common subexpressionsi. Xue and Cai presented a computationally and lifetime optimal algorithm for speculative PRE based on profile data [279]. In the area of speculative PRE. it is possible to tailor the use of speculation to maximize run-time performance for the execution that matches the profile. Their algorithm was first to make use of SSA to solve data flow problems for expressions in the program. VanDrunen and Hosking proposed their anticipation-based SSAPRE (A-SSAPRE) that removes the requirement of Convention SSA Form and is best for program representations in the form of triplets [266]. VanDrunen and Hosking subsequently presented their Value-based Partial Redun- . non-lexically identical expressions are allowed. Murphy et al. Lo et al. The SSAPRE algorithm thus brings the many desirable characteristics of SSA-based solution techniques to PRE. Chow et al [70][155] came up with the first SSA-based approach to perform PRE.

12.7 Additional Reading 149 dancy Elimination algorithm (GVN-PRE) that they claim to subsume both PRE and GVN [267]. . the PRE part is not SSA-based because it uses a system of data-flow equations at the basic block level to solve for insertion points. But in this algorithm.

.

φ Extensions Progress: 48% Part III .

.

but with a track for beginners TODO: Section on SigmaSSA? TODO: Section on Dynamic SSA? .153 Progress: 88% TODO: Aimed at experts.

.

At an instruction that defines variable a and uses variables b and c (say). say a == 0. However. these richer flavors of extended SSA-based analyses still retain many of the benefits of SSA form (e. even though both uses have the same reaching definition. The extensions arise from the fact that many analyses need to make finer-grained distinctions between program points and data accesses than is achieved by vanilla SSA form. the information for b and c is propagated directly to the instruction from the definitions of those variables. SSA-based constant propagation aims to compute for each single assignment variable. we have introduced the foundations of SSA form and its use in different program analyses.. For example. Like- 155 . we now have additional information that can distinguish the value of a in the then and else parts. sparse data flow propagation) which distinguish them from classical data flow frameworks for the same analysis problems. As an example. consider an if-then-else statement with uses of a in both the then and else parts. However. Rastello Progress: 50% Review in progress Thus far. there are many analyses for which definition points are not the only source of new information. We now motivate the need for extensions to SSA form to enable a larger class of program analyses.CHAPTER 13 Introduction V. If the branch condition involves a . the (usually over-approximated) set of possible values carried by the definition for that variable.g. Static Single Information form The sparseness in vanilla SSA form arises from the observation that information for an unaliased scalar variable can be safely propagated from its (unique) definition to all its reaching uses without examining any intervening program points. Sarkar F.

Note that MayDef functions can result in the creation of new names (versions) of variables. These versions are grouped together into a single version called the zero version of the variable. The goal of Chapter 14 is to present a systematic approach to dealing with these additional sources of information. both of which establish orthogonality among transfer functions for renamed variables. F. SSI form can provide distinct names for the two uses of a in the then and else parts of the if-then-else statement mentioned above. Information can be propagated in the forward and backward directions using SSI form..156 13 Introduction — (V. a static use or definition of indirect memory access ∗p in the C language could represent the use or definition of multiple local variables whose addresses can be taken and may potentially be assigned to p along different control flow paths. The sparseness of SSI form follows from formalization of the Partitioned Variable Lattice (PVL) problem and the Partitioned Variable Problem (PVP). This additional renaming is also referred to as live range splitting. The φ functions inserted at join nodes in vanilla SSA form are complemented by σ functions in SSI form that can perform additional renaming at branch nodes and interior (instruction) nodes. the fact that p is non-null because no exception was thrown at the first use. Hashed SSA (HSSA) form introduced in Chapter 15 addresses another important requirement viz.e. compared to vanilla SSA form. do not appear in a real instruction outside of a φ . This approach is called Static Single Information (SSI) form. For example. Hashed SSA form The motivation for SSI form arose from the need to perform additional renaming to distinguish among different uses of an SSA variable. the need to model aliasing among variables. enabling analyses such as range analysis (leveraging information from branch conditions) and null pointer analysis (leveraging information from prior uses) to be performed more precisely than with vanilla SSA form.. Rastello) wise. akin to the idea behind splitting live ranges in optimized register allocation. As summarized above. a major concern with HSSA form is that its size could be quadratic in the size of the vanilla SSA form. For example. To represent aliasing of local variables. HSSA form extends vanilla SSA form with MayUse (µ) and MayDef (χ ) functions to capture the fact that a single static use or definition could potentially impact multiple variables. while still retaining the space and time benefits of vanilla SSA form. It is capable of representing the output of any alias analysis performed as a pre-pass to HSSA construction. Sarkar. since each use or definition can now be augmented by a set of MayUse’s and MayDef’s respectively. and it involves additional renaming of variables to capture new sources of information.. A heuristic approach to dealing with this problem is to group together all variable versions that have no “real” occurrence in the program i. the use of a reference variable p in certain contexts can be the source of new information for subsequent uses e.g. µ or χ function. HSSA form does not take a position on the accuracy of alias analysis that it represents. .

However. type information can be used to obtain a partitioning of the heap into disjoint subsets e. To provide A [k ] with a single reaching definition. Array SSA form builds on the observation that all indirect memory operations can be modeled as accesses to elements of abstract arrays that represent disjoint subsets of the heap. Array SSA form (Chapter 16) takes an alternate approach to modeling aliasing of indirect memory operations by focusing on aliasing in arrays as its foundation. even in the absence of pointers and heap-allocated data structures. For example. the Hashed SSA name in HSSA form comes from the use of hashing in most value-numbering algorithms. The aliasing problem for arrays is manifest in the fact that accesses to elements A [i ] and A [ j ] of array A refer to the same location when i = j . Array SSA form introduces a definition-Φ (d Φ) operator that represents the merge of A 2 [ j ] with the prevailing value of array A prior to A 2 .g. Further. The accuracy of analyses for Array SSA form depends on the accuracy with which pairs of array subscripts can be recognized as being definitely-same ( ) or definitely-different ( ). which serves as the single definition that reaches use A [k ] (which can then be renamed to A 3 [k ]). the reaching definition for A [k ] will be A 2 or A 1 if k == j or k == i (or a prior definition A 0 if k = j and k = i ). Consider a program with a definition of A [i ] followed by a definition of A [ j ]. In fact. as illustrated by the algorithm for sparse constant propagation of array elements presented in this chapter. The vanilla SSA approach can be used to rename these two definitions to (say) A 1 [i ] and A 2 [ j ].13 Introduction — (V. This aliasing can occur with just local array variables. F. the reaching definition for this use can be uniquely identified in vanilla SSA form. This extension enables sparse data flow propagation algorithms developed for vanilla SSA form to be applied to array variables. Global value numbering is used to increase the effectiveness of the virtual variable approach. Rastello) 157 In addition to aliasing of locals. HSSA form addresses this possibility by introducing a virtual variable for each address expression used in an indirect memory operation. and renaming virtual variables with φ functions as in SSA form. The challenge with arrays arises when there is a subsequent use of A [k ]. for array variables. ∗p and ∗q may refer to the same location in the heap. the reaching definition depends on the subscript values. Array SSA form In contrast to HSSA form. In this example. thereby leading to the insertion of µ or χ functions for virtual variables as well. it is important to handle the possibility of aliasing among heap-allocated variables. Sarkar. For modern objectoriented languages like Java. To model heap-allocated objects. The result of this d Φ operator is given a new name. instances of field x are guaranteed to be . the alias analysis pre-pass is expected to provide information on which virtual variables may potentially be aliased.. A 3 (say). even if no aliasing occurs among local variables. since all indirect memory accesses with the same address expression can be merged into a single virtual variable (with SSA renaming as usual). For scalar variables.

The problem of resolving aliases among field accesses p . the size of Array SSA form is guaranteed to be linearly proportional to the size of a comparable scalar SSA form for the original program (if all array variables were treated as scalars for the purpose of the size comparison). (TODO: check if this is correct. Psi-SSA form Psi-SSA form (Chapter 17) addresses the need for modeling static single assignment form in predicated operations. Sarkar. the challenge is that a predicated operation may or may not update its definition operand. If all predicates are false. a ψ function has the form. then the assignment is a no-op and the result is undefined. depending on the value of the predicate guarding that assignment. thereby enabling Array SSA form to be used for analysis of objects as in the algorithm for redundant load elimination among object fields presented in this chapter. ψ-projection. In general. From an SSA form perspective. the entire heap can be modeled as a single heap array... A number of algebraic transformations can be performed on ψ functions.x then becomes equivalent to the problem of resolving aliases among array accesses x [p ] and x [q ]. a0 = Psi(p1?a1. This challenge is addressed in Psi-SSA form by introducing ψ functions that perform merge functions for predicated operations. A key advantage of using predicated operations in a compiler’s intermediate representation is that it can often enable more optimizations by creating larger basic blocks compared to approaches in which predicated operations are modeled as explicit control flow graphs. often obtained by using the well known if conversion transformation.. In contrast to HSSA form. As in HSSA form. ψ-permutation and ψ-promotion.158 13 Introduction — (V. analogous to its use in establishing equality of address expressions in HSSA form. In particular. . global value numbering is used to establish definitely-same relationships. Chapter 17 also includes an algorithm for transforming a program out of Psi-SSA form that extends the standard algorithm for destruction of vanilla SSA form. where each input argument a i is associated with a predicate p i as in a nested if-then-else expression for which the value returned is the rightmost argument whose predicate is true. . F. pi?ai.x and q .). the set of instances of field x can x that is indexed by the object reference be modeled as a logical array (map) (key). Gated Single Assignment form TO BE COMPLETED . an alias analysis prepass can be performed to improve the accuracy of definitely-same and definitelydifferent information. A predicated operation is an alternate representation of a fine-grained control flow structure. In such cases. including ψ-inlining. pn?an). For weakly-typed languages such as C. Rastello) disjoint from instances of field y ... analogous to the merge performed by φ functions at join points in vanilla SSA form. ψ-reduction..

. 2. For loop transformations.. Program dependence graph (PDG) constitute the basis of such IR extensions. Rastello) 159 Control dependences SSA goes with a control flow graph that inherently imposes some ordering: we know well how to schedule instructions within a basic-block. see Chapter 18) dependence edges. Execution order of side effect instructions must also be respected. Sarkar.. . However. Then in the same spirit than SSA phi-nodes aim at combining the information as early as possible (as opposed to standard def-use chains. Alias analysis is the first step toward more precise dependence information..=*p. Gated-SSA (GSA) provides an interpretable (data or demand driven) IR that uses this concept.13 Introduction — (V. we believe that. So as to expose parallelism / locality one need to get rid of the CFG at some points.. or select afterward (using similar predicate expressions) the correct value among a set of eagerly computed ones. we know how to extend it to hyper-blocks. software pipelining. . similar nodes can be used for memory dependences.. etc. Psi-SSA is a very similar IR but more appropriated to code generation for architectures with predication. Without the use of phi-nodes.. one need to manipulate a higher degree of abstraction to represent the iteration space of nested loops and to extend data flow information to this abstraction. Too conservative dependences annihilate the potential of optimizations. to illustrate this point. Psi-SSA is described in Chapter 17. and q may alias. technically. F.=*p where p.. Representing this information efficiently in the IR is important. Note that such extensions sometimes face difficulties to handle loops correctly (need to avoid deadlock between the loop predicate and the computation of the loop body. One could simply add a dependence (or a flow arc) between two "consecutive" instructions that may or must alias. PDG and GSA are described in Chapter 18. Memory based data-flow SSA provides data flow / dependences between scalar variables. replicate the behavior of infinite loops. One can expose even more parallelism (at the level of instructions) by replacing control flow by control dependences: the goal is either to express a predicate expression under which a given basic block is to be executed. see the discussion in Chapter 2). 1. loop carried control dependences complicate the recognition of possible loop transformations: it is usually better to represent loops and there corresponding iteration space using a dedicated abstraction. we say that SSA provides data flow (data dependences).). Consider the following sequential code *p=.. as we will illustrate further. The goal is to enrich it with control dependences. Indirect memory access can be considered very conservatively as such and lead to (sometimes called state.. the amount of def-use chains required to link the assignments to their uses would be . *q=.

A1). it becomes difficult to devise that iteration i and i + 1 can be executed in parallel: the phi node adds a loop carried dependence. There exists many formalisms including Kahn Process Networks (KPN). This is part of the huge research area of automatic parallelization outside of the scope of this book. HSSA (see Chapter 15) and Array-SSA (see the Chapter 16) are two different implementations of this idea. then (Dynamic) Single Assignment forms should be his friend. For further details we refer to the corresponding Encyclopedia of Parallel Computing [205]. loop tiling or multidimensional software pipelining). F. If one is interested in performing more sophisticated loop transformations than just exposing fully parallel loops (such as loop interchange. or Fuzzy Data Flow Analysis (FADA) that implement this idea. Let us illustrate this point using a simple example: for i { A[i]=f(A[i-2]) } where the computation at iteration indexed by i + 2 accesses the value computed at iteration indexed by i .160 13 Introduction — (V. Suppose we know that f (x ) > x and that prior to entering the loop A [∗] ≥ 0. In other words only simple loop carried dependences can be expressed this way. Then a SSA like representation for i { A2=phi(A0. Sarkar. But each time restrictions apply. Rastello) quadratic (4 here). On the other hand. the introduction of a phi-function might add a control dependence to an instruction that would not exist otherwise. . A1=PHI(j==i:A1[j]=f(A2[j-2]). Hence the usefulness of generalizing SSA and its phi-node for scalars to handle memory access for sparse analyzes. j!=i:A1[j]=A2[j]) } would easily allow for the propagation of information that A 2 ≥ 0. One have to admit that if this early combination is well suited for analysis or interpretation. by adding this phi-node.

If a program’s intermediate representation guarantees this correspondence between information and variable for every variable. . .CHAPTER 14 Static Single Information Form F. . 14. . in such a way that each variable name is defined only once. 161 . . Pereira F. . . if an invariant occurs for a variable v at any program point where v is alive. . Similarly to liveness analysis. . . . . . the information that concerns liveness analysis is the set of variables alive at a certain program point. . . . . . . the SSA format let us assign to each variable the invariant – or information – of being constant or not. . . For example. then we say that the program representation provides the Single Static Information (SSI) property. . . . then we can associate this invariant directly to v . an information is a element in the data-flow lattice. . . . . . In the particular case of constant propagation. However. but also to backward data-flow analyses. . . The SSA intermediate representation gives us this invariant because it splits the live ranges of variables. Rastello Progress: 90% Text and figures formating in progress . Now we will show that live range splitting can also provide the SSI property not only to forward. . . . . many other classic data-flow approaches bind information to pairs formed by a variable and a program point. . . We call such facts information. . . . Using the notation introduced in Section 8. In Chapter 8 we have shown how SSA form allows us to solve sparse forward data-flow problems such as constant propagation. . . . . . . . . . . .1 Introduction The objective of a data-flow analysis is to discover facts that are true about a program.

. but not to all of them. These different intermediate representations rely on a common strategy to achieve the SSI property: live range splitting. . . There exists a special class of data-flow problems. that we will detail further. Pereira.2. . . . There exists extensions of the SSA form that provide the SSI property to more data-flow analyses than the original SSA does. For instance. are the Extended-SSA (e-SSA) form. . . and to explain how it supports the sparse data-flow analyses discussed in Chapter 8.2. 14.2. .2 Static Single Information The goal of this section is to define the notion of Static Single Information. and also from conditional tests where these variables are used. for a data-flow analysis that derives information from the use sites of variables. . These analyses and transformations include copy and constant propagation as illustrated in Chapter 8. Therefore. the SSA form provides the static single information property to any data-flow analysis that obtains information at the definition sites of variables.2.2. . In this chapter we show how to use live range splitting to build program representations that provide the static single information property to different types of data-flow analysis. . The e-SSA form provides the SSI property to analyses that take information from the definition site of variables. .5 lets us solve sparsely any data-flow problem that contains the SSI property. Indeed. . we revisit the concept of sparse analysis in Section 14. With this purpose. . .3 – to any PVL problem. . This sparse framework is very broad: many well-known data-flow problems are partitioned lattice.1. and from last use sites (which we define later). However.2. . And the algorithms that we give in Section 14. . . . . . which we call Partitioned Variable Lattice (PVL). . . . . and the Static Single Information (SSI) form. We will look more carefully into these problems in Section 14. . the information associated with v might not be unique along its entire live-range: in that case the SSA form does not provide the SSI property. that fits in this chapter’s sparse data-flow framework very well. . .162 14 Static Single Information Form — (F. . . . . The intermediate program representations that we discuss in this chapter lets us provide the Static Single Information property – formalized in Section 14. . .6. . Rastello) Different data-flow analysis might extract information from different program facts. . . . . F.6. a program representation may afford the SSI property to some data-flow analysis. . . . . The SSI form provides the static single information property to data-flow analyses that extract information from the definition sites of variables. Two classic examples. . . . the SSA form naturally provides the SSI property to the reaching definition analysis. such as the class inference analysis described in Section 14. . . . .2. as we will see in the examples from Section 14. .

Because the instruction immediately before p 2 does not add any new information to the abstract state of i . . Figure 14. For instance. 0]. [s] = ⊥ l2: s = 0 i = i + 1 s = s + i return s p2: [i] = [0. the problem of estimating the interval of values that any integer variable may assume throughout the execution of a program. [s] = [1.14. In this case we call a program point any region between two consecutive instructions and we let [v ] denote the abstract information that is associated with variable v . [i ]1 = [i ]2 . 0]. For instance. +∞] l4: i = i + 1 l6: return s p4: [i] = [1. The dense approach might keep more redundant information during the dataflow analysis.2.1 An example of a dense data-flow analysis that finds the range of possible values associated with each variable at each program point. the interval of possible values that v might assume at i .1 illustrates this analysis with an example. Because this approach keeps information at each program point. [i ]2 is updated with the information that flows directly from the predecessor point p 1 . an instruction that neither defines nor uses any variable is associated with an identity transfer function. +∞] p3: [i] = [0. 100]. [s] = [0. in our example. The transfer function that updates the abstract state of i at program point p 2 is an identity. if we let [v ]i denote the abstract state of variable v at program point i . +∞] Fig.1 Sparse Analysis Traditionally. in range analysis. then we have that. +∞] l5: s = s + i p5: [i] = [1. [s] = [0. we call it dense. [s] = [0. 99].3. [s] = [0. This redundancy happens because some transfer functions are identities. [s ]3 = [s ]4 and [i ]4 = [i ]5 . Let’s consider. in contrast with the sparse analyses seen in Section 8. [s] = ⊥ 163 i = 0 s = 0 while i < 100: l1: i = 0 p1: [i] = [0. as an example. An algorithm that solves this problem is called a range analysis. 100]. non relational data-flow analyses bind information to pairs formed by a variable and a program point. 14. for each pair formed by a variable v plus a program point i . A traditional implementation of this analysis would find.2 Static Single Information (a) (b) p0: [i] = ⊥. 0] l3: (i < 100)? p6: [i] = [100. 14. 100].

See for example Section 1. without loss of generality. the sparse solution tends to be more economical in terms of space and time.but not PVP . however. Solving a data-flow analysis sparsely has many advantages over doing it densely: because we do not need to keep bitvectors associated with each program point. a forward data-flow analysis. Pereira. PARTITIONED DATA-FLOW PROBLEM: Let = {v 1 . . . equivalently. These analyses. an element of a lattice . Going back to our example. These problems are called Partitioned Data-flow Analyses (PDA). as they fit into the family of PVL problems that we describe in the next section. We define the PDA family of data-flow problems as the sum of two notions: Partitioned Variable Problem (PVP) and Partitioned Variable Lattice (PVL). and F s is the transfer function associated with s . Rastello) The goal of sparse data-flow analysis is to shortcut the identity transfer functions. A typical example in the PVP category is the well-known liveness analysis. Definition 1 (PVP/PVL). v n } be the set of program variables. In other words. we can replace all the constraint variables [v ]i by a single constraint variable [v ]. which include live variables. . a task that we accomplish by grouping contiguous program points bound to identities into larger regions. .164 14 Static Single Information Form — (F. Not every data-flow problem can be easily solved sparsely. F. reaching definitions and forward/backward printing. Furthermore. The corresponding Maximum Fixed Point (MFP) problem is said to be a Partitioned Variable Lattice Problem iff: [PVL]: can be decomposed into the product of v 1 × · · · × is the lattice associated with program variable v i . with each program point i . . After defining these concepts formally. one of the control flow predecessors of i . if the information associated with a variable is invariant along its entire live range. a given variable v may be mapped to the same interval along many consecutive program points. the inequation x i F s (x s ). we discuss these two examples. The analysis can be written1 as a constraint system that binds to each program point i and each s ∈ pred(i ) the equation x i = x i ∧ F s (x s ) or. Among the many examples of PVL . can be decomposed into a set of sparse data-flow problems – usually one per variable – each independent on the other. where x i denotes the abstract state associated with program point i .2.2 Partitioned Variable Lattice (PVL) Data-Flow Problems There exists a class of data-flow problems that can be naturally solved via sparse data-flow analyses. given by the equation x i = s ∈pred(i ) F s (x s ).problems. vi 1 vn where each As far as we are concerned with finding its maximum solution.2 of [198]. 14. We consider. This data-flow analysis is an equation system that associates. for each variable v and every i ∈ live(v ).3. many of them can. we cite range analysis. then we can bind this information to the variable itself.

because they are associated with the transfer function [v ]i = [v ]i ∧ Fvs ([v 1 ]s . This also paves the way toward a formal definition of the Static Single Information property. [v n ]s ) with Fvs from 14. . There are two types of program points associated with non-identity transfer functions: definitions and conditionals: (1) at the definition point of variable v . However.2 Static Single Information 165 A PVL problem is said to be a Partitioned Variable Problem iff: [PVP]: each transfer function F s can also be decomposed into a product Fvs1 × Fvs2 × · · · × Fvsn where Fvsj is a function from v j to v j . . . . This information does not depend on any program variable other than v itself. . Property 1 (SSI). the range interval of some variable v at program point i has to be expressed in term of the intervals associated with some other variables (not only v ) at the predecessors S of i .14. Fv simplifies to a function that depends only on some [u ] where each u is an argument of the instruction defining v . STATIC SINGLE INFORMATION: Consider a forward (resp. then we can bind this information to the variable itself. for each variable v and every i ∈ live(v ). (2) at the conditional tests that use a variable v . if we denote by the lattice of integer intervals. can be expressed in term of its state at the successors s of i . In other words. . For a given variable v its liveness information at program point i . The liveness information can be computed for each individual variable independently. Liveness analysis is a partitioned variable problem. backward) monotone PVL problem E d e n s e stated as a set of constraints [v ]i Fvs ([v 1 ]s . [v ]i . e.2. Going back to range analysis. . The abstract states associated with all the variables at each program point can be written as the cross product = n . [v ]i ∧ Fvs ([v 1 ]s . At each program point we are interested in computing the set of variables alive at that point. The other programs points can be ignored. [v ]i = to (and not from to ). [v n ]s ) which simplifies to [v ]i = [v ]i . Contrary to liveness analysis. [v n ]s ) . yet “sparser”. Consider the problem of range analysis again. then the overall lattice can be written as = n .. . In other words. .3 The Static Single Information Property If the information associated with a variable is invariant along its entire live range. .g. . a large number of them do fulfill the PVL property. The information related to each variable belongs to a lattice of Boolean values . most of the data-flow analyses do not provide the PVP property. where n is the number of variables (PVL property). This gives the intuition on why a propagation engine along the def-use chains of a SSA-form program can be used to solve the constant propagation problem in an equivalent. we can replace all the constraint variables [v ]i by a single constraint variable [v ]. Fv can be simplified to a function that uses [v ] and possibly other variables that appear in the test. [v ]i = [v ]i ∧ Fvs ([v ]s ) with Fvs from to . As opposed to liveness information. manner. .

live-in) at inst. i. F. . [LINK]: each instruction inst for which Fvi n s t depends on some [u ]s should contain an use (resp. create new definitions of variables. [INFO]: each program point i ∈ live(v ) should be bound to an undefined transfer function. Each of these special instructions.i )∈v a r i a b l e s ×p ro g _poi n t s be a maximum solution to E d e n s e . φ -function. and for which Fvs (Yvs1 .3 we provide a sparser way to split live ranges that fit Property 1. together with φ -functions.2. this strategy would lead to the creation of many trivial program regions. The LINK property forces the def-use chains to reach the points where information is available for a transfer function to be evaluated. and a parallel copy is a copy that must be done in parallel with another instruction. and each s ∈ pred(i ) (resp. These properties allows us to attach the information to variables. 14. These notations are important elements of the propagation engine described in the section that follows. then we would automatically provide these properties. then s should contain a definition (resp. Rastello) for every variable v . The VERSION property provides an one-to-one mapping between variable names and live ranges. s )) should have a φ -function at the entry of i (resp. [ VERSION]: for each variable v . . Each node i that has several predecessors (resp. However. s ∈ succ(i )).4 Special instructions used to split live ranges We perform live range splitting via special instructions: the σ-functions and parallel copies that. i. If we split them between each pair of consecutive instructions.⊥ is non-trivial. σ-functions. instead of program points. last use) of v . We must split live ranges to provide the SSI properties.. successor) of a program point i where a variable v is live and such that Fvs = λx . is not the simple projection on v . . as the newly created variables would be live at only one program point. Possibly. In Section 14. . We accomplish this last task by inserting into the program pseudo-uses and pseudo-definitions of this variable. each program point i . σ-function at the exit of i ) for v as defined in Section 14. a σ-function (for a branch point) is the dual of a φ -function (for a join point). and parallel copies split live ranges at different kinds of program points: interior nodes. .166 14 Static Single Information Form — (F. branches and joins. In short.4. live(v ) is a connected component of the CFG.e.e. The SPLIT property forces the information related to a variable to be invariant along its entire live-range. outgoing edges (i . Yvsn ) has different values on its incoming edges (s . Fvi = λx . successors).2. Let (Yvi )(v . i ) (resp. INFO forces this information to be irrelevant outside the live range of the variable. and we would lose sparsity. Pereira.⊥. A program representation fulfills the Static Single Information property iff: [SPLIT ]: let s be the unique predecessor (resp. def ) of u live-out (resp. we may have to extend the live-range of a variable to cover every program point where the information is relevant.

each program point i . v 1 : l q ) = σ(v 1 ) q . . We call joins the program points that have one successor and multiple predecessors. For instance. one per predecessor (resp. hence. σfunction) is seen as a set of copies. . . any φ -function (resp. which .. then all definitions of i n s t plus all left-hand variables v i are written simultaneously. [v n ]l ). which is a program point with a unique predecessor and multiple successors.2 Static Single Information 167 Interior nodes are program points that have a unique predecessor and a unique successor. . To ensure Property 1. . two different definitions of the same variable v might be associated with two different constants. We ensure this property via special instructions called σ-functions. . providing two different pieces of information about v . the use that reaches the definition of a variable must be unique. which leads to as many constraints.4(d).14. The σ-functions are the dual of φ -functions. . To avoid that these definitions reach the same use of v we merge them at the earliest program point where they meet. vm : l q ) = σ(v m ) j represents m σ-functions that assign to each variable v i the value in v i if control flows into block l j . We do it via our well-known φ -functions. At these points we perform live range splitting via copies. . i. then this copy must be done in parallel with the existing instruction. .2.e. the m σ-functions encapsulate m parallel copies. . . This means that all the uses of i n s t plus all right-hand variables v i are read simultaneously. a m : l m ). [v n ]s ) (or equivalently [v ]i = [v ]i ∧ Fvs ([v 1 ]s . then i n s t is computed. The notation. . . To simplify the discussion. [v n ]s ) for every variable v . s ∈ succ(i )). These assignments happen in parallel. . . . . that we will explain later. vm = vm denotes m copies v i = v i performed in parallel with instruction i n s t . . In other words. If the program point already contains another instruction. . uses parallel copies at nodes l 4 and l 6 . . . gives us n constraints such as [a ]i Fvl ([v 1 ]l .. The assignment 1 (v 1 : l 1 . In backward analyses the information that emerges from different uses of a variable may reach the same branch point. The null pointer analysis example of Figure 14. .. and each s ∈ pred(i ) (resp. in the same way that in a SSAform program the definition that reaches a use is unique. . Also. backward) PVL problem E dense stated as a set of equations [v ]i Fvs ([v 1 ]s .. successor). 1 q (v m : l 1. 14. performing a parallel assignment depending on the execution path taken. i ns t v1 = v1 .. a φ -function such as a = φ (a 1 : j j j l 1 .5 Propagating Information Forwardly and Backwardly ssi Let us consider a unidirectional forward (resp. notice that variables live in different branch targets are given different names by the σ-function that ends that basic block. .

F. worklist −= i foreach v ∈ i. . these algorithms can be made control flow sensitive. . the corresponding constraint [v ] G v where a . If necessary. We define a system of sparse equations E s p a r s e as follows: • For each instruction at a program point i that defines (resp. . uses) a variable v . . defined) at i . . .168 14 Static Single Information Form — (F. i. like Algorithm 12. [v n ]s ). we let a . . [z ]i .defs() [v ] = [v ]n e w usually simplifies into [a ]i [a j ]l . . . . . . Fvi depends only on some [a ]i . [v n ]l ). and each defi ([a ]. we show here how to build an equivalent sparse constrained system. z are used (resp. Consider that a program in SSI form gives us a constraint system that associates with each variable v the constraints [v ]i ssi Fvs ([v 1 ]s . . . Definition 2 (SSI constrained system). .defs()]) [v ]n e w = [v ] ∧ G v if [v ] = [v ]n e w : worklist += v. z be its set of used (resp. Pereira. • The sparse constrained system associates with each variable v . we can use a very similar propagation algorithm to communicate information backwardly. . defined) variables. seen in Figure 15 propagates information forwardly. a σ-function classical meet [a ]i l j ∈pred(i ) [a j ] (l 1 : a 1 . G v ([a ]. [v n ]). . Thus.uses(): i ([i. The propagation engine discussed in Chapter 8 sends information forwardly along the def-use chains naturally formed by the SSA-form program. [v ] appears (and has to) on the right hand side of the assignment for Algorithm 14 while it does not for Algorithm 15. . This last can be written equivalently into the l j used in Chapter 8. . A slightly modified version of this algorithm. Rastello) Algorithm 14: Backward propagation engine under SSI 1 2 3 4 5 function back_propagate(transfer_functions ) 1 6 7 8 9 10 11 worklist = foreach v ∈ vars: [v ] = foreach i ∈ insts: worklist += i while worklist = : let i ∈ worklist. .e. use) point i of v . [z ]) inition (resp. . . there exists a function i i defined as the restriction of F i on Gv a × ··· × z . If a given program fulfills the SSI property for a backward analysis. Because of the LINK property. . l m : a m ) = σ(a ) at program point i yields n constraints such as j j j [a j ]l j Fvl ([v 1 ]l . which usually simplifies into [a j ]l j [a ]i . . This comes from the asymmetry of our SSI form that ensures (for practical purpose only as we will explain soon) the Static Single Assignment property but not the Static Single j . . . . . Figure 14 shows a worklist algorithm that propagates information backwardly. . . Given a ssi program that fulfills the SSI property for E dense and the set of transfer functions Fvs . Still we should outline a quite important subtlety that appears in line 8 of algorithms 14 and 15. [z ]) = v i Fv ([v 1 ]. . Similarly. .

is that i (. 2 in the ideal world with monotone framework and lattice of finite height . ) with the old value of [v ] to ensure the monotonicity one needs to meet G v of the consecutive values taken by [v ]. If we have several uses of the same variable. worklist −= i foreach v ∈ i. Technically this is the reason why we manipulate a constraint system (system with inequations) and not an equation system as in Chapter 8. It is not necessary to enforce the SSU property in this instance of dead-code elimination. in addition to the SSA property.14. of our intermediate representation at the expenses of adding more φ -functions and σ-functions.uses() [v ] = [v ]n e w Use (SSU) property. many data-flow analyses can be classified as PVL problems. then the sparse backward constraint system will have several inequations – one per variable use – with the same left-hand-side.2 Static Single Information 169 Algorithm 15: Forward propagation engine under SSI 1 2 3 4 5 function forward_propagate(transfer_functions ) 1 6 7 8 9 10 11 worklist = foreach v ∈ vars: [v ] = foreach i ∈ insts: worklist += i while worklist = : let i ∈ worklist. The sparse constraint system does have several equations (one per variable use) for the same left-hand-side (one for each variable). .6 Examples of sparse data-flow analyses As we have mentioned before. 14. In this section we present some meaningful examples. It would be still possible to enforce the SSU property. replacing G v in Algorithm 14 by “i is a useful instruction or one of its definitions is marked as useful” leads to the standard SSA-based dead-code elimination algorithm. in terms of compilation time and memory consumption.uses()]) [v ]n e w = G v if [v ] = [v ]n e w : worklist += v. However. this guarantee is not necessary to every sparse analysis. The slight and important difference for a constraint system as opposed to an equation system.defs(): i ([i. .2. Both systems can be solved 2 using a scheme known as chaotic iteration such as the worklist algorithm we provide here. The dead-code elimination i problem illustrates well this point: for a program under SSA form. and doing so would lead to a less efficient solution.

In Figure 14.1 after live range splitting. on the other hand. we know that this variable is upper bounded by 100 at entry of program point l 4 . as it is not used in the conditional. Therefore. Similarly. Figure 14.1. Rastello) Fig. given in Figure 14.2 (a) The example from Figure 14. or at the conditionals where they are used. the live range of variable i must be split at its definition.170 14 Static Single Information Form — (F. (b) A solution to this instance of the range analysis problem. we should split the live ranges of variables at their definition points. 0] immediately after program point l 1 . . Splitting at conditionals is done via σ-functions. 14.2(b) shows the result that we obtain by running the range analysis on this intermediate representation. F. Range analysis revisited We start this section revising this chapter’s initial example of data-flow analysis. must be split only at its definition point. The representation that we obtain by splitting live ranges at definitions and conditionals is called the Extended Static Single Assignment (e-SSA) form. The live range of s . Pereira. in order to achieve the SSI property. Figure 14. or from the points where variables are tested. due to the conditional test that happens before. In order to ensure the SSI property in this example. A range analysis acquires information from either the points where variables are defined. and at the conditional test.2(a) shows the original example after live range splitting. This solution assigns to each variable a unique range interval.2 we know that i must be bound to the interval [0.

Null pointer analysis The objective of null pointer analysis is to determine which references may hold null values. otherwise an exception would have been thrown during the invocation v 1 . we split live ranges at these program points.4 illustrates this analysis. JavaScrip. and rely on σ-functions to merge them back. possibly null. we split live ranges after each variable is used. This instance of class inference is a Partitioned Variable Problem (PVP) 3 . definitely not-null. where [v j ] denotes the set of methods associated with variable v j .m (). Ruby or Lua. A class inference engine tries to assign a class to a variable v based on the ways that v is defined and used. .14. in Figure 14. Our objective is to infer the correct suite of methods for each object bound to variable v . The Python program in Figure 14. as we see in Figure 14.2 Static Single Information 171 Class Inference Some dynamically typed languages.4(c) we notice that the state of v 4 is the meet of the state of v 3 . Figure 14. as we show in Figure 14. For instance.3(e).4(b).3(a) illustrates this optimization. Because information is both produced at definition but also at use sites. Figure 14. On the other hand. 3 Actually class inference is no more a PVP as soon as we want to propagate the information through copies.m () cannot result in a null pointer dereference exception. It is possible to improve the execution of programs written in these languages if we can replace these simple tables by actual classes with virtual tables. Because type inference is a backward analysis that extracts information from use sites. and the state of v 1 . and we must conservatively assume that v 4 may be null. A fix-point to this constraint system is a solution to our data-flow problem.3(c) shows the results of a dense implementation of this analysis. The use-def chains that we derive from the program representation lead naturally to a constraint system.3(b) shows the control flow graph of the program. such as Python. This analysis allows compilers to remove redundant null-exception tests and helps developers to find null pointer dereferences. which we show in Figure 14. and Figure 14.3(f ). because the data-flow information associated with a variable v can be computed independently from the other variables. we know that v 2 cannot be null and hence the call v 2 . represent objects as tables containing methods and fields.

. .m3} l6: v.m2( ) {m3} l7: v. 14.m2( )||(v5) = (v3) (c) []v6 = {m3} [v5] = [v6] [v4] = [v6] [v2] = {m1} ∪ [v4] [v3] = {m2} ∪ [v5] [v7] = {} [v1] = [v2] ∧ [v7] (e) (d) l7: v6 =ϕ (v4. .m3} {} l3: tmp = i + 1 {m1.m3} l4: v. . However. F. . v7) =σ v1 l3: tmp = i + 1 l4: v2. .m( ) (b) [v1] = Possibly Null [v2] = Not Null [v3] = Not Null [v4] = [v3] ∧ [v1] (c) Null pointer analysis as an example of forward data-flow analysis that takes information from the definitions and uses of variables. . . . .m( )||v3 = v2 v4 =ϕ (v3.m3() (a) 14 Static Single Information Form — (F.m1(tmp) else: v = OY() v. .m3} l2: (i%2)? {m1. . . 14.m1( )||(v4) = (v2) l5: v3 = new OY( ) l6: v3.172 def test(i): v = OX() if i % 2: tmp = i + 1 v. . . .3 Class inference analysis as an example of backward data-flow analysis that takes information from the uses of variables. . v1) v4.m1( ) {m3} l5: v = new OY( ) {m2. . . 14. . . . . we have not yet seen how to convert a program to a format that provides . .m2( ) l1: v = new OX( ) {m1.m( ) l3: v. . . .m3( ) (b) l1: v1 = new OX( ) l2: (i%2)? (v2. Pereira. . .m3( ) Fig.m( ) l4: v. . .m( ) (a) Fig. l1: v = foo( ) l2: v. . . . . . . Rastello) l1: v = new OX( ) l2: (i%2)? l3: tmp = i + 1 l4: v. . . v5) v6.m1( ) l7: v. . . . .m( )||v2 = v1 v2. . .m3( ) l5: v = new OY( ) l6: v.3 Construction and Destruction of the Intermediate Program Representation In the previous section we have seen how the static single information property gives the compiler the opportunity to solve a data-flow problem sparsely. . .m2() print v. . . .4 v1 = foo( ) v1. . .

The list in Figure 14. there exist two definitions of variable i . is a forward analysis that takes information from points where variables are defined. Firstly.2.3 Construction and Destruction of the Intermediate Program Representation 173 the SSI property. many different sources of information reaching a common program point.4. Secondly. ensuring that the live ranges of two different re-definitions of the same variable do not overlap.e.1. in Figure 14. l 4 }↓ . we have that v = {l 4 . Hence the live-range of i has to be split . 14. the program point immediately after l i ). it splits live ranges. The algorithm SSIfy in Figure 14. For instance.5 gives further examples of live range splitting strategies.3. we might have. for each variable. l 2 . we have the live range splitting strategies enumerated below. For instance. We describe each of these phases in the rest of this section. in Figure 14. l 3 . For instance. these points are not the only ones where splitting might be necessary. l 6 . Finally. the live-range splitting strategy is characterized by the set Uses↑ where Uses is the set of use points. Similarly. thus. and that s = {l 2 . As we have pointed out in Section 14. and conditional tests that use these variables. hence. We let I ↓ denote a set of points i with forward direction. The live-range of v must be split at least at every point in v .g. Out(l 3 ). via the three-steps algorithm from Section 14. it renames these newly created definitions.6 implements a live range splitting strategy in three steps.2.2. 14. l 1 and l 4 . The information that flows forward from l 1 and l 4 collides at l 3 . inserting new definitions of variables into the program code.6.. in Figure 14.2 Splitting live ranges In order to implement v we must split the live ranges of v at each program point listed by v . e.2. l 7 }↑ . • Range analysis. • Null pointer analysis takes information from definitions and uses and propagates this information forwardly. that reach the use of i at l 3 . l 4 }↓ where Out(l i ) denotes the exit of l i (i. in Figure 14. we have that v = {l 1 .3(d). we let I ↑ denote a set of points i with backward direction.1 Splitting strategy A live range splitting strategy v = I ↑ ∪ I ↓ over a variable v consists of a set of “oriented” program points.4. for the same original variable. we have that i = {l 1 .3.3. However. the loop entry. • Class inference is a backward analysis that takes information from the uses of variables. Going back to the examples from Section 14. it removes dead and non-initialized definitions from the program code. l 5 }↓ . Corresponding references are given in the last section of this chapter.14. For instance. This is a task that we address in this section.

Uses) denotes the set of instructions that define (resp. For control flow graphs that contain several exit nodes or loops with no exit. the post-dominance frontier is the dominance frontier in a CFG where direction of edges have been reversed.7 shows the algorithm that we use to create new definitions of variables. LastUses denotes the set of instructions where a variable is used. 1 function SSIfy(var v. First. and after which it is no longer live. to a new definition i 1 . Similarly. which is the dual of DF+ .174 14 Static Single Information Form — (F. In general. v ) rename(v ) 3 clean(v ) 4 2 v) Fig. Defs (resp. Out(Conds) denotes the exits of the corresponding basic blocks. in our example. as we . These new definitions are created at the iterated post-dominance frontier of points that originate information. in lines 3-9 we create new definitions to split the live ranges of variables due to backward collisions of information. This algorithm has three main phases. Class inference Hochstadt’s type inference Null-pointer analysis Defs↓ Splitting strategy Defs↓ LastUses↑ Defs↓ Defs↓ Out(Conds)↓ Uses↑ Defs↓ Out(Conds)↓ Uses↑ Uses↑ Uses↑ Defs↓ Out(Conds)↑ Uses↓ Fig. That is. the set of program points where information collides can be easily characterized by the notion of join sets and iterated dominance frontier (DF+ ) seen in Chapter 4. the notion of post dominance requires the existence of a unique exit node reachable by any CFG node. just as the notion of dominance requires the existence of a unique entry node that can reach every CFG node. reaching definitions cond. F. Figure 14. use) the variable. split sets created by the backward propagation of information can be over-approximated by the notion of iterated post-dominance frontier (pDF+ ). Splitting_Strategy split(v . then each of its predecessors will contain the live range of a different definition of v . we can ensure the single-exit property by creating a dummy common exit node and inserting some never-taken exit edges into the program. 14.6 Split the live ranges of v to convert it to SSI form immediately before l 3 – at In(l 3 ) –. Conds denotes the set of instructions that apply a conditional test on a variable. leading. constant propagation Partial Redundancy Elimination ABCD. Note that. range analysis Stephenson’s bitwidth analysis Mahlke’s bitwidth analysis An’s type inference. Rastello) Client Alias analysis. Pereira. 14.5 Live range splitting strategies for different data-flow analyses. If a program point is a join node. taint analysis.

3. and Out(l ) to denote a program point immediately after l . .i s _ j oi n : insert “v = φ (v . ). In lines 10-16 we perform the inverse operation: we create new definitions of variables due to the forward collision of information. as we see in line 11.3 Variable Renaming The algorithm in Figure 14. To rename a variable v we traverse the program’s dominance tree. which we denote by Out(. v )" at i elseif i . . in this case. this means that the variable is not defined at this point. by φ -functions (due to v or to the splitting in lines 10-16).i s _b r a n c h : foreach e ∈ ou t g oi n g _e d g e s (i ) S ↓ = S ↓ In(D F + (e )) else: S ↓ = S ↓ In(D F + (i )) S = v S↑ S↓ “Split live range of v by inserting φ . ensure in lines 6-7 of our algorithm.7 Live range splitting. Finally. . Our starting points S ↓ .i s _b r a n c h : insert “(v . or by parallel copies.3 Construction and Destruction of the Intermediate Program Representation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 175 function split(var v. . and copies" foreach i ∈ S : if i does not already contain any definition of v : if i . in lines 17-23 we actually insert the new definitions of v . The . The definition currently on the top of the stack is used to replace all the uses of v that we find during the traversal.i s _ j oi n : foreach e ∈ i n com i n g _e d g e s (i ): S ↑ = S ↑ Out(p D F + (e )) else: S ↑ = S ↑ Out(p D F + (i )) S↓ = foreach i ∈ S ↑ Defs(v ) I ↓ : if i . but in the region immediately after it. because we want to stay in SSA form in order to have access to a fast liveness check as described in Chapter 9. 14.8 builds def-use and use-def chains for a program after live range splitting.. stacking each new definition of v that we find. This algorithm is similar to the classic algorithm used to rename variables during the SSA construction that we saw in Chapter 3. include also the original definitions of v . σ.. We use In(l ) to denote a program point immediately before l .14.. Notice that these new definitions are not placed parallel to an instruction. v ) = σ(v )" at i else: insert a copy “v = v " at i Fig. Splitting_Strategy v = I↓ ∪ I↑) “compute the set of split points" S↑ = foreach i ∈ I ↑ : if i . 14. These new definitions might be created by σ functions (due to v or to the splitting in lines 3-9). from top to bottom... If the stack is empty..

set_def (instruction inst): let v i be a fresh version of v 27 replace the defs of v by v i in inst 28 set Def(v i ) = inst 29 stack. . We mean by “actual” instructions.set_use(u ) 10 if exists instruction d in n that defines v : 11 stack. then we treat it as a simple assignment. if we must rename a single definition inside a σ-function. like we do in lines 15-16 of the algorithm. .) = σ(v ) in Out(n ): 13 stack. . F.set_use and stack. the algorithm in Figure 14. By symmetry.set_use.peek() 23 replace the uses of v by v i in inst 24 if v i = ⊥: set Uses(v i ) = Uses(v i ) inst 25 26 function stack. .9 eliminates φ -functions and parallel copies that define variables not actually used in the code. . .) in In(m ): 19 stack. . For simplicity we consider this single use as a simple assignment when calling stack. it also eliminates σ-functions and parallel copies that use variables not actually defined in the code. 14. . . . 14. .set_def that build the chains of relations between variables. Notice that sometimes we must rename a single use inside a φ -function. Rastello) function rename(var v ) “Compute use-def & def-use chains" “We consider here that stack.3.176 1 2 3 14 Static Single Information Form — (F.set_def(v : l i = v ) 17 foreach m in s u c c e s sor s (n ): 18 if exists v = φ (. v : l q ) = σ(v ) in Out(n ): 15 foreach v : l i = v in (v : l 1 . and that d e f (⊥) = e n t r y " 4 st ack = 5 foreach CFG node n in dominance order: 6 if exists v = φ (v : l 1 . as one can see in line 20.8 Versioning renaming process replaces the uses of undefined variables by ⊥ (line 3). v : l q ) in In(n ): 7 stack.set_use(instruction inst ): 21 while d e f (stack. . .pop() 22 v i = stack. . v : l n . We have two methods.peek()) does not dominate inst: stack.set_def(v = φ (v : l 1 . as in lines 19-20 of the algorithm. . . v : l q )) 8 foreach instruction u in n that uses v : 9 stack.push(v i ) 30 Fig. v : l q ) = σ(v ): 16 stack. .) = σ(v )) 14 if exists (v : l 1 . those instructions that already existed in the program before we transformed it with split.set_use(v = v : l n ) 20 function stack. . stack.isempty(). .peek() = ⊥ if stack. . . In line 3 we let . Pereira. . . . Similarly.4 Dead and Undefined Code Elimination Just as Algorithm 7. .set_use((. .set_def(d ) 12 foreach instruction (.

uses = {⊥} remove inst Fig. We let inst. σ. In this case.t. in lines 11-14 performs a similar process. and inst. The corresponding version of v is then marked as used (line 14).14. in a compiler that already supports the SSA form.defs\defined = : foreach v i ∈ web ∩ inst. so as to restrict the cleaning process to variable v .defs = } while ∃inst ∈ active s. web ∩ inst. σ-functions.uses denote the set of variables used by inst. As an example. Each non live variable (see line 15). 14. The corresponding version of v is then marked as defined (line 8). Original instructions not inserted by split are called actual instruction. this time to add to the active set instructions that can reach actual uses through def-use chains.t.uses = } while ∃inst ∈ active s.3(d).uses\used = : foreach v i ∈ web ∩ inst. or copy functions are removed by lines 19-20. The next loop.uses\used: active = active ∪ Def(v i ) used = used ∪ {v i } let live = defined ∩ used foreach non actual inst ∈ Def(w e b ): foreach v i operand of inst s.defs denote the set of variable(s) defined by inst. and copies that can reach actual definitions through use-def chains. Finally every useless φ . as we see in lines 4-6 and lines 10-12.defs = {⊥} or inst.9 Dead and undefined code elimination. inst.e. is to represent them by φ -functions.t. Then.3 Construction and Destruction of the Intermediate Program Representation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 177 clean(var v ) let web = {v i |v i is a version of v } let defined = let active = { inst |inst actual instruction and web ∩ inst. 14.10(a) shows how we would represent the σ-functions of Figure 14. either undefined or dead (non used) is replaced by ⊥ in all φ . σ. or copy functions where it appears in. This is done by lines 15-18. v i ∈ / live: replace v i by ⊥ if inst. The set “active” is initialized to actual instructions in line 4.5 Implementation Details Implementing σ-functions: The most straightforward way to implement σ-functions.3. the σ-functions can be implemented as single arity φ -functions. during the loop in lines 5-8 we add to active φ -functions. Figure 14. “web” be the set of versions of v . If l is a branch point with n successors that would contain a σ-function (v 1 : .defs\defined: active = active ∪ Uses(v i ) defined = defined ∪ {v i } let used = let active = { inst |inst actual instruction and web ∩ inst. i.

A critical edge links a basic block with several successors to a basic block with several predecessors. Thus. . . . ?].m2( )||v5 = v3 tmp = i + 1 v3 = OY( ) v1. the compiler must implement these instructions. Many compiler textbooks describes the theoretic basis of the notions of lattice. . . . 14. . Kildall [?] and Hecht [?]. v n : l n ) = σ(v ). then we rename v j to v . F. v3) v6. l 1 .m1( )||v4 = v2 v7 =ϕ (v1) v3 = OY( ) v3. . Rastello) (a) v1 = OX( ) (i%2)? (b) v1 = OX( ) (i%2)? v2 =ϕ (v1) tmp = i + 1 v2. . . . . . Allen [?. v j . Pereira. . A simple way to eliminate all the σ-functions and parallel copies is via copy-propagation. . . . . . data-flow analyses such as reaching definitions.4 Further Reading The monotone data-flow framework is an old ally of compiler writers. (b) implementing σ-functions via single arity φ functions.m3( ) Fig. . . . . For a comprehensive . . v5) v6.m2( ) v6 =ϕ (v4. . . .10(b) shows the result of copy folding applied on Figure 14. . Figure 14. . . . . . We have already seen in Chapter 3 how to replace φ -functions with actual assembly instructions. . . . . . As an example. before producing an executable program. . This case happens when the control flow edge l → l j is critical. . . now we must also replace σ-functions and parallel copies.10 (a) getting rid of copies and σ-functions. available expressions and liveness analysis have made their way into the implementation of virtually every important compiler. . . . In this case. . . . we insert at the beginning of l j an instruction v j = φ (v : l ). . Notice that it is possible that l j already contains a φ -function for v . we copy-propagate the variables that these special instructions define.10(a).m1( ) v6 =ϕ (v1. . . . . however.). monotone data-flow framework and fixed points. . . If l j already contains a φ -function v = φ (. . . then.178 14 Static Single Information Form — (F. . . 14. . SSI Destruction: Traditional instruction sets do not provide φ -functions nor σ-functions. Since the work of pionners like Prosser [?]. . .m3( ) v3. for each successor l j of l . .

There are data-flow analyses that do not meet the Partitioned Variable Lattices property. Noticeable examples include abstract interpretation problems on relational domains. make it possible to solve some data-flow problems sparsely. defining building algorithms and giving formal proofs that these algorithms are correct. Singer proposed new algorithms to convert programs to SSI form. to provide a fast algorithm to eliminate array bound checks in the context of a JIT compiler [36]. [255] have extended the literature on SSI representations. we refer the interested reader to Nielson et al. Choi et al. for example. Well-known among these representations is the Extended Static Single Assignment form. including algorithms and formal proofs. which they call weak and strong.’s Sparse Evaluation Graph (SEG) [69].14. have also separated the SSI program representation in two flavors. The original description of the intermediate program representation known as Static Single Information form was given by Ananian in his Master’s thesis [12]. such as Polyhedrons [81]. For instance. by Johnson et al. 175]). however.’s Quick Propagation Graphs [146]. These data-structures work best when applied on partitioned variable problems. He also discussed a number of data-flow analyses that are partitioned variable problems. We did not discuss bidirectional data-flow problems. There exist other intermediate program representations that. or Ramalingan’s Compact Evaluation Graphs [217]. In terms of data-structures. Zadeck proposed fast ways to build data-structures that allow one to solve these problems efficiently.’s work. Boissinot et al. The notion of Partitioned Variable Problem (PVP) was introduced by Zadeck. The nodes of this graph represent program regions where information produced by the data-flow analysis might change. Boissinot et al. is the Static Single Use form (SSU). The notation for σ-functions that we use in this chapter was borrowed from Ananian’s work. having seen use in a number of implementations of flow analyses. and also showed how this program representation could be used to handle truly bidirectional data-flow analyses. Working on top of Ananian’s and Singer’s work. like the SSI form. but the interested reader can find examples of such analyses in Khedker et al. Another important representation. [214. and best known method proposed to support sparse data-flow analyses is Choi et al. Tavares et al. The SSI program representation was subsequently revisited by Jeremy Singer in his PhD thesis [238]. 124. All these program representations are very effective. whereas SSI and other variations of SSU allow two consecutive uses of a variable on the same path.’s work [157].’s book [198] on static program analysis. which supports data-flow analyses that acquire information at use sites. the “strict” SSU form enforces that each definition reaches a single use.4 Further Reading 179 overview of these concepts. The presentation that we use in this chapter is mostly based on Tavares et al. [37] have proposed a new algorithm to convert a program to SSI form.’s ideas have been further expanded. As uses and definitions are not fully symmetric (the live-range can “traverse” a use while it cannot traverse a definition) there exists different variants of SSU (eg. Nowadays we have efficient algorithms that build such data-structures [211. in his PhD dissertation [284]. the first. they only fit specific data-flow problems. . Octagons [191] and Pentagons [176]. introduced by Bodik et al. 212. 144].

The data-flow analyses discussed in this chapter are well-known in the literature. and later Singer [238]. The intermediate representation based approach has two disadvantages. Solving any coupled data flow analysis problem along with a SEG was mentioned in [69] as an opened problem. First it has to be maintained and at some point destructed.5 was taken from Hochstadt et al. in order to compile Self programs more efficiently [65]. However. Rastello) As opposed to those approaches. The type inference analysis that we mentioned in Figure 14. For another example of bidirectional bitwidth analysis. because it increases the number of variables. . Class inference was used by Chambers et al. and which may not [196]. and backwards. In addition to being used to eliminate redundant array bound checks [36]. SSI is compatible with SSA extensions such as gated-SSA described in Chapter 18 which allows demand driven interpretation. it might add some overhead to analyses and transformations that do not require it. For instance. taking information from definitions. Stephenson et al. F. as illustrated by the conditional constant propagation problem described in Chapter 8. which we have already discussed in the context of the standard SSA form. Second.’s work [257].’s algorithm [180].180 14 Static Single Information Form — (F. and has advantages and disadvantages when compared to the data-structure approach. 122]. see Mahlke et al. coupled data-flow analysis can be solved naturally in IR based evaluation graphs. and range analysis [252. uses and conditional tests. have showed how to use the SSI representation to do partial redundancy elimination sparsely. Pereira. [246] described a bit-width analysis that is both forward. Last. IR based solutions to sparse dataflow analyses have many advantages over data-structure based approaches. Ananian [12]. an IR allows concrete or abstract interpretation. the solution promoted by this chapter consists in an intermediate representation (IR) based evaluation graph. On the other hand. Nanda and Sinha have used a variant of nullpointer analysis to find which method dereferences may throw exceptions. the eSSA form has been used to solve Taint Analysis [227].

. and the effectiveness of HSSA will be influenced by the accuracy of this analysis. obtaining the so called Hashed SSA form 1 . we will see how to represent them in SSA form for scalar variables. . . . . . . . that HSSA is a technique useful for representing aliasing effects. 15. This allows the application of all common SSA based optimizations to perform uniformly both on local variables and on external memory areas. . Subsequently we will represent indirect memory operations on external memory areas as operations on "virtual variables" in SSA form. is an SSA extension that can effectively represent how aliasing relations affect a program in SSA form.1 Introduction Hashed SSA (or in short HSSA). . . . however.aliasing Hashed-SSA form. . . HSSA Hashed-SSA form. . . . . . . given aliasing information. Finally we will apply global value numbering (GVN) to all of the above. The following sections explain how HSSA works. . . . . more generally. . . . HSSA CHAPTER 15 Hashed SSA form: HSSA M. Mantione F. . It works equally well for aliasing among scalar variables and. Initially. . . . . avoiding an explosion in the number of SSA versions for aliased variables. . . Then we will introduce a technique that reduces the overhead of the above representation. . . . . . . but not for detecting aliasing. Chow Progress: 85% Minor polishing in progress . . . . . . . It should be noted. . . for indirect load and store operations on arbitrary memory locations. . For this purpose. 1 The name Hashed SSA comes from the use of hashing in value-numbering 181 . a separate alias analysis pass must be performed. . which will be handled uniformly with scalar (local) variables. .

. . uses can be "MustUse" or "MayUse" operands (respectively in the direct and indirect case). through the variable name. . or a "MayDef" operand in the indirect case. .2 SSA and aliasing: µ and χ -functions Aliasing occurs inside a compilation unit when a single one single storage location (that contains a value) can be potentially accessed through different program "variables". . . . . . . and we will represent MayUse through the use of µ-functions. . each definition can be a "MustDef" operand in the direct case. storage locations with global scope can obviously be accessed by different functions. . .182 \mufun \chifun 15 Hashed SSA form: HSSA — (M. . . when two or more storage locations partially overlap. F. This is a problem because every optimization pass is concerned with the actual value that is stored in every variable. . . This. . . and indirectly. assuming that we have already performed alias analysis. we must formally define the effects of indirect definitions and uses of variables. . . . the only safe option for the compiler is to handle them as "volatile" and avoid performing optimizations on them. which is not desirable. through the pointer that holds its address. . Mantione. In this case every function call can potentially access every global location. . Obviously the argument of the µ-operator is the potentially used variable. . the argument to the χ -operator is the assigned variable itself. when a local variable is referred by a pointer used in an indirect memory operation. In this case the variable can be accessed in two ways: directly. 15. . . in the presence of aliasing the compiler could try to track the values of variable addresses inside other variables (and this is exactly what HSSA does). Chow) . . where different parts of a program can access the same storage location under different names. . • Second. . • Finally. . This expresses the fact . Intuitively. . Similarly. The first thing that is needed is a way to model the effects of aliasing on a program in SSA form. happens with the C "union" construct. . . • Third. . . when the address of a local variable is passed to a function. because the access depends on the address that is effectively stored in the variable used in the indirect memory operation. Particularly. In all the other cases (indirect accesses through the address of the variable) the situation becomes more complex. . but the formalization of this process is not trivial. . unless the compiler uses global optimizations techniques where every function is analyzed before the actual compilation takes place. . If variables can be accessed in unpredictable program points.1 where ∗p represents an indirect access with address p . The real problem with aliasing is that these different accesses to the same program variable are difficult to predict. This can happen in one of the four following ways: • First. . which in turn can then access the variable indirectly. Less obviously. . . and when those values are used. . . . We will represent MayDef through the use of χ -functions. . for instance. . . Only in the first case (explicitly overlapping locations) the compiler has full knowledge of when each access takes place. To do this. . The semantic of µ and χ operators can be illustrated through the C like example of Figure 15. . .

so the original value could "flow through" it. .. i 3 = φ (i 1 . Parallel instructions are represented in Figure 15. . . 4 else f (). if j then 3 . . return. . . if j 1 then . allows detecting that the assignment to i 4 is not dead. .. . an assignment of any scalar variable can be safely considered dead if it is not marked live by a standard SSA based dead-code elimination. .. . . Thanks to the systematic insertion of µ-functions and χ -functions. The use of µ and χ -operators does not alter the complexity of transforming a program in SSA form. if j then 3 . . . . Ideally. (a) Initial C code (b) After µ and χ insertion (c) After φ insertion and versioning Fig. a µ and a χ should be placed in parallel to the instruction that led to its insertion. . i 4 = 4. 4 else f () µ(i ).4.3 Introducing zero versions to limit the explosion of the number of variables While it is true that µ and χ insertion does not alter the complexity of SSA construction. the potential side effect of any assignment to i . .2.. 5 ∗p = 3 i = χ (i ). Still. . This distinction allows us to model call effects correctly: the called function appears to potentially use the values of variables before the call. . 1 2 6 7 i = 4. . . . The assignment of value 2 to i would have been considered as dead in the absence of the function call to f (). . represented through the µ-function at the return. and function f might indirectly use i but not alter it . . else f () µ(i 1 ). i = 2. i 1 = 2. All that is necessary is a pre-pass that inserts them in the program.1 using the notation introduced in Section 14. .1(c).1 A program example where ∗p might alias i . . and χ -operators immediately after it. . . . µ-functions could be inserted immediately before the involved statement or expression.15.. . . applying it to a production compiler as described in the previous sec- .3 Introducing “zero versions” to limit the explosion of the number of variables 183 parallel instruction zero version that the χ -operator only potentially modifies the variable. 5 ∗p = 3. . . return µ(i ). . ∗p = 3 i 2 = χ (i 1 ). . i = 2.. 15. .. practical implementations may choose to insert µ and χ s before or after the instructions that involve aliasing. . . . . . i 2 ).. . . In our running example of Figure 15. 15. . . . 1 2 6 7 1 2 3 4 5 6 7 8 i = 4. . . . and the potentially modified values appear after the call.. return µ(i 4 ). . . Particularly. which potentially uses it (detected through its corresponding µ-function). .

setting it to true whenever a real occurrence is met in the code. Therefore. inducing use-def chains forward control flow graphthe insertion of new φ -functions which in turn create new variable versions. The idea is that variable versions that have no real occurrence do not influence the program output. This can be done while constructing the SSA form. In our example of Figure 15. the φ result has zero version if it has no real occurrence. Algorithm 16 performs zero-version detection if only use-def chains. Since they do not directly appear in the code. All in all. Mantione. iterate as many times as the longest chain of contiguous φ assignments in the program. A "HasRealOcc" flag is associated to each variable version. or it could be the one indirectly assigned by the χ . would be needlessly large. This bound can easily be reduced to the largest loop depth of the program by traversing the versions using a topological order of the forward control flow graph. is also associated to each original program variable. F. χ and φ -functions are not "real occurrences". while the corresponding reduction in the number of variable versions is definitely desirable. recursive definition is the following: • The result of a χ has zero version if it has no real occurrence. Chow) variable version tion would make working with code in SSA form terribly inefficient. which in turn is proportional to the number of definitions and therefore to the code size. the while loop may. in the worst case. χ definitions adds uncertainty to the analysis of variables values: the actual value of a variable after a χ definition could be its original value.1. An equivalent. are available. A list "NonZeroPhiList". Our notion of useless versions relies on the concept of "real occurrence of a variable". Once the program is converted back from SSA form these variables are removed from the code. and their value is usually unknown. On the other hand. zero version detection in the presence of µ and χ -functions does not change the complexity of SSA construction in a significant way. Intuitively. initially empty. and call it "zero version". In practice the resulting IR. distinguishing among them is almost pointless. The time spent in the first iteration grows linearly with the number of variable versions. we consider zero versions. so that no space is wasted to distinguish among them. versions of variables that have no real occurrence. and not def-use chains. • If the operand of a φ has zero version. For those reasons. .184 15 Hashed SSA form: HSSA — (M. which is an actual definition or use of a variable in the original program. variable occurrences in µ. the solution to this problem is to factor all variable versions that are considered useless together. The biggest issue is that the SSA versions introduced by χ -operators are useless for most optimizations that deal with variable values. We assign number 0 to this special variable version. i 2 have no real occurrence while i 1 and i 3 have. and especially the number of distinct variable versions. in SSA form. and whose value comes from at least one χ -function (optionally through φ -functions). This is bedata-flow cause χ -operators cause an explosion in the number of variable values.

which in turn can have a real occurrence as argument. v j .d e f .op e r a n d s .r e mov e (v i ). not being able to distinguish them usually only slightly affects the quality of optimizations that operate on values.r e mov e (v i ).v e r s ion = 0. v j .Ha s Re a l Oc c = t r u e . foreach v i ∈ v . There is only one case in dead-code elimination where the information loss is evident. else if ∃v j ∈ V .op e r a n d s .NonZ e roPhi Li s t . while c ha n g e s c ha n g e s = f a l s e . This loss of information has almost no consequences on the effectiveness of subsequent optimization passes.NonZ e roPhi Li s t . If the µ is eliminated by some optimization. v .v e r s ion = 0.a d d (v i ) c ha n g e s = t r u e . Zero versions have no real occurrence.v e r s ion = 0 then v i . c ha n g e s = t r u e .Ha s Re a l Oc c then v i . if ¬Ha s Re a l Oc c ∧ v i .v e r s ion = 0 then v i . else if ∃v j ∈ V . zero versions for which use-def chains have been broken must be assumed live.op e r a t or = χ then v i .Ha s Re a l Oc c ∧ v i .d e f . so there is no real statement that would be eliminated by the dead-code elimination pass if we could detect that a zero version occurrence is dead. On the other hand.Ha s Re a l Oc c then v i . the real occurrence used in the χ becomes dead but will not be detected as it. This case is sufficiently rare in real world code that using zero versions is anyway convenient.Ha s Re a l Oc c = t r u e .15. c ha n g e s = t r u e . .v e r s ion = 0. Since variables with zero versions have uncertain values.NonZ e roPhi Li s t do let V = v i . v . if ∀v j ∈ V .d e f . v j .op e r a t or = φ let V = v i .NonZ e roPhi Li s t . v j . else v .d e f .3 Introducing “zero versions” to limit the explosion of the number of variables 185 dead code elimination Algorithm 16: Zero-version detection based on SSA use-def chains 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 foreach variable v do foreach version v i of v do if ¬v i . when performing sparse dead-code elimination along use-def chains. since the χ is conservatively marked as non dead. However this has no practical effect. A zero version can be used in a µ and defined in a χ . if ∀v j ∈ V .

HSSA allows to apply SSA based optimizations to variables also when they are affected by aliasing. for simplicity. . . . What HSSA does is to represent the "target" locations of indirect memory operations with virtual variables. As an example. we assume that the code intermediate representation supports a dereference operator. in Figure 15. . . . it may alias with an operand. Mantione. .4 SSA and indirect memory operations: virtual variables The technique described in the previous sections only apply to ”regular“ variables in a compilation unit. We will then. The . and be able to apply all SSA based optimizations uniformly on both of them. . but it reveals nothing about the values stored in the locations "p->x" and "p->y". In other words. . . . . As we noted above indirect memory operations cannot be handled by the SSA construction algorithm because they are operations while SSA construction works renaming variables. which performs an indirect load (memory read) from the given address. but ∗p is not considered as an SSA variable. Chow) . . • *p = expression: indirect store. . Java or C#) typically contains many operations on data stored in global memory areas. It is worth noting that operations on array elements suffer from the same problem. and then to have a piece of code that computes the modulus of a vector of address "p": "m = (p->x * p->x) + (p->y * p->y). Putting that code snipped in SSA form tells us that the value of "p" never changes. . . . . . . . and not to arbitrary memory locations accessed indirectly. . in C we can imagine to define bidimensional vector as a struct: "typedef struct {double x. . . .} point.". A virtual variable is an abstraction of a memory area and appears under HSSA thanks to the insertion of µ and φ -functions: indeed as any other variable. . Some examples of this notation are: • *p: access memory at address p. . . . This situation is far from ideal. . For instance. because code written in current mainstream imperative languages (like C++. . 15.186 15 Hashed SSA form: HSSA — (M. . and we will represent it with the usual "*" C language operator. represent indirect stores with the usual combination of dereference and assignment operators. . . . . . To do this. . • *(p+1): access memory at address p+1 (like to access an object field at offset 1). . F. and φ -functions have been introduced to keep track of i ’s live-range. Looking at the code it is obvious that "x" is accessed twice but both accesses give the same value so the second access could be optimized away (the same for "y"). • **p: double indirection. . µ. in the general case. . . . This operator can be placed in expressions on both the left and right side of the assignment operator. . . The purpose of HSSA is to handle indirect memory operations just like accesses to scalar variables. The problem is that "x" and "y" are not “regular” variables: "p" is a variable while "p->x" and "p->y" are indirect memory operations.1. but memory access operations are still excluded from the SSA form. . .". . χ . . up to now. double y. . . like in C.

p = p + 1. The insertion of µ-function and χ -function for virtual variables is performed in the same pass than for scalar variables. ∗p = . . Note that the zero versioning technique can be applied unaltered to virtual variables. drives the choice of virtual variables2 . 2 I’m not sure what this means. · · · = ∗p .2(b). q = b. . y with 5. and choose two virtual variables to represent the two corresponding distinct memory locations. and 8 (c) x alias with op 3. the most factored HSSA form would have only one single virtual variable on the overall. considered as a given. ∗p = · · · v ∗ = χ (v ∗ ) w ∗ = χ (w ∗ ). 1 2 3 4 5 6 7 8 p = b. ∗p = · · · y ∗ = χ (y ∗ ). In Figure 15. there exists many possible ways to chose virtual variables. q = q + 1. and 8 Fig. p = p + 1. 6. At this point. q = q + 1. q = b. 1 2 3 4 5 6 7 8 p = b.2(c) we considered as given by alias analysis the non aliasing between b and b + 1. 5. virtual variables can obviously alias with one another as in Figure 15. Use-def relationships of virtual variables now represent use-def relationships of the memory locations accessed indirectly. In the example of Figure 15. q = q + 1. · · · = ∗q µ(v ∗ ) µ(w ∗ ). (a) Initial C code (b) v and w alias with ops 3. . two virtual variables v ∗ and w ∗ are respectively associated to the memory expressions ∗p and ∗q .15. q = b. · · · = ∗q . To summarise. and 8). 1 2 3 4 5 6 7 8 p = b. ∗p = .2(c) v is associated to the base pointer of b and w to b + 1 respectively.2 shows two possible forms after µ-function and χ -function insertion for the same code.7. The only discipline imposed by HSSA is that each indirect memory operand must be associated to a single virtual variable. Assignment factoring corresponds to make each virtual variable represents more than one single indirect memory operand. when the program is converted into SSA form. virtual variables can be renamed just like scalar variables. Rephrase? (dnovillo) .7. · · · = ∗p µ(y ∗ ). As one can see. In the general case. · · · = ∗p µ(v ∗ ) µ(w ∗ ). So on one extreme. Also.2(b). 15.4 SSA and indirect memory operations: virtual variables 187 example of Figure 15. · · · = ∗q µ(y ∗ ). we can complete the construction of HSSA form by applying global value numbering.2 Some virtual variables and their insertion depending on how they alias with operands. On the other extreme. . p = p + 1. there would be one virtual variable for each indirect memory operation. v ∗ and w ∗ alias with all indirect memory operands (of lines 3. alias analysis. . In Figure 15. ∗p = · · · x ∗ = χ (x ∗ ). ∗p = · · · v ∗ = χ (v ∗ ) w ∗ = χ (w ∗ ).5. .

. . If the program is in SSA form computing value numbers that satisfy the above property is straightforward: variables are defined only once. As the name suggests. for p 1 + 1. . . p 1 and 1 are replaced by their respective value numbers. . HSSA is complete only once Global Value Numbering is applied to scalar and virtual variables. every occurrence of the expression "p->x" will always have the same GVN and therefore will be guaranteed to return the same value. therefore two expressions that apply the same operator to the same SSA variables are guaranteed to give the same result and so can have the same value number. . F. . typically storing the result of a computation in a temporary variable and reusing it later instead of performing the computation again. . . . introduced µ and χ -operators to handle them. . While computing global value numbers for code that manipulates scalar variables is beneficial because it can be used to implement redundancy elimination. and defined virtual variables as a way to apply SSA also to memory locations accessed indirectly. . . it can be used to determine when two address expressions compute the same address: this is guaranteed if they have the same global value number. . . This puts them in the same rank as scalar variables and allows the transparent application of SSA based optimizations on indirect memory operations. applied zero versioning to keep the number of SSA versions acceptable. . . . .5 GVN and indirect memory operations: HSSA In the previous sections we sketched the foundations of a framework for dealing with aliasing and indirect memory operations in SSA form: we identified the effects of aliasing on local variables. . . handling all of them uniformly (GVN.188 Global Value Numbering 15 Hashed SSA form: HSSA — (M. . This value number is obtained using a hash function represented here as H(k e y ). . . . . In our example. Mantione. To identify identical expressions. . . . . . Chow) . Similarly. . . However. 15. Loads of lines 5 and 8 can not be considered as redundant because ∗ ∗ the versions for v (v 1 then v 2 ) are different. This allows indirect memory operands that have the same GVN for their address expressions and the same virtual variable version to become a single entity in the representation. allowing the compiler to emit code that stores it in a temporary register instead of performing the redundant memory reads (the same holds for "p->y"). . . . applying GVN to virtual variables has additional benefits. . . then the expression is put into canonical form.3. . On the other hand the load of line 8 can be safely avoided using a rematerialization of the value computed in line 7 as ∗ both the version for v (v 2 ) and the value number for the memory operands are . GVN is normally used as a way to perform redundancy elimination which means removing redundant expression computations from a program. . For instance. . . . . p 2 and q2 will have the same value number h 7 while p 1 will not. in the vector modulus computation described above. GVN works assigning a unique number to every expression in the program with the idea that expression identified by the same number are guaranteed to give the same result. as H(p 1 ) = H(b ) in our example. . . . and finally hashed. Which ends up to H(+(H(b ). . each expression tree is hashed bottom up: as an example. See Chapter 12). . . . consider the example of Figure 15. First of all. H(1))).

v 1 ) q2 4 ∗ v2 ∗ ) i v a r (∗h 7 . ∗p = 3. h 6 ) h7 ∗ i v a r (∗h 7 . v 2 ) that simplifies ∗ into i v a r (∗h 7 . ∗ · · · = ∗q2 µ(v 2 ). In other words HSSA transparently extends DEADCE to work also on indirect memory stores. and as a consequence will lead to a live µ-function . h 6 ) will be hashed to h 7 . identical. h2 = h0. q1 = b . 1 2 3 4 5 6 7 8 p1 = b . then it can be safely eliminated 3 . p 2 = p 1 + 1. v 2 ) will be hashed to h 14 . v 2 value b h0 h0 con s t (3) ∗ v1 h3 con s t (1) +(h 0 . This happens naturally because each node of the expression is identified by its value number.5 GVN and indirect memory operations: HSSA 189 1 2 3 4 5 6 7 8 p = b. and i v a r (∗q2 . As a last example. h 14 = h 12 h 13 = χ (h 4 ). q2 = q1 + 1. v 1 p1 + 1 ∗ i v a r (∗p 2 . q = q + 1. v 1 1 +(h 0 . ∗p = 4. and the fact that it is used as an address in another expression does not raise any additional complication. Having explained why HSSA uses GVN we are ready to explain how the HSSA intermediate representation is structured. v 1 ) h7 con s t (4) ∗ v2 h 12 (d) Hash Table Fig. · · · = h 14 µ(h 13 ). ∗ · · · = ∗p 2 µ(v 1 ). p = p + 1.15. q1 + 1 that simplifies into +(h 0 . ∗ ∗ ∗p 1 = 3 v 1 = χ (v 0 ).3 Some code after variables versioning. h 11 = h 7 . ∗ h 5 = h 3 h 4 = χ (H(v 0 )). if all the associated virtual variable versions for an indirect memory store (defined in parallel by χ -functions) are found to be dead. 1 2 3 4 5 6 7 8 h1 = h0. its corresponding HSSA form along with its hash table ∗ entries. v 2 hash h0 h1 h2 h3 h4 h5 h6 h7 h8 h 10 h 11 h 12 h 13 h 14 ∗ ) i v a r (∗p 1 . v 1 ) ∗ ) i v a r (∗p 2 . · · · = ∗p . q = b. h 6 ) p2 ∗ i v a r (∗h 7 . 15. A program in HSSA keeps its CFG 3 Note that any virtual variable that aliases with a memory region live-out of the compiled procedure is considered to alias with the return instruction of the procedure. (c) HSSA statements (a) Initial C code (b) with one virtual variable key b p1 q1 3 ∗ v1 ∗ ) i v a r (∗h 0 . · · · = h 10 µ(h 4 ). ∗ ∗ ∗p 2 = 4 v 2 = χ (v 1 ). h8 = h7. Another advantage of GVN in this context is that it enables uniform treatment of indirect memory operations regardless of the levels of indirections (like in the "**p" expression which represents a double indirection). · · · = ∗q .

. . . . 7. . F. . It is in this way that HSSA extends all SSA based optimizations to indirect memory operations. In this sense virtual variables are just an aid to apply aliasing effects to indirect memory operations. . As already explained. . . even if they were originally designed to be applied to "plain" program variables. . . zero versioning and virtual variable . . . Such entries are referred as ivar node. . in terms of code generation. Second. . . Each constant. . will emit indirect memory accesses to the address passed as their operand. even thought virtual variables do not contribute to the emission of machine code. its entry in the hash table is made up of both the address expression and its corresponding virtual variables version. While each ivar node is related to a virtual variable (the one corresponding to its address expression). Chow) structure. . . However. . . . the virtual variable has no real counterpart in the program code. address and variable (both scalar and virtual) finds its entry in the hash table. . . . are operations and not variables: the compiler back end. . be aware of liveness and perform SSA renaming (with any of the regular SSA renaming algorithms). . . This is illustrated in lines 3. . . . but the left and right hand-side of each statement are replaced by their corresponding entry in the hash table.3. because it is a simple composition of µ and χ insertion. and 8 of Example 15.6 Building HSSA We now present the HSSA construction algorithm. . . . First of all. with basic-blocks made of a sequence of statements (assignments and procedure calls). their use-def links can be examined by optimization passes to determine the program data dependency paths (just like use-def links of scalar variables). . . when processing them.190 15 Hashed SSA form: HSSA — (M. . . . . . . . It is straightforward. and its GVN is determined taking into account the SSA version of its virtual variable. . This allows two entries with identical value numbers to “carry” the same value. . Each operand that corresponds to an indirect memory operation gets a special treatment on two aspects. . Mantione. . . 15. Therefore in HSSA value numbers have the same role of SSA versions in "plain" SSA (while in HSSA variable versions can effectively be discarded: values are identified by GVNs and data dependencies by use-def links between IR nodes. It should be noted that ivar nodes. for indirect variables so as to recall its operational semantic and distinguish them from the original program scalar variables. . . expressions are also hashed using the operator and the global value number of its operands. . . it is considered semantically both as an expression (that has to be computed) and as a memory location (just as other variables).

Some basics about GVN are recalled in Chapter 12. Perform SSA renaming on all scalar and virtual variables as for standard SSA construction At the end of Algorithm 17 we have code in plain SSA form.15. reusing existing hash table expression nodes and using var nodes of the appropriate SSA variable version (the current one in the dominator tree traversal) b. but the number of unique SSA versions had diminished because of the application of zero versions. because what really matters are the indirect memory operations themselves. then run Algorithm 16 (Zero-version detection) The application of global value numbering is straightforward. and • their virtual variables have the same versions. or are separated by definitions that do not alias the ivar (possible to verify because of the dominator tree traversal order) . The next steps corresponds to Algorithm 18 where steps 5 and 6 can be done using a single traversal of the program. two ivar nodes have the same value number if these conditions are both true: • their address expressions have the same value number. Algorithm 17: SSA form construction 1. Insert φ -functions (considering both regular and χ assignments) as for standard SSA construction 4. At the end of this phase the code has exactly the same structure as before. initialize HasRealOcc and NonZeroPhiList as for Algorithm 16. indirect memory operations are "annotated" with virtual variables. Perform a pre-order traversal of the dominator tree. together with regular SSA renaming and GVN application. applying GVN to the whole code (generate a unique hash table var node for each scalar and virtual variable version that is still live) a. Perform alias analysis and assign a virtual variables to each indirect memory operand 2. Algorithm 18: Detecting zero versions 5. Moreover. perform DEADCE (also on χ and φ stores) 6. expressions are processed bottom up. note how virtual variables are sort of "artificial" in the code and will not contribute to the final code generation pass. and also virtual variables have SSA version numbers. However. The use of µ and χ operands guarantees that SSA versions are correct also in case of aliasing.6 Building HSSA 191 introduction (described in previous sections). Algorithm 19: Applying GVN 7. Insert µ-functions and χ -functions for scalar and virtual variables 3.

This is important because when generating value numbers we must be sure that all the involved definitions already have a value number. . . . keeping only a list of statements pointing to the shared expressions nodes. it is strict (i. . All optimizations conceived to be applied on scalar variables work "out of the box" on indirect locations: of course the implementation must be adapted to use the HSSA table. . Also all φ . Mantione. . . it is relatively easy to implement register promotion. . . φ and χ ) is updated to point to its var or ivar node (which will point back to the defining statement) 9. . a dominator tree preorder traversal satisfies this requirement. Indirect memory operation (ivar) nodes are both variables and expressions. . and can be discarded from the code representation: the information they convey about aliasing of indirect variables has already been used to generate correct value numbers for ivar nodes. . once built. . The left-hand side of each assignment (direct and indirect. . all definitions dominate their use). each node in the code representation has a proper value number. . . . . . . As a consequence. and in practice the values will have "zero ver- . . and every value in the program code is represented by a reference to a node in the HSSA value table. . The crucial issue is that the code must be traversed following the dominator tree in pre-order. . . . . Of course the effectiveness of optimizations applied to indirect memory operations depends on the quality of alias analysis: if the analysis is poor the compiler will be forced to "play safe". . In fact the original program code can be discarded. and nodes with the same number are guaranteed to produce the same value (or hold it in the case of variables). . . . 15. and benefit from the optimizations applied to both kinds of nodes. . . . real. . is particularly memory efficient because expressions are shared among use points (they are represented as HSSA table nodes).. . µ and χ operands are updated to point to the corresponding GVN table entry At the end of the last steps listed in Algorithm 20. . . HSSA form is complete. . . . . F.7 Using HSSA As seen in the previous sections. . Since the SSA form is a freshly created one. . Note that after this step virtual variables are not needed anymore. but their algorithm (and computational complexity) is the same even when dealing with values accessed indirectly. Particularly.e. . . Algorithm 20: Linking definitions 8. . . .192 register promotion 15 Hashed SSA form: HSSA — (M. This representation. . Chow) As a result of this. HSSA is an internal code representation that applies SSA to indirect memory operations and builds a global value number table. . valid also for values computed by indirect memory operations. . .

. . . . . D. . . . . Canada. Cite the paper mentioned in Fred’s paper about factoring. Cite work done in the GCC compiler (which was later scrapped due to compile time and memory consumption problems. . . for all indirect memory operations where the alias analyzer determines that there are no interferences caused by aliasing all optimizations can happen naturally. .15. . . . . . . like in the scalar case.8 Further readings Cite the original paper from Zhou et al. . . . 2007 GCC Developers’ Summit. . 15. . so few or no optimizations will be applied to them. . . Talk about the possible differences (in terms of notations) that might exist between this chapter and the paper. . Add a reference on the paper about register promotion and mention the authros call it indirect removal . . but also representation of alias information (eg points-to). . . [286] Cite the original paper from Chow. .airs. . . Give some pointers to computation of alias analysis. Novillo. . . . . .pdf Memory SSA . . On the other hand. . . . . . . . .com/dnovillo/Papers/mem-ssa. . but the experience is valuable): http://www.8 Further readings 193 sion" most of the time. .A Unified Approach for Sparsely Representing Memory Operations. . July 2007. . Discuss the differences with Array SSA. Ottawa. . Cite Alpern. . Wegman. Zadech paper for GVN.

.

16. . .2 introduces full Array SSA form for run-time evaluation and partial Array SSA form for static analysis. . . . . . . . . Any program with arbitrary control flow structures and arbitrary array subscript expressions can be automatically converted to this Array SSA form. The rest of the chapter is organized as follows. . Section 16. . . . . . and redundant load elimination as an exemplar of a program optimization that can be extended to heap objects using Array SSA form. we focus on sequential programs and use constant propagation as an exemplar of a program analysis that can be extended to array variables using Array SSA form. . . . we introduce an Array SSA form that captures element-level data flow information for array variables. . . . . . . thereby making it applicable to structures. . . .3 extends the scalar SSA constant propagation algorithm to enable constant propagation through array elements. There are several potential applications of Array SSA form in compiler analysis and optimization of sequential and parallel programs. heap objects and any other data structure that can be modeled as a logical array. . . . . . Section 16. and coincides with standard SSA form when applied to scalar variables. . In this chapter.1 Introduction In this chapter. . . . . This includes an extension 195 . . . . . . .CHAPTER 16 Array SSA Form Vivek Sarkar Kathleen Knobe Stephen Fink Progress: 80% Minor polishing in progress . . . . . . . . A key extension over standard SSA form is the introduction of a definition-Φ function that is capable of merging values from distinct array definitions on an element-by-element basis.

. . Stephen Fink) to the constant propagation lattice to efficiently record information about array elements and an extension to the work-list algorithm to support definition-Φ functions (section 16. @A j [k ] := i where i is the iteration vector for the loops surrounding the definition of A j .2).4 shows how Array SSA form can be extended to support elimination of redundant loads of object fields and array elements in strongly typed languages. Kathleen Knobe. . . 2. These loops could be for-loops. <one for each enclosing loop. . ( ). . and to ensure that each use refers to precisely one definition. . Renamed array variables: All array variables are renamed so as to satisfy the static single assignment property. thereby ensuring that n ≥ 1. . . at the start of program execution. (This assumption is not necessary for partial Array SSA form. . we use the concept of an iteration vector to differentiate among multiple dynamic instances of a static definition. is followed by the statement. Array-valued @ variables: For each static definition A j . . . 3. . 1. f (). and section 16. . . A single point in the iteration space is specified by the iteration vector i = (i 1 . . .. 16. . . . .3. . . . we treat the outermost region of acyclic control flow in a procedure as a dummy outermost loop with a single iteration. or even loops constructed out of goto statements. Each update of a single array element. . this definition of iteration vectors assumes that all loops are single-entry. which is an n -tuple of iteration numbers. . we know that each def executes at most once in a given iteration of its surrounding loops. Analogous to standard SSA form. . .1). i n ). .2 Array SSA Form To introduce full Array SSA form with runtime evaluation of Φ functions. . . . . S k . . control Φ operators are introduced to generate new names for merging two or more prior definitions at control flow join points. Section 16. that occur in the same dynamic instance of S k ’s enclosing procedure. we introduce an @ variable (pronounced “at variable”) @A j that identifies the most recent iteration vector at which definition A j was executed. two renamed arrays that originated from the same array variable in the source program such that A 1 [k ] := . . . is an update of a single array element and A 0 . or equivalently. . . For convenience. . A j [k ] := .) For single-entry loops. Definition Φ’s: A definition-Φ operator is introduced in Array SSA form to deal with preserving (“non-killing”) definitions of arrays. Let n be the number of loops that enclose S k in procedure f (). . .3. . The key extensions in Array SSA form relative to standard SSA form are as follows. and a further extension to support non-constant (symbolic) array subscripts (section 16. .5 contains suggestions for further reading. .196 16 Array SSA Form — (Vivek Sarkar. . We assume that all @ variables are initialized to the empty vector. . . . . . that the control flow graph is reducible. hence the iteration vector serves the purpose of a “timestamp”. . . . . For convenience. . Consider A 0 and A 1 . . . . while-loops. . .

A 0 . A 2 represents an element-level merge of arrays A 1 and A 0 . Figures 16. that serves as an approximation of full Array SSA form. A 2 := Φ(A 1 . Consider a (control or definition) Φ statement. Definition Φ’s did not need to be inserted in standard SSA form because a scalar definition completely kills the old value of the variable. A 2 [ j ]. and the conversion of the program to full Array SSA form as defined above. The inputs and output of Φ will be lattice elements for scalar/array variables that are compile-time approximations of their run-time values. 4.1 and 16. 16. A static analysis will need to approximate the computation of this Φ operator by some data flow transfer function. A 0 . @A 1 . A 2 := Φ(A 1 . We use the notation (V ) to denote the lattice element for a scalar or array variable V .1) The key extension over the scalar case is that the conditional expression specifies an element-level merge of arrays A 1 and A 0 . will in general be modeled by the data flow equation. Consider a (control or definition) Φ operator of the form. @A 0 ). many compile-time analyses do not need . A [∗] := initial value of A i := 1 C := i < 2 if C then k := 2 ∗ i A [k ] := i print A [k ] endif print A [2] Fig. A 0 . in the result array A 2 : if @A 1 [ j ] A 2 [ j ] = else A 0 [ j ] end if @A 0 [ j ] then A 1 [ j ] (16. A 2 := d Φ(A 1 .1). Array-valued Φ operators: Another consequence of renaming arrays is that a Φ operator for array variables must also return an array value. @A 1 . (@A 1 ). A definition Φ. A 0 .1 Example program with array variables We now introduce a partial Array SSA form for static analysis. @A 1 . Its semantics can be specified precisely by the following conditional expression for each element. Therefore. (@A 0 )).2 Array SSA Form 197 is the prevailing definition at the program point just prior to the definition of A 1 . is inserted immediately after the definitions for A 1 and @A 1 . Φ . A 2 := Φ(A 1 . (A 0 ). Since definition A 1 only updates one element of A 0 .2 show an example program with an array variable. @A 0 ). the statement. While the runtime semantics of Φ functions for array variables critically depends on @ variables (Equation 16. @A 0 ). @A 1 .16. (A 2 ) = Φ ( (A 1 ). @A 0 ).

@A 0 ) print A 2 [k ] A 3 := Φ(A 2 . . . . a partial Array SSA form can be obtained by dropping dropping @ variables. For analyses that do not distinguish among iteration instances. . . A 0 . . For such cases. A consequence of dropping @ variables is that partial Array SSA form does not need to deal with iteration vectors. . . For scalar variables.3 Sparse Constant Propagation of Array Elements 16. @A 0 ) print A 3 [2] endif fig: Fig. . . . . A 0 . . A 2 := φ (A 1 . @A 2 . @A 0 ) by a data flow equation. @A 0 ) @A 3 := max(@A 2 . (A 0 )). . A 0 . . . . .1 Array Lattice for Sparse Constant Propagation In this section. . . (A 2 ) = φ ( (A 1 ). . . @A 0 ) @A 2 := max(@A 1 . . .2 Conversion of program in figure 16. A 0 ) instead of A 2 := Φ(A 1 . . that does not use lattice variables (@A 1 ) and (@A 0 ). . Let A i n d and e l e m be the universal set of index values and the universal set of array element values respectively for an array variable A . . .3. it is sufficient to model A 2 := Φ(A 1 . . . . @A 1 .1 to Full Array SSA Form the full generality of @ variables. and using the φ operator. Stephen Fink) @A 0 [∗] := ( ) . . @A 1 [∗] := ( ) A 0 [∗] := initial value of A @A 0 [∗] := (1) i := 1 C := i < n if C then k := 2 ∗ i A 1 [k ] := i @A 1 [k ] := (1) A 2 := d Φ(A 1 . . . . . the set denoted by lattice element (A ) is a subset of A index-element pairs in A i nd × e l e m . Kathleen Knobe. . There are three kinds of lattice elements for array variables that are of interest in our framework: . . and therefore does not require the control flow graph to be reducible as in full Array SSA form. . . . @A 1 . @A 1 . . . . 16. For an array variable. . . . 16.198 16 Array SSA Form — (Vivek Sarkar. . . the resulting φ operator obtained by dropping @ variables exactly coincides with standard SSA form. . . . A 0 . @A 0 ). we describe the lattice representation used to model array values A for constant propagation.

. . (A 1 [k ]) (A 1 ) = (A 1 ) = 〈(i 1 . Note that (A ) = ⊥ is equivalent to an empty list. (i 2 . . . (A ) = ⊥ ⇒ SET( (A )) = A i nd × el em This “bottom” case indicates that.〉. e 1 ). e 1 ).3 Lattice computation for operator (k ) = (k ) = Con s t a n t e j . .3 shows how (A 1 [k ]). 1. such that A [i 1 ] = e 1 . 16. We denote this function by [ ] . (i 2 . . . (i ))〉 ⊥ (i ) = ⊥ ⊥ ⊥ ⊥ ⊥ (k ).. e 2 ). e 2 ). otherwise ⊥ (k ) = ⊥ ⊥ ⊥ ⊥ ⊥ (A 1 [k ]) = [ ]( (A 1 ). . i 2 . . if ∃ (i j . (Extensions to handle non-constant indices are described in section 16.16.4 Lattice computation for operator (A 1 ) = d [ ]( (i ) = (i ) = Con s t a n t 〈( (k ). (k )). i 1 .3 Sparse Constant Propagation of Array Elements 199 (A ) = ⇒ SET( (A )) = { } This “top” case indicates that the set of possible index-element pairs that have been identified thus far for A is the empty set. (k )) = true ⊥. . where A 1 [k ] is an array element read (A 1 ) (k ) = (k ) = Cons t a n t (k ) = ⊥ Fig. . etc. The meaning of this lattice element is that the current stage of analysis has identified some finite number of constant index-element pairs for array variable A . a read access to an array element.}) × el em The lattice element for this “constant” case is represented by a finite list of constant index-element pairs. . 2. in case (2) above.) A 3.. they both denote the universal set of indexelement pairs. The constant indices. e 1 ). e 1 ). . . array A may take on any value from the universal set of index-element pairs. { }. (A ) = 〈 〉. e j ) ∈ (A 1 ) with (i j . . according to the approximation in the current stage of analysis. e 2 ). must represent distinct (non-equal) index values. . the lattice element for array reference A 1 [k ].2. is computed as a function of (A 1 ) and (k ). where A 1 [k ] := i is an array element write We now describe how array lattice elements are computed for various operations that appear in Array SSA form. i 2 . (i )). 16.〉 (A 1 ) = ⊥ Fig.3.} ∪ ( A i n d − {i 1 . . . 〈(i 1 .〉 A ⇒ SET( (A )) = {(i 1 . the lattice elements for A 1 and k . (i 2 . We start with the simplest operation viz. All other elements of A are assumed to be nonconstant. Figure 16. A [i 2 ] = e 2 . (A ) = 〈(i 1 .

. Consider a definition φ operation of the form. A 2 := φ (A 1 . and the size of list I exceeds a threshold size Z . i j ) = true }. 〈(i 1 . The join operator ( ) is used to compute (A 2 ). Since A 1 corresponds to a definition dφ( of a single array element.〉). A 0 ). e 1 ). The notation UPDATE((i . The lattice computation for (A 2 ) = (A 1 ). e 1 ) .. 2. . (A 1 [k ]) = [ ] ( (A 1 ). We denote this function by d [ ] i.〉 with respect to the constant index-element pair (i . the lattice element for A 2 . the three cases considered for (A 1 ) in figure 16. (i )). (Optional) If there is a desire to bound the height of the lattice due to compiletime considerations. . which contains exactly one constant indexelement pair..〉) ⊥ (A 1 ).e. e )〉. UPDATE involves four steps: 1. 〈 ( (k ). Now. Therefore.3 represents a “definitely-same” binary relation i.e. (A 2 ) (A 1 ) = (A 1 ) = 〈(i .3 occurs in the middle cell when neither (A 1 ) nor (k ) is or ⊥. e 1 ). Compute the list T = { (i j . 4. e ). As before. then one of the pairs in I can be dropped from the output list so as to satisfy the size constraint.5. I . (i )) 〉..4 occurs in the middle cell when both (k ) and (i ) are constant. . (a . .5 denotes a special update of the list (A 0 ) = 〈(i 1 . . e 1 ). 〈(i 1 . denotes a “definitely-different” binary relation i. as a function of (A 1 ) and (A 0 ). (A 1 ) = 〈(i . A 2 := d φ (A 1 . is computed as a function of (k ) and (i ). For this case. e ). e j ) | (i j . . e )〉 ⊥ ⊥ dφ( .〉) used in the middle cell in figure 16. e ). the list for (A 1 ) can contain at most one pair (see figure 16. e )〉 (A 1 ) = ⊥ Fig. b ) = true if and only if a and b are known to have exactly the same value. the interesting case in figure 16.e. (A 2 ) = φ ( (A 1 ).4). consider a write access of an array element.. . e ).. . the value returned for (A 1 ) is simply the singleton list.e.200 16 Array SSA Form — (Vivek Sarkar. e j ) ∈ (A 0 ) and (i . the lattice elements for k and i .4 shows how (A 1 ). we turn our attention to the φ functions. consider a control φ operation that merges two array values. e ) into T to obtain a new list. b ) = true if and only if a and b are known to have distinct (non-equal) values. (k )). 〈(i 1 . . 16. Stephen Fink) i. (a . the lattice elements for A 1 and A 0 i. Figure 16. . Return I as the value of UPDATE((i . . and (A 1 ) = ⊥.〉 UPDATE ((i (A 0 ) = ⊥ 〈(i . which in general has the form A 1 [k ] := i .e. The rules for computing this join . .5 Lattice computation for operation (A 2 ) = (A 0 ) = (A 0 ) = 〈(i 1 . A 0 ) is a definition φ Finally. 3. . (A 0 )) where A 2 := d φ (A 1 . . The interesting case in figure 16. (A 1 ) = d [ ] ( (k ). A 0 ). Analogous to . Insert the pair (i . Kathleen Knobe.5 are (A 1 ) = . e 1 ). (A 0 )) is shown in figure 16. the lattice element for the array being written into. The notation in the middle cell in figure 16. (A 0 )) = (A 1 ) (A 0 ). Next.

the numbering S1 through S8 indicates the correspondence. . .8) results in one data flow equation (in figure 16. e 1 ). the resulting array element constants revealed by the algorithm can be used in whatever analyses or transformations the compiler considers to be profitable to perform.11.6 Lattice computation for a control φ operation (A 2 ) = (A 0 ) = 〈(i 1 . The fixpoint solution is shown in figure 16.3 Sparse Constant Propagation of Array Elements (A 2 ) = (A 1 ) (A 0 ) (A 0 ) = (A 1 ) ⊥ φ( 201 (A 0 ) = ⊥ ⊥ ⊥ ⊥ (A 1 ) = (A 1 ) = 〈(i 1 . and the data flow equations for this example are shown in figure 16. as a generalization of the algorithm for constant subscripts described in . e 1 ). A 0 ) is operator are shown in figure 16.6. (A 0 )) = (A 0 ). variable I is known to equal 3 i. . . if C then else Y [3] := 99 D [1] := Y [3] ∗ 2 Z := D [1] endif D [1] := Y [I ] ∗ 2 Fig.〉 (A 0 ) (A 1 ) ∩ (A 0 ) ⊥ (A 1 ) (A 1 ). If.10. Each assignment statement in the partial Array SSA form (in figure 16.〉 (A 1 ) = ⊥ Fig. The partial Array SSA form for this example is shown in figure 16.16.9.8.9). (I ) = 3. including the standard worklist-based algorithm for constant propagation using scalar SSA form.6 denotes a simple intersection of lists (A 1 ) and (A 0 ) — the result is a list of pairs that appear in both (A 1 ) and (A 0 ). 16. 16. .2 Beyond Constant Indices In this section we address constant propagation through non-constant array subscripts. In either case ( (I ) = ⊥ or (I ) = 3).e.7 Sparse Constant Propagation Example 16. then the lattice variables that would be obtained after the fixpoint iteration step has completed are shown in figure 16. where A 2 := φ (A 1 . instead. depending on different cases for (A 1 ) and (A 0 ). This solution was obtained assuming (I ) = ⊥. .7. The notation (A 1 ) ∩ (A 0 ) used in the middle cell in figure 16.. Any solver can be used for these data flow equations.3. We conclude this section by discussing the example program in figure 16.

9. 198) 〉 〈 (1. assuming I is unknown S1: S2: S3: S4: S5: S6: S7: S8: (Y1 ) (Y2 ) (D 1 ) (D 2 ) (D 3 ) (D 4 ) (D 5 ) (Z ) = = = = = = = = 〈 (3. 198) 〉 〈 (1.8 Array SSA form for the Sparse Constant Propagation Example S1: S2: S3: S4: S5: S6: S7: S8: (Y1 ) (Y2 ) (D 1 ) (D 2 ) (D 3 ) (D 4 ) (D 5 ) (Z ) = < (3. Kathleen Knobe. 16. 99) 〉 〈 (3. D 0 ) D 3 [1] := Y2 [I ] ∗ 2 D 4 := d φ (D 3 . 99) 〉 〈 (1.202 16 Array SSA Form — (Vivek Sarkar. (D 0 )) = d [ ] ( ∗ ( [ ] ( (Y2 ). 198) 〉 198 Fig. assuming I is known to be = 3 . (D 0 )) = φ ( (D 2 ).9 Data Flow Equations for the Sparse Constant Propagation Example S1: S2: S3: S4: S5: S6: S7: S8: (Y1 ) (Y2 ) (D 1 ) (D 2 ) (D 3 ) (D 4 ) (D 5 ) (Z ) = = = = = = = = 〈 (3. 198) 〉 ⊥ ⊥ ⊥ ⊥ Fig. (Y0 )) = d [ ] ( ∗ ( [ ] ( (Y2 ).10 Solution to data flow equations from figure 16. 16. D 5 := φ (D 2 . D 0 ) . 99) > = d φ ( (Y1 ). (D 4 )) = [ ] ( (D 5 ). 16.9... 198) 〉 〈 (1. 1) Fig. D 4 ) Z := D 5 [1] endif Fig. 2)) = d φ ( (D 3 ). 99) 〉 〈 (3. Y0 ) if C then D 1 [1] := Y2 [3] ∗ 2 D 2 := d φ (D 1 . 198) 〉 〈 (1. Stephen Fink) Y0 and D 0 in effect here. 16.11 Solution to data flow equations from figure 16. 99) 〉 〈 (1. 2)) = d φ ( (D 1 ). S1: S2: S3: S4: S5: S6: S7: S8: else Y1 [3] := 99 Y2 := d φ (Y1 . 198) 〉 〈 (1. (I ))). 3)).

. and that (i . C 2 ) = (C 1 . it is clear that (i . a [i ] := k ∗ 5 . that is. (I 1 . for two symbols. I 2 ) = false and (I 1 . I 2 ) = FALSE (I 1 . In the loop in figure 16.12. (C 1 . C 1 and C 2 . some program analysis is necessary to compute the and relations. S 2 ) and (S 1 . C ) = true. i + 1) = true. we don’t know if they are the same or different. even though the index/subscript value i is not a constant.1. (I 1 . only one of the following three cases is possible: Case 1: Case 2: Case 3: (I 1 . However. I 2 ) = TRUE. unlike . 0) = true if i is a loop index variable that is known to be ≥ 1.3. := a [i ] 203 enddo Fig. We now discuss the compile-time computation of and for symbolic indices. . As an example. given index values I 1 and I 2 . I 2 ) = FALSE (I 1 . it is always correct to state that (I 1 .. We would like to extend the framework from section 16. C 2 ).3. S 2 ) are FALSE. . we can leverage past work on array dependence analysis to identify cases for which can be evaluated to true. For example. I 2 ) = TRUE The first case is the most conservative solution.1 to be able to recognize the read of a [i ] as constant in such programs. consider the program fragment in figure 16. . 16. • For constants. . . If (A . For symbolic indices. Observe that. I 2 ) = false. C ) = true. is not an equivalence relation because is not transitive.. we see that the read access of a [i ] will have a constant value (k ∗ 5 = 10). the values for (C 1 . If two indices i and j have the same value number. it is possible that both (S 1 . The problem of computing is more complex.12. There are two key extensions that need to be considered for non-constant (symbolic) subscript values: • For constants. C 2 ) and (C 1 . I 2 ) = FALSE. C 1 and C 2 .3 Sparse Constant Propagation of Array Elements k := 2 do i := .12 Example of Constant Propagation through Non-constant Index section 16. C 2 ) can be computed by inspection. j ) must = true. it does not imply that (A . The problem of determining if two symbolic index values are the same is equivalent to the classical problem of global value numbering. In the absence of any other knowledge. I 2 ) = FALSE. However. S 1 and S 2 . B ) = true and ( B . however. then (i . (I 1 .16. Note that .

i j and k have the same value number).e. For the read operation in figure 16. where A 1 [k ] is an array element read operator.13.13 ⊥ (k ) = (k ) = VALNUM(k ) e j . SET( ) = { } and SET(⊥) = A i nd × e l e m . consider the lattice itself. i . the result is or ⊥ as specified in figure 16. 16. 16. the result is or ⊥ as specified in figure 16. (i )). (A 1 ) (k ) = (k ) = VALNUM(k ) (k ) = ⊥ Fig. if ∃ (i j .1 viz. the result is the singleton list 〈(VALNUM(k ). where A 1 [k ] := i is an array element write operator. Note that the specification of how and are used is a separate issue from the precision of the and values. if the value of the right-hand-side. ⊥ (i ))〉 (i ) = ⊥ ⊥ ⊥ ⊥ ⊥ Lattice computation for (A 1 ) = d [ ] ( (k ). These operations were presented in section 16.14. The only extension required relative to figure 16. if there exists a pair (i j .14. Each element in the lattice is a list of index-value pairs where the value is still required to be constant but the index may be symbolic — the index is represented by its value number. We now revisit the processing of an array element read of A 1 [k ] and the processing of an array element write of A 1 [k ]. (A 1 [k ]) (A 1 ) = (A 1 ) = 〈(i 1 . For the write operation in figure 16..3. First.13 and figure 16. (k )). Stephen Fink) Let us consider how the and relations for symbolic index values are used by our constant propagation algorithms. e j ) ∈ (A 1 ) with (i j . Otherwise. is a constant.3 and 16. (i ))〉. If (k ) = VALNUM(k ).1 can be extended to deal with non-constant subscripts..204 16 Array SSA Form — (Vivek Sarkar.VALNUM(k )) = true (i.3.3. Let us now consider the propagation of lattice values through d φ operators. VALNUM(k )) = true ⊥.〉 (A 1 ) = ⊥ Fig. e 1 ). Otherwise.e j ) such that (i j . the lattice value of index k is a value number that represents a constant or a symbolic value. . (If no symbolic . The versions for nonconstant indices appear in figure 16.4) for constant indices. The and ⊥ lattice elements retain the same A meaning as in section 16. . i j ) = true if i and i j are symbolic value numbers rather than constants. Kathleen Knobe.1 (figures 16. If (k ) = VALNUM(k ). We now describe how the lattice and the lattice operations presented in section 16. . otherwise ⊥ (k ) = ⊥ ⊥ ⊥ ⊥ Lattice computation for (A 1 [k ]) = [ ] ( (A 1 ).13. the lattice value of index k is a value number that represents a constant or a symbolic value.14. then the result is e j .5 is that the relation used in performing the UPDATE operation should be able to determine when (i .14 (i ) = (i ) = Con s t a n t 〈(VALNUM(k ).

. we introduce redundant load elimination as an exemplar of a program optimization that can be extended to heap objects in strongly typed languages by using Array SSA form. followed by a scalar replacement transformation that uses the analysis results to perform load elimination (Section 16. . . . . For each field x .4. .4. 16. Thus. .1 Analysis Framework We introduce a formalism called heap arrays which allows us to model object references as associative arrays. An extended Array SSA form is constructed on heap arrays by adding use-φ functions. . While u φ functions are used by the redun- .4 Extension to Objects: Redundant Load Elimination In this section. A use φ (u φ ) function creates a new name whenever a statement reads a heap array element.x is modeled as a write of element x [q ]. we introduce a hypothetical x x one-dimensional heap array.x is modeled as a read of element x [p ]. 3. A control φ from scalar SSA form. . . . .4. . . . . .1). . i j ) = . . . . 16. a GETFIELD of p . . with one dimension indexed by the object reference as in heap arrays for fields. and the second dimension indexed by the integer subscript.16. and a PUTFIELD of q . This extension models object references (pointers) as indices into hypothetical heap arrays (Section 16.2).4 Extension to Objects: Redundant Load Elimination 205 information is available for i and i j . . . u φ functions represent the extension in “extended” Array SSA form. . . . .4. . . . Note that field x is considered to be the same field for objects of types C 1 and C 2 . Heap array consolidates all instances of field x present in the heap. . . . . We then describe how definitely-same and definitely-different analyses can be extended to heap array indices (Section 16. then it is always safe to return false. . . . The use of distinct heap arrays for distinct fields leverages the fact that accesses to distinct fields must be directed to distinct memory locations in a strongly typed language. . . . . Heap arrays are renamed in accordance with an extended Array SSA form that contains three kinds of φ functions: 1. .3). . Heap arrays are indexed by object references. A definition φ (d φ ) from Array SSA form. . . . . if C 2 is a subtype of C 1 . Accesses to one-dimensional array objects with the same element type are modeled as accesses to a single two-dimensional heap array for that element type. . . The main purpose of the u φ function is to link together load instructions for the same heap array in control flow order. 2. .) (i . .

thereby enabling the compiler to replace it by a use of scalar temporary that captures the value in p.15. It relies on two observations related to allocation-sites: 1. (p ) = (r ). q ) = true. For the original program in figure 16.. and relates to pointer alias analysis. we show how global value numbering and allocation site information can be used to efficiently compute definitely-same ( ) and definitelydifferent ( ) information for heap array indices.x to be eliminated i. . We outline a simple and sound approach below. so that (p .e. and 2) object references q and r are distinct (definitely different) in all program executions. Therefore. the presence of the allocation site in q := new Type1 ensures that (p . to be replaced by a use of T1. thus enabling the second load of r. which can be replaced by more sophisticated techniques as needed.15 illustrates two different cases of scalar replacement (load elimination) for object fields. p := r.g. constant propagation) that do not require the creation of a new name at each use. the goal of our analysis is to determine that the load of r. Figure 16. r ) = true. Figure 16. if (i ) = ( j ).x is redundant.. For the code fragment above. then (i ..15(a). we use the notation (i ) to denote the value number of SSA variable i . We need to establish two facts to perform this transformation: 1) object references p and r are identical (definitely same) in all program executions. 16. Stephen Fink) dant load elimination optimization presented in this chapter. it is not necessary for analysis algorithms (e.4. The problem of computing for object references is more complex than value numbering.x.x can enable the load (use) of r. ensures that p and r are given the same value number. they can be used to further refine the and information. Object references that contain the results of distinct allocation-sites must be different.x.e.15(b) contains an example in which a scalar temporary (T2) is introduced for the first load of p.x to be eliminated i. For example. introducing a scalar temporary T1 for the store (def ) of p. thereby reducing pointer analysis queries to array index queries. the statement. 2. If more sophisticated pointer analyses are available in a compiler. replaced by T2. in Figure 16. The notation Hx[p] refers to a readx /write access of heap array element. In both cases. j ) = true. An object reference containing the result of an allocation-site must be different from any object reference that occurs at a program point that dominates the allocation site in the control flow graph. Kathleen Knobe.2 Definitely-Same and Definitely-Different Analyses for Heap Array Indices In this section. As before. [p ]. As an example.206 16 Array SSA Form — (Vivek Sarkar.

. . 16...x := . (A 0 )) where A 2 := u φ (A 1 .. := q.17 Lattice computation for tion (A 2 ) = (A 0 ) = (A 0 ) = 〈(i1 ). 16.x := . // . .. := p new Type1 .x := q.. T2 := . (A 0 )) where A 2 := d φ (A 1 ...〉 (A 1 ) ∪ ⊥ (A 0 ) (A 0 ) = ⊥ (A 1 ) ⊥ ⊥ uφ( (A 1 ).. ... . . . .. . p. .x T2 ..4...x // .. A 0 ) is a use φ opera- The main program analysis needed to enable redundant load elimination is index propagation.... ..x // . // Hx[p] := .〉) ⊥ (A 1 ). T1 . ... . := Hx[r] After redundant load elimination: After redundant load elimination: r := q := ... T1 := p. T1 (a) r := q := ..4 Extension to Objects: Redundant Load Elimination Original program: Original program: 207 r := q := . . A 0 ) is a definition φ (A 2 ) (A 1 ) = (A 1 ) = 〈i 〉 (A 1 ) = ⊥ Fig.... . := Hx[r] r := q := .16..x . := p new Type1 p. := p new Type1 .. // Hx[q] := .. 16.〉 UPDATE (i (A 0 ) = ⊥ 〈i 〉 ⊥ ⊥ dφ( . := Hx[p] // Hx[q] := . := p new Type1 p.x := . .15 Examples of scalar replacement 16. r..3 Scalar Replacement Algorithm (A 2 ) (A 1 ) = (A 1 ) = 〈i 〉 (A 1 ) = ⊥ Fig.16 Lattice computation for operation (A 2 ) = (A 0 ) = (A 0 ) = 〈i1 . . . := q.x := .. T2 (b) Fig. 〈(i1 )....x := q. .. which identifies the set of indices that are available at a spe- . r.

19 = {} ={ ={ x 0) x 2) ={ ={ r := p q := new Type1 (p ) = (r )} . .18 give the lattice computations which define the index propagation solution. x [ q ] := .16. .. 3 x x 4 := d φ ( 3.. The notation UPDATE (i .16 denotes a special update of the list (A 0 ) = 〈i1 .〉 with respect to index i . ... 〈i1 .. A 0 ) is (a) Extended Array SSA form: (b) After index propagation: ( ( ( ( ( x 0) x 1) x 2) x 3) x 4) (c) After transformation: r := p q := new Type1 . Figure 16. . . and the size of list I exceeds a threshold size Z . These results depend on definitely-different analysis establishing that (p ) = (q ) and definitely-same analysis establishing that (p ) = (r ). 16. . (q )} . T1.19(c) shows the transformed code after performing the scalar replacement actions. Index propagation is a dataflow problem. 4. Note that the lattice element does not include the value of [i] (as in constant propagation). .15(a) cific def/use A i of heap array A .. (A 0 )) = (A 0 ).x := T1 (q )} q. . . 16. . and replaced by a use of the scalar temporary. then any one of the indices in I can be dropped from the output list. . . if there is a desire to bound the height of the lattice due to compile-time considerations.. (Optional) As before.19(b). for scalar replacement if and only if index propagation determines that an index with value number (x) is available at the def of A j . := [ r ] 4 Fig. . Figure 16. 〈i1 . Figures 16.〉 (A 0 ) (A 1 ) ∩ (A 0 ) ⊥ (A 1 ) (A 0 ) = ⊥ ⊥ ⊥ ⊥ (A 1 ). the algorithm selects a load. Kathleen Knobe. Figure 16.〉).〉 (A 1 ) = ⊥ Fig.. The results of index propagation are shown in figure 16. . A j [x]. Stephen Fink) (A 1 ) (A 0 ) (A 0 ) = (A 1 ) ⊥ φ( (A 1 ) = (A 1 ) = 〈i1 . Return I as the value of UPDATE(i . . p. x x 2 := d φ ( 1. := T1 Trace of index propagation and load elimination transformation for program in figure 16..x has thus been eliminated in the transformed code. .19(a) shows the extended Array SSA form computed for this example program. . Insert i into T to obtain a new list.. x 1 [p ] := . just the fact that it is available. After index propagation. where A 2 := φ (A 1 . . x .〉) used in the middle cell in figure 16. I . 3. .15(a). List T contains only those indices from (A 0 ) that are definitely different from i . (p ) = (r ).17 and 16. 2.18 Lattice computation for a control φ operation (A 2 ) = (A 0 ) = 〈(i1 ). The load of p..19 illustrates a trace of this load elimination algorithm for the example program in figure 16. the goal of which is to compute a lattice value ( ) for each renamed heap variable in the Array SSA form such that a load of [i] is available if (i) ∈ ( ). (p ) = (r )} T1 := .x := . ij ) = true }. Compute the list T = { ij | ij ∈ (A 0 ) and (i .. UPDATE involves four steps: 1. 16.208 (A 2 ) = 16 Array SSA Form — (Vivek Sarkar. .

. The definitely-same and definitely-different analyses outlined in this chapter are sound. . partial redundancy elimination. . There are many possible directions for future research based on this work. . . . . . . . and scalar replacement analyses within a single framework that can be used to analyze scalar variables. . . . the interested reader can consult [159] for details on full Array SSA form. . . . 16.16. and pointer objects with a unified approach. . . but conservative. . .3. In addition to reading the other chapters in this book for related topics. . . . [115] for details on load and store elimination for pointer-based objects using Array SSA form.5 Further Reading 209 . . In restricted cases. they can be made more precise using array subscript analysis from polyhedral compilation frameworks. . Another interesting direction is to extend the value numbering and definitely-different analyses mentioned in section 16. illustrated how it can be used to extend program analyses for scalars to array variables using constant propagation as an exemplar.5 Further Reading In this chapter. . . and [23] for extensions of this load elimination algorithm to parallel programs. . . . . . . . . . Achieving a robust integration of Array SSA and polyhedral approaches is an interesting goal for future research. . . . and illustrated how it can be used to extend optimizations for scalars to array variables using load elimination in heap objects as an exemplar. . [231] for efficient dependence analysis of pointer-based array objects. . [160] for details on constant propagation using Array SSA form. array variables. .2 so that they can be combined with constant propagation rather than performed as a pre-pass. . value numbering. . . . . . . . we introduced an Array SSA form that captures element-level data flow information for array variables. type propagation. An ultimate goal is to combine conditional constant. . . Past work on Fuzzy Array Dataflow Analysis (FADA) [24] may provide a useful foundation for exploring such an integration.

.

. When a variable is defined by a predicated operation. the value of the variable after the predicated operation is either the value of the expression on the assignment if the predicate is true. . . and the value of a variable is the value of the expression on the unique assignment to this variable. . . . .1 Overview In the SSA representation. As a result. . . de Ferrière Progress: 95% Final reading in progress . This essential property of the SSA representation does not any longer hold when definitions may be conditionally executed. . . An example is given figure 17. . and is ignored otherwise. . . . We will also use the notation p to refer to the complement of predicate p. . . . . . . . . and new pseudo definitions are introduced on φ -functions to merge values coming from different control-flow paths. . Each definition is an unconditional definition. the value of the variable will or will not be modified depending on the value of a guard register.single definition property predicated execution guard predicate CHAPTER 17 Psi-SSA Form F. . or the value the variable had before this operation if the predicate is false.1(b). This is represented in figure 17. . . . . . . each definition of a variable is given a unique name. . . .1(c) where we use the notation p ? a = op to indicate that an operation a = op is executed only if predicate p is true. . . . . 17. 211 . . . . The goal of the ψ-SSA form advocated in this chapter is to express these conditional definitions while keeping the static single assignment property. . . . . . . . . . .

. p ? a 2 = op2. . flow. . . . de Ferrière) a 1 = op1. the compiler may also generate predicated operations through if-conversion optimizations as described in Chapter 21. the φ -function would be augmented with the predicate p to tell which value between a 1 and a 2 is to be considered. . presented in Chapter 18. The φ -function of the SSA code with control-flow is “replaced” by a ψ-function on the corresponding predicated code. . which holds the correct value. . or holds the value of a 1 if p is false. . which would prevent code reordering for scheduling Our solution is presented in figure 17. These multiple reaching definitions on the use of a cannot be represented by the standard SSA representation. . In the notation. .1 SSA representation (c) With predication . . a 3 = φ (a 1 . the predicate p i will be omitted if p i ≡ true.1(d). . = a ?? 17 Psi-SSA Form — (F. A ψ-function a 0 = ψ(p 1 ?a 1 . compiler guard gate predicate a = op1. . . . . . p ? a 2 = op2 | a 1 would have the following semantic: a 2 takes the value computed by op2 if p is true. each argument a i is associated with a predicate p i . . and takes a variable number of arguments a i . . the use of a on the last instruction refers to the variable a 1 if p is false. . . . . . a 3 = ψ(a 1 . This representation is adapted to code optimization and code generation on a low-level intermediate representation. . In figure 17.212 Gated-SSA form dependence. p i ?a i . . 17. if (p ) a 1 = op1. . p ?a 2 ) = a3 (d) ψ-SSA then =a then a 2 = op2. The use of a on the last instruction of Figure 17. a 1 = op1. a 0 . . . . . . . . p ? a 2 = op2. 17. .2 Denition and Construction Predicated operations are used to convert control-flow regions into straight-line code. with information on the predicate associated with each argument. However. p n ?a n ) defines one variable. The drawback of this representation is that it adds dependencies between operations (here a flow dependence from op1 to op2). Later on. This representation is more suited for program interpretation than for optimizations at code generation level as addressed in this chapter. Another possible representation would be to add a reference to a 1 on the definition of a 2 .1(c). . . .1(c) would now refer to the variable a 2 . . . . if (p ) a = op2. \psifun back-end. . . . a 2 ) = a3 (b) SSA (a) non-SSA Fig. . . Gated-SSA is a completely different intermediate representation where the control-flow is no longer represented. . . . . . . . . In such a representation. One possible representation would be to use the Gated-SSA form. . or to the variable a 2 if p is true. . Predicated operations may be used by the intermediate representation in an early stage of the compilation process as a result of inlining intrinsic functions. . . .

p ? a 1 = 1. although this predicate is not explicit in the representation. we must have p i ⊆ q (or p i ⇒ q ). x 1 = φ (a 1 . the semantics of the original code requires that the definition of a 3 occurs after the definitions of a 1 and a 2 . x 2 = φ (x 1 . . 17. which may allow more optimizations to be performed on this code.2 Definition and Construction 213 dependence. • Rule on predicates : The predicate p i associated with the argument a i in a ψ-function must be included in or equal to the predicate on the definition of the variable a i . We take the convention that the order of the arguments in a ψ-function . The order of the arguments in a ψ-function gives information on the original order of the definitions. In other words. In the control-flow version of the code. a 3 ) (a) Control-flow code Fig. p ?a 2 ) q ? a 3 = 0. must be dominated by its definition. A ψ-function is evaluated from left to right. p ? a 2 = −1. which means the definition of a 3 must be executed after the value for x 1 has been computed. and each predicate p i . x 1 = ψ(p ?a 1 . A compiler transformation can now freely move these definitions independently of each other. . there is a control-dependence between the basic-block that defines x 1 and the operation that defines a 3 . a 2 and a 3 . The value of a ψ-function is the value of the right most argument whose predicate evaluates to true. if (p ) then else a 1 = 1.2 ψ-SSA with non-disjoint predicates then A ψ-function can represent cases where variables are defined on arbitrary independent predicates such as p and q in the example of Figure 17. . a 2 and a 3 and the variables x 1 and x 2 were introduced to merge values coming from different control-flow paths. for the code q ? a i = op. Each argument a i . a 2 ) if (q ) a 3 = 0. p i ?a i . In the predicated form of this example. It can occur at any location in a basic block where a regular operation is valid. . However. • It is predicated : A ψ-function is a predicated operation. . p ?a 2 . ). • It has an ordered list of arguments : The order of the arguments in a ψfunction is significant.2 : For this example. a 0 = ψ(. under the predicate n k =1 p k . there is no longer any control dependencies between the definitions of a 1 . . during the SSA construction a unique variable a was renamed into the variables a 1 . x 2 = ψ(p ?a 1 . control- A ψ-function has the following properties: • It is an operation : A ψ-function is a regular operation.17. q ?a 3 ) (b) Predicated code a 2 = −1.

predicated definitions behave exactly the same as non predicated . Usual algorithms that perform optimizations or transformations on the SSA representation can now be easily adapted to the ψ-SSA representation. The first argument of the ψ-function is already renamed and thus is not modified. from top to bottom. . . within the ψ-SSA representation. . .3. . . p 2 ?x ) (c) op-renaming p 2 ? x 2 = op x 3 = ψ(p 1 ?x 1 . for each predicated definition of a variable. The second argument is renamed into the current version of x which is x 2 . . p 2 ?x 2 ) (d) ψ-renaming Construction and renaming of ψ-SSA ψ-functions can also be introduced in an SSA representation by applying an ifconversion transformation.214 construction. .1). and that x 1 is defined through predicate p 1 (possibly true). . . . . 17. .3 SSA algorithms With this definition of the ψ-SSA representation. is inserted right after op. without compromising the efficiency of the transformations performed. p 2 ?x ) (b) ψ-insertion p 2 ? x 2 = op x = ψ(p 1 ?x 1 . . 17. of variables if-conversion def-use chains 17 Psi-SSA Form — (F. . . The construction of the ψ-SSA representation is a small modification on the standard algorithm to built an SSA representation (see Section 3. . . . . This insertion and renaming of a ψ-function is shown on Figure 17. Local transformations on control-flow patterns can also require to replace φ -functions by ψ-functions. . . . . . the variable x is given a new name. . . . . in the control-flow dominance tree of the program in a non-SSA representation. . a ψ-function of the form x = ψ(p 1 ?x 1 . . . On the definition of the ψ-function. x 3 . . . basic blocks are processed in their dominance order. . . . . of $\psi$SSA renaming. . de Ferrière) is. say x 2 . equal to the original order of their definitions. . The insertion of ψ-functions is performed during the SSA renaming phase. . On an operation. after renaming x into a freshly created version.3 p 2 ? x = op x = ψ(p 1 ?x 1 . . . . Then renaming of this new operation proceeds. This information is needed to maintain the correct semantics of the code during transformations of the ψ-SSA representation and to revert the code back to a non ψ-SSA representation. p 2 ?x ). . . a new ψ-function will be inserted just after the operation : Consider the definition of a variable x under predicate p 2 (p 2 ? x = op). implicit data-flow links to predicated operations are now explicitly expressed through ψ-functions. which becomes the current version for further references to the x variable. p 2 ? x = op (a) Initial Fig. from left to right. . . . . suppose x 1 is the current version of x before to proceeding op. . . During the SSA renaming phase. such as the one that is described in Chapter 21. and operations in each basic block are scanned from top to bottom. Actually.

. .. . are examples of algorithm that can easily be adapted to this representation with minor efforts. p n ?a n ). .. a 1 = op1 p 2 ? a 2 = op2 p 2 ? a 3 = op3 x 2 = ψ(a 1 . . . and induction variable analysis (see Chapter 11). Other algorithms such as dead code elimination(see Chapter 3). . . . . .. . p 3 ?a 3 ) Fig. . . . . . ψ-projection. the only modification is that ψ-functions have to be handled with the same rules as the φ -functions. The first argument a 1 of the ψ-function can safely be removed . . p 2 ?a 2 . p 2 ?a 3 ) ψ-reduction. The predicate p i associated with argument a i will be distributed with an and operation over the predicates associated with the inlined arguments. .4. As an example. In this algorithm. namely ψ-inlining.4 Psi-SSA algorithms 215 constant propagation dead code elimination global vale numbering partial redundancy elimination induction variable recognition $\psi$-inlining $\psi$-reduction ones for optimizations on the SSA representation. a 1 = op1 p 2 ? a 2 = op2 x 1 = ψ(a 1 . those transformations are defined as follows: ψ-inlining recursively replaces in a ψ function an argument a i that is defined on another ψ function by the arguments of this other ψ-function. . . p 2 ?a 3 ) Fig. .4 ψ-inlining of the definition of x 1 a 1 = op1 p 2 ? a 2 = op2 x 1 = ψ(a 1 . . . . 17. 17. . . p 1 ∧ p 2 ?a 2 . global value numbering.4 Psi-SSA algorithms In addition to standard algorithms that can be applied to ψ-functions and predicated code. ψ-reduction. An argument n a i associated with predicate p i can be removed if p i ⊆ k =i +1 p k . This can be illustrated by the example of Figure 17. .5 a 1 = op1 p 2 ? a 2 = op2 p 2 ? a 3 = op3 x 2 = ψ(p 2 ?a 2 . a number of specific transformations can be performed on the ψfunctions. . .5. . the classical constant propagation algorithm under SSA can be easily adapted to the ψ-SSA representation. This is shown in figure 17. . . . For a ψ-function a 0 = ψ(p 1 ?a 1 . p 2 ?a 2 ) p 3 ? a 3 = op3 x 2 = ψ(p 1 ?x 1 . . p 2 ?a 2 ) // dead p 3 ? a 3 = op3 x 2 = ψ(p 1 ?a 1 . . . . . . . p i ?a i . . . . . Only the ψ-functions have to be treated in a specific way. .17. . 17. partial redundancy elimination (see Chapter 12).. . . . ψ-permutation and ψ-promotion. . . p 3 ?a 3 ) ψ-reduction removes from a ψ-function an argument a i whose value will always be overridden by arguments on its right in the argument list.. .. .

. An example of such a permutation is shown on Figure 17. p 2 ?a 3 ) p 2 ? y1 = x 2 p 2 ? a 2 = op2 p 2 ? a 3 = op3 x 2 = ψ(p 2 ?a 2 .. If p i is known to be disjoint with p . . k =1 p k . Two arguments in a ψ-function can be permuted if the intersection of their associated predicate in the ψ-function is empty. . then p i must fulfill n pi \ n pk k =i ∩ i −1 k =1 pk = (17. and in particular. p 2 ? a 2 = op2 p 2 ? a 3 = op3 x 2 = ψ(p 2 ?a 2 ..8 . p i ?x i . a i actually contributes no value to the ψ-function and thus can be removed. This promotion must also satisfy the properties of ψ-functions. 17.. where only a subset of the instructions supports a predicate operand. that the predicate associated with a variable in a ψ-function must be included in or equal to the predicate on the definition of that variable (which itself can be a ψ-function).8(c). ψprojection on predicate p is usually performed when the result of a ψ-function is used in an operation predicated by p ..6 ψ-projection of x 2 on p 2 . . p 2 ?a 3 ) x 3 = ψ(p 2 ?a 2 ) p 2 ? y1 = x 3 Fig. p 2 ?a 2 ) ψ-promotion changes one of the predicates used in a ψ-function by a larger predicate. Second argument a 3 can be removed.6.. partial 17 Psi-SSA Form — (F. p 2 ?a 3 ) Fig.1) where p i \ k =i p k corresponds to the possible increase of the predicate of the ψn function. ψ-permutation changes the order of the arguments in a ψ-function.. In this new ψ-function... The ψ-SSA representation can be used on a partially predicated architecture. p i ?x i .216 $\psi$-projection predication $\psi$-permutation $\psi$-promotion speculation predication. Promotion must obey the following condition so that the semantics of the ψ-function is not altered by the transformation : consider an operation a 0 = ψ(p 1 ?x 1 . In a ψfunction the order of the arguments is significant.7. p n ?x n ) promoted into a 0 = ψ(p 1 ?x 1 . p 2 ? a 3 = op3 p 2 ? a 2 = op2 x 2 = ψ(p 2 ?a 2 . . 17. This is illustrated in Figure 17. A simple ψ-promotion is illustrated in Figure 17.... Figure 17. p n ?x n ) with p i ⊆ p i . an argument a i initially guarded by p i shall be guarded by the conjunction p i ∧ p .7 ψ-permutation of arguments a 2 and a 3 p 2 ? a 3 = op3 p 2 ? a 2 = op2 x 2 = ψ(p 2 ?a 3 . de Ferrière) ψ-projection creates from a ψ-function a new ψ-function on a restricted predicate say p .

1. p ?a 2 ) (b) ψ-SSA a 1 = ADD i 1 . a 2 ) (a) control flow Fig. . . . Taking the example of an architecture where the ADD operation cannot be predicated. results in a program with the same semantics as the original program. . . . 17. . 1. The property of the ψ-C-SSA form is that the renaming into a single variable of all variables that belong to the same ψ-φ -web. Now. . 1. . . x = φ (a 1 . the first argument of a ψ-function can be promoted under the true predicate. x = ψ(p ?a 1 . . and the removal of the ψ and φ functions. This algorithm uses ψ-φ -webs to create a conventional ψ-SSA representation.17. it may also be profitable to perform speculation in order to reduce the number of predicates on predicated code and to reduce the number of operations to compute these predicates. .5 Psi-SSA destruction The SSA destruction phase reverts an SSA representation into a non-SSA representation. .9 to illustrate the transformations that must be performed to convert a program from a ψ-SSA form into a program in ψ-C-SSA form. ψ-promotion of argument a 1 shows an example where some code with control-flow edges was transformed into a linear sequence of instructions. . one of them can be promoted to include the other conditions. . . . Also. Usually. This phase must be adapted to the ψ-SSA representation. . . . the predicate associated with this argument can be promoted. the ADD operation must be speculated under the true predicate. a 2 = ADD i 1 . . a 1 = ADD i 1 . . . set of variables such that if two variables are referenced on the same φ or ψ-function then they are in the same ψ-φ -web. . . .8 ψ-SSA for partial predication. . 2.5 Psi-SSA destruction 217 speculation destruction. . . x = ψ(a 1 . . a 2 = ADD i 1 . . . of SSA phiweb C-SSA if (p ) then else a 1 = ADD i 1 . . The notion of φ -webs is extended to φ and ψ operations so as to derive the notion of conventional ψ-SSA (ψ-C-SSA) form. Once speculation has been performed on the definition of a variable used in a ψ-function. . . . minimal. consider Figure 17. p ?a 2 ) (c) after ψ-promotion a 2 = ADD i 1 . usually reducing the number of predicates. when disjoint conditions are computed. . A ψ-φ -web is a non empty. . as will be explained in the following section. . . On an architecture where the ADD operation can be predicated. . .1). . A side effect of this transformation is that it may increase the number of copy instructions to be generated during the ψ-SSA destruction phase. . . provided that the semantic of the ψ-function is maintained (Equation 17. 2. 2. . . . . . . . 17. .

However. b = .. a = . resulting . after the definition of a.. p? c = b x = ψ(a . b. p ? b = .. c and x into a single variable will now follow the correct behavior. after the definition of b and p. p? c = b x = ψ(a . ψ-C-SSA forms and non-SSA form after destruction Looking at the first example (Figure 17. the renaming of the variables a.......9(d) the definition of the variable b has been speculated. q ? c = .. A new variable c must be defined as a predicated copy of the variable b... p ?b ) (a) ψ-T-SSA form a = . 17. The order in which the definitions of the variables a. Such code may appear after a code motion algorithm has moved the definitions for a and b relatively to each other. Such code will occur after copy folding has been applied on a ψ-SSA representation. b and x into a single variable will not restore the semantics of the original program... the semantics of the ψ-function is that the variable x will only be assigned the value of b when p is true. q ?c ) (g) ψ-T-SSA form Fig. variable b is then replaced by variable c..9(a)). b and x occur must be corrected.. p? x = b (f ) non-SSA form a = . x = . In fact.. the dominance order of the definitions for the variables a and b differs from their order from left to right in the ψ-function.. =x = . In Figure 17. p ?b ) y = ψ(a ... The renaming of variables a. p ?c ) (e) ψ-C-SSA form p ? b = .. the renaming of the variables a. c and x into a single variable will result in the correct behavior. d =a p ? b = . b = ...9(g). x = ψ(a ... = .. We see that the value of a has to be preserved before the definition of b. This is done through the definition of a new variable (d here). In Figure 17. in the ψ-function. x = ψ(a ..9 a = .... the renaming of the variables a. Now.... x and y into a single variable will not give the correct semantics. x = ψ(a . x = ψ(a . c. a = . q ? c = .. p? x = b (c) non-SSA form x = .... the value of a used in the second ψ-function would be overridden by the definition of b before the definition of the variable c. (i) non-SSA form Non-conventional ψ-SSA (ψ-T-SSA) form. q ?c ) (h) ψ-C-SSA form x y p? x q? y = .. Here...218 17 Psi-SSA Form — (F. b = .. This is done through the introduction of the variable c that is defined as a predicated copy of the variable b. de Ferrière) p ? b = . p ?c ) (b) ψ-C-SSA form a = . p ?b ) (d) form ψ-T-SSA p ? b = .. p ?b ) y = ψ(d .

9(d).17. the normalized-ψ characteristics must hold on a virtual ψ-function where ψ-inlining has been performed on these arguments. • Psi-web : This phase grows ψ-webs from ψ-functions.g. These transformations may cause some ψ-functions to be in a non-normalized form.9(a) and Figure 17. We detail now the implementation of each of the first two parts. and the variables d. and introduces repair code where needed such that each ψ-web is interference free. • The predicate associated with each argument in a normalized-ψ-function is equal to the predicate used on the unique definition of this argument. from top to bottom. . resulting in a program in a non-SSA form with the correct behavior. Now. Also. Predicated definitions may be moved relatively to each others. The normalized form of a ψ-function has two characteristics: • The order of the arguments in a normalized-ψ-function is. When ψ-functions are created during the construction of the ψ-SSA representation. • Phi-web : This phase is the standard SSA-destruction algorithm (e. This algorithm is made of three parts. operation speculation and copy folding may enlarge the domain of the predicate used on the definition of a variable. We will now present an algorithm that will transform a program from a ψ-SSA form into its ψ-C-SSA form. Later.9(h). equal to the order of their definitions. transformations are applied to the ψ-SSA representation. • Psi-normalize : This phase puts all ψ-functions in what we call a normalized form. This can be done using the pining mechanism presented in Chapter 22. the variables a.5. We only detail the algorithm for such a traversal. see Chapter 22) with the additional constraint that all variables in a ψ-web must be coalesced together. they are naturally built in their normalized form. 17. in the control-flow dominance tree. PSI-normalize implementation Each ψ-function is processed independently. c and y will be renamed into another variable. When some arguments of a ψ-function are also defined by ψ-functions. An analysis of the ψ-functions in a top down traversal of the dominator tree reduces the amount of repair code that is inserted during this pass. These two characteristics correspond respectively to the two cases presented in Figure 17.1 Psi-normalize We define the notion of normalized-ψ. from left to right. b and x can be renamed into a single variable.5 Psi-SSA destruction 219 normalized-$\psi$ \psiweb pining coalescing normalized-$\psi$ $\psi$-T-SSA in the code given in Figure 17.

until a definition on a nonψ-function is found. If they are not equal. a i is defined by the operation p i ?a i = a i . a new variable a i is introduced and is initialized at the highest point in the dominator tree after the definition of a i and p i . Then. a i is defined by the operation p i ? a i = a i . The cases where repair code is needed can be easily and accurately detected by observing that variables in a ψ-function interfere. . . until all arguments of the ψ-function are processed. If the definition we found for a i dominates the definition for a i −1 . a i is replaced in the ψ-function by a i . we give a definition of the actual point of use of arguments on normalized ψ-functions for liveness analysis. see Section 22. a ψ-permutation can be applied between a i −1 and a i . liveness analysis is computed accurately and an interference graph can be built. When all ψ functions are processed.5. Then. so as to reflect into the ψ-function the actual dominance order of the definitions of a i −1 and a i . When all arguments are processed. . the argument list is processed from left to right. In the same way as there is a specific point of use for arguments on φ functions for liveness analysis (e. the ψ is in its normalized form. the predicate p i associated with this argument in the ψ-function and the predicate used on the definition of this argument are compared. Consider the code in Figure 17. although the non-ψ definition for a i appears before the definition for a i −1 .2 Psi-web The role of the psi-web phase is to repair the ψ-functions that are part of a non interference-free ψ-web. When a i is defined on a ψ-function. 1 Then. with the definition for a i −1 . For each argument a i . we consider the dominance order of the definition for a i . If the predicates p i −1 and p i are disjoint. 17. The algorithm continues with the argument a i +1 . .220 \psiweb interference uses. This copy operation is inserted at the highest point that is dominated by the definitions of a i −1 and a i . we recursively look for the definition of the first argument of this ψ-function. a i is replaced by a i in the ψ-function. . de Ferrière) For a ψ-function a 0 = ψ(p 1 ?a 1 . a new variable a i is created for repair. p n ?a n ). This case corresponds to the example presented in Figure 17. With this definition. We use in this example the notation of the select operator x = cond ? exp1 : exp2 1 When a i is defined by a ψ-function. If ψpermutation cannot be applied. of \psifuns liveness 17 Psi-SSA Form — (F. . .9(g). the function will contain only normalized-ψ-functions. predicated definitions have been modified to make a reference to the value the predicated definition will have in case the predicate evaluates to false.g. some correction is needed. . p i ?a i . Liveness and interferences in Psi-SSA. Instead of using a representation with ψfunctions.10 (b). its definition may appear after the definition for a i −1 . .3).

Then an interference graph can be built to collect the interferences between variables involved in ψ or φ -functions. p i ?a i . First.10 ψ-functions and C conditional operations equivalence a = op1 b = p ? op2 : a c = q ? op3 : b x =c (b) conditional form Given this definition of point of use of ψ-function arguments.. Then. If the ψ-web that contains a i does not interfere with the ψ-web that . We can see that a renaming of the variables a.. merging when non-interfering the ψ-webs of its operands together. p n ?a n ) be a normalized ψ-function. a traditional liveness analysis can be run.. Using this equivalence for the representation of a ψ-function. Note that this transformation can only be performed on normalized ψ-functions...17. Repairing interferences on ψ-functions. A pseudo-code of this algorithm is given in Figure 21. in no specific order. are processed from right (a n ) to left (a 1 ). p n ?a n ). and using the usual point of use of φ -function arguments. Each of the predicated definitions make an explicit use of the variable immediately to its left in the argument list of the original ψ-function from Figure 17. an interference between two variables that are defined on disjoint predicates can be ignored. p i ?a i . . Let a 0 = ψ(p 1 ?a 1 .... For the construction of the interference graph. we now give a definition of the point of use for the arguments of a ψ-function. and the same predicate must be used on the definition of an argument and with this argument in the ψ operation. 17. a = op1 p ? b = op2 q ? c = op3 x = ψ(a .. q ?c ) (a) ψ-SSA form Fig. p ?b .5 Psi-SSA destruction 221 uses. c and x into a single representative name will still compute the same value for the variable x. b. For i < n .. Definition 3 (use points). .. .10 (a). say a 0 = ψ(p 1 ?a 1 . ψ-functions are processed one at a time. . Two ψ-webs interfere if at least one variable in the first ψweb interferes with at least one variable in the other one. the ψ-webs are initialized with a single variable per ψ-web. of \psifuns interference \psiweb that assigns exp1 to x if cond is true and exp2 otherwise. the point of use of argument a i occurs at the operation that defines a i +1 . since the definition of an argument must be dominated by the definition of the argument immediately at its left in the argument list of the ψ-function.. The point of use for the last argument a n occurs at the ψ-function itself. The arguments of the ψ-function. We now present an algorithm that resolves the interferences as it builds the ψwebs.

. or just above the ψ-function in case of the last argument. .222 live-range liveness check 17 Psi-SSA Form — (F. with no changes. they are merged together. Otherwise.op a i ∈ livein(u . Since it interferes with both x and c. foreach u in IGraph. {c } does not interfere with {x }.. starting with variable c .. The live-range for variable c extends down to its use in the ψ-function that defines the variable x. The current argument a i in the ψ-function is replaced by the new variable a i . append C i right before op . if the definition of u dominates the definition of a i . Then. while op is a ψ-function replace op by the operation that defines its first argument . p n ?a n ) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 begin let psiWeb be the web containing a 0 . let C i be a new predicated copy p i ? a i ← a i . opndWeb ← {a i } . and is initialized just below the definition for b.. a i . inserted just above the definition for a i +1 . as a predicated copy of b.interferenceSet(a i ) do if u ∈ psiWeb if u . {x }.. Then variable a is processed. is created and is initialized with a predicated copy p i ? a i = a i . c }. The live-range for variable b extends down to its use in the definition of variable d.op dominates a i . it is live-through the definition of a i . p i ?a i . Consider the code in Figure 17. The interference graph is updated conservatively.def. or the psi operation . repair code is needed. {c }. they can be merged together. For each u ∈ U . last. r ?c ). {b }. a i . foreach a i in [a n . if opndWeb = psiWeb if IGraph. say U . a ) .op. if the definition of a i dominates the definition of b . thus is should be made interfering with a i . it should not interfere with a i . but not further down. c }. The argument list is processed from right to left i. variable b is processed. This can be done by considering the set of variables. A variable b’ is created. At the beginning of the processing on the ψ-function x = ψ(p ?a . opndWeb) let a i be a freshly created variable . if u is in the merged ψ-web. b . repairing code is needed.interfere(psiWeb. q ?b . . Thus. The liveness on the ψ-function creates a live-range for variable a that extends down to the definition of b.op) IGraph. c or x. a i interferes with. psiWeb ← psiWeb ∪ opndWeb. replace a i by a i in the ψ-function.def. de Ferrière) contains a 0 . .def. a n −1 . A new variable. the variable a does not interfere with the variables b. .. psiWeb now becomes {x .def. a 1 ] do let opndWeb be the web containing a i .. and as . This live-range interferes with the variables c and x. resulting in psiWeb = {x .11 to see how this algorithm works. {d }. Algorithm 21: ψ-webs merging during the processing of a ψ-function a 0 = ψ(p 1 ?a 1 . it should be made interfering only if this definition is within the live-range of a i (see Chapter 9). ψ-webs are singletons {a }.e.addInterference(a i . let op be the operation that defines a i +1 . . The interference graph is updated.

This representation is targeted at very low level optimization to improve operation scheduling in presence of predicated instructions. . global value numbering [75]. . There are also other SSA representations that can handle predicated instruction. b = . partial redundancy elimination [71] and induction variable analysis [276] which have already been implemented into a ψ-SSA framework. x = ψ(p ?a . . q ?b .17. ... .. . . {a } is merged to psiWeb.. q ? b = .. presented in Chapter 18. . ..... . Another representation is the Gated SSA form. . . . . . . . .. . . . The phi-web phase mentioned in this chapter to complete the ψ-SSA destruction algorithm can use exactly the same approach by simply initializing ψ-φ -webs by ψ-webs. . . . of which is the Predicated SSA representation [60]. . . . . r ? c = . dead code elimination [194]. q ?b . . Among these algorithm. b = . . . .. . 17. The ψ-SSA destruction algorithm presented in this chapter is inspired from the SSA destruction algorithm of Sreedhar et al. .6 Additional reading In this chapter we mainly described the ψ-SSA representation and we detailed specific transformations that can be performed thanks to this representation.6 Additional reading 223 no interference is encountered. More details on the implementation of the ψ-SSA algorithms. [242] that introduces repair code when needed as it grows φ -webs from φ -functions. . . . . . . usually by just adapting the rules on the φ -functions to the ψ-functions. . .. x =b x = . . . . b =b c = . .. ....11(c). we can mention the constant propagation algorithm described in [273]. p ? a = . The final code after SSA destruction is shown in Figure 17.11 Elimination of ψ live-interference p? q? q? r? a = . . x = ψ(p ?a . and figures on the benefits of this representation. . r ?c ) s? d =b +1 p? q? q? r? x = . 17. can be found in [249] and [93]... . r ?c ) s? d =b +1 (a) before processing the ψ-function Fig. . . We mentioned in this chapter that a number of classical SSA-based algorithm can be easily adapted to the ψ-SSA representation. s? d =b +1 (c) after actual coalescing (b) after processing the ψ-function .

.

. . An introduction to each graph will be given. . . representing the data dependencies in a program. . . . . . . Additionally. . . . The CFG models control flow. . . . . . . . discovering irreducibility and performing interval analysis techniques. . Therefore it comes as no surprise that a number of program graphs use SSA concepts as the core principle of their representation. . . . A cornucopia of program graphs have been presented in the literature and implemented in real compilers. . Traditionally. but many graphs model data flow. . . One of the seminal graph representations is the Control Flow Graph (CFG). . . . . . These range from those which are a very literal translation of SSA into a graph form to those which are more abstract in nature. . . . . . We will look at a number of different SSA-based graph representations.CHAPTER 18 Graphs and gating functions J. We aim to introduce a selection of program graphs which use SSA concepts. . which was introduced by Allen to explicitly represent possible control paths in a program. . . . . . . along with diagrams to show how sample programs look 225 . The graphs we consider in this chapter are all data flow graphs. . .1 Introduction Many compilers represent the input program as some form of graph in order to aid analysis and transformation. . . . Stanier Progress: 90% Text and figures formating in progress . . These range from very literal translations of SSA into graph form to more abstract graphs which are implicitly SSA. 18. representing a program in this way makes a number of operations simpler to perform. This is useful as a large number of compiler optimizations are based on data flow. . the CFG is used to convert a program into SSA form. such as identifying loops. and examine how they may be useful to a compiler writer.

. Thus.2 Data Flow Graphs The data flow graph (DFG) is a directed graph G = (V . . . . Notice that many different variations exist in the literature. . . . . Stanier ) when translated into that particular graph. . . . However.1 which is then translated into an SSA Graph. we must first demand the results of the operands and then perform the operation indicated on that vertex. . The DFG can be seen as a companion to the CFG.3 The SSA Graph We begin our exploration with a graph that is very similar to SSA: the SSA Graph. . However. . . . nor does the DFG contain whole program information. . . . . . and directed edges connect uses to definitions of values. . . . . the DFG has no such concept. . Whereas the CFG imposes a total ordering on instructions. 18. . . . . . . . An instruction executes once all of its input data values have been consumed. .226 18 Graphs and gating functions — (J. optimisations such as dead code elimination. Additionally. . . . . . Note that each node mixes up the operation and the variable(s) it defines. . . . . . . . . . . constant folding and common subexpression elimination can be performed effectively. . . . target code cannot be generated directly from the DFG. . . For us. . . We present some sample code in Figure 18. . . we will touch on the literature describing a usage of a given graph with the application that it was used for. The outgoing edges from a vertex represent the arguments required for that operation. . With access to both graphs. . . . and they can be generated alongside each other. . . an SSA Graph consists of vertices which represent operations (such as add and load) or φ functions. . . . . This graph is therefore a demand-based representation. The textual representation of SSA is much easier for a human to read. . In order to compute a vertex. the primary benefit of representing the input program in this form is that the compiler writer is able to apply a wide array of graph-based optimizations by . . . . . . . . . . . . . . . E ) where the edges E represent the flow of data from the result of one operation to the input of another. . . and the ingoing edge(s) to a vertex represents the propagation of that operation’s result(s) after it has been computed. . . . . . . . . Reversing the direction of each edge would provide the widely used data-based representation. . When an instruction executes it produces a new data value which is propagated to other connected instructions. keeping both graphs updated during optimisation can be costly and complicated. . . . The SSA Graph can be constructed from a program in SSA form by explicitly adding use-definition chains. . . . 18. . . . as actual data structures might be able to find one from the other. .

It is possible to augment the SSA Graph to model memory dependencies.a2) a2>20 goto loop a2=a1*i1 i1>100 i3=phi(i1. i 3 = φ (i 1 . This can be achieved by adding state edges that enforce an order of interpretation. loop: a 1 = φ (a 0 . after we touch on the concept of gating functions. and has been combined with an extended . end: a 3 = φ (a 1 .i 2 ). print(a 3 + i 3 ). if a 2 > 20 goto end. which we will look at later. i 1 = φ (i 0 . operator strength reduction. goto loop. rematerialization.i2) begin: a 0 = 0.18. a0=0 i1=phi(i0. if i 1 > 100 goto end. Fig. These edges are extensively used in the Value State Dependence Graph. 18.a 2 ).i2) i0=0 i2=i1+1 0 1 using standard graph traversal and transformation techniques.a 2 ). i 0 = 0. a 2 = a 1 * i 1.3 The SSA Graph 227 print(a3+i3) a3=phi(a1. the SSA Graph has been used to detect a variety of induction variables in loops. i 2 = i 1 + 1.1 Some SSA code translated into a (demand-based) SSA graph.i 2 ).a2) goto loop a1=phi(a0. also for performing instruction selection techniques. In the literature.

. . . . . . .. Scans and reductions also show a similar SSA Graph pattern and can be detected using the same approach. . . . . .3. . such as wrap-around variables. . . . . as the strict . . SCCs can be easily discovered in linear time using any depth first search traversal. Given that a program is represented as an SSA Graph. . . . A simple IV recognition algorithm is based on the observation that each basic linear induction variable will belong to a non-trivial strongly connected component (SCC) in the SSA graph. .228 18 Graphs and gating functions — (J. . . . . A more sophisticated technique is developed in Chapter 11. . . . • The other operand of each addition or subtraction is loop invariant.1 Finding induction variables with the SSA Graph We illustrate the usefulness of the SSA Graph through a basic induction variable (IV) recognition technique. . • The SCC contains only addition and subtraction operators. . endloop where k is a constant or loop invariant.. . . . .. .. The PDG was developed to aid optimizations requiring reordering of instructions and graph rewriting for parallelism. 18. . . . the task of finding induction variables is simplified.4 Program Dependence Graph The Program Dependence Graph (PDG) represents both control and data dependencies together in one graph. . . The reader should note that the exact specification of what constitutes an SSA Graph changes from paper to paper. . . . . . . . . as each author tends to make small modifications for their particular implementation. A basic linear induction variable i is a variable that appears only in the form: i = 10 loop . This technique can be expanded to detect a variety of other classes of induction variables. Each such SCC must conform to the following constraints: • The SCC contains only one φ -function at the header of the loop. 18. Stanier ) SSA language to aid compilation in a parallelizing compiler. . i = i + k . . The essence of the intermediate representation (IR) has been presented here. non-linear induction variables and nested induction variables. and the right operand of the subtraction is not part of the SCC (no i=n-i assignments). .

E ) where nodes V are statements. let S consist of all edges (A . Predicate nodes test a conditional statement and have true and false edges to represent the choice taken on evaluation of the predicate. The PDG is a directed graph G = (V . control dependence between statements and predicate nodes are computed. Thus. B ) in the CFG such that B is not an ancestor of A in the postdominator tree. the postdominator tree is traversed backwards from B until we reach A ’s parent. v ) if w post-dominates v and w does not strictly post-dominate u . Loops in the PDG are represented by back edges in the control dependence subgraph. predicate expressions or region nodes. and circular nodes are region nodes. if a region node has three different control-independent statements as immediate children. through which control flow enters and exits the program respectively. A node w is said to be control dependent on node u along CFG edge (u . Dashed edges represent control dependence. Then. Construction of the PDG is tackled in two steps from the CFG: construction of the control dependence subgraph and construction of the data dependence subgraph. Next. For this purpose. and solid edges represent data dependence. A slight difference in its construction process is that the corresponding control dependence edges from u to w is labelled by the boolean value taken by the predicate computed in u when branching on edge (u . v ) or simply control dependent on edge (u . then region nodes are added. rectangular nodes represent statements. First. Similar to the CFG. i. v ). Control dependence between nodes is nothing else than post-dominance frontier. Given (A . for . Region nodes group control dependencies with identical source and label together.. To begin with. an unpruned PDG is created by checking. Region nodes are also inserted so that predicate nodes will only have two successors. the ENTRY node is added with one edge labelled true pointing to the CFG entry node. Each of these edges has an associated label true or false. and another labelled false going to the CFG exit node.e. diamond nodes predicates. then these could potentially be executed in parallel. it is assumed that every node of the CFG is reachable from the entry node and can reach the exit node.18. and edges E represent either control or data dependencies. Statement nodes represent instructions in the program. If the control dependence for a region node is satisfied. Then. then it follows that all of its children can be executed. B ). Diagrammatically. the set of all edges E has two distinct subsets: the control dependence subgraph E C and the data dependence subgraph E D . Each region node summarizes a set of control conditions and “groups” all nodes with the same set of control conditions together. a PDG also has two nodes ENTRY and EXIT. marking all nodes visited (including B ) as control dependent on A with the label of S .2.4 Program Dependence Graph 229 ordering of the CFG is relaxed and complemented by the presence of data dependence information. region nodes are added to the PDG. We show an example code translated into a PDG in Figure 18. each edge in S is considered in turn. dominance frontier on the reverse CFG. Thus. the postdominator tree is constructed for the procedure. To compute the control dependence subgraph. Then. The construction of the control dependence subgraph falls itself into two steps.

Nodes associated with the evolution of the induction variable i are omitted. and using a hash table to map sets of control dependencies to region nodes. if a > 20 goto end. the intersection I N T of C D is computed for each immediate child of N in the postdominator tree.2 F R3 A[i]=a Some code translated into a PDG. If I N T = C D then the corresponding dependencies are deleted from the child and replaced with a single dependence on the child’s control predecessor. i = i + 1. If none exists. Next. the corresponding edges . a new region node R is created with these control dependencies and entered into the hash table. loop: if i > 100 goto end. A[i] = a.230 18 Graphs and gating functions — (J. which control region it depends on. 18. For each node N visited in the postdominator tree. end: return a. the hash table is checked for an existing region node with the same set C D of control dependencies. goto loop. a = 2 * B[i]. a pass over the graph is made to make sure that each predicate node has a unique successor for each truth value. This is done by traversing the postdominator tree in postorder. Fig. Stanier ) entry T a=2*B[i] (a>20)? R1 F R2 return a (i>100)? begin: i = 1. If more than one exists. each node of the CFG. R is made to be the only control dependence predecessor of N . Then.

In this context. that is not contained in an SCC of the PDG (considering both control and data dependence edges) can be vectorized. the statement a=2*B[i] cannot.4. The PDG’s structure has been exploited for generating code for vectorisation. As an example. On the other hand. (anti) B writes to a storage location previously accessed through a read by A . Memory access locations often vary with the iterations of the enclosing loops. any node of a CFG loop. 18. Sequential code can be derived from a PDG. The PDG can also be constructed during parsing. (output) A and B both have a write access to the same storage location. since the statement A[i]=a in the loop do not form an SCC in the PDG. A statement B that is to be executed after a statement A in the original sequential ordering of the program depends on A in the following situations: (flow) B reads to a storage location that was lastly accessed by A through a write.1 Detecting parallelism with the PDG The structure of the PDG allows for parallelism to be detected easily. a scheduler based on data dependencies will generally restrict its scope to hyper-blocks. and has also been used in order to perform accurate program slicing and testing.2. As an example. Memory accesses can not always be analyzed with enough precision. a conservative dependence is to be inserted also. . Side effects can also dictate the insertion of a dependence between A and B to force the sequential ordering of the final schedule. it can be vectorized provided array expansion of variable a .18. code transformations such as loop unrolling or if-conversion (see Chapter 21) that effectively change control dependencies into data dependencies can expose instruction level parallelism. Dependence analysis can take advantage of some abstract representations of the access function such as when the memory address can be represented as an affine function of the induction variables. the PDG can directly be used to detect parallelism. but also the dependence can be labelled with a distance vector as a function itself of the loop indices. a loop indexed by i . In the presence of a may-alias between two consecutive accesses. Not only does this enables a refinement of the alias information. but generating the minimal size CFG from a PDG turns out to be an NP-Complete problem. On a regular CFG representation. because of the circuit involving the test on a . would lead to a flow dependence of distance i + 1. then as a read at iteration 2i + 1. first as a write at iteration i .4 Program Dependence Graph 231 are replaced by a single edge to a freshly created region node that itself points to the successor nodes. In the example in Figure 18. that would access array A [i ] twice. However.

. has Vi n i t as the initial input value for the loop. . . . Vf i n a l ) has P as predicate and Vf i n a l as the definition reaching beyond the loop. . A φif function of the form φif (P. . end: return a. . . . . These gating functions are directly interpretable versions of φ -nodes. A φexit function of the form φexit (P. .232 18 Graphs and gating functions — (J. • The φentry function is inserted at loop headers to select the initial and loop carried values. . . if a > 20 goto end. A[i] = a. . . A φentry function of the form φentry (Vi n i t . . . . and Vi t e r as the iterative input. . Stanier ) . • The φexit function determines the value of a variable when a loop terminates. . . . . Gated Single Assignment form (GSA. Fig. we cannot directly interpret the SSA Graph. .3 A graph representation of our sample code in GSA form. By this logic. . . . and V1 and V2 as the values to be selected if the predicate evaluates to true or false respectively. . 18. . goto loop. and also reduces the complexity of performing code generation. . V1 . . 18. they cannot be directly interpreted. begin: i = 1. as they do not specify the condition which determines which of the variable definitions to choose. . . . . V2 ) has P as a predicate. . . . This can be read simply as if-then-else. We replace φ -functions at loop headers with φentry functions. . Being able to interpret our IR is a useful property as it gives the compiler writer more information when implementing optimizations. φ -functions are used to identify points where variable definitions converge. .5 Gating functions and GSA In SSA form. However. . . . . . . . i = i + 1. . sometimes called Gated SSA) is an extension of SSA with gating functions. loop: if i > 100 goto end. . Vi t e r ). . We usually distinguish the three following forms of gating functions: • The φif function explicitly represents the condition which determines which φ value to select. . and replace φ -nodes in the representation. a = 2 * B[i]. .

Here.. If the code is not already under SSA. a 2 ) are not evaluated before p is known 1 . (c) the DAG representation of the nested gated φif The more natural way uses a data flow analysis that computes.r is taken or when the path ¬p . a 2 .. This example shows several interesting points. the semantic of both the φexit and φif are strict in their gate: here a 1 or φexit (q .3 shows how our earlier code in Figure 18. else if (r ) then (. Finally. (b) the CDG (with region nodes ommited). computing the values of gates is highly related to the simplification of path expressions: in our running example a 2 should be selected when the path ¬p followed by q (denoted ¬p . . At the header of our sample loop. Of course. φif (q .2 translates into GSA form. for each program points and each variable. this representation of the program does not allow for an interpreter to decide whether an instruction with a side effect (such as A [i 1 ] = a 2 in our running example) has to be executed or not.. First. Figure 18..¬r is taken which simplifies to ¬p . for our nested if-then-else example. (¬p ∧ q )?a 2 ) instead.. a 1 ). a φif function that results from the nested if-then-else code of Figure 18. and if at a merge point of the CFG its predecessor basic blocks are reached by different variables.. a classical path compression technique 1 As opposed to the ψ-function described in Chapter 17 that would use a syntax such as a 3 = φ ((p ∧ ¬q )?a 1 . a 1 should be selected either when the path ¬p . Diverse approaches can be used to generate the correct nested φif or φexit gating functions. This set of paths is abstracted using a path expression. 18. a φ -function is inserted.) (a) A structured code.4 would be itself nested as a = φif (p . the corresponding path expressions are merged. the φ -function has been replaced by a φentry function which determine between the initial and iterative value of i .. the nested φexit functions selects the correct live-out version of a .. a 3 ).. Second.4 a 1 = .q ) is taken while a 1 should be selected when the path p is taken. If a unique variable reaches all the predecessor basic blocks.) else (a 3 = .5 Gating functions and GSA 233 It is easiest to understand these gating functions by means of an example. The gates of each operand is set to the path expression of its corresponding incoming edge.18. After the loop has finished executing.. Similarly. we can see the use of both φentry and φexit gating functions. if (p ) then if (q ) then (a 2 = .) Fig.) else (. its unique reaching definition and the associated set of reaching paths.

This relationship is illustrated by Figure 18. demanddriven substitution can be performed using GSA which only substitutes needed information. Stanier ) can be used to minimize the number of visited edges. These gating functions are important as the concept will form components of the Value State Dependence Graph later.4 in the simple case of a structured code. Instead. Traditionally.5 A program on which to perform symbolic analysis. one can derive the gates of a φif function from the control dependencies of its operands. By using gating functions it becomes possible to construct IRs based solely on data dependencies. Alternatively. backward.234 18 Graphs and gating functions — (J.1 Backwards symbolic analysis with GSA GSA is useful for performing symbolic analysis. as is done in the PDG. which can be complex and prone to human error. Consider the following program: Fig.5. Those variations concern the correct handling of loops in addition to the computation and representation of gates.1 else J = JMAX T: a s s e r t (J ≤ JMAX) If forwards substitution were to be used in order to determine whether the assertion is correct. R: JMAX = Expr S: if(P) then J = JMAX . many variants of GSA have been proposed. we can utilize gating functions along with a data flow graph for an effective way of representing whole program information using data flow information. This is also a more attractive proposition than generating and maintaining both a CFG and DFG. complete forward substitution is expensive and can result in a large quantity of unused information and complicated expressions. Forward propagation through this program results in statement T being a s s e r t ((i f P t he n Expr-1 e l s e Expr) ≤ Expr ). With the diversity of applications (see chapters 24 and 11). then the symbolic value of J must be discovered. 18. starting at the top of the program in statement R. However. symbolic analysis is performed by forward propagation of expressions through a program.5. One can observe the similarities with the φ -function placement algorithm described in Section 4. . making them good for analysis and transformation. GSA has seen a number of uses in the literature including analysis and transformations based on data flow. 18. There also exists a relationship between the control dependencies and the gates: from a code already under strict and conventional SSA form. These IRs are sparse in nature compared to the CFG. One approach has been to combine both of these into one representation.

. . . . . . . . Cycles are permitted but must satisfy various restrictions. J 2 ) T: a s s e r t ( J 3 ≤ J M AX 1 ) Using this backwards substitution technique. . . The corresponding VSDG (b) shows both value and state edges and a selection of nodes. This allows for skipping of any intermediate statements that do not affect variables in T.7 Substitution steps in backwards symbolic analysis. . . .1 else J 2 = J M AX 1 J 3 = φif (P. . non-trivial programs.6 Value State Dependence Graph The gating functions defined in the previous section were used in the development of a sparse data flow graph IR called the Value State Dependence Graph (VSDG). . . J 2 ) = φif (P. . In (a) we have the original C source for a recursive factorial function. demand-driven substitution. . . . . In non-trivial programs this can greatly reduce the number of redundant substitutions. . . J 3 = φif (P. An example VSDG is shown in Figure 18. J 1 . . In real. . . . .6 Figure 18. . . loop and merge nodes together with value and state dependency edges. J M AX 1 − 1. . . . and follow the SSA links of the variables from J 3 .5 in GSA form. J M AX 1 ) The backwards substitution then stops because enough information has been found. J 1 . . . . Thus the substitution steps are: Fig. avoiding the redundant substitution of J M AX 1 by E x p r . making symbolic analysis significantly cheaper. we start at statement T. . . . .8. . A VSDG represents a single procedure: this matches the classical CFG. . . 18. . .6 Value State Dependence Graph 235 thus the assert statement evaluates to true. .18. R: J M AX 1 = Expr S: if(P) then J 1 = J M AX 1 . . . . . . these expressions can get unnecessarily long and complicated. . . The VSDG is a directed graph consisting of operation nodes. 18. The program above has the following GSA form: Fig. 18. Using GSA instead allows for backwards.

state dependency edges (dashed lines). a γ-node. The labelling function associates each node with an operator. a conditional node. . State dependency (E S ) represents two things. Stanier ) public : fac() n STATE 1 1 eq sub fac() int fac(int n) { int result. and a return node in general must wait for an earlier loop to terminate even though there might be no value-dependency between the loop and the return node. 18. if(n == 1) result = n. Value dependency (E V ) indicates the flow of values between nodes. the first is essential sequential dependency required by the original program. return result.6. a const node. } call mul T F C γ return Ê (a) Fig. and the function entry and exit nodes. 18. else result = n * fac(n . e. E V . The VSDG corresponds to a reducible program.1 Definition of the VSDG A VSDG is a labelled directed graph G = (N . The second purpose is that state dependency edges can be added incrementally until the VSDG corresponds to a unique CFG. a given load instruction may be required to follow a given store instruction without being re-ordered. a call node. N 0 . e.236 18 Graphs and gating functions — (J.8 (b) A recursive factorial function.g. Such state dependency edges are called serializing edges. E S . there are no cycles in the VSDG except those mediated by θ (loop) nodes.1).g. value dependency edges E V ⊆ N × N . N ∞ ) consisting of nodes N (with unique entry node N 0 and exit node N ∞ ). whose VSDG illustrates the key graph components—value dependency edges (solid lines). state dependency edges E S ⊆ N × N . .

The majority of nodes in a VSDG generate a value based on some computation (add. otherwise F . 3 4 5 F<x> F<y> a) ÊÊif (P) ÊÊÊÊÊÊÊÊx = 2. T . else y = 5. and returns the value of T if C is true. We also see how two syntactically different programs can map to the same structure in the VSDG. r .6. Note that..3 γ-Nodes The γ-node is similar to the φif gating function in being dependent on a control predicate. with the effect that two separate γ-nodes with the same condition can be later combined into a tuple using a single test. Figure 18. F ) evaluates the condition dependency C . we use a pair of values (2-tuple) of values for the T and F ports. in implementation terms. n 2 ) where n 2 ∈ succ(n 1 ). 2 T<y> T<x> P { C Ɣ { x y Fig. y = 3.2 Nodes There are four main classes of VSDG nodes: value nodes (representing pure arithmetic). A γ-node γ(C . rather than the control-independent nature of SSA φ -functions.. and state nodes (side-effects).18. ÊÊÊÊelse ÊÊÊÊÊÊÊÊx = 4. n . { . which are treated as tuples). ÊÊÊÊif (P) y = 3. subtract. γ-nodes (conditionals). ÊÊÊÊÊÊÊ. y = 5. Here.9 illustrates two γ-nodes that can be combined in this way.6 Value State Dependence Graph 237 The VSDG is implicitly represented in SSA form: a given operator node. 18. We generally treat γ-nodes as single-valued nodes (contrast θ -nodes. are a special case).6. it is therefore useful to talk about the idea of an output port for n being allocated a specific register. which have no dependent nodes. 18.9 Two different code schemes (a) & (b) map to the same γ-node structure. will have zero or more E V -consumers using its value. b) ÊÊif (P) x = 2. to abbreviate the idea of r being used for each edge (n 1 . else x = 4. etc) applied to their dependent values (constant nodes. a single register can hold the produced value for consumption at all consumers. 18. Êθ -nodes (loops).

The θ -node directly implements pre-test loops (while. R . . 18. loop return ports R 1 .10 also shows a pair (2tuple) of values being used for I . I .4 θ -Nodes The θ -node models the iterative behaviour of loops. which both reduces the cost of optimization. It has five specific ports which represent dependencies at various stages of computation... When C is false. X k . Evaluating the X port triggers it to evaluate the I value (outputting the value on the L port). ++i) ÊÊÊ--j. .. A θ -node θ (C . post-test loops (do. While C evaluates to true. . As i is not used after the loop there is is no dependency on the i port of X. loop iteration ports L 1 . while condition value C holds true.. R ... . The θ -node corresponds to a merge of the φentry and φexit nodes in Gated SSA. i < 10. .10 An example showing a for loop.238 18 Graphs and gating functions — (J.6. . and increases the likelihood of two schematically-dissimilar loops being isomorphic in the VSDG. . . .. = j. I k . . L k . . loop-peeling). L . j = . .. When C evaluates to false computation ceases and the last internal value is returned through the X port. but it has two important benefits: (i) it exposes the first loop body iteration to optimization in post-test loops (cf. sets L to the current internal value and updates the internal value with the repeat value R . for). . .while. it returns the final internal value through the X port. A loop which updates k variables will have: a single condition port C . repeat. 10 i j j I R { 1 i 1 < { C θ X L i + - { j { i j Fig.. it evaluates the R value (which in this case also uses the θ -node’s L value). one for each loop-variant value. .. for(i = 0. modelling loop state with the notion of an internal value which may be updated on each iteration of the loop. R k . initialvalue ports I 1 . 0 . X ) sets its internal value to initial value I then. and (ii) it normalizes all loops to one loop structure. . Stanier ) 18. .until) are synthesised from a pre-test loop preceded by a duplicate of the loop body. . L . X . The example in Figure 18. At first this may seem to cause unnecessary duplication of code. and loop exit ports X 1 .

. . . . . . . and returns a list of results. . . . such placement of loop invariant nodes for later phases. and a second pass to delete the unmarked (i. For example. . .7 Further readings A compiler’s intermediate representation can be a graph. which returns at least one result (which will be a state value in the case of void functions). . It is safe because all nodes which are deleted are guaranteed never to be reachable from the return node. many optimisations become trivial. . . . 97]. 18. we suppose that the number of arguments to. Note also that the VSDG neither forces loop invariant code into nor out-of loop bodies. We can represent the control flow of a program as a Control Flow Graph (CFG) [6]. Thus. a function is smaller than the number of physical registers—further arguments can be passed via a stack as usual. Dead code generates VSDG nodes for which there is no value or state dependency path from the return node. . We can also represent programs as a type of Data Flow Graph (DFG) [96. .11). . . . a dead node is a node that is not postdominated by the exit node N ∞ . and many different graphs exist in the literature. A CFG is traditionally used to convert a program to SSA form [86]. . To perform dead node elimination. 18. . .6. and SSA can be represented in this way as an SSA Graph [80]. . . . the result of the function does not in any way depend on the results of the dead nodes. . . . .18. .6. . . . .. . This combines both dead code elimination and unreachable code elimination. consider dead node elimination (Figure 18. only two passes are required over the VSDG resulting in linear runtime complexity: one pass to identify all of the live nodes.e. but rather allows later phases to determine.6 Dead node elimination with the VSDG By representing a program as a VSDG. . and results from. . The call node takes both the name of the function to call and a list of arguments. by adding serializing edges. . We maintain the simplicity of the VSDG by imposing the restriction that all functions have one return node (the exit node N ∞ ). i. . . To ensure that function calls and definitions are able to be allocated registers easily.7 Further readings 239 18. or become dead after some other optimisation. . it is treated as a state node as the function body may read or update state. . Unreachable code generates VSDG nodes that are either dead. dead) nodes. .e. . . . . . . .5 State Nodes Loads and stores compute a value and state. . where straight-line instructions are contained within basic blocks and edges show where the flow of control may be transferred to once leaving that block. .

converts control dependencies into data dependencies. [11] presented a precursor of GSA for structured code. E V . Procedure DNE(G ) { 1: WalkAndMark(N ∞ . If conversion (see Chapter 21). 26]. [114] represents control and data dependencies in one graph. to detect equality of variables. rematerialization [50]. E S . } Procedure DeleteMarked(G ) { 1: foreach (node n ∈ N ) do 2: if n is unmarked then d e l e t e (n ). [203] as an intermediate stage in the construction of the Program Dependence Web IR. Section 18. φentry by µ. The PDG has been used for program slicing [204]. and widely for parallelization [112. } Fig. G ). N ∞ ) with zero or more dead nodes. 235. 110.240 18 Graphs and gating functions — (J. Gating functions can be used to create directly interpretable φ -functions. This chapter adopts their notations. the book of Darte et al. [90] provides a good overview of such representations. We showed an example of how the PDG directly exposes parallel code. 125]. It has also been used for performing instruction selection techniques [102. 232].11 Dead node elimination on the VSDG. The original usage of GSA was by Ballance et al. 18. Among others. m ) ∈ (E V ∪ E S )) do 4: WalkAndMark(m ). and φexit by η. Havlak [135] presented an algorithm for construction of a simpler version of GSA—Thinned GSA—which is constructed from a CFG in SSA form. These are used in Gated Single Assignment Form. Stanier ) Input: A VSDG G (N . } Procedure WalkAndMark(n . 2: mark n .e. 3: foreach (node m ∈ N ∧ (n . and has been combined with an extended SSA language to aid compilation in a parallelizing compiler [247]. testing [25]. The Program Dependence Graph (PDG) as defined by Ferrante et al. a φentry for the entry of a loop. The construction technique sketched in this chapter is developed in more detail in [261]. The example given of how to perform backwards demand-driven symbolic analysis using GSA has been borrowed from [262]. We choose to report the definition of Bilardi and Pingali [32]. and a φexit for its exit. Further GSA papers replaced φif by γ. Alpern et al. Their definition of control dependencies that turns out to be equivalent to post-dominance frontier leads to confusions at it uses a non standard definition of post-dominance. 2: DeleteMarked(G ). G ) { 1: if n is marked then finish. a φif for a if-then-else construction. Output: A VSDG with no dead nodes. i. To avoid the potential . An example was given that used the SSA Graph to detect a variety of induction variables in loops [277. operator strength reduction [80].4 mentions possible abstractions to represent data dependencies for dynamically allocated objects. GSA has been used for a number of analyses and transformations based on data flow.

Further study has taken place on the problem of generating code from the VSDG [263. data dependencies and state to model a program.7 Further readings 241 loss of information related to the lowering of φ -functions into conditional moves or select instructions. It uses the concept of gating functions. 244]. We then described the Value State Dependence Graph (VSDG) [143]. . the Value Dependence Graph [275]. Detailed semantics of the VSDG are available [143]. as well as semantics of a related IR: the Gated Data Dependence Graph [264]. 168. gating ψ-functions (see Chapter 17) can be used. which is an improvement on a previous.18. unmentioned graph. and it has also been used to perform a combined register allocation and code motion algorithm [142]. We gave an example of how to perform dead node elimination on the VSDG.

.

φ Machine code generation and optimization Progress: 80% Part IV .

.

no performance results. try for max pages 10 per ssainput .245 Progress: 80% TODO: Style: instructive.

.

• Instruction scheduling and software pipelining. Ten years later.CHAPTER 19 SSA form and code generation B. laying out data objects in sections and composing the stack frames. code generation techniques have significantly evolved. the 1986 edition of the “Compilers Principles. and produce an assembly source or relocatable file with debugging information as result. Register allocation and stack frame building. or FORTRAN. and Tools” Dragon Book by Aho et al. • Branch optimizations and basic block alignment. Techniques. Peephole optimizations. and producing assembly source or object code. In current releases of compilers such as the Open64 or GCC. Dupont de Dinechin Progress: 55% Review in progress In a compiler for imperative languages like C. the code generator covers the set of code transformations and optimizations that operate on a program representation close to the target processor ISA. allocating variable live ranges to architectural registers. Control-flow (dominators. Precisely. the 1997 textbook “Advanced Compiler Design & Implementation” by Muchnich extends code generation with the following: • Loop unrolling and basic block replication. In these compilers. code generator optimizations include: 247 . loops) and data-flow (variable liveness) analyses. lists the tasks of code generation as: • • • • Instruction selection and calling conventions lowering. scheduling instructions to exploit micro-architecture. The main duties of code generation are: lowering the program intermediate representation to the target processor instructions and calling conventions. as they are mainly responsible for exploiting the performance-oriented features of processor architectures and microarchitectures. C++.

or even the stack pointer. . . . . or instruction modifiers. along with operand naming constraints. . . . . . Instructions access values and modify the processor state through operands. but are required to connect the flow of values between operations. such as the processor status register. including pre-fetching and pre-loading. from implicit operands. Exploitation of hardware looping or static branch prediction hints. and has a set of clobbered registers. Machine instructions may also have multiple target operands.1. Explicit operands correspond to allocatable architectural registers. . . For instance. .1. . using the SSA form on machine code raises issues. . Implicit operands correspond to the registers used for passing arguments and returning results at function call sites. which are not apparent in the instruction behavior. . . . . . • • • • • • If-conversion using SELECT. . . a number of which we list in Section 19.3. The compiler view of operations also involves indirect operands. . . . . . 19. Yet basic SSA form code cleanups such as constant propagation and sign extension removal need to know what is actually computed by machine instructions. . c ) may either compute c ← a − b or c ← b − a or any such expression with permuted operands. without any encoding bits. Memory hierarchy optimizations. the procedure link register. that may interfere with instruction scheduling. Matching fixed-point arithmetic and SIMD idioms by special instructions. and may also be used for the registers encoded in register mask immediates. immediate values. Dupont de Dinechin) This increasing sophistication of compiler code generation motivates the introduction of the SSA form in order to simplify analyses and optimizations. b . . We distinguish explicit operands. . there is no straightforward mapping between machine instruction and their operational semantics. It is seen by the compiler as an operator applied to a list of operands (explicit & implicit). or predicated. In this chapter. . a subtract with operands (a . Use of specialized addressing modes such as auto-modified and modulo. . An operation is an instance of an instruction that compose a program. . . . conditional move.1 Representation of instruction semantics Unlike IR operators. . . . . . instructions. An instruction is a member of the processor Instruction Set Architecture (ISA). . However. . .2.1 Issues with the SSA form on machine code 19.248 19 SSA form and code generation — (B. . VLIW instruction bundling. . we adopt the following terminology. . which are associated with a specific bit-field in the instruction encoding. . Some options for engineering a code generator with the SSA form as a first class representation are presented in Section 19. . . . . The suitability of the SSA form for main code generation phases is discussed in Section 19. Implicit operands correspond to single instance architectural registers and to registers implicily used by some instructions. . . .

There are at least two ways to address this issue. where an implicit conditional branch back to the loop header is taken whenever the program counter match some address. Alternate semantic combinators. besides variants of φ functions and parallel COPY instructions. ’isPredicated’. 19. etc. and to extract independently register allocatable operands from wide operands. or that a memory access is safe. An effective way to deal with these constraints in the SSA form is by inserting parallel COPY instruction that write to the constrained source operands or read from the constrained target operands of the instructions. ’isRigh’. For instance.19. etc. yet properties such as safety for control speculation. Typical operation properties include ’isAdd’.1. such as the long multiplies on the ARM. Most information can be statically tabulated by the instruction operator. The implied loop-back branch is also conveniently materialized by a pseudoinstruction. or modifiers of the instruction operator semantic combinator. can be refined by the context where the instruction appears. a technique used by the Open64 compiler. Typical operand properties include ’isLeft’. range propagation may ensure that an addition cannot overflow. • Add properties to the instruction operator and to its operands. Finally. that is. etc. ’isCommutative’. In such cases there is a need for pseudo-instructions to compose wide operands in register tuples. • A number of embedded processor architectures such as the Tensilica Xtensa provide hardware loops. or combined divisionmodulus instructions. ’isOffset’. code generation for some instruction set architectures require that pseudo-instructions with known semantics be available. • Associate a ’semantic combinator’. An issue related to the representation of instruction semantics is how to factor it. are common. to each target operand of a machine instruction. ’isLoad’.2 Operand naming constraints Implicit operands and indirect operands are constrained to specific architectural registers either by the instruction set architecture (ISA constraints). need to be associated with each machine instruction of the code generator internal representation. ’isBase’. The new SSA variables thus created are pre-colored with the . that a division by zero is impossible. This more ambitious alternative was implemented in the SML/NJ [171] compiler and the LAO compiler [91]. or more generally on register tuples. • Machine instructions that operate on register pairs. or being equivalent to a simple IR instruction. a tree of IR-like operators. Extended properties that involve the instruction operator and some of its operands include ’isAssociative’.1 Issues with the SSA form on machine code 249 such as memory accesses with auto-modified addressing. or by application binary interface (ABI constraints).

with an implied OR with the previous value. The execution of conditional instructions is guarded by the evaluation of a condition on a multi-bit operand. Instructions that use the stack pointer must be treated as special cases as far as the SSA form analyses and optimizations are concerned.1. The COPY is a parallel copy in case of multiple constrained source operands. Again. Moreover. Full predicated execution support Most instructions accept a Boolean predicate operand which nullifies the instruction effects if the predicate evaluates to false. Explicit instruction operands may also be constrained to use the same register between a source and a target operand. then using this new variable as the constrained source operand. as interrupt handling may reuse the program run-time stack. Stack pointer definitions including memory allocations through alloca(). some instruction effects on bit-field may be ’sticky’. Dupont de Dinechin) required architectural register. This is the case of the stack pointer. A difficult case of ISA or ABI operand constraint is when a variable must be bound to a specific register at all points in the program. When mapping a processor status register to a SSA variable. In the setting of [40]. We extend the classification of [182] and distinguish four classes of ISA: Partial predicated execution support SELECT instructions like those of the Multiflow TRACE architecture [78] are provided. that is. 19. the COPY operations are processed when going out of SSA form. Typical sticky bits include exception flags of the IEEE 754 arithmetic.3 Non-kill target operands The SSA form assumes that variable definitions are kills. activation frame creation/destruction. these constraints are represented by inserting a COPY between the constrained source operand and a new variable. EPIC-style architectures also provide predicate define instructions . One possibility is to inhibit the promotion of the stack pointer to a SSA variable. any operation that partially reads or modifies the register bit-fields should appear as reading and writing the corresponding variable.250 19 SSA form and code generation — (B. or even to use different registers between two source operands (MUL instructions on the ARM). The execution of predicated instructions is guarded by the evaluation of a single bit operand. This is not the case for target operands such as a processor status register that contains several independent bit-fields. The parallel COPY operations are coalesced away or sequentialized when going out of SSA form [40]. Operand constraints between one source and the target operand are the general case on popular instruction set architectures such as x68 and the ARM Thumb. or the integer overflow flag on DSPs with fixed-point arithmetic. are then encapsulated in a specific pseudoinstruction. Predicated execution and conditional execution are the other main source of definitions that do not kill the target register. The Multiflow TRACE 500 architecture was to include predicated store and floating-point instructions [177].

. . . . This simple transformation enables SSA form analyses and optimizations to remain oblivious to predicated and conditional code. . the implicit source operand is a bit-field in the processor status register and the condition is encoded on 4 bits. .19.2 SSA form in a code generator 251 (PDIs) to efficiently evaluate predicates corresponding to nested conditions: Unconditional. . . . Choice to use DJ graph or not Loop nesting forest for abstract interpretation (under SSA) Mismatch with the IR control tree . . . . .2. Generalizing this observation provides a simple way to handle predicated or conditional execution in vanilla SSA form: • For each target operand of the predicated or conditional instruction. . . . . Conditional. . parallel-AND [126]. a CMOV is equivalent to a SELECT with a ’must be same register’ naming constraint between one source and the target operand. CMOV instructions are available in the ia32 ISA since the Pentium Pro. . add a ’must be same same register’ renaming constraint with the corresponding target operand. . . . . . . . Extensions of the SSA form such as the ψ-SSA [249] presented in Chapter 17 specifically address the handling operations that do not kill the target operand because of predicated or conditional execution. add a corresponding source operand in the instruction signature. . Observe however that under the SSA form. On the VelociTITM TMS230C6xxx architecture. . .2 SSA form in a code generator 19. . the SELECT instructions kill the target register. Unlike other predicated or conditional instructions. . . . . Partial conditional execution support Conditional move (CMOV) instructions similar to those introduced in the Alpha AXP architecture [34] are provided. . On the ARM architecture. . The drawback of this solution is that non-kill definitions of a given variable (before SSA renaming) remain in dominance order across transformations. . . . . . . .1 Control tree flavor • • • • • Control tree and memory dependences and user hints Control structures Bourdoncle Havlak. . . . parallel-OR. the source operand is a general register encoded on 3 bits and the condition is encoded on 1 bit. . • For each added source operand. . . . . as opposed to ψ-SSA where predicate value analysis may enable to relax this order. 19. . Full conditional execution support Most instructions are conditionally executed depending on the evaluation of a condition of a source operand.

“Efficiently Computing Static Single Assignment Form and the Control Dependence Graph” [ TOPLAS 13(4) 1991] – insert COPY for the arguments of Φ-functions in the predecessors.252 19 SSA form and code generation — (B.2. • Variable interference. “Practical Improvements to the Construction and Destruction of Static Single Assignment Form” [SPE 28(8) 1998] – identifies ’Lost Copy’ and ’Swap’ problems – fix incorrect behavior of Cytron et al. Dupont de Dinechin) 19. liveness checking (Chapter 9). after register allocation Sreedhar. intersection 19.2. during.2.4 SSA form construction • • • • When is SSA form constructed Copy folding or not Inherit from byte-code of from IR Conventional SSA is assumed for inputs of phases. then rely on the register allocator coalescing • Briggs et al.5 SSA form destruction • • • • Before. 19.3 Variable lineness and interference • liveness sets.2. [ TOPLAS 13(4) 1991] when critical edges are not split . SSI flavor or not SSA on stack slots for register allocation 19.2 SSA form variant • • • • • • Mixes with non-SSA form or not Choice to use multiplex or not Choices of phi function semantics Choice to keep conventional or not. Boissinot Parallel copy sequentialization Pseudo-kill insertion if no regalloc under SSA • Cytron et al.

“Translating Out of Static Single Assignment Form” [SAS’99] (US patent 6182284): – Method 1 inserts COPY for the arguments of Φ-functions in the predecessors and for the Φ-functions targets in the current block. then applies a new SSA-based coalescing algorithm – Method 3 maintains liveness and interference graph to insert COPY that will not be removed by the new SSA-based coalescing algorithm – the new SSA-based coalescing algorithm is more effective than register allocation coalescing • Leung & George “Static Single Assignment Form for Machine Code” [PLDI’99] – handles the operand constraints of machine-level SSA form – builds on the algorithm by Briggs et al.19. x 3 . [SAS 1999] • a Φ-congruence class is the closure of the Φ-connected relation • liveness under SSA form: Φ arguments are live-out of predecessor blocks and Φ targets are live-in of Φ block • SSA form is conventional if no two members of a Φ-congruence class interfere under this liveness • correct SSA destruction is the removal of Φ-functions from a conventional SSA form • after SSA construction (without COPY propagation). x 3 . [SPE 28(8) 1998] • Rastello et al. then partition into non-interfering classes – introduce the “dominance forest” data-structure to avoid quadratic number of interference tests – critical edge splitting is required • Sreedhar et al. y 1 . [SAS 1999] Liveness and Congruence Example • variables x 1 . y 2 . the SSA form is conventional . several interferences inside the congruence class Insights of Sreedhar et al. [SAS’99] (and avoid patent) Sreedhar et al. “Optimizing Translation Out of SSA using Renaming Constraints” [CGO’04] (STMicroelectronics) – fix bugs and generalize Leung & George [PLDI’99] – generalize Sreedhar et al.2 SSA form in a code generator 253 • Budimli´ c et al. “Fast Copy Coalescing and Live-Range Identification” [PLDI’02] – lightweight SSA destruction motivated by JIT compilation – use the SSA form dominance of definitions over uses to avoid explicit interference graph – construct SSA-webs with early pruning of interfering variables. y 3 are in the same congruence class • in this example.

. . . Dupont de Dinechin) • Methods 1 – 3 restore a conventional SSA form • the new SSA-based coalescing is able to coalesce interfering variables. any if-conversion technique targeted to full predicated or conditional execution support may be adapted to partial predicated or conditional execution support.254 19 SSA form and code generation — (B. . . . . The basic idea is to replace conditional branches by straight-line computations that directly use the condition as a source operand. architectural support for if-conversion can be improved by supporting speculative execution. . in particular those resulting from control-flow constructs. . . so the SSA form analyses and optimizations no longer apply. Speculative execution (control speculation) refers to executing an operation before knowing that its execution is required. . . . . . . In principle. . non-predicated instructions with side-effects such as memory accesses can be used in combination with SELECT to provide a harmless effective address in case the operation must be nullified [182]. . 19. . Unlike classic techniques that match one IR tree or one DAG at a time. . . 19. . . .3. . . . program variables are mapped to architectural registers or to memory locations. After register allocation. . . .3. . . . as long as the SSA form remains conventional • • • • Region-based or not Register allocate under SSA or not Reassociation under SSA Coalescing aggressiveness before instruction scheduling (dedicated etc. Besides predicated execution. . .2 If-conversion If-conversion refers to optimization techniques that remove conditional branches from a program region. down to register allocation. . . . . . . . . .1 Instruction selection Instruction selection (Chapter 20). . . such as when moving code above a branch [177] or promot- . The scope and effectiveness of if-conversion depends on ISA support for predicated execution [182]. . . 19. . For instance. using the SSA form as input extends the scope of pattern matching to more complex IR graphs. .3 Code generation phases and the SSA form The applicability of the SSA form only spans the early phases of the code generation process: from instruction selection.) . .

The algorithm is general enough to process cyclic and irreducible rooted flow graphs. even though seminal techniques [8. Consequently. The proposed algorithm is tailored to acyclic regions with single entry and multiple exits. The result of an if-converted region is typically a hyperblock. The main improvement of this algorithm over [209] is that it also speculates instructions up the dominance tree through predicate promotion. so speculating potentially excepting instructions requires architectural support. control-flow regions selected for if-conversion are acyclic. They assume a fully predicated architecture wit only Conditional PDIs. If-conversion is primarily motivated by instruction scheduling on instructionlevel parallel processors [182]. • Leupers [173] focuses on if-conversion of nested if-then-else (ITE) statements on architectures with full conditional execution support. non-trapping memory loads known as ’dismissible’ are provided. 209] consider more general control-flow.19. The IMPACT EPIC architecture speculative execution [18] is generalized from the ’sentinel’ model [181]. a set of predicated basic blocks in which control may only enter from the top. Speculative execution assumes instructions have reversible side effects. • Lowney et al. and as such is able to compute R and K functions without relying on explicit control dependences.3 Code generation phases and the SSA form 255 ing predicates [182]. • Blickstein et al. A dynamic programming technique appropriately selects either a conditional jump or a conditional instruction based implementation scheme for each ITE statement. which is often single-issue. The R function assigns a minimal set of Boolean predicates to basic blocks. • reduces uses of the branch unit. and the objective is the reduction of worst-case execution time (WCET). as removing conditional branches: • eliminates branch resolution stalls in the instruction pipeline. On the Multiflow TRACE 300 architecture and later on the Lx VLIW architecture [108]. but it practice it is applied to single entry acyclic regions. [34] pioneer the use of CMOV instructions to replace conditional branches in the GEM compilers for the Alpha AXP architecture. that is. This work further proposes a pre-optimization pass to hoist or sink common sub-expressions before predication and speculation. • increases the size of the instruction scheduling regions. • Fang [106] assumes a fully predicated architecture with Conditional PDIs. A few contributions to if-conversion only use the SSA form internally: . and the K function express the way these predicates are computed. In case of inner loop bodies. if-conversion further enables vectorization [8] and software pipelining (modulo scheduling) [209]. [177] match the innermost if-then constructs in the Multiflow compiler in order to generate the SELECT and the predicated memory store operations. except for stores and PDIs. Classic contributions to if-conversion dot not consider the SSA form: • Park & Schlansker [209] propose the RK algorithm based the control dependences. but may exit from one or more locations [184].

Dupont de Dinechin) • Jacome et al. [73] introduce a predicated execution support aimed at removing non-kill register writes from the processor micro-architecture. The proposed framework reduces acyclic control-flow constructs from innermost to outermost. adapted to the Unconditional PDIs and the ORP instructions. in particular predicated move operations. The first idea of the SSA-PS transformation is to realize the conditional assignments corresponding to φ -functions via predicated switching operations. The source operands of the φ -functions are replaced by so-called φ -lists. predicated memory accesses. • Bruel [52] targets VLIW architectures with SELECT and dismissible load instructions. and the monitoring of the if-conversion benefits provides the stopping criterion. • Chuang et al. The SSA-PS transformation predicates non-move operations and is apparently restricted to innermost if-then-else statements. which also accepts ψ-SSA form as input. [139] propose the Static Single Assignment – Predicated Switching (SSA-PS) transformation aimed at clustered VLIW architectures. • Stoutchinin & Gao [250] propose an if-conversion technique based on the predication of Fang [106] and the replacement of φ -functions by ψ-functions. and performs tail duplication. The φ -lists are processed by topological order of the operand predicates to generate the ’phi-ops’. a correctness condition re-discovered later by Chuang et al. The technique is implemented in the Open64 for the ia64 architecture. The ψfunctions arguments are paired with predicates and are ordered in dominance order. The other contribution from this work is the generation of ’phi-ops’. reduces height of predicate computations. where each operand is associated with the predicate of its source basic block. [73] for their phi-ops. and ORP instructions for OR-ing multiple predicates. is described in Chapter 21. This work also details how to transform the ψ-SSA form to conventional ψ-SSA . by formulating simple correctness conditions for the predicate promotion of operations that do not have side-effects. whose insertion points are computed like the SSA form placement of the φ -functions. • Ferrière [93] extends the ψ-SSA form algorithms of [249] to the partial predicated execution support. The core technique control speculates operations. Recent contributions to if-conversion input SSA form and output ψ-SSA form: • Stoutchinin & Ferrière [249] introduce ψ-functions in order to bring predicated code under the SSA form for a fully predicated architecture. The second idea is that the predicated external moves leverage the penalties associated with inter-cluster data transfers. They propose SELECT instructions called ’phi-ops’. A restriction of the RK algorithm to single-entry single-exit regions is proposed.256 19 SSA form and code generation — (B. Unconditional PDIs. A generalization of this framework. with predicated move instructions that operate inside clusters (internal moves) or between clusters (external moves). It can also generate ψ-functions instead of SELECT operations. They prove the conversion is correct provided the SSA form is conventional.

and dead code elimination.19. Memory packing optimizations select wider memory access instructions whenever the effective addresses are provably adjacent. 19. The main issues that remain for the selection of an if-conversion technique are the match with the architectural support. such as unroll-and-jam. Induction variable classification (Chapter 11). the SSA version of the self assignment technique also pioneered by the Multiflow Trace Scheduling compiler [177]. Loop unwinding replicates the loop body while keeping the loop exit branches. and requires that pre-conditioning or post-conditioning code be inserted [177]. A self-contained explanation of these techniques appears in Chapter 17.4 Memory addressing and memory packing Memory addressing optimizations include selection of auto-modified addressing modes. with the added benefit that already predicated code can be part of input. SIMD idioms.3 Inner loop optimizations Non-inner loop transformations. copy folding.3 Code generation phases and the SSA form 257 by generating CMOV operations.5 Code cleanups and instruction re-selection Constant propagation. and the approach to if-conversion region formation. Inner loop unrolling or unwinding is facilitated by using the loop-closed SSA form (Chapter 11). A main motivation for inner loop unrolling is to unlock modulo scheduling benefits [167].3. Loop unrolling replicates the loop body and removes all but one of the replicated loop exit branches. and no side effects such as possible address misalignment traps are introduced.3. Uses bit width analysis. 19. Required because analyses and code specialization .3. Thanks to these two latter contributions. This style of loop unrolling necessarily applies to counted loops. Bit-width analysis. GVN or LICM or PRE Instruction re-selection. virtually all if-conversion techniques formulated without the SSA form can be adapted to the ψ-SSA form. are usually left to the IR loop optimizer. 19.

On a classic code generator.) So scheduling instructions with SSA variables as operands is not effective unless extra dependences are added to the scheduling problem to prevent such code motion. and an instruction scheduler is expected to do so. Scheduling OR-type predicate define operations [233] raises the same issues.6 Pre-pass instruction scheduling Further down the code generator. are limited to true data dependences on operands. An instruction scheduler is also expected . Keeping those φ -functions in the scheduling problem has no benefits and raises engineering issues. super-block scheduling [138]. If the only ordering constraints between instructions. besides control dependences and memory dependences. • Instruction scheduling must account for all the instruction issue slots required to execute a code region. or trace scheduling [177]. the classic scheduling regions are single-entry and do not have control-flow merge. all ’sticky’ bit-field definitions can be reordered with regards to the next use.1. the next major phase is pre-pass instruction scheduling. • Some machine instructions have side effects on special resources such as the processor status register. For instance. Dupont de Dinechin) 19. Other reasons why it seems there is little to gain to schedule instructions on a SSA form program representation include: • Except in case of trace scheduling which pre-dates the use of SSA form in production compilers. code motion will create interferences that must later be resolved by inserting COPY operations in the scheduled code region. pre-pass instruction scheduling operates before register allocation. hyper-block scheduling [184]. (Except for interferences created by the overlapping of live ranges that results from modulo scheduling. except for instructions with ISA or ABI constraints that binds them to specific architectural registers. Representing these resources as SSA variables even though they are not operated like the processor architectural registers requires coarsening the instruction effects to the whole resource.3. instruction operands are mostly virtual registers. So there are no φ -functions in case of acyclic scheduling. acyclic instruction scheduling techniques apply: basic block scheduling [128].3. tree region scheduling [134]. as discussed in Section 19. In turn this implies def-use variable ordering that prevents aggressive instruction scheduling. and only φ -functions in the loop header in case of software pipelining.258 19 SSA form and code generation — (B. preparation to pre-pass instruction scheduling include virtual register renaming. due to their parallel execution semantics and the constraint to keep them first in basic blocks. super-block or hyperblock body are candidates for software pipelining techniques such as modulo scheduling. also known as register web construction. By definition. and for other program regions. Moreover. Innermost loops with a single basic block. in order to reduce the number of anti dependences and output dependences in the instruction scheduling problem. For innermost loops that are not software pipelined. as these are resolved by modulo renaming.

To summarize.19.7 Register allocation The last major phase of code generation where SSA form has demonstrated benefits is register allocation and its three sub-problems: variable spilling. A first example is ’move renaming’ [283]. If required by the register allocation. where the dependence between additive induction variables and their use as base in base+offset addressing modes is relaxed to the extent permitted by the induction step and the range of the offset. the dynamic switching of the definition of a source operand defined by a COPY operation when the consumer operations ends up being scheduled at the same cycle or earlier. variable coloring. Another example is ’inductive relaxation’ [92]. trying to keep the SSA form inside the pre-pass instruction scheduling currently appears more complex than operating on the program representation with classic compiler temporary variables. These techniques apply to acyclic scheduling and to modulo scheduling. sufficiency . [104] [147] Issue with aliased registers. the SSA form should be re-constructed. • Aggressive instruction scheduling relaxes some flow data dependences that are normally implied by SSA variable def-use ordering. saturation. Pre conditioning to reduce MaxLive.3 Code generation phases and the SSA form 259 to precisely track accesses to unrelated of partially overlapping bit-fields in a processor status register. This representation is obtained from the SSA form after aggressive coalescing and SSA destruction. and variable coalescing (Chapter 23).3. Issue with predicated or conditional instructions. 19.

.

. . . . . . . . . . . . The instruction code selector seeks for a cost-minimal cover of the DFT. . . . . . . Each of the selected 261 . . . .code selection pattern matching data flow tree. . Krall B. The unit of translation for tree pattern matching is expressed as a tree structure that is called a data flow tree (DFT). . . .1 Introduction Instruction code selection is a transformation step in a compiler that translates a machine-independent intermediate code representation into a low-level intermediate representation or to machine code for a specific target architecture. 20. generator tools have been designed and implemented that generate the instruction code selector based on a specification of the machine description of the target. . The Intermediate Representation (IR) of an input program is passed on to an optional lowering phase that breaks down instructions and performs other machine dependent transformations. . . A possible scenario of a code generator in a compiler is depicted in Figure 20. . . One of the widely used techniques in code generation is tree pattern matching. . . The basic idea is to describe the target instruction set using an ambiguous cost-annotated graph grammar. . . . . . . . the instruction selection performs the mapping to machine code or lowered IR based on the machine description of the target architecture. . . . . . Thereafter. . . DFT grammar CHAPTER 20 Instruction Code Selection D. Scholz Progress: 95% Final reading in progress . . This approach is used in large compiler infrastructures such as GCC or LLVM that target a range of architectures. . . . Instead of hand-crafting an instruction selector for each target architecture. .1. Ebner A.

The . SHL(reg. Scholz) Machine Description IR Lowering (optional) Instruction Code Selection MachineDependent Backend Target Program Fig. A. and LD. Rules that translate from one non-terminal to another are called chain rules. Ebner. imm) ADD(reg.262 symbol. ADD. Krall. chain Code Generator 20 Instruction Code Selection — (D. either by constructing a new intermediate representation or by rewriting the DFT bottom-up. e. reg)) LD(ADD(reg. The terminals of the grammar are VAR. and terminal symbols (shown in upper-case). 20. Note that there are multiple possibilities to obtain a cover of the dataflow tree for the example shown in Figure 20. terminal rule. Non-terminal s denotes a distinguished start symbol for the root node of the DFT. Each rule has associated costs.2. rules have an associated semantic action that is used to emit the corresponding machine instructions.1 Scenario: An instruction code selector translates a compiler’s IR to a low-level machinedependent representation. SHL. 20. reg ← imm that translates an immediate value to a register. An example of a DFT along with a set of rules representing valid ARM instructions is shown in Figure 20.2. CST. non-terminal symbol. VAR:a rule R1 R2 R3 R4 R5 R6 R7 R8 R9 R 10 VAR:i CST:2 s reg imm reg reg reg reg reg reg reg ← ← ← ← ← ← ← ← ← ← reg imm CST VAR SHL(reg. reg) SHL(reg. B. Terminal symbols match the corresponding labels of nodes in the dataflow trees. imm))) cost 0 1 0 0 1 1 1 1 1 1 SHL ADD LD Fig. Each rule consists of non-terminals (shown in lower-case)..2 Example of a data flow tree and a rule fragment with associated costs. Non-terminals are used to chain individual rules together.g. reg) LD(reg) LD(ADD(reg.

some operations have different semantics. Third. To get a handle on instruction code selection for SSA graphs. we will discuss in the following an approach based on a reduction to a quadratic mathematical programming problem (PBQP). The example in Figure 20. R 3 . e. To accommodate for fixed-point values. SSA graphs capture acyclic and cyclic information flow beyond basic block boundaries. the DFT could be covered by rule R 2 . Alternatively. output or anti-dependencies in SSA graph do not exist.. Consider the code fragment of a dot-product routine and the corresponding SSA graph shown in Figure 20. For example. Second. However.20.1 Introduction 263 def-use chains fixed-point arithmetic cost of a tree cover is the sum of the costs of the selected rules. Nodes in the SSA graph represent a single operation while edges describe the flow of data that is produced at the source node and consumed at the target node. we add the following rules to the grammar introduced in Figure 20. R 4 . For fixed-point values most arithmetic and bit-wise operations are identical to their integer equivalents. Instead of using a textual SSA representations.i results in a value with 2i fractional digits. R 7 . In the figure the color of the nodes indicates to which basic block the operations belong to. and R 8 which yields four cost units for the cover for issuing four assembly instructions.2: 1 We consider its data-based representation here. SSA graphs are a suitable representation for code generation: First.multiplying two fixed-point values in format m . The result of the multiplication has to be adjusted by a shift operation to the right (LSR). we employ a graph representation of SSA called the SSA graph 1 that is an extension of DFTs and represents the data-flow for scalar variables of a procedure in SSA form. Tree pattern matching on a DFT is limited to the scope of tree structures. The code implements a simple vector dot-product using fixed-point arithmetic. A dynamic programming algorithm selects a cost optimal cover for the DFT. R 5 . As even acyclic SSA graphs are in the general case not restricted to a tree. and R 10 which would give a total cost for the cover of one cost unit.g. no dynamic programming approach can be employed for instruction code selection. See Chapter 18 .3. we can extend the scope of the matching algorithm to the computational flow of a whole procedure. To overcome this limitation. The use of the SSA form as an intermediate representation improves the code generation by making def-use relationships explicit. the DFT could be covered by rules R 3 . SSA exposes the data flow of a translation unit and utilizes the code generation process. Hence. SSA graphs often arise naturally in modern compilers as the intermediate code representation usually already is in SSA form.3 has fixed-point computations that need to be modeled in the grammar. Incoming edges have an order which reflects the argument order of the operation.

However. fp_ *q ) { fp_ s = 0. as a tree-pattern matcher generates code at statement-level. is able to propagate non-terminal fp2 across the φ node prior to the return and emits the code for the shift to the right in the return block. fp_ stands for unsigned short fixed point type. Rd. . Scholz) rule reg fp fp2 fp fp fp2 fp fp2 In the example the accumulation for double-precision fixed point values (fp2) can be performed at the same cost as for the single-precision format (fp). Rm. . Ebner. fp) 1 MUL fp2 1 LSR ADD ADD(fp. ← ← ← ← ← ← ← ← VAR is_fixed_point ? ∞ otherwise 0 VAR is_fixed_point ? 0 otherwise ∞ MUL(fp.. ∗e = p + N . Rd.) 0 cost instruction Rd.3 Instruction code selection SSA Graph for a vector dot-product in fixed-point arithmetic. Rd. Krall.. . ADD < ADD MUL φ CST:2N LD LD CST:0 φ φ VAR:p INC VAR:q INC Fig. 20. Rs i Rs Rs fp_ dot_product(fp_ *p . Thus.) 0 PHI(fp2. Rm. ret φ } } return s . An instruction code selector that is operating on the SSA graph.264 20 Instruction Code Selection — (D.. performing the intermediate calculations in the extended format. B. the information of having values as double-precision cannot be hoisted across basic block boundaries. Rm. A. Rm. q = q + 1. fp) 1 ADD(fp2.. it would be beneficial to move the necessary shift from the inner loop to the return block. fp2) 1 ADD PHI(fp. p = p + 1. while (p < e ) { s = s + (∗p ) ∗ (∗q ).

First. .3. . x j . i. . . A solution of PBQP is a mapping h of each variable to an element in its domain. . . we will explain how to perform instruction code selection on SSA graphs with the means of a specialized quadratic assignment problem (PBQP). . . .20. we will introduce the PBQP problem and then we will describe the mapping of the instruction code selection problem to PBQP . The quality of the assignment is measured by a local cost function c (x i .register assignment.. . . . .e. . . n }. . (20. .2. . . or bank selection for architectures with partitioned memory.1 Partitioned Boolean Quadratic Problem Partitioned Boolean Quadratic Programming (PBQP) is a generalized quadratic assignment problem that has proven to be effective for a wide range of applications in embedded code generation. 1. . Consider a set of discrete variables X = {x 1 . . x j . . Instead of problem-specific algorithms. these problems can be modeled in terms of generic PBQPs that are solved using a common solver library. h (x j ) . . . The chosen element imposes local costs and related costs with neighboring variables. . . . An extension of patterns to arbitrary acyclic graph structures. . .. f= c (x i . Partitioned Boolean Quadratic Progr . an element of i needs to be chosen for variable x i . . . . .2 Instruction Code Selection for Tree Patterns on SSA-Graphs 265 In the following. . . . . . . . d i ). 2. address mode selection. . d i . . . e. the quality of a solution is based on the contribution of two sets of terms. Hence. PBQP is flexible enough to model irregularities of embedded architectures that are hard to cope with using traditional heuristic approaches. . .g. . First. . x n } and their finite domains { 1 . . We measure the quality of the assignment with a related cost function C (x i .1) 1≤ i < j ≤ n . . we discuss the instruction code selection problem by employing a discrete optimization problem called partitioned boolean quadratic problem.1. which we refer to as DAG grammars. 20. is discussed in Sub-Section 20. 20. . . . d j ). h (x i )) + 1≤i ≤n C x i . For assigning variable x i to the element d i in i . h (x i ). . . . . . The total cost of a solution h is given as.2 Instruction Code Selection for Tree Patterns on SSA-Graphs The matching problem for SSA graphs reduces to a discrete optimization problem called Partitioned Boolean Quadratic Problem (PBQP). . For assigning two related variables x i and x j to the elements d i ∈ i and d j ∈ j . .

branch-&-bound techniques can be applied.many of the cost matrices i . E .2 Instruction Code Selection with PBQP In the following. x j ). i. . d j ) is decomposed for each pair (x i . j are zero matrices and do not contribute to the overall solution. This means that the number of decision vectors and the number of cost matrices in the PBQP are determined by the number of nodes and edges in the SSA graph respectively. As in traditional tree pattern matching. x j . In the following we represent both the local cost function and the related cost function in matrix form. Ebner. A. i. j that is not the zero matrix.. In general. we describe the modeling of instruction code selection for SSA graphs as a PBQP problem. d i . 20. A matrix element corresponds to an assignment (d i .. For each decision variable x i we have a corresponding node v i ∈ V in the graph. c ). optimal or near-optimal solutions can often be found within reasonable time limits.e. The local costs reflect the costs of the rules and the related costs reflect the costs of chain rules making rules compatible with each other.. for many practical cases. We will present an example later in this chapter in the context of instruction code selection. the solution may not be optimal. and for each cost matrix i . Simi− → larly. i. which we refer to as a PBQP graph. Input grammars have to be normalized. The cost functions c and C map nodes and edges to the original cost vectors and matrices respectively. For general PBQPs. B. v j ). the PBQP instances are sparse. however. C .2. Scholz) The PBQP problem seeks for an assignment of variables x i with minimum total costs. However. The costs for the pair are represented as | i |-by-| j | matrix/table i j . Thus. .266 20 Instruction Code Selection — (D.e. the local cost function c (x i . The sizes of i depend on the number of rules in the grammar. . d j ). an ambiguous graph grammar consisting of tree patterns with associated costs and semantic actions is used. where n is the number of discrete variables and m is the maximal number of elements in their domains. i. In the basic modeling. . we introduce an edge e = (v i . A solution for the PBQP instance induces a complete cost minimal cover of the SSA graph. Krall. Currently. | n |). To obtain still an optimal solution outside the subclass.e. the related cost function C (x i .. This means that each rule is either a . finding a solution to this minimization problem is NP hard. d i ) is represented by a cost vector c i enumerating the costs of the elements. the algorithm produces provably optimal solutions in time (n m 3 ). there are two algorithmic approaches for PBQP that have been proven to be efficient in practice for instruction code selection problems.e.a polynomial-time heuristic algorithm and a branch-&-bound based algorithm with exponential worst case complexity. The variables x i of the PBQP are decision variables reflecting the choices of applicable rules (represented by i ) for the corresponding node of x i . A PBQP problem has an underlying graph structure graph G = (V . For a certain subclass of PBQP .m = max (| 1 |. SSA and PBQP graphs coincide.

consider the grammar in Figure 20. If the non-terminal nt and nt’ are identical. There are three possible cases: 1..2 Instruction Code Selection for Tree Patterns on SSA-Graphs 267 rule. .an operation represented by a node in the SSA graph. Weight w u is used as a parameter to optimize for various objectives including speed (e.. In our example. For each node u in the SSA graph.e. ntk p ) where nti are non-terminals and OP is a terminal symbol. If the non-terminals nt and nt differ and there exists a rule r : nt ← nt in the transitive closure of all chain rules. A base rule is a production of the form nt0 ← OP(nt1 . . Temporary non-terminal symbols t1. . e. R 6 ) that can be used to cover the shift operation SHL in our example. i. . An edge in the SSA graph represents data transfer between the result of an operation u . 0〉. . . the corresponding element in matrix u v is zero. since the result of u is compatible with the operand of node v . nt . 3. . nt .4. An element in the matrix corresponds to the costs of selecting a specific base rule ru ∈ R u of the result and a specific base rule rv ∈ R v of the operand node. A chain-rule is a production of the form nt0 ← nt1 . To illustrate this transformation. ) and rv is · · · ← OP(α.there are three rules (R 4 . and t3 are used to decompose larger tree patterns into simple base rules. This transformation can be iteratively applied until all production rules are either chain rules or base rules. − → The cost vector c u = w u · 〈c (R 1 ).w v · c (r ). production so-called base rule or a chain rule. both R 4 and R 5 have associated costs of one. the w u is set to one). which is the source of the edge. OP2 (β ). Assume that ru is nt ← OP(. a PBQP variable x u is introduced. we impose costs dependent on the selection of variable x u and variable x v in the form of a cost matrix u v .e.20. β and γ denote arbitrary pattern fragments.. c (R k u )〉 of variable x u encodes the local costs for a particular assignment where c (R i ) denotes the associated cost of base rule R i .g. To ensure consistency among base rules and to account for the costs of chain rules. Rule R 6 contributes no local costs as we account for the full costs of a complex tree pattern at the root node. 2. A production rule nt ← OP1 (α. The domain of variable x u is determined by the subset of base rules whose terminal symbol matches the operation of the SSA node. thus the cost vector for the SHL node is 〈1.g. and the operand v which is the tail of the edge. the corresponding element in u v has the costs of the chain rule. The instruction code selection problem for SSA graphs is modeled in PBQP as follows. base rule. β ) where nt is the non-terminal of operand v whose value is obtained from the result of node u . i. . Each base rule spans across a single node in the SSA graph. . which is a normalized version of the tree grammar introduced in Figure 20. where nt0 and nt1 are non-terminals. γ) and nt ← OP2 (β ) where nt is a new non-terminal symbol and α. 1. Otherwise. The last rule is the result of automatic normalization of a more complex tree pattern. . γ)) can be normalized by rewriting the rule into two production rules nt ← OP1 (α. w u is the expected execution frequency of the operation at node u ) and space (e.2. t2.g. the corresponding element in u v has infinite costs prohibiting the selection of incompatible base rules. All nodes have the same weight of one. R 5 . .

of which the first one expects non-terminal reg as its second argument. Ebner.4 PBQP instance derived from the example shown in Figure 20. and R 6 are applicable for the shift. Base rules R 4 . The execution of the semantic action rules inside a basic block follow the order of basic blocks. SHL(reg. Krall. ADD(reg. ADD(reg. 20. SHL(reg. A. ADD(reg. Scholz) R2 0 R1 0 VAR:a ( ) VAR:i ( ) CST:2 ( ) reg reg 0 reg 0 reg 0 imm reg 1 imm 0 imm 0 SHL reg reg 0 ( R4 1 R5 1 R6 0 ) reg 0 reg 0 reg reg t1 rule R1 R2 R3 R4 R5 R6 R7 R8 R9 R 10 R 11 R 12 cost 0 0 1 1 1 0 1 0 0 1 1 1 reg 0 0 ∞ t1 ∞ ∞ 0 reg 0 0 ∞ imm reg reg reg reg t1 reg t2 t3 reg reg reg ← ← ← ← ← ← ← ← ← ← ← ← CST VAR imm SHL(reg.2. B. The grammar has been normalized by introducing additional non-terminals.268 R2 0 20 Instruction Code Selection — (D. A solution of the PBQP directly induces a selection of base and chain rules for the SSA graph. 1) and is zero otherwise. Highlighted elements in Figure 20. rules R 5 and R 6 both expect imm.4. Special care is necessary for chain . the corresponding cost matrix accounts for the costs of converting from reg to imm at index (1. There is a single base rule R 1 with local costs 0 and result non-terminal imm for the constant. Consequently. LDW(reg) LDW(t2) LDW(t3) ADD ( R7 1 R8 0 R9 0 ) reg) imm) imm) reg) t1) reg) reg t2 t3 reg 0 ∞ ∞ R 10 1 t2 ∞ 0 ∞ R 11 1 t3 ∞ ∞ 0 LDW ( R 12 1 ) Fig. consider the edge from CST:2 to node SHL in Figure 20. As an example.4 show a cost-minimal solution of the PBQP with costs one. R 5 .

. . the load and the increment of both p and q can be done in a single instruction benefiting both code size and performance. . and r point to mutually distinct memory locations. .3. However. . . . . On such a machine.or post-process these patterns heuristically often missing much of the optimization potential. especially in situations where patterns are automatically derived from generic architecture descriptions. . the IRC instruction of the HPPA architecture. . autoincrement. . q . Assuming that p . . . We will now outline. with the only restriction that their arguments have to be the same. . . a numbering of chosen nodes. These architecture-specific tweaks also complicate re-targeting. . . the graph induced by SSA edges becomes acyclic. . . it produces multiple results and spans across two non-adjacent nodes in the SSA graph. Consider the introductory example shown in Figure 20. e. . .there is no support for machine instructions with multiple results. . . the idea is to express in the modeling of the problem.3 Extensions and Generalizations 20. If we select all three instances of the post-increment store pattern concurrently. a possible problem formulation for these generalized patterns in the PBQP framework discussed so far.the DIVU instruction in the Motorola 68K architecture performs the division and the modulo operation for the same pair of inputs.. .g. .20. Similar examples can be found in most architectures. . . . Other examples are the RMS (read-modify-store) instructions on the IA32/AMD64 architecture. . and the code cannot be emitted. Many architectures have some form of auto-increment addressing modes.g. through the example in Figure 20.1 Instruction Code Selection for DAG Patterns In the previous section we have introduced an approach based on code patterns that resemble simple tree fragments. . The code fragment contains three feasible instances of a post-increment store pattern. To overcome this difficulty. . Instead. . . This restriction often complicates code generators for modern CPUs with specialized instructions and SIMD extensions. . . . . that reflects the existence of a topological order. . Such chain rules may be placed inefficiently and a placement algorithm [232] is required for some grammars. 20. . . or fsincos instructions of various math libraries. . . ..and decrement addressing modes of several embedded systems architectures.3. Compiler writers are forced to pre.3 Extensions and Generalizations 269 rules that link data flow across basic blocks. e. . . . . . post-increment loads cannot be modeled using a single tree-shaped pattern. .5. . . there are no further dependencies apart from the edges shown in the SSA graph. . .

Rules for the VAR nodes are omitted for simplicity. They encode whether a particular instance is chosen or not. i. VAR: p INC ST ST VAR: q Fig. Scholz) ∗p = r + 1. Continuing our example. 20. the PBQP for the SSA graph introduced in Figure 20. In addition to the post-increment store pattern with costs three. . each of these new states can be seen as a proxy for the whole set of instances of (possibly different) complex productions including the node.2 : reg ← INC(reg) For our modeling.the postincrement store pattern P1 : tmt ← ST(x :reg.e. The local costs are set to the combined costs for the particular pattern for the on state and to 0 for the off state. e.1 : stmt ← ST(reg.5 is shown in Figure 20.. Thus. We can safely squash all identical base rules obtained from this process into a single state. The local costs for these proxy states are set to 0. DAG patterns are decomposed into simple base rules for the purpose of modeling. ∗q = p + 1. A.5 INC ST INC VAR: r DAG patterns may introduce cyclic data dependencies. There are three instances of the post-increment store pattern (surrounded by boxes) in the example shown in Figure 20. Modeling The first step is to explicitly enumerates instances of complex patterns. Ebner. reg) P1. Krall. i.6.concrete tuples of nodes that match the terminal symbols specified in a particular production. As for tree patterns. reg).e. new variables are created for each enumerated instance of a complex production. reg ← INC(x ) : 3 is decomposed into the individual pattern fragments P1.g. the domain of existing nodes is augmented with the base rule fragments obtained from the decomposition of complex patterns. Furthermore. we assume regular tree patterns for the store and the increment nodes with costs two denoted by P2 and P3 respectively..the domain basically consists of the elements on and off. B.270 20 Instruction Code Selection — (D. ∗r = q + 1.5..

We use k as a shorthand for the term 3 − 2M .2 off on1 on2 on3 0 0 ∞ 0 ∞ 0 ∞ 0 off on1 on2 on3 P2 P1. mapped to on2 ).20..1 (2 M ) 4: INC P3 P1.g. Their domain is defined by the simple base rule with costs two and the proxy state obtained from the decomposition of the post-increment store pattern. we refine the state on such that it reflects a particular index in a concrete topological order.2 P2 P1. 1〉 off on1 on2 on3 off on1 on2 on3 0 0 0 0 0 ∞ ∞ ∞ 0 0 ∞ ∞ 0 0 0 ∞ off on1 on2 on3 ( 0 k k k ) 8: instance 〈3.1 (2 M ) off on1 on2 on3 P3 P1.1 0 0 ∞ 0 ∞ 0 ∞ 0 P3 P1. Nodes 7.2 (2 M ) P3 P1.2 P2 P1.2 off on1 on2 on3 0 0 ∞ 0 ∞ 0 ∞ 0 3: ST P2 P1. Matrices among these nodes account for data dependencies.6 VAR: r PBQP Graph for the Example shown in Figure 20. 4〉 off on1 on2 on3 off on1 on2 on3 0 0 0 0 0 ∞ ∞ ∞ 0 0 ∞ ∞ 0 0 0 ∞ off on1 on2 on3 ( 0 k k k ) 9: instance 〈5. 20.1 0 0 0 0 5: ST P2 P1. To this end.2 (2 M ) 2: ST P2 P1.consider the matrix established among nodes 7 and 8. Assuming instance 7 is on at index 2 (i. As noted before. the only remaining choices for .2 (2 M ) VAR: q Fig. M is a large integer value. 6〉 off on1 on2 on3 P2 P1.e. and 9 correspond to the three instances identified for the post-increment store pattern.2 P2 P1.1 0 0 ∞ 0 ∞ 0 ∞ 0 P3 P1.2 P3 P1. e. Nodes 1 to 6 correspond to the nodes in the SSA graph.1 0 0 0 0 off on1 on2 on3 off on1 on2 on3 0 0 0 0 0 ∞ ∞ ∞ 0 0 ∞ ∞ 0 0 0 ∞ off on1 on2 on3 ( 0 k k k ) 7: instance 〈2.1 off on1 on2 on3 0 0 ∞ 0 ∞ 0 ∞ 0 P3 P1.3 Extensions and Generalizations 271 VAR: p 1: INC P3 P1. 8.1 (2 M ) 6: INC P3 P1.5.1 0 0 0 0 0 0 ∞ 0 ∞ 0 ∞ 0 P2 P1. we have to guarantee the existence of a topological order among the chosen nodes.

The complexity of the approach is hidden in the discrete optimization problem called PBQP . mapped to off) or to enable it at index 3 (i. . etc. . . . . which is similar to macro substitution techniques and ensures a correct but possibly non-optimal matching. as node 7 has to precede node 8. . it becomes less clear in which basic block.e. we subtract these costs from the cost vector of instances. . This would be a NP-complete problem. . that the heuristic solver of PBQP cannot guarantee to deliver a solution that satisfies all constraints modeled in terms of ∞ costs. These limitations do not apply to exact PBQP solvers such as the branch-&-bound algorithm. the penalties for the proxy states are effectively eliminated unless an invalid solution is selected. . . . . . A. . . this formulation allows for the trivial solution where all of the related variables encoding the selection of a complex pattern are set to off (accounting for 0 costs) even though the artificial proxy state has been selected.272 20 Instruction Code Selection — (D. mapped to on3 ). The whole flow of a function is taken into account rather a local scope. . . . . Scholz) instance 8 are not to use the pattern (i. For chain rules. Ebner. As we move from acyclic linear code regions to whole-functions. .e. . The interested reader is referred to the relevant literature [103. . . Cost matrices among nodes 1 to 6 do not differ from the basic approach discussed before and reflect the costs of converting the non-terminal symbols involved. . . . In exchange. the selected machine instructions should be emitted. Many aspects of the PBQP formulation presented in this chapter could not be covered in detail. . One way to work around this limitation is to include a small set of rules that cover each node individually and that can be used as a fallback rule in situations where no feasible solution has been obtained. . . . .) is provided. 20. . . It is also straight-forward to extend the heuristic algorithm with a backtracking scheme on RN reductions. Thus. . . Additional cost matrices are required to ensure that the corresponding proxy state is selected on all the variables forming a particular pattern instance (which can be modeled with combined costs of 0 or ∞ respectively). a polynomial-time algorithm based on generic network flows is introduced that . . . . . the obvious choices are often non-optimal. . . . B. 101] for an in-depth discussion. . . . . . . . However. which would of course also be exponential in the worst case.4 Summary and Further Reading Aggressive optimizations for the instruction code selection problem are enabled by the use of SSA graph. Krall. In [232]. . . Free PBQP libraries are available from the web-pages of the authors and a library is implemented as part of the LLVM [178] framework. We can overcome this problem by adding a large integer value M to the costs for all proxy states. . It should be noted that for general grammars and irreducible graphs. The move from basic tree-pattern matching [4] to SSA-based DAG matching is a relative small step as long as a PBQP library and some basic infrastructure (graph grammar translator.

.20.4 Summary and Further Reading 273 allows a more efficient placement of chain rules across basic block boundaries. This technique is orthogonal to the generalization to complex patterns.

.

. dataparallelism scheduling If-Conversion C. . the instruction fetch throughput is increased and the instruction cache miss penalty reduced. . . and thus exposes parallelism very naturally within the new merging basic-block. . . . . . . . . ILP scheduling control flow predicated execution Gated-SSA form dependence. . this is the case whenever they are connected neither by a data nor a control dependence. . could be executed in parallel. WLIW Instruction Level Parallelism. . if-converted code replaces control dependencies by data dependencies. . Enlarging the size of basic blocks allows earlier execution of long latency operations and the merging of multiple control flow paths into a single flow of execution. . . . . . If-conversion is the process of transforming a control flow region with conditional branches. The imperative control flow representation of the program. Bruel Progress: 90% Text and figures formating in progress . hyperblock scheduling or modulo scheduling. . . . . into an equivalent predicated sequence of instructions (in a single basic-block). . . even “connected” by some control flow edges. . . . . Yet. . As explained in Chapter 18. . . . .CHAPTER 21 if-conversion Very Long Instruction Word. 275 . . . Removing branches improves performance in several ways: By removing the misprediction penalty. relying on static schedulers to organize the compiler output such that multiple instructions can be issued in each cycle. . . . .1 Introduction Very Large Instruction Word (VLIW) or Explicitly Parallel Instruction Computing (EPIC) architectures make Instruction Level Parallelism (ILP) visible within the Instruction Set Architecture (ISA). two different instructions. inherently imposes an order of execution between instructions of different basic blocks. 21. . . . that can later be exploited by scheduling frameworks such as VLIW scheduling. As for Gated-SSA form (see Chapter 18). . . . . controldependence. . .

• the schedule height have been reduced. assuming that all instructions have a one cycle latency. r = q ? r1 : r2 stands for a select instruction where r is assigned r1 if q is true. But the main benefit here is that it can be executed without branch disruption. instruction speculation 21 If-Conversion — (C. Bruel) Consider the simple example given in figure 21.1 Static schedule of a simple region on a 4 issue processor. 21. • the variables have been renamed. regardless of the test outcome. and r2 otherwise.276 select. In this figure. to seven cycles. implying a better exploitation of the available resources. .1. because instructions can be control speculated before the branch. From this introductory example. After if-conversion the execution path is reduced to six cycles with no branches. and assuming a very optimistic one cycle branch penalty. ALU1 ALU2 - ALU3 - ALU4 - q = test q ? br l 1 q = test q ? br l 1 x =b r =x ∗y l2 : r = r + 1 l1 : x =a y =2 r =x +y br l 2 x =b y =3 r =x ∗y y =3 - - - return r - - - l2 : r =r +1 return r l1 : x = a r =x +y br l 2 y =2 - - - (a) Control Flow ALU1 q = test x1 = b r1 = x 1 ∗ y 1 r = q ? r1 : r2 r =r +1 return r ALU2 y1 = 3 r2 = x 2 + y 2 - (b) with basic block ordering ALU3 x2 = a ALU4 y2 = 2 - (c) After if-conversion Fig. that represents the execution of an if-then-else-end statement on a 4-issue processor. and a merge pseudo-instruction have been introduced. we can observe that: • the two possible execution paths have been merged into a single execution path. the schedule height goes from six cycles in the most optimistic case. With standard basic-block ordering.

or in the STMicroelectronics ST231 processor.1 Architectural requirements The merge pseudo operations needs to be mapped to a conditional form of execution in the target’s architecture. 21. an instruction must not have any possible side effects. full speculation predication.d dismissible load operation in the Multiflow Trace series of computers. Given this. speculative loads can be very effective to fetching data earlier in the instruction stream. usually memory operations that are not easily speculated. r2 otherwise. In this figure. • (Control) speculative execution: The instruction is executed unconditionally. t (c) speculative using cmov Conditional execution using different models To be speculated.1. and then committed using conditional moves (cmov) or select instructions. to represent a select like operation. Still exploiting those properties on larger scale control flow regions requires a framework that we will develop further. p ? x = a +b p ? x = a ∗b (a) fully predicated Fig. because as any other long latency instructions. we also use the notation r = c ? op to represent the predicated execution of op if the predicate c is true. This is regrettable. Modern architectures provide architectural support to dismiss invalid address exceptions. Examples are the ldw. reducing stalls. Its semantic is identical to the gated φif -function presented in Chapter 18: r takes the value of r1 if c is true. r = p ? op if the predicate c is false.21. The main difference is that with a dismissible model. other instructions are speculated. instruction gated instruction $\phiif$-function Thanks to SSA. Memory operations are a major impediment to if-conversion. or hazards. but also the speculative load of the Intel IA64.2 we differentiate the three following models of conditional execution: • Fully predicated execution: Any instruction can be executed conditional on the value of a predicate operand. For instance a memory load must not trap because of an invalid address. the transformation to generate if-converted code seems natural locally. and register renaming was performed by SSA construction.1 Introduction 277 predication. Similarly. 21. the merging point is already materialized in the original control flow as a φ pseudo-instruction. we use the notation r = c ? : r1 : r2 . partial select. As illustrated by Figure 21.2 t1 = a + b t2 = a ∗ b x = p ? t1 : t2 (b) speculative using select x = a +b t = a ∗b x = cmov p . . • partially predicated execution: Only a subset of the ISA is predicated.

. . .s p ?x =t (a) IA64 speculative load x = select p ? addr : dummy stw(x . . Finally we propose a global framework to pull together those techniques. using the ψ-SSA form presented in Chapter 17. . We then describe how this framework is extended to use predicated instructions. Stores can also be executed conditionally by speculating part of their address value. the technique described in this chapter is based on incremental reductions. it can be emulated using two conditional moves. identify a control-flow region and if-convert it in one shot.2 Basic Transformations Global if-conversion approaches. . . . . value) (d) index store hoisting This chapter is organized as follow: we start to describe the SSA techniques to transform a CFG region in SSA form to produce an if-converted SSA representation using speculation. Bruel) invalid memory access exceptions are not delivered. . As opposed to global approaches. . . incrementally enlarging the scope of the if-converted region to its maximum beneficial size. . t = ld. . . . . . . . . . . . . offer both speculative and predicated memory operations. A speculative model allows to catch the exception thanks to the token bit check instruction. .s(addr) chk. . . . . . This translation can be done afterward. we consider basic SSA transformations which goal is to isolate a simple diamond-DAG structure (informally an if-then-else-end) that can be easily if-converted. . . value) (c) base store hoisting Fig. It allows the program to stay in full SSA form where all the data dependencies are made explicit. The complete . . 21. If the target architecture does not provide such an gated instruction. . . . .278 dismissible model gated instruction conditional move 21 If-Conversion — (C. . To this end.3 Examples of speculated memory operations t = ldw. . . . and the select instruction still be used as an intermediate form. . and can thus be feed to all SSA optimizers. which can be problematic in embedded or kernel environment that relies on memory exception for correct behavior. . . Note that the select instruction is an architecture instruction that does not need to be replaced during the SSA destruction phase. . Figure 21. Some architectures. 21. with additional constraints on the ordering on the memory operations due to possible alias between the two paths.d(addr) x = select p ? t : x (b) ST231 dismissible load index = select p ? i : j stw(x [index]. . . . . such as the IA64. . .3 shows examples of various form of speculative memory operations.

. that itself points to B 0 . . 21. are “redirected” to the exit node. After the transformation. . . such that there are exactly two different control flow paths from head to exit . all edges that point to a node different than the exit node or to the willing entry node. This diamond-DAG can then be reduced using the φ removal transformation.5. Through path duplication. B n would give a diamond-DAG. . 21. consider two distinguished nodes. the join node of the considered region has n predecessors with φ -functions of the form B 0 : r = φ ( B 1 : r1 . (2) then replacing the φ -function by a select. . The objective of path duplication is to get rid of all side entry edges that avoid a single-exit-node region to be a diamond-DAG. . the φ -function in B 0 is replaced by B 0 : r = φ ( B 12 : r12 . More formally. r2 ) r1 = g + 1 r2 = x p ? br r1 = g + 1 r2 = x r = p ? r1 : r2 exit r = φ (r1 .1 SSA operations on Basic Blocks The basic transformation that actually if-converts the code is the φ removal that takes a simple diamond-DAG as an input. a new variable B 12 : r12 = φ ( B 1 : r1 .4 φ removal The goal of the φ reduction transformation is to isolate a diamond-DAG from a structure that resemble a diamond-DAG but has side entries to its exit block. In the most general case. named head and the single exit node of the region exit. head p ? br r1 = g + 1 r2 = x r = φ (r1 . B n : rn ).2 Basic Transformations 279 $\phi$ removal. reduction single-entry-node/singleexit-node.4. B 2 : r2 ) is created in this new basic-block. .e. (3) finally simplifying the control-flow to a single basic-block. and is such that removing edges from B 3 . The φ removal consists in (1) speculating the code of both branches in the entry basic-block (denoted head). φ reduction can then be applied to the obtained region. . B 2 : r2 .21. the first node sidei on . B 1 and B 2 point to a freshly created basic-block.2. One can notice the similarity with the nested arity-two φif -functions used for gated-SSA (see Chapter 18).3. Nested if-then-else-end in the original code can create such control flow. a single-entry-node/single-exitnode (SESE) DAG with only two distinct forward paths from its entry-node to its exit-node. . This transformation is illustrated by the example of Figure 21. consider (if exists). . r2 ) Fig. . SESE $\phi$ reduction. say B 12 . i. reduction $\phiif$-function Gated-SSA form path duplication framework. This is illustrated through the example of Figure 21. B n : rn ). that identifies and incrementally performs the transformation will be described in Section 21.

r3 ) r3 = . . . . B 2 : r2 . . . . .7 made up of three distinct basic-block. The last transformation.6. or exit.7 the goal is to get rid of side exit edges that avoid a single-entry-node region to be a diamond-DAG. All the φ -functions that are along P and P for which the number of predecessors have changed have to be updated accordingly. . Variables renaming (see Chapter 5) along with copy-folding can then be performed on P and P . r = φ (r12 . . sidep : r1 ). inferring the minimum amount of select operations would require having and updating liveness information. r1 . namely the Conjunctive predicate merge. rm ) in sidei and into r = φ (r0 ) i. a r = φ (sidep : r1 . B 2 : r2 . Implementing the same framework on a non-SSA form program. Conceptually the transformation can be understood has first isolating the outgoing path p → q and then if-converting the obtained diamondDAG. B2 r1 = g + 1 r2 = x r12 = p ? r1 : r2 r3 = .e. Hence. side. . .8 the minimality and the pruned flavor of the SSA form allows to avoid inserting useless select operations. B2 r1 = g + 1 r2 = x r1 = g + 1 r2 = x B 12 r12 = φ (r1 . Bruel) p ? br B1 p ? br 1 r3 = . . r2 ) B 0 r = φ (r12 . . involving either a global data-flow analysis or the insertion of copies at the exit node of the diamond-DAG. . . As opposed to path duplication. 21. . SSA form solves the renaming issue for free. head. . . sidep → exit that has at least two predecessors. which is empty. . a r = φ (sidei −1 : r0 . B n : rn .280 renaming.5 φ reduction one of the forward path head → side0 → . r3 ) Fig. All steps are illustrated through the example of Figure 21. that branches with predicate p to side. the transformation is actually restricted to a very simple pattern highlighted in Figure 21. . would require more efforts: The φ reduction would require variable renaming. of variables copy folding conjunctive predicate merge minimal SSA form pruned SSA form B 21 If-Conversion — (C. and as illustrated by Figure 21. . The transformation duplicates the path P = sidei → · · · → sidep → exit into P = side’i → · · · → side’p → exit and redirects sidei −1 (or head if i = 0) to side’i . . . As illustrated by Figure 21. r = r0 in side’i . r3 ) B 0 r = φ (r1 . rm ) originally in sidei will be updated into r = φ (r1 . r2 . branches itself with predicate q to another basic-block outside of the region or to exit. concerns the if-conversion of a control flow pattern that sometimes appears on codes to represent logical and or or conditional operations. B n : rn ) in exit will be updated into r = φ (sidep : r1 . . .

t 2 ) r1 = x r2 = p ? t 1 : t 2 r = φ (r1 . r3 ) t1 = x + y t 3 = φ (t 1 .7 convergent conjunctive merge r = φ (r1 . r4 . r2 ) r = φ (r1 . r2 . r3 ) t1 = x + y t 3 = φ (t 1 . t 2 ) r1 = x r2 = φ (t 1 . r3 ) t1 = x + y t1 = x + y side1 side’1 t3 = t1 · · · = t3 t 3 = φ (t 1 .2 Basic Transformations 281 speculation head r1 = w p ? br side0 r0 = z r2 = x r4 = φ (r1 . r4 ) (b) after path-duplication r1 = w p ? br side’0 r0 = z r4 = r1 t1 = x + y side’1 · · · = t1 r2 = x r3 = y r1 = w r0 = z r4 = r1 t1 = x + y · · · = t1 r04 = p ? r0 : r4 r2 = x side0 r3 = y r4 = φ (r2 .2. As we will illustrate hereafter. t 2 ) · · · = t3 t2 = x side1 exit r5 = φ (r0 .6 Path duplication r5 = φ (r04 . 21.21. r2 ) Fig. r4 ) (c) after renaming/copy-folding Fig. t 2 ) · · · = t3 r3 = y r1 = w head p ? br r0 = z r2 = x r3 = y t2 = x side0 side’0 r4 = r1 r4 = φ (r2 . t 2 ) · · · = t3 t2 = x r4 = φ (r2 .2 Handling of predicated execution model The φ removal transformation described above considered a speculative execution model. t 2 ) · · · = t3 r5 = φ (r0 . r4 ) (d) after φ reduction head p ? br p p ∧q p p ∧q q br p s = p ∧q s ? br s ¯ s side q B1 r1 = x exit r2 = φ (t 1 . 21. r3 ) t1 = x + y t2 = x side1 t 3 = φ (t 1 . in the context of a predicated execution . r4 ) (a) almost diamond-DAG exit r5 = φ (r0 . r4 . r2 ) 21.

the computation of p has to be done before the computations of x 1 and x 2 . but then an anti-dependence from x = a + b and p ? x = c appears that forbid its execution in parallel. . . then x 1 and x 2 can be coalesced to x . . speculation adds anti-dependencies. antiparallel execution \psifun 21 If-Conversion — (C.282 predicated execution coalescing $\psi$-SSA form dependence.8 (b) if-conversion on pruned SSA model. The use of ψ-SSA (see Chapter 17). As illustrated by Figure 21. t 2 ) r = φ (r1 . . . r2 = f(t 2 ) t = φ (t 1 . speculating an operation on an if-converted code corresponds to remove the data dependence with the corresponding predicate. datadependence. Just as (control) speculating an operation on a control flow graph corresponds to ignore the control dependence with the conditional branch. p ? x 2 ) (c) partially speculated p = (a < ? b) x = a +b p ?x =c (d) after coalescing Speculation removes the dependency with the predicate but adds anti-dependencies between concurrent computations. . r2 = f(t 2 ) t = p ? t1 : t2 r = p ? r1 : r2 · · · = f(r ) p ? br t1 = . r2 ) · · · = f(r ) SSA predicate minimality t1 = . . . . speculation is performed during the φ removal transformation. if only the computation of x 1 is speculated. if both the computation of x 1 and x 2 are speculated. whenever it is possible (operations with side effect cannot be speculated) and considered as beneficial. Bruel) p ? br t1 = . r1 = f(t 1 ) t2 = . 21. Also. controldependence. 21. p ? x 2 ) (a) predicated code Fig. the choice of speculation versus predication is an optimization decision that should not be imposed by the intermediate representation. r1 = f(t 1 ) t2 = . transforming speculated code into predicated code can be viewed as a coalescing problem. r1 = f(t 1 ) t2 = .9: For the fully predicated version of the code. . . r1 = f(t 1 ) t2 = . speculating the computation of x 1 removes the dependence with p and allows to execute it in parallel with the test (a < ? b ). as the intermediate form of if-conversion. . . r2 = f(t 2 ) r = p ? r1 : r2 · · · = f(r ) r = φ (r1 . In practice. . This trade-off can be illustrated through the example of Figure 21. p ? x 2 ) (b) fully speculated p = (a < ? b) x1 = a + b p ? x2 = c x = ψ(p ? x 1 . only the operations part . they cannot be coalesced and when destruction ψ-SSA. allows to postpone the decision of speculating some code. . .10b. p = (a < ? b) p ? x1 = a + b p ? x2 = c x = ψ(p ? x 1 . r2 ) · · · = f(r ) (a) if-conversion on minimal SSA Fig. the ψ-function will give rise to some select instruction. r2 = f(t 2 ) t1 = .9 p = (a < ? b) x1 = a + b x2 = c x = ψ(p ? x 1 . while the coalescing problem is naturally handled by the ψ-SSA destruction phase. On the other-hand on register allocated code.

d 2 ) (a) nested if Fig. s ? x 2 ) q ? d 1 = f(x ) q ? d2 = 3 d = ψ(q ? d 1 .e. p ? x 2 ) d 1 = f(x ) q ? d2 = 3 d = ψ(q ? d 1 . Note that as opposed to speculating it. p ) and the current branch predicate (e. which forbids in turn the operation d 1 = f(x ) to be speculated.10a containing a sub region already processed.g. originally nongated operand x 1 gets gated by q . All operations that cannot be speculated. As an example.10a.g. Consider the code of Figure 21. Marking operations that cannot be speculated can be done easily using a forward propagation along def-use chains. 21. The only subtlety related to projection is related to generating the new predicate as the logical conjunction of the original guard (e. then the ψ-function cannot be speculated. and in particular for speculation. q ). q ? d 2 ) (c) predicating both branches d = φ (d 1 . A ψ-function can also be projected on a predicate q as described in Chapter 17: All gates of each operand are individually projected on q . if (q ) then p = (a < ? b) p = (a < ? b) x1 = a + b p ? x2 = c x = ψ(x 1 . q ? d 2 ) (b) speculating the then branch else end x1 = a + b p ? x2 = c x = ψ(x 1 . i. this operation might be already predicated (on p here) prior to the if-conversion. while the p gated operand x 2 gets gated by s = q ∧ p . must be predicated. the ψ-function. Just as for x 2 = c in Figure 21. In that case.10 Inner region ψ Speculating code is the easiest part as it could be done prior to the actual ifconversion by simply hoisting the code above the conditional branch. they can be considered for inclusion in a candidate region for ifconversion. . However. Similarly. Still we would like to outline that since ψ-functions are part of the intermediate representation.2 Basic Transformations 283 def-use chains projection $\psi$-projection guard of one of the diamond-DAG branch are actually speculated. the strength of ψ-SSA allows to treat ψ-functions just as any other operation. also has to be speculated. meaning that instead of predicating x 2 = c by p it gets predicated by q ∧ p . If one of them can produce hazardous execution. all the operations defining the operands x 1 and x 2 should also be speculated. a projection on q is performed. that is part of the then branch on predicate q .21. the operation defining x . Suppose we are considering a none-speculated operation we aim at if-converting. This partial speculation leads to manipulating code made up of a mixed of predication and speculation. p ? x 2 ) d 1 = f(x ) d2 = 3 p = (a < ? b) s =q ∧p q ? x1 = a + b s ? x2 = c x = ψ(q ? x 1 . including possibly some ψ-functions. To speculate the operation d 1 = f(x ). predicating a ψ-function does not impose to predicate the operation that define its operands.

. and q ∧ p . of SSAHere s needs to be computed at some point. operands of projected operations follow. . . BB12}. . . BB14. . . The post-order list of conditional blocks (represented in bold) is [BB11. . . .3. Bruel) incremental update. . φ reduction can be applied. the number of predicate registers will determine the depth of the if-conversion so that the number of conditions does not exceed the number of available predicates. . . q and q are already available. used predicates are q . . However. BB9. . . and basic-block BB3 that contains a function call cannot be if-converted (represented in gray). . . The earlier place where q ∧ p can be computed is just after calculating p. . . . . .1 SSA Incremental if-conversion algorithm The algorithm takes as input a CFG in SSA form and applies incremental reductions on it using a list of candidate conditional basic-blocks sorted in post-order. . . . . Once. Note that as the reduction proceeds. . . The exit node BB7. or because a basic block cannot be if-converted. . on q for the else). . basic adhoc updates can be implemented instead. operations have been speculated or projected (on q for the then branch. each φ -functions at the merge point is replaced by a ψfunction: operands of speculated operations are placed first and guarded by true. . Each basic-block in the list designates the head of a sub-graph that can be ifconverted using the transformations described in Section 21. BB2]. . .3 Global Analysis and Transformations Hot regions are rarely just composed of simple if-then-else-end control flow regions but processors have limited resources: the number of registers will determine the acceptable level of data dependencies to minimize register pressure. BB10. . Post-order traversal allows to process the regions from inner to outer. . . BB16. . guarded by the predicate of the corresponding branch. When the if-converted region cannot grow anymore because of resources. . . . 21. . the number of processing units will determine the number of instructions that can be executed simultaneously. The first candidate region is composed of {BB11. . promoting the instructions of BB12 in BB11.2. 21.284 21 If-Conversion — (C. The inner-outer incremental process advocated in this chapter allows to evaluate finely the profitability of if-conversion. . Consider for example the CFG from the gnu wc (word count) program reported in Figure 21.11a. . Here. . q . BB17. Our heuristic consists in first listing the set of all necessary predicates and then emitting the corresponding code at the earlier place. . . maintaining SSA can be done using the general technique described in Chapter 5. . BB6. . BB2 becomes the . . then another candidate is considered in the post-order list until all the CFG is explored. BB2. .

BB16 can be promoted. some basic-blocks (such as BB3) may have to be excluded from the region to if-convert.3. BB1 BB1 BB2 BB3 BB5 BB3 BB2 BB5 BB4 BB6 BB7 BB9 BB1 BB9’ BB4 BB6’ BB7 BB6 BB9 BB10 BB14 BB15 BB11 BB16 BB12 BB17 BB19 BB7 BB3 BB2 BB5 BB11’ BB12’ BB6 BB9 BB10’ BB14’ BB16’ BB15’ BB14 BB15 BB10 BB11 BB16 BB12 BB17 BB4 BB17’ BB19’ BB19 (a) Initial Fig. 21.11 (b) If-conversion (c) Tail duplication If-conversion of wc (word count program).11. a newly predicate must be created to hold the merged conditions. The region headed by BB17 is then considered. BB19 is duplicated into BB19 . Tail duplication can be used to exclude BB3 from the to-be-if-converted region. the goal of tail . Similar to path duplication described in Section 21. BB2. Basic-blocks in gray cannot be if-converted.2 Tail Duplication Just as for the example of Figure 21. BB19 is duplicated into a BB19 with BB2 as successor. BB19 cannot yet be promoted because of the side entries coming both from BB15 and BB16.21. a new merging predicate is computed and inserted. After the φ removal is fully completed BB16 has a unique successor. So. so as to promote BB17 and BB19 into BB16 through φ reduction. The process finished with the region head by BB9. The region headed by BB16 that have now BB17 and BB19 as successors is considered. BB10 is then considered. BB14 needs to be duplicated to BB14 . Again since BB16 contains predicated and predicate setting operations. BB19 already contains predicated operations from the previous transformation. BB14 is the head of the new candidate region where BB15. Tail duplication can be used for this purpose.3 Global Analysis and Transformations 285 tail duplication path duplication single successor of BB11.2. 21. BB19 can then be promoted into BB17.

Note that a new node. This would lead to the CFG of Figure 21. BB4. To evaluate those costs. A global approach would just do as in Figure 21. and BB7. leading to the code depicted in Figure 21. The prevalent idea is that a region can be if-converted if the cost of the resulting if-converted basic block is smaller than the cost of each basic block of the region taken separately weighted by its execution probability. has been added here after the tail duplication by a process called branch coalescing. Aggressive if-conversion can easily exceed the processor resources. (2) any branch from B j that points to a basic-block B k of the region is rerouted to its duplicate B k . Formally. consider a region made up of a set of basic-blocks. . we would have entry = BB2. Consider again the example of Figure 21. Moving operations earlier in the instruction stream increases live-ranges. Tail duplication consists in: (1) for all B j (including B s ) reachable from B s in .12b. create a basic-block B j as a copy of B j . consider the example of Figure 21. second. and out = BB4. tail duplication would be performed prior to any if-conversion. . which technique consists in.11c: first select the region.12e.3. BB7.286 hyper-block formation branch coalescing 21 If-Conversion — (C. a distinguished one entry and the others denoted ( B i )2≤i ≤n . This is usually done in the context of hyper-block formation. . to ifconvert a region in “one shot”.11b. . leading to excessive register pressure or moving infrequently used long latencies instructions into the critical path. To illustrate this point. get-rid of side incoming edges using tail duplication. Applying tail duplication to get rid of side entry from BB2 would lead to exactly the same final code as shown in Figure 21. 21. Applying if-conversion on the two disjoint regions respectively head by BB4 and BB4’ would lead to the final code shown if Figure 21. . finally perform ifconversion of the whole region in one shot.12c. B s = BB6. as opposed to the inner-outer incremental technique described in this chapter. we consider all possible paths impacted by the if-conversion.12a where BB2 cannot be if-converted.12f. Bruel) duplication is to get rid of incoming edges of a region to if-convert. Getting rid of the incoming edge from BB4 to BB6 is possible by duplicating all basic-blocks of the region reachable from BB6 as shown in Figure 21. The selected region is made up of all other basic-blocks. Our incremental scheme would first perform if-conversion of the region head by BB4. Suppose a basic-block B s has some predecessors out1 .11a. In our example. such that any B i is reachable from entry in . (3) any branch from a basic-block outk to B s is rerouted to B s . Using a global approach as in standard hyperblock formation. outm that are not in . We would like to point out that there is no phasing issue with tail duplication.3 Profitability Fusing execution paths can over commit the architectural ability to execute in parallel the multiple instructions: Data dependencies and register renaming introduce new register constraints. and suppose the set of selected basic-blocks defining the region to if-convert consists of all basic-blocks from BB2 to BB19 excluding BB3.

B i +1 [ where [ B i . Note that if B i branches to S q on predicate q . . exit[ be the two paths with the condition taken branch on p . B 1 . B i = prob(q ) × [ B i . B 1 . B 1 . B i +1 [ represents the Let pathp = [head. B m .3 Global Analysis and Transformations 287 BB1 BB2 BB3 BB1 BB2 BB4 BB5 BB6 BB7 BB3 BB6’ BB6 BB4’ BB5’ BB4 BB5 BB2 BB3 BB4’ BB5’ BB6’ BB4 BB5 BB6 BB7 BB1 (a) initial BB1 BB2 BB4 BB5 BB6 BB3 (b) tail duplication (c) if-convertion BB1 BB1 BB2 BB4 BB5 BB6 BB3 BB2 BB3 BB4’ BB5’ BB6’ BB4 BB5 BB6 BB7 (d) initial Fig. . For a path Pq = [ B 0 . 0 otherwise. and falls through to S q . there are two such paths starting at basic-block head. . . . Then.13. . S q [ + prob(q ) × [ B i . B 2 . . S q [ = [ B i ] + prob(q ) × br_lat. B i +1 ) corresponds to a conditional branch. B n [ of probability prob(q ). its cost is given by Pq = prob(q ) × cost of basic-block [ B i ] estimated using its schedule height plus the branch latency br_lat. For the code of Figure 21. B 1 .12 (e) if-conversion (f ) tail duplication Absence of phasing issue for tail duplication For all transformations but the conjunctive predicate merge. 21. if the edge ( B i .21. exit[ and pathp = [head. . . . . B n . . the overall cost before if-conversion simplifies to n m costcontrol = pathp +pathp = [head]+prob(p )× br_lat + This is to be compared to the cost after if-conversion i =0 [ B i ] +prob(p )× [Bi ] i =0 costpredicated = head ◦ n i =1 B i ◦ m i =1 B i . B 1 . exit[ of respective probability prob(p ) and prob(p ). n −1 i =0 [ B i . we would have pathp = [head. exit[ and pathp = [head.

Naturally. prob(p ∧ q ). convergent or not) is evaluated similarly. 21. another strategy such as path duplication from the exit block will be evaluated. exit[. emitting a conjunctive merge might not be beneficial. The branch probability is obtained either from static branch prediction heuristics. this heuristic .14 can be evaluated similarly. Profitability for any conjunctive predicate merge (disjunctive or conjunctive. and pathp = [head. pathp ∧q = [head.288 21 If-Conversion — (C. In that case. profile information or user inserted directives. and prob(p ). resource usage and scheduling constraints. The overall cost before the transformation (if branches are on p and q ) pathp ∧q + pathp ∧q + pathp simplifies to costcontrol = head + side = [head] + prob(p ) × (1 + prob(q )) × br_lat which should be compared to (if the branch on the new head block is on p ∧ q ) costpredicated = head ◦ side = [head ◦ side] + prob(p ) × prob(q ) × br_lat head p head side p ∧q p ∧q side q q p B1 exit B1 exit Fig. 21. head p p B1 B1 B2 head B1 B1 B2 B1 B2 exit exit Fig. There are three paths impacted by the transformation: pathp ∧q = [head. The local dependencies computed between instructions are used to compute the dependence height. side. Bruel) where ◦ is the composition function that merges basic blocks together. side. removes associated branches and creates the predicate operations.13 φ reduction The profitability for the logical conjunctive merge of Figure 21. B 1 [. exit[ of respective probability prob(p ∧ q ).14 conjunctive predicate merge Note that if prob(p ) 1. The objective function needs the target machine description to derive the instruction latencies.

. . the region selection can be reevaluated at each nested transformation. . Basic block selection and if-conversion are performed as a single process. . . . Thanks to conditional moves and ψ transformations. Predication and speculation are often presented as two different alternatives for if-conversion. . . . using well known techniques such as tail duplication or branch coalescing only when the benefit is established. . 21. . While it is true that they both require different hardware support. . . . inner-outer process. predictability [117] becomes a major criteria for profitability. excluding basic blocks which do not justify their inclusion into the ifconverted flow of control.21. .4 Conclusion We presented in this chapter how an if-conversion algorithm can take advantage of the SSA properties to efficiently assign predicates and lay out the new control flow in an incremental.5 Additional reading 289 can be either pessimistic. . . or uncertainty in the branch prediction. As opposed to the alternative top-down approach. because it does not take into account new optimization opportunities introduced by the branch removals or explicit new dependencies. . . . . reconstructing the control flow at schedule time. . . . . . . . . exposes ILP in VLIW architectures using trace scheduling and local ifconverted if-then-else regions using the select and dismissible load operations. . using local analysis. . . . . . . . . such as new predicate merging instructions or new temporary pseudo registers. . . . . Hyperblocks [185] was proposed as the primary if-converted scheduling framework. . . . Reverse if-conversion was proposed in [19]. . . or optimistic because of bad register pressure estimation leading to register spilling on the critical path. . after application of more aggressive region selection criteria. The idea behind was to enable the compiler to statically reorganize the instruction. . . . In this respect. . . they should coexist in an efficient if-conversion process such that every model of conditional execution is accepted. . . . . with the advantage that all the instructions introduced by the ifconversion process in the inner regions. . . . . . . . . . . . . they are now generated together in the same framework. . . . . . This cost function is fast enough to be reapplied to each region during the incremental processing. . can be taken into account. . . . . . . . . the size and complexity of the inner region under consideration makes the profitability a comprehensive process. . . . . . . . . . hyperblocks being created lazily. . . . . . To overcome the hard to predict profitability in conventional if-conversion algorithms. . . . . . But since the SSA incremental if-conversion framework reduces the scope for the decision function to a localized part of the CFG. . . 21.5 Additional reading [221]. .

A global SSA framework was presented [53] to support s e l e c t moves using aggressive speculation techniques. holding guard and topological information. but they are based on speculation using conditional moves. further extended to ψ-SSA [54] allowing a mix of speculative and predicated techniques. starting from an if-converted region fully predicated. or light weight φ -SSA generation. In SSA-PS [140]. In both works. [183] evaluates how predicated operations can be performed using an equivalent sequences of speculative and conditional moves. . enabling support for a fully predicated ISA. ψ instructions are inserted while in SSA using a modified version of the classical Fang algorithm [107]. In [251]. to map phi-predication with phi-lists. Bruel) The duality between SSA like φ s and predicate dependencies have been used in other works. the use of SSA aims at solving the multiple definition problem exploiting variable renaming and join points.290 21 If-Conversion — (C. Predicated Switching operations are used to realize the conditional assignments using aggressive speculation techniques with conditional moves. but do not yet exploit the native SSA properties to build up the if-converted region. Phi-Predication [74] uses a modified version of the RK algorithm. Those works established that standard if-conversion techniques can be applied to a SSA form using the ψ-SSA representation.

. .e. . . . SSA imposes a strict discipline on variable naming: every “name” must be associated to only one definition which is obviously most of the time not compatible with the instruction set of the targeted architecture. . .interference SSA destruction critical edge CHAPTER 22 SSA destruction for machine code F. . it increases subsequently the size of the intermediate representation. . . . . second. . . . . . Unfortunately. Also.2 involves the splitting of every critical edges. . . . thus making it not suitable for just-in-time compilation. . . as we will see further. the compiler might not allow the splitting of a given edge. Correctness SSA at machine level complicates the process of destruction that can potentially lead to bugs if not performed carefully. care must be taken with duplicated edges. . . . a two291 . . As an example. . This can occur after control flow graph structural optimizations like dead code elimination or empty block elimination. or exception handling code. . . . The algorithm described in Section 3. . . . . i. . finally. . .1 Introduction Chapter 3 provides a basic algorithm for destructing SSA that suffers from several limitations and drawbacks: first. . this obstacle can easily be overcame. . 22. region boundaries. . . Rastello Progress: 90% Text and figures formating in progress . . because of specific architectural constraints. . . . the edges should be considered as critical and then split. . . But then it becomes essential to be able to append a copy operation at the very end of a basic block which might neither be possible. In such case. Fortunately. it works under implicit assumptions that are not necessarily fulfilled at machine level. . it must rely on subsequent phases to remove the numerous copy operations it inserts. . . when the same basic block appears twice in the list of predecessors. .

reducing the amount of these copies is the role of the coalescing during the register allocation phase.292 variable version \phiweb 22 SSA destruction for machine code — (F. Speed and Memory Footprint The cleanest and simplest way to perform SSA destruction with good code quality is to first insert copy instructions to make the SSA form conventional. this aggressive coalescing would not cope with the interference graph colorability. register constraints related to calling conventions or instruction set architecture might be handled by the register allocation phase. with less effort. The difficulty comes both from the size of the interference graph (the information of colorability is spread out) and the presence of many overlapping live-ranges that carry the same value (so noninterfering). Code quality The natural way of lowering φ -functions and expressing register constraints is through the insertion of copies (when edge-splitting is not mandatory as discussed above). and only effectively insert the non-coalesced ones. If done carelessly. In theory. before eventually renaming φ -webs and getting rid of φ -functions. in a transitional stage. This is why some prefer using. while two names can. A few memory and time-consuming existing coalescing heuristics mentioned in Chapter 23 are quite effective in practice. as we already mentioned. Apart from dedicated registers for which optimizations are usually very careful in managing there live-range. as we will see. To overcome this difficulty one can compute liveness and interference on demand which. Coalescing can also. The former simplifies the SSA destruction phase. Unfortunately this approach will lead. However. then take advantage of the SSA form to run efficiently aggressive coalescing (without breaking the conventional property). enforcement of register constraints impacts the register pressure as well as the number of copy operations. two versions of the same original variable should not interfere. This makes the SSA destruction phase the good candidate for eliminating (or not inserting) those copies. Rastello) address mode instruction. As we will see. For those reasons we may want those constraints to be expressed earlier (such as for the pre-pass scheduler). is made simpler by the use of SSA form. As opposed to a (so-called conservative) coalescing during register allocation. in which case the SSA destruction phase might have to cope with them. . one idea is to virtually insert those copies. be performed prior to the register allocation phase. for SSA construction. To fulfill memory and time constraints imposed by just-in-time compilation. Remains the process of copy insertion itself that might still take a substantial amount of time. such as auto-increment (x = x + 1) would enforce its definition to use the same resource than one of its arguments (defined elsewhere). Implicitly. while the latter simplifies and allows more transformations to be performed under SSA. the notion of versioning in place of renaming. thus imposing two different definitions for the same temporary variable. the resulting code will contain many temporary-to-temporary copy operations. strict SSA form is really helpful for both computing and representing equivalent variables. to an intermediate representation with a substantial number of variables: the size of the liveness sets and interference graph classically used to perform coalescing become prohibitively large for dynamic compilation.

. . . 22. . . . . B1 c ? br B 0 B2 . . . . . B1 a1 ← a1 c ? br B 0 B2 . .1 Isolation of a φ -node (b) After isolating the φ -function When incoming edges are not split. if a 0 is live out of the block B i where a i is defined.1. B2 : a 2) a0 ← a0 (a) Initial code Fig. a2 ← a2 B0 B0 : a 0 ← φ(B1 : a 1. . as they concern different variables. . a copy is also inserted on the outgoing edge (one for its defining operand). right after φ -functions. . . and algorithm efficiency (speed and memory footprint).e. . . . just before the branching operation. . if copies are sequentialized blindly. edge splitting can be avoided by treating symmetrically φ -uses and definition-operand: instead of just inserting copies on the incoming control-flow edges of the φ -node (one for each use operand). or a 0 is used in B 0 as a φ -function argument (as in the “swap problem”). . . . if parallel copies are used.22. . an empty parallel copy is inserted both at the beginning (i. 22. . . i > 0. For that reason. If.2 Correctness Isolating φ -node using copies In most cases. . . . . This has the effect of isolating the value associated to the φ -node thus avoiding (as discussed further) SSA destruction issues such as the well known lost-copy problem. code quality (elimination of copies). a 0 is dead before a i is defined but. The process of φ -node isolation is illustrated by Figure 22. . . . . . . if any). . . inserting a copy not only for each argument of the φ -function.e. . The layout falls into three corresponding sections. in which case the edge from B i to B 0 is critical (as in the “lost copy problem”). In this latter case. the live range of a 0 can go beyond the definition point of a i and lead to incorrect code after renaming a 0 and a i with the same name. several copies are introduced at the same place. . . . . Note that. . . they should be viewed as parallel copies. . The corresponding pseudo-code is given in Algorithm 22. . . because of different φ -functions. as far as correctness is concerned. if any) and at the end of each basicblock (i. but also for its result is important: without the copy a 0 ← a 0 . . . . . . . the φ -function defines directly a 0 whose live range can be long enough to intersect the live range of some a i . . . B2 : a 2) B0 B0 : a 0 ← φ(B1 : a 1. Two cases are possible: either a 0 is used in a successor of B i = B 0 . . copies can be sequentialized in any order.2 Correctness 293 This chapter addresses those three issues: handling of machine level constraints.

or the counter variable must not be promoted to SSA.3 where copyfolding would remove the copy a 2 ← b in Block B 2 . 22. B 1 B2 t 1 ← φ (t 0 . To go out of SSA.294 copy-folding 22 SSA destruction for machine code — (F. or some instruction must be changed. a counter u is decremented by the instruction itself. .2 (b) Branch with decrement (c) C-SSA with additional edge splitting Copy insertion may not be sufficient. In other words. ← u 2 B3 . the φ -function of Block B 2 . suppose that for the code of Figure 22. . . then branches to B 1 if u = 0.2c. It must then be given the same name as the variable defined by the φ -function. t 2 ) . Or one could split the critical edge between B 1 and B 2 as in Figure 22. . . t2 ← t1 + 2 c ? br B 2 B2 B3 . .2a. This is the case for the example of Figure 22. If B 2 is eliminated. u2 ← u1 − 1 t0 ← u 2 u 2 = 0 ? br B 1 B1 . However. . Rastello) φ -node isolation allows to solve most of the issues that can be faced at machine level. . For example. t 2 ) . B 1 t0 ← u t 1 = φ (t 0 . In addition to the condition. cannot be translated out of SSA by standard copy insertion because u interferes with t 1 and its live range cannot be split. B1 br_dec u . t 2 ) . SSA destruction by copy insertion alone is not always possible. one could add t 1 ← u − 1 in Block B 1 to anticipate the branch. ← u 2 (a) Initial SSA code Fig. no copy insertion can split its live range. br_dec u . when the basic block contains variables defined after the point of copy insertion. . br_dec u . there is . or the control-flow edge must be split somehow. . Then. which uses u . . . If u is used in a φ -function in a direct successor block. There is another tricky case when a basic-block have twice the same predecessor block. If both variables interfere. t2 ← t1 + 2 c ? br B 2 B2 t 1 ← φ (u . . . simple copy insertions is not enough in this case. . this is just impossible! To solve the problem.2b). ← u 2 B3 . there remains subtleties listed below. Limitations There is a tricky case. t2 = t1 + 2 c ? br B 2 . This is for example the case for the PowerPC bclr branch instructions with a behavior similar to hardware loop. u 2 ) . depending on the branch instructions and the particular case of interferences. B 1 decrements u . . B1 u 1 ← φ (u 0 . the instruction selection chooses a branch with decrement (denoted br_dec) for Block B 1 (Figure 22. . This can result from consecutively applying copy-folding and control flow graph structural optimizations like dead code elimination or empty block elimination. the SSA optimization could be designed with more care. .

. a1 ← a1 b ← b b > a 1 ? br B 0 B0 B0 B0 : a 0 ← φ(B1 : a 1. c ↑R 1 ) (c) Operand pining of a function call Fig. or where an operand must use a given register. B2 : b ) a0 ← a0 B0 : a 0 ← φ(B1 : a 1. b > a 1 ? br B 0 B1 a1 ← . b > a 1 ? br B 2 B2 a2 ← b B0 B1 a1 ← . The live-range pining of a variable v to resource R will be represented R v . b ← . 22... 22. just as if v were a version of temporary R .4 Operand pining and corresponding live-range pining ↑T ↑T (b) Corresponding live-range pining R 0b = b R 1c = c R 0a = f (R 0b . This last case encapsulates ABI constraints. Examples of operand pining are operand constraints such as 2-address-mode where two operands of one instruction must use the same resource.... For the sake of the discussion we differentiate two kinds of resource constraints that we will refer as operand pining and live-range pining. R 1c ) a = R 0a (d) Corresponding live-range pining Tp 1 ← p 1 Tp 2 ← Tp 1 + 1 p 2 ← Tp 2 . B2 : a 2) (a) Initial C-SSA code Fig.. On the other hand the pining of an operation’s operand to a given resource does not impose anything on the live-range of the corresponding variable.. Live-range pining expresses the fact that the entire live-range of a variable must reside in a given resource (usually a dedicated register). B1 a1 ← .. b ← .22. The scope of the constraint is restricted to the operation.2 Correctness 295 instruction set architecture application binary interface pining no way to implement the control dependence of the value to be assigned to a 3 other than through predicated code (see chapters 17 and 18) or through the reinsertion of a basic-block between B 1 and B 0 by the split of one of the edges. An example of live-range pining are versions of the stack-pointer temporary that must be assigned back to register SP..3 (b) T-SSA code (c) After φ -isolation Copy-folding followed by empty block elimination can lead to SSA code for which destruction is not possible through simple copy insertion The last difficulty SSA destruction faces when performed at machine level is related to register constraints such as instruction set architecture (ISA) or application binary interface (ABI) constraints.. p2 ← p1 + 1 (a) Operand pining of an auto-increment a ↑R 0 ← f (b ↑R 0 . B2 : b ) B0 : a 0 ← φ(B1 : a 1. b ← . An operand pining to a resource R will be represented using the exponent ↑R on the corresponding operand..

We assume this to always be the responsibility of register allocation. it will not be correct if the insertion point (for the definition-operand) of copy a 0 ← a 0 does not dominate all the uses of a 0 . Then it checks for interferences that would persist. It is designed to be simple. It first inserts parallel copies to isolate φ -functions and operand pining. Algorithm 22 splits the data-flow between variables and φ -nodes through the insertion of copies. For a given φ -function a 0 ← φ (a 1 . We say that x and y are pin-φ related to one another if they are φ -related or if they are pinned to a common resource. We will denote such interferences as strong. . this transformation is correct as long as the copies can be inserted close enough to the φ -function. As far as correctness is concerned. Precisely this leads to inserting in Algorithm 22 the following tests: • line 10: “if the definition of a i does not dominate PC i then continue.2b). The resulting pseudo-code is given by Algorithm 23. the pin-φ -webs. At this point. The computation of φ -webs given by Algorithm 4 can be generalized easily to compute pin-φ -webs. as they cannot be tackled through the simple insertion of temporary-to-temporary copies in the code. It might not be the case if the insertion point (for a use-operand) of copy a i ← a i is not dominated by the definition point of a i (such as for argument u of the φ -function t 1 ← φ (u . The transitive closure of this relation defines an equivalence relation that partitions the variables defined locally in the procedure into equivalence classes. . we will denote as split operands the newly created local variables to differentiate them to the ones concerned by the two previous cases (designed as non-split operands). the pin-φ -equivalence class of a resource represents a set of resources “connected” via φ -functions and resource pining. But not only: the set of all variables pinned to a common resource must also be interference free. t 2 ) for the code of Figure 22.296 22 SSA destruction for machine code — (F. . . a n ). the code is still under SSA and the goal of the next step is to check that it is conventional: this will obviously be the case only if all the variables of a φ -web can be coalesced together.4: parallel copies with new variables pinned to the corresponding resource are inserted just before (for use-operand pining) and just after (for definition-operand pining) the operation. We consider that fixing strong interferences should be done on a case by case basis and restrict the discussion here on their detection. We suppose a similar process have been performed for operand pining to express them in terms of live-range pining with very short (when possible) live-ranges around the concerned operations. symmetrically.” For the discussion. Intuitively. Rastello) Note that looser constraints where the live-range or the operand can reside in more than one resource are not handled here. Detection of strong interferences The scheme we propose in this section to perform SSA destruction that deals with machine level constraints does not address compilation cost (in terms of speed and memory footprint). We first simplify the problem by transforming any operand pining to a live-range pining as sketched in Figure 22.” • line 17: “if one use of a 0 is not dominated by PC 0 then continue. .

based on the union-find pattern 1 2 3 4 begin for each resource R web(R ) ← {R }. . for each variable v web(v ) ← {v }. foreach B 0 : basic block of the CFG do foreach φ -function at the entry of B 0 of the form a 0 = φ ( B 1 : a 1 . Note that some interferences such as the one depicted in Figure 22. If any interference have been discovered. /* all a i can be coalesced and the φ -function removed */ Now. add copy a i ← a i to PCi . one need to check that each web is interference free. a n ) for each source operand a i in instruction union(web(a dest ).22. web(a i )) . A web contains variables and resources. The notion of interferences between two variables is the one discussed in Section 2.7 for which we will propose an efficient implementation later in this chapter. add copy a 0 ← a 0 to PC0 . let a i be a freshly created variable. let a 0 be a freshly created variable. . it has to be fixed on a case by case basis. . insert an empty parallel copy at the end of B .2 Correctness 297 Algorithm 22: Algorithm making non-conventional SSA form conventional by isolating φ -nodes 1 2 3 4 5 6 7 8 10 11 12 13 14 15 17 18 19 20 begin foreach B : basic block of the CFG do insert an empty parallel copy at the beginning of B . . .3 can be detected and handled initially (through edge splitting if possible) during the copy insertion phase. Algorithm 23: The pin-φ -webs discovery algorithm. . A variable and a physical resource do not interfere while two distinct physical resources will interfere with one another. replace a i by a i in the φ -function. web(v )) for each instruction of the form a dest = φ (a 1 . if v pinned to a resource R union(web(R ). B n : a n ) do foreach a i (argument of the φ -function corresponding to B i ) do let PC i be the parallel-copy at the end of B i . replace a 0 by a 0 in the φ -function. . . begin let PC 0 be the parallel-copy at the beginning of B 0 .

(e) Interferences . . . x 3 ) x2 ← x2 x3 ← x2 + 1 x3 ← x3 p ? br B 1 B0 B0 x ← . . .. . x1 ← . . between x 2 and x 3 ) materialize the presence of an interference between the two corresponding variables. . Rastello) . x 3 } x x2 x3 (c) Interferences Fig. . . x 3 can then be coalesced with x leading to the code of Figure 22. . .e. A solid edge between two nodes (e. ← x 2 B1 x1 ← . . . . . .e. it is important to remove as many copies as possible.298 22 SSA destruction for machine code — (F. x 3 ) B1 x3 ← x2 + 1 p ? br B 1 B2 . . .5 SSA destruction for the lost-copy problem. To improve the code however. x 3 } is coalesced into a representative variable x . . expressing the fact that they cannot be coalesced and share the same resource. . . 22. . according to the interference graph of Figure 22. the corresponding φ -web {x 1 . . i. . i. x2 ← φ (x 1 .5d. . x2 ← x B1 x ← x 2 + 1 p ? br B 1 B2 . . . x 2 . . . . . . . . . between x 2 and x 2 ) that could be removed by their coalescing. An interference graph (as depicted in Figure 22.5e. . . the presence of a copy (e. . A dashed edge between two nodes materializes an affinity between the two corresponding variables.. liveness and interferences can then be defined as for regular code (with parallel copies).3 Code Quality Once the code is in conventional SSA. . . . . This process is illustrated by Figure 22. . . 22. as it relies in renaming all variables in each φ -web into a unique representative name and then remove all φ -functions. the correctness problem is solved: destructing it is by definition straightforward. .5e) can be used. . ← x 2 B2 . . x 3 . . . . . . . . . . x 2 . . Aggressive coalescing This can be treated with standard non-SSA coalescing technique as conventional SSA allows to get rid of φ -functions: the set of variables of a SSA web can be coalesced leading to a single non-SSA variable. . . ← x 2 (a) Lost-copy problem (b) C-SSA x1 x1 x2 x2 x3 x3 (d) After coalescing x1 {x 1 .5: the isolation of the φ -function leads to inserting the three copies that respectively define x 1 . x 1 . .g.g. and uses x 2 . . B0 x ← x 1 1 x 2 ← φ (x 1 .

when a and b are coalesced. liveness of φ -function operands should reproduce the behavior of the corresponding non-SSA code as if the variables of the φ -web were coalesced all together. Furthermore.3 Code Quality 299 basic-block exit Liveness under SSA If the goal is not to destruct SSA completely but remove as many copies as possible while maintaining the conventional property. to be efficient coalescing must use an accurate notion of interference. after coalescing a and b . The corresponding interference graph on our example is depicted in Figure 22. Because of this. . However. The scope of variable coalescing is usually not so large. Definition 4 (multiplexing mode). However. the set of interferences of the new variable may be strictly smaller than the union of interferences of a and b . after the φ -isolation phase and the treatment of operand pining constraints.22. c should not interfere anymore with the coalesced variable. Hence the interference graph would have to be updated or rebuilt. with this conservative interference definition. the “has-the-same-value” binary relation defined on variables is. . in SSA. simply merging the two corresponding nodes in the interference graph is an over-approximation with respect to the interference definition. given by its unique definition. in other words variable a i for i > 0 is live-out of basic-block Bi . in other words variable a 0 is live-in of B 0 . statically. For example. the code contains many overlapping live-ranges that carry the same value.. related to the execution) notions: the notion of liveness and the notion of value. each variable has. As already mentioned in Chapter 2 the ultimate notion of interference contains two dynamic (i. Value-based interference As said earlier. an equiv- . and graph coloring based register allocation commonly take the following conservative test: two variables interfere if one is live at a definition point of the other and this definition is not a copy between the two variables. . if the SSA form fulfills the dominance property.e. B n : a n ) be in multiplexing mode. The semantic of the φ -operator in the so called multiplexing mode fits the requirements. a unique value. Thus. Analyzing statically if a variable is live at a given execution point or if two variables carry identical values is a difficult problem. . One can notice that.5c. Let a φ -function B 0 : a 0 = φ ( B 1 : a 1 . its use-operands are at the exit of the corresponding predecessor basic-blocks. then its liveness follows the following semantic: its definition-operand is considered to be at the entry of B 0 . in a block with two successive copies b = a and c = a where a is defined before. and b and c (and possibly a ) are used after. it is considered that b and c interfere but that none of them interfere with a .

finding the value of a variable can be done by a simple topological traversal of the dominance tree: when reaching an assignment of a variable b . using the same scheme as in SSA copy folding. Then. consider the following loop body if(i = 0) {b ← a . such a coalescing may be incorrect as it would increase the live range of the dominating variable. the fact that b is live at the definition of c does not imply that b and c interfere. a ← c . the interference between b and c is actual. Note that our notion of values is limited to the live ranges of SSA variables. and c have the same value V (c ) = V (b ) = V (a ) = a . . Hence. This sharing problem is difficult to model and optimize (the problem of placing copies is even worse). . the presence of two copies from the same source. . By comparison. e. A shared copy corresponds precisely to the previous example of two successive copies b = a and c = a i. Suppose however that a (after some other coalescing) interferes with b and c . ← b . We coalesce two variables b and c if they are both copies of the same variable a and if their live ranges intersect. our equality test in SSA comes for free. . we would face the complexity of general value numbering. as we consider that each φ -function defines a new variable. . .300 22 SSA destruction for machine code — (F. no coalescing can occur although coalescing b and c would save one copy. by “sharing” the copy of a .e. This can be done in a second pass after all standard affinities have been treated. Note that if their live ranges are disjoint. if the instruction is a copy b = a . But. 1 Dominance property is required here. We have seen that. thus they do not interfere. In the previous example. possibly creating some interference not taken into account. The interference test in now both simple and accurate (no need to rebuild/update after a coalescing): if live(x ) denotes the set of program points where x is live. but we can optimize it a bit. } c ← . . The value of an equivalence class 1 is the variable whose definition dominates the definitions of all other variables in the class. Shared copies It turns out that after the φ -isolation phase. otherwise V (b ) is set to b . a interfere with b if live(a ) intersects live(b ) and V (a ) = V (b ) The first part reduces to def(a ) ∈ live(b ) or def(b ) ∈ live(a ) thanks to the dominance property. thanks to our definition of value. V (b ) is set to V (a ). and the treatment of operand pining constraints the code also contains what we design as shared copies.g. b . We could propagate information through a φ -function when its arguments are equivalent (same value). a . Rastello) alence relation.

. . . . . . . and rely on the liveness check described in Chapter 9 to test if two liveranges intersect or not.def. . . . . . . .islive(p ). too. . .4 Speed and Memory Footprint 301 . This motivates. Also. . not to build any interference graph at all. .def.22. . . . . . b) avoid the quadratic complexity of interference check between two sets of variables by an optimistic approach that first coalesces even interfering variables. . . Interference check Liveness sets and interference graph are the major source of memory usage. . updates.op a. . . All these constructions. under strict SSA form. b ) V (a ) = V (b ) . . meaning that for a given variable a and program point p . it inserts many instructions before realizing most are useless. and copy insertion is already by itself time-consuming.def. say a and b intersect: intersect(a . .def. to check is two variables live-ranges. . one can answer if a is live at this point through the boolean value of a . then traverses each set of coalesced variables and un-coalesce one by one all the interfering ones.3 this can be done linearly (without requiring any hash map-table) on a single traversal of the program if under strict SSA form. in the context of JIT compilation. b ) ⇔ liverange(a ) ∩ liverange(b ) = a. . if a general coalescing algorithm is used.def. Let us suppose for this purpose that a “has-the-samevalue” equivalence relation. . . We also suppose that liveness check is available. .def.op dominates b.op = b. .op dominates a. .op) Which leads to our refined notion of interference: interfere(a . .op) b. . . . First. . c) emulating (“virtualizing”) the introduction of the φ -related copies.islive out(b. . b ) ⇔ intersect(a .4 Speed and Memory Footprint Implementing the technique of the previous section may be considered too costly. .islive out(a. . . . . a graph representation with adjacency lists (in addition to the bit matrix) and a working graph to explicitly merge nodes when coalescing variables. manipulations are time-consuming and memory-consuming.op ⇔ a.def. It introduces many new variables. 22.def.op b. . would be required. . . This can directly be used. is available thanks to a mapping V of variables to symbolic values: variables a and b have the same value ⇔ V (a ) = V (b ) As explained in Paragraph 22. The size of the variable universe has an impact on the liveness analysis and the interference graph construction. We may improve the whole process by: a) avoiding the use of any interference graph and liveness sets.

To this end. We denote the (interference free) set of variables pinned to a common resource that contains variable v . it is not put in another set. cur_anc = ⊥ which allows to conclude that there is no dominating variable that interfere with c . Variables defined at the same program point are arbitrarily ordered.eanc (updated line 13) represents the immediate intersecting dominator with the same value than v . when . variables are identified by their definition point and ordered using dominance accordingly. So the process of uncoloring a variable might have the effect of un-coloring some others. thanks to the dominance property.idom = b ). coalescing has the effect of merging vertices and interference queries are actually to be done between sets of vertices. so as to use the standard definition of immediate dominator (denoted v . so we have b . and different sets are assigned different colors. The idea is to first merge all copy and φ -function related variables together. However. updated lines 8-9). and graph coloring. set to ⊥ if not exists. Here we focus on a single merged-set and the goal is to make it interference free within a single traversal. and that a merged-set cannot contain variables pinned to different physical resources. We suppose that if two variables are pinned together they have been assigned the same color. atomic-merged-set(v ). In other words. this can be done linearly using a single traversal of the set. we will abusively design as the dominators of a variable v .idom. To overcome this complexity issue. and any un-colored variable v is to be coalesced only with the variables it is pinned with. Actually. The principle is to identify some variables that interfere with some other variables within the merged-set. In reference with register allocation. atomic-merged-set(v ).302 22 SSA destruction for machine code — (F. We will say un-colored. a colored variable is to be coalesced with variables of the same color. cur_anc (if not ⊥) represents a dominating variable interfering with v and with the same value than v . Rastello) De-coalescing in linear time The interference checking outlined in the previous paragraph allows to avoid building an interference graph of the SSA form program. The process of decoalescing a variable is to extract it from its set.idom: when v is set to c (c . as b does not intersect c and as b . we will associate the notion of colors to merged-sets: every variables of the same set are assigned the same color.eanc = ⊥. the set of variables of color identical to v which definition dominates the definition of v . From now. The idea exploits the tree shape of variables live-ranges under strict SSA. When line 13 is reached. As we will see. To illustrate the role of v . let us consider the example of Figure 22.eanc = ⊥ and d . enforcing at each step the sub-set of already considered variables to be interference free. the technique proposed here is based on a de-coalescing scheme.6 where all variables are assumed to be originally in the same merged-set: v . just isolated. Algorithm 24 performs a traversal of this set along the dominance order.eanc in Algorithm 24. and remove them (along with the one they are pinned with) from the merged-set.eanc = a . variables pinned together have to stay together. We suppose that variables have already been colored and the goal is to uncolor some of them (preferably not all of them) so that each merged-set become interference free. A merged-set might contain interfering variables at this point.

v . break.eanc = a and thus cur_anc = a . ← c + 1 Fig.. Procedure DeCoalesce(v . This allows to detect on line 15 the interference of e with a . cur_anc ← cur_anc. v .22. cur_anc ← v .idom.6 d ←a e ←b +2 . cur_idom ← v . if cur_anc = ⊥ if V (cur_anc) = V (v ) v . 6 7 8 9 10 11 12 13 14 15 16 17 18 a ← .4 Speed and Memory Footprint 303 v is set to e . d does not intersect e but as a intersects and has the same value than d (otherwise a or d would have been un-colored). u ) begin while (u = ⊥) (¬(u dominates v ) ∨ uncolored(u )) do u ← u . else(/* cur_anc and v interfere */) if preferable to uncolor v uncolor atomic-merged-set(v ). 22.idom.. Algorithm 24: De-coalescing of a merged-set 1 2 3 4 5 begin cur_idom = ⊥...idom ← u . cur_idom). break . while cur_anc = ⊥ while cur_anc = ⊥ ¬ (colored(cur_anc) ∧ intersect(cur_anc. b ←a c ←b +1 .eanc ← ⊥.eanc ← cur_anc.eanc.. v )) cur_anc ← cur_anc.eanc. foreach variable v of the merged-set in DFS pre-order of the dominance tree do DeCoalesce(v .. we have d . ← a + e Variables live-ranges are sub-trees of the dominator tree Virtualizing φ -related copies The last step toward a memory friendly and fast SSA-destruction algorithm consists in emulating the initial introduction of copies and only actually insert them . else uncolor atomic-merged-set(cur_anc).

identified as a “virtual” parallel copy. Rastello) on the fly when they appear to be required. it is materialized in the parallel copy and a i (resp.2b) to the others. def of a 0 ) is now assumed. a copy is inserted. The overall scheme works as follow: (1) every copy related (even virtual) variables are first coalesced (unless pining to physical resources forbid it). A key implementation aspect is related to the handling of pining. to be on the corresponding control flow edge. as in the multiplexing mode. Let a φ -function B 0 : a 0 = φ ( B 1 : a 1 . In particular. the de-coalescing scheme given by Algorithm 24 have to be adapted to emulate liveranges of split-operands. materialization of remaining virtual copies is performed through a single traversal of all φ -functions of the program: whenever one of the two operands of a virtual copy is uncolored. We use the term non-split and split operands introduced in Section 22. Because of this we consider a different semantic for φ -functions than the multiplexing mode previously defined. The corresponding φ -operator is replaced and the use of a i (resp. where the real copies. (3) finally. coalescing is performed in two separated steps. in other words variable a 0 is not live-in of B 0 . Because of the virtualization of φ -related copies. First pin-φ -webs have to be coalesced. and as the late point of B i (late( B i ) – just before the branching instruction) for a use-operand. . if any. When the algorithm decides that a virtual copy a i ← a i (resp. . The liveness for non-split operands follows the multiplexing mode semantic. then the liveness for any split-operand follows the following semantic: its definition-operand is considered to be at the early point of B 0 . for correctness purpose. and use a special location in the code.304 basic-block early basic-block late 22 SSA destruction for machine code — (F. only copies that the first approach would finally leave un-coalesced are introduced. . a 0 ) becomes explicit in its merged-set. The so obtained merged sets (that contain local virtual . a 0 ← a 0 ) cannot be coalesced. . For a φ -function B 0 : a 0 = φ ( B 1 : a 1 . the cases to consider is quite limited: a local virtual variable can interfere with another virtual variable (lines 3-4) or with a “real” variable (lines 5-12). This way. . . The trick is to use the φ -function itself as a placeholder for its set of local virtual variables. Definition 5 (copy mode). B n : a n ) be in copy mode. (2) then merged-sets (identified by their associated color) are traversed and interfering variables are de-coalesced. We choose to postpone the materialization of all copies along a single traversal of the program at the really end of the de-coalescing process. we denote the program point where the corresponding copy would be inserted as the early point of B 0 (early( B 0 ) – right after the φ -functions of B 0 ) for the definitionoperand. We use exactly the same algorithms as for the solution without virtualization. B n : a n ) and a split operand B i : a i .2. in other words variable a i for i > 0 is (unless used further) not live-out of basic-block B i . The pseudo-code for processing a local virtual variable is given by Algorithm 26. As the live-range of a local virtual variable is very small. To this end we differentiate φ -operands for which a copy cannot be inserted (such as for the br_dec of Figure 22. its use-operands are at the late point of the corresponding predecessor basic-blocks. Detection and fix of strong interferences is handled at this point. . . will be placed. or whenever the colors are different.

c. .22.cur_idom ← v . else foreach operation OP at l (including φ -functions) do foreach variable v defined by OP do if ¬colored(v ) then continue else c ← color(v ). begin if (/* Interference */)Phi = ⊥ ∧ V (operand_from_B(Phi )) = V (a ) uncolor atomic-merged-set(Phi). Phi’.curphi. variables) have to be identified as atomic i. while (u = ⊥) (¬(u dominates B ) ∨ ¬colored(u )) u ← u . This systematic copy insertion can also be avoided and managed lazily just as for φ -nodes isolation.eanc. For φ -functions. Without any virtualization. B : v . the trick is again to use the φ -function itself as a placeholder for its set of local virtual variables: the pining of the virtual variables is represented through the pining of the corresponding φ -function. while cur_anc = ⊥ while cur_anc = ⊥ ¬ (colored(cur_anc) ∧ cur_anc. B : v . . u ). A variable cannot be de-coalesced from a set without de-coalescing its atomic merged-set from the set also. the process of transforming operand pining into live-range pining also introduces copies and new local variables pinned together.cur_idom). Non singletons atomic merged-sets have to be represented somehow. DeCoalesce_virtual(Phi. .curphi = ⊥ foreach basic-block B successor of B do foreach operation Phi: “B : a 0 = φ (.cur_idom). Algorithm 26: Process (de-coalescing) a virtual variable 1 2 3 4 5 6 7 8 9 10 11 12 Procedure DeCoalesce_virtual(Phi. We do not address this aspect of the virtualization here: to . )” in B do if ¬colored(Phi) then continue else c ← color(Phi ). foreach basic-block B in CFG in DFS pre-order of the dominance tree do foreach program point l of B in topological order do if l = late( B ) foreach c ∈ COLORS do c .curphi ← Phi. cur_anc ← cur_anc.eanc ← ⊥.idom ← u .eanc.e.4 Speed and Memory Footprint 305 Algorithm 25: De-coalescing with virtualization of φ -related copies 1 2 3 4 5 6 7 8 9 10 11 12 13 14 foreach c ∈ COLORS do c . B : v . . As a consequence. c. . they cannot be separated. any φ -function will be pinned to all its non-split operands.cur_idom = ⊥. v . atomic merged sets will compose larger merged sets. After the second step of coalescing. break. cur_anc ← u .idom. if (/* interference */)cur_anc = ⊥ if preferable to uncolor Phi uncolor atomic-merged-set(Phi). . DeCoalesce(v . v . else uncolor atomic-merged-set(cur_anc). if colored(v ) then c . if colored(Phi) then c . c . return.islive(out( B ))) cur_anc ← cur_anc.

.. . . b) eliminate redundant copies. is due to the fact that in the presence of many copies the code contains many intersecting variables that . . was proposed by Sreedhar et al.. . The first one. The initial value that must be copied into b is stored in pred(b ). the to_do list is considered. The list to_do contains the destination of all copies to be treated. . This gives several benefits: a simpler implementation. . . namely the “lost copy problem” and the “swap problem”. . The reason for that.5 Further Readings SSA destruction was first addressed by Cytron et al. . . When a variable a is copied in a variable b . . . . . . Copies are first treated by considering leaves (while loop on the list ready). . This information is stored into loc(a ). [243]. . 22. we treat the copies placed at a given program point as parallel copies. . . at first sight. The initialization consists in identifying the variables whose values are not needed (tree leaves). Briggs et al. in particular for defining and updating liveness sets. . . . They address the associated problem of coalescing and describe three solutions. . The third solution that turns out to be nothing else than the first solution that would virtualizes the isolation of φ -functions shows to introduce less copies. . Rastello) simplify we consider any operand pining to be either ignored (handled by register allocation) or expressed as live-range pining. Then. thanks to an extra copy into the fresh variable n . . . . . . and fewer constraints for the coalescer. . correct. [49] pointed subtle errors due to parallel copies and/or critical edges in the control flow graph. c) eliminate φ -functions and leave CSSA. [87] who propose to simply replace each φ -function by copies in the predecessor basic-block. possibly breaking a circuit with no duplication. . write the final copies in some sequential order.e. we need to go back to standard code. Although this naive translation seems.2 (Algorithm 6) the sequentialization of a parallel copy can be done using a simple traversal of a windmill farm shaped graph from the tip of the blades to the wheels. . . . . . . identified by Boissinot et al. allowing to overwrite a variable as soon as it is saved in some other variable. . a more symmetric implementation. . . . . . . Two typical situations are identified. . Sequentialization of parallel copies During the whole algorithm. which are stored in the list ready.306 22 SSA destruction for machine code — (F. However. ignoring copies that have already been treated. i. As explained in Section 3. . which are indeed the semantics of φ -functions. The first solution. by isolating φ -functions. . Algorithm 27 emulates a traversal of this graph (without building it). both simple and correct. consists in three steps: a) translate SSA into CSSA. . the algorithm remembers b as the last location where the initial value of a is available. . . at the end of the process. .

5 Further Readings 307 Algorithm 27: Parallel copy sequentialization algorithm Data: Set P of parallel copies of the form a → b .push(b ) . Register renaming constraints. see for example how the interference graph is computed in [16].pop() . /* initialization */ forall the (a → b ) ∈ P do loc(a ) ← a . /* available in c */ 12 emit_copy(c → b ) . pred(a ) ← ⊥ . but it was considered to be too space consuming. /* needed and not copied yet */ pred(b ) ← a . such as calling conventions or dedicated registers. [38] revisited Sreedhar et al. Simple data-flow analysis scheme is used to place repairing copies. a = b . if b = loc(pred(b )) emit_copy(b → n ) . Recomputing it from time to time was preferred [63. /* b is not used and can be overwritten 1 2 3 4 5 6 7 8 9 to_do = [] while ready = [] b ← ready. /* generate the copy */ 13 loc(a ) ← b . available in n */ /* b can be overwritten */ do not actually interfere.22. This interference notion is the most commonly used. loc(b ) ← n . 62].push(a) .push(b ) . /* just copied. /* copy into b to be done */ forall the (a → b ) ∈ P do if loc(b ) = ⊥ then ready. /* pick a free location */ a ← pred(b ) . They proposed a simple conservative test: two variables interfere if one is live at a definition point of the other and this definition is not a copy between the two variables. /* now. c ← loc(a ) . are treated with pinned variables. with this conservative interference definition. forall the (a → b ) ∈ P do loc(b )← ⊥ . ready. By revisiting this approach to address the coalescing of copies Rastello et al. after coalescing some variables the interference graph has to be updated or rebuilt. pred(n ) ← ⊥ . The notion of value may be approximated using data-flow analysis on specific lattices [11] and under SSA form simple global value numbering [228] can be used.’s approach in the light of this remark and proposed the value based interference described in this chapter. Leung and George [172] addressed SSA destruction for machine code. can be overwritten 10 while 11 15 16 17 18 19 */ b ← to_do. */ /* look for remaining copy */ /* break circuit with copy */ /* now.push(b ) . one extra fresh variable n Output: List of copies in sequential order ready ← [] . [63] in the context of register allocation. Still they noticed that. /* (unique) predecessor */ to_do. The value based technique described here can also obviously be used in the context of register allocation even if the code is not under SSA form. A counting mechanism to update the interference graph was proposed. Boissinot et al. available in b */ 14 if a = c and pred(a ) = ⊥ then ready. to_do ← [] . [220] pointed out and fixed a few errors present in the original al- .pop() . The ultimate notion of interference was discussed by Chaitin et al.

revisited in this chapter.308 22 SSA destruction for machine code — (F. that exploits the underlying tree structure of dominance relation between variables of the same merged-set. It proposes the de-coalescing technique. [56]. The first technique to address speed and memory footprint was proposed by Budimli´ c et al. this algorithm is quite complicated to implement and not suited for just in time compilation. Rastello) gorithm. May [188]. this chapter describes a fast sequentialization algorithm that requires the minimum number of copies. A similar algorithm has already been proposed by C. . While being very efficient in minimizing the introduced copies. Last.

. . . . . . . Bouchez S. as many operations as possible should draw their operands from processor registers without loading them from memory beforehand. . There is only a small number of registers available in a CPU. . Due to the huge latency of the memory hierarchy (even loading from L1 cache takes three to ten times longer than accessing a register). . . . . . . . . the task of register allocation is not only assigning the variables to registers but also deciding which variables should be evicted from registers and when to store and load them from memory (spilling). Hence. . a liveness analysis determines for each variable the program points where the variable is live. . .CHAPTER 23 Register Allocation F. . . and to deal with allocation restrictions that the instruction set architecture and the runtime system impose (register targeting). . . 309 . . . . In each procedure.1 Introduction Register allocation is usually performed per procedure. Ideally. A variable is live at a program point if there exists a path from this point on which the variable will be used later on and is not overwritten before that use. . . . . . . . . with usual values ranging from 8 to 128. . 23. . register allocation has to remove spurious copy operations (copy coalescing) inserted by previous phases in the compilation process. register allocation is one of the most important optimizations in a compiler. . Furthermore. . . . . . . . . . . The compiler determines the location for each variable and each program point. Hack Progress: 90% Text and figures formating in progress Register allocation maps the variables of a program to physical memory locations. . . . .

Maxlive being only a lower bound on the chromatic number. Hence. we will use the terms “register” and “color” interchangeably in this chapter. A k -coloring is a mapping from the nodes of the graph to the first k natural numbers (called colors) such that two neighbouring nodes have different colors. as can be seen in Figure 23. [61] showed that for every undirected graph.” It is helpful to think of interferences as an undirected interference graph: The nodes are the variables of the program. Hence. Bouchez. hence their nodes are all mutually connected. 1 This definition of interference by liveness is an overapproximation. The size of the largest clique in the interference graph is called its clique number. This the major nuisance of classical register allocation: The compiler cannot efficiently determine how many registers are needed. There are refined definitions that create less interferences. That is. The smallest k for which a coloring exists is called the chromatic number. The set of variables live at some program point form a clique in this graph: Any two variables in this set interfere. A coloring of this graph with k colors corresponds to a valid register allocation with k registers. storage needs to be allocated for that variable. the register pressure is at most two at every program point. The resource conflict of two variables is called interference and is usually defined via liveness: Two variables interfere if (and only if ) there exists a program point where they are simultaneously live. ideally a register.1 The number of live variables at a program point is called the register pressure at that program point. the gap between the chromatic number and Maxlive can even be arbitrarily large. Hack) Hence. because the nodes of a clique must all have different colors. in this chapter we will restrict ourselves to this definition. From the NP-completeness of graph coloring then follows the NP-completeness of register allocation. This means one might need more registers than Maxlive. The inequality between Maxlive and the chromatic number is caused by the cycle in the interference graph. 2 In theory. The somewhat unintuitive result is that. the value of the variables is allowed to reside in different registers at different times. . the interference graph cannot be colored with two colors. we still might need more than Maxlive registers for a correct register allocation! The causes of this problem are control-flow merges. determining the chromatic number of an interference graph is not possible in polynomial time unless P=NP . even if at every program point there are no more than Maxlive variables live. However. and two nodes are connected if they interfere. S. The set of all program points where a variable is live is called the live-range of the variable. there is a program whose interference graph is this graph. The maximum register pressure over all program points in a procedure is called the register pressure of that procedure.310 23 Register Allocation— (F. Its chromatic number is three. This can be shown using by applying Chaitin’s proof of the NP-completeness of register allocation to Mycielski graphs. In that example. However. or “Maxlive. inserting a copy (move) instruction at a program point that creates a new variable.1. Of course χ ≥ ω holds for every graph. Thus. Chaitin et al.2 This situation changes if we permit live-range splitting. denoted χ . denoted ω.

23. the node e in the graph is split into two nodes. 23. This interplay between live-range splitting and colorability is the key issue in register allocation. which degrades the runtime of the compiled program. Then. then it either spills this variable to memory.1 Non-SSA Register Allocators As finding a valid k -coloring is NP-complete. or make the graph harder to color for the heuristic. Both cases generate additional spills. i.” and not because we are actually out of registers. we might spill a variable because the heuristic is “not good enough. or frees a register by spilling the variable it contains. The interference graph degenerates into several disconnected components and its chromatic number drops to Maxlive. making its chromatic number equal to two.. The problem here is that the spilling decision is made to revive the coloring heuristic.1 Introduction 311 a ← ··· b ← ··· c ← a + ··· e←b ←c d ← ··· e ← a + ··· ←d d a e b c (a) Example program and live-ranges of its variables Fig. This is of course unacceptable because we never want to add memory accesses in favor of an eliminated copy. coalescing may increase the chromatic number of the graph.1 Example program and its interference graph (b) Interference graph In Figure 23.23. Even worse.e. At the same time. undoing live-range splitting) during coloring.1. In an extreme setting. assume we split the live-range of e in the left block: we rename it by e and add a copy e ← e at the end of the block. Maxlive. e and e . classical algorithms perform copy coalescing (i. one could split the live-ranges of all variables after every instruction.e. Of course.. If that algorithm fails to assign a register to a variable.1a. However. assigning registers is performed by some heuristic algorithm. Thus. and not because the variable that gets spilled is in itself a good candidate for spilling. that do not interfere: this breaks the cycle in the graph. such an extreme live-range splitting introduces a lot of shuffle code. existing techniques often apply conservative coalescing . if done aggressively.

3 It has even been 3 This also implies that chordal graphs do not have “holes” like the interference graph in Figure 23. Chordal graphs can be optimally colored in linear a time with respect to the number of edges in the graph.312 23 Register Allocation— (F. at the expense of the quality of the coalescing. induces a tree on the control-flow graph.2 SSA form to the rescue of register allocation The live-ranges in an SSA form program all have a certain property: SSA requires that all uses of a variable are dominated by its definition. . Hack) approaches which are guaranteed not to increase the chromatic number of the graph. Hence. 51. they are perfect. 131]. Under SSA form. The argument and result variables of that φ -function constitute new live-ranges. Hence a situation like in Figure 23. Thus. This is a very important property for register allocation because it means that Maxlive equals the chromatic number of the graph.2 SSA version of the example program b c e2 (b) Interference graph This special structure of the live-ranges leads to a special class of interference graphs: Gavril [121] showed that the intersection graphs of subtrees are the chordal graphs. however. 23. the clique number equals the chromatic number. the whole liverange is dominated by the definition of the variable. 23. S. They can branch downwards on the dominance tree but have a single root: the program point where the variable is defined. the live-ranges of SSA variables are all tree-shaped [41. which means that for every subgraph. Every cycle is spanned by chords.1. the live-range of e is split by a φ -function.1. giving more freedom to the register allocator since they can be assigned to different registers. Furthermore. Bouchez.1 can no longer occur: e had two “roots” because it was defined twice. Dominance. e2 ) (a) example program in SSA Fig. a ← ··· b ← ··· c ← a + ··· e2 ← b ←c d ← ··· e1 ← a + · · · ←d d a e1 e3 e3 ← φ(e1 .

. However. .3. . . . . 23. . Under SSA. . Obviously. if Maxlive ≤ R . . . . for every clique in the interference graph. . the number of registers. This implies inserting load instructions in front of every use and store 4 Considering all simultaneously alive variables interfere. So. . . . This can be done by checking on every program point the number of variables that are alive.” The pseudo-code is given on Figure 23. there is no need for spilling. . lower the register pressure to k everywhere in the program. . spilling is mandatory if Maxlive is strictly greater than R . This allows to decouples spilling from coloring: First. . . color the interference graph with k colors in polynomial time. . . The register pressure at a program point is the number of variables alive at this point. . .2 Spilling 313 shown [41. Maxlive is the coloring number of the chordal interference graph of the program (TODO: ref to chapter 2?). . i.. . We only need to know which uses are “last-uses. we can devise a polynomial test to know whether spilling is necessary or not by computing Maxlive. . . but double-check my words ! TODO: seb: do we need this? doing a standard liveness analysis does the job. .2 Spilling The term spilling is heavily overloaded: In the graph-coloring setting it often means to spill a node of the interference graph and thus the entire live-range of a variable. . .3 Maxlive and colorability of graphs TODO: Florent: This part does not seem to be useful. . 131] that. . . this is not required as Maxlive can easily be computed without liveness information by performing a scan of the dominance tree. . . the order does not matter. . . .1. . • • • • definition of Maxlive show that Maxlive is the minimum number of colors required scan of the dominance tree sufficient to know Maxlive: pseudo-code Fab: MAXLIVE minimum if interference graph==intersection graph. . the good news is that this is the only case where spilling is required. in the case of SSA. . . . . . TODO: check Maxlive scan needs last uses. . there is one program point where all the variables of that clique are live. starting from the root. Then.23. 23. . . Clearly we do not want to go into such details but be careful to use precise and correct terms. . since it is the maximum number of simultaneously alive variables. .4 The greatest register pressure over all a program is called Maxlive. . . .e. If liveness information is available on the whole program. Indeed. we can also refer to Appel’s liveness algo.

even in this simplistic setting. {}) Fig.2. Hack) function compute-branch (p.314 23 Register Allocation— (F.3 Pseudo-code for Maxlive instructions after each definition of the (non-SSA) variables.0 compute-branch (root. It is even NP-complete to find the minimum number of nodes to spill to decrease Maxlive just by one! Finding spill candidates is not facilitated by SSA.1 Spilling under the SSA Form As stated above.max( maxlive. live) live <. Other approaches. For the moment. Let us start with a simplistic setting where the location of loads and stores is not optimized: loads are placed directly in front of uses and stores directly after the definition. Furthermore we consider a spill everywhere setting: a variable is either spilled completely or not at all. TODO: figure? Flo: if space allows it. Because many compilers use SSA heavily. 23. the live-range of a variable then degenerates into small intervals: one from the definition and the store.{ uses in p that are last uses } live <.2 Finding Spilling Candidates We need to find program points to place loads and stores such that Maxlive ≤ R . Thus. let us ignore how to choose “what” to spill and “where. lowering Maxlive to R ensures that a register allocation with R registers can be found in polynomial time for SSA programs. 23. However. #live ) foreach child c of p compute-branch (c.2. . we consider spilling as an SSA program transformation that establishes Maxlive ≤ R by inserting loads and stores into the program. are able to keep parts of the live-range of a variable in a register such that multiple uses can potentially reuse the same reload.live . live) function compute-maxlive maxlive <. spilling should takes place before registers are assigned and yield a program in SSA form. 23.” and concentrate on the implications of spilling in SSA programs. S. yes. it is a reasonable assumption that the program was in SSA before spilling. Hence. If spilled.live + { definitions of p } maxlive <. it is NP-complete to find the minimum number of nodes to establish Maxlive ≤ R . Bouchez. and one from each load to its subsequent use. like some linear scan allocators.

This algorithm is not optimal since the first time a variable is spilled.. The idea is to scan the block from to bottom. a spill everywhere will place a superfluous (and costly!) load in that loop. For subsequent spilling of the same variable (but on a different part of its live-range). However. formulations like spill everywhere are often not appropriate for practical purposes. Hence this algorithm may produce more stores than necessary (e. 23. And we spill it only up to this next use. Spill everywhere approaches try to minimize this behavior by adding costs to variables. However. to make the spilling of such a variable less likely. we give a possibility to perform spilling under SSA • Hack’s algorithm: simplified version? • à la Belady following dominance tree? • Fab: The last Algorithm from Sebastian is Belady like. a simple algorithm that works well is the “furthest first” algorithm. In the context of a basic block. Hence.3 Spilling under SSA Still. a load and a store are added. So yes try to simplify it. only a load must be added. such approximations are often too coarse to give good performance. The idea is to use a data-flow analysis to approximately compute the next use values on the control-flow graph. by spilling two variables while spilling an other part of the first one would have been enough). This forces the spilling algorithm to prefer variables that are not used in loops and to leave in registers the parts of live-ranges that are used inside loops. • Seb: I think it’s ok to briefly outline the approaches that are out there and refer to the papers.g. We then have to find how and in which order to process the basic blocks of the control-flow graph in order to determine a reasonable initial register occupation for each block. These bring in a flow-insensitive information that a variable reside an frequently executed area.2. Spilling this variable frees a register for the longest time. hence diminishing the chances to have to spill other variables later.. it is imperative to cleverly split the live-range of the variable according to the program structure and spill only parts of it.” i.23. A variable might be spilled because at some program point the pressure is too high. provide the pseudo code and give the intuitions on how the full version improves it. and putting the whole live-range to memory is too simplistic. basic blocks. In this next use calculation. whenever where the register pressure is too high. loops are virtually unrolled by a large factor such that code “behind” loops appears far away from code in front of the loop. We will present in the next section an algorithm that extends this “furthest first” algorithm to general control-flow graphs. if that same variable is later used in a loop where the register pressure is low.e. we spill the variable whose next use is the furthest away. but it gives good results on “straight-line code.2 Spilling 315 Moreover. .

it can get stuck. . . . This scheme is based on the observation that given R colors—representing the registers—. and as such. . . Hack) TODO: Ask Sebastian to do it. . . This process can be iterated with the remaining nodes. . the greedy coloring scheme can never get stuck. the interference graph after spilling has two interesting properties. . . .. i. . . hence easily colorable in linear time. In particular. In traditional register allocation. . We will then present a lighter coloring algorithm based on the scanning on the dominance tree. . However. . .pdf in biblio . .e. . . if spilling has already been done so that the maximum register pressure is at most R . .3. we will see that under the SSA form. . . . Bouchez.3 Coloring and coalescing Due to the decoupled register allocation and the use of SSA. . whose degree may have decreased (if the simplified node was one of their neighbours).316 greedy coloring scheme 23 Register Allocation— (F. . this is the trigger for spilling some variables so as to unstuck the simplification process. . if a node in the graph has at most R − 1 neighbours. . . . . . . . 23. . it still is chordal. In that case. and show how it successfully color programs under SSA form. we know that R colors are sufficient to color it since Maxlive ≤ R . The greedy scheme is a coloring heuristic.1 Greedy coloring scheme In traditional graph coloring register allocation algorithms. we know it is possible to color the graph with R colors. . . we do not know whether the graph is R -colorable or not. . . . a valid R -coloring can be obtained by assigning colors to nodes in the reverse order of their simplification. popping nodes from the stack and assigning them one available color. This is always possible since they have at most R − 1 colored neighbours. Such a node node can be simplified. We call this algorithm the greedy coloring scheme. . In this section. . we will explain how to add coalescing to the coloring phase to reduce the number of copies in a program. removed from the graph and placed on a stack. i. it happens whenever all remaining nodes have degree at least R . . . there will always be one color available for this node whatever colors the remaining nodes have. we will first present the traditional graph coloring heuristic of Chaitin et al. 23. First. If the graph becomes empty. Finally. the assignment of registers to variables is done by coloring the interference graph using the greedy heuristic of Chaitin et al. [?]TODO: file braun09cc. . .e. . . . . . Second. .. S.

In this representation. The interference graph in hence the intersection graph of a family of subtrees. but it is in fact quite easy to understand why without resorting to this theory. Since spilling has already been done. say at point d (the instruction that defines the leaf ). that is defined last in this branch. In the graph theory. We can visually see that this variable will not have many intersecting variables. So. hence there is at most R − 1 intersecting live-ranges: the corresponding node in the interference graph . Consider the “leaf” of a branch of the dominance tree. worklist = {v in V | degree[v] < k} . the interference graph can be colored with R colors using the greedy scheme. Consider the greedy coloring scheme described in the previous section. it can be simplified.3. the live-range on this branch which starts last. while worklist != {} do let v in worklist .e. If it does.4 Iskgreedy function 23. then all its neighbour liveranges start before d —by definition of the leaf—and end after d —because they intersect. 23. For all v. k number of colors stack = {} .2 SSA graphs are colorable with a greedy scheme One of the interesting properties of SSA form for register allocation is that the live-ranges of variables are subtrees of the dominance tree (see Chapter 2. foreach w neighbour of v do degree[w] = degree[w]-1 . and once removed another variable will become the new leaf. degree[v] = #neighbours of v in G. if degree[w] = k . i. We will now prove more thoroughly that if the register pressure is at most R . we know that the register pressure at d is at most R . Fig.1 then worklist = worklist U {w} push v on stack .23. Let us consider such a subtree representation of the interference graph (see Figure 23.4). We see that simplification can always happens at the end of the branches of the dominance tree. moving upwards until the whole tree is simplified. it is in fact a candidate for simplification. this means this graph in chordal and as such colorable in polynomial time. worklist = worklist \ {v} . If this leaf does not intersect any other live-range. Section 2. a node candidate for simplification corresponds to a live-range which intersect with at most R − 1 other live-ranges.3 Coloring and coalescing 317 Function Simplify(G) Data: Undirected graph G = (V. /* Remove v from G */ if V != {} then Failure ``The graph is not simplifiable'' return stack .. Proof.5 TODO) and try to apply the greedy coloring scheme to it. all neighbours of the leaf are live at d . and picture a “tree” representation in you head. At the end of each dangling branch there is a “leaf” variable. E).

Fig. we still use the interference graph. the function Is_k_Greedy maintains a stack of simplified variables. By building on our current understanding of why the greedy scheme works for SSA graphs. i. It is also possible to assign colors directly after the greediness has been check during the spilling phase (see Section ??). a structure that takes time to build and uses memory up.5 Subtree We just proved that if we are under SSA form and the spilling has already been done so that Maxlive ≤ R . as we have reduced the interplay (hence complexity) between those two phases. 23.3.6. for instance.. S. this also means that existing register allocators can be easily modified to handle SSA code and that register allocation can benefit from powerful coalescing strategies such as aggressive or conservative ones. nodes that can be simplified. Moreover. the simplification process will empty the whole graph. . One disadvantage remains compared to lighter algorithms. Bouchez.318 23 Register Allocation— (F. • Biased coloring is used to coalesce variables linked by φ -functions. 23. is guaranteed to perform register allocation with R colors without any additional spilling.3 A light tree-scan coloring algorithm • On the dominance tree. Removing a live-range does not change the shape of the graph (the remaining live-ranges are still subtrees of the dominance tree): as long at the graph is not empty. possible to scan from top and assign colors to variables as they come. This stack can be directly used for coloring by making a call to Assign_colors given in Figure 23. Hack) has at most R − 1 neighbours and can be simplified. the fact that register allocation can be decoupled into a phase of spilling first and then a phase of coloring/coalescing allows the writing of more involved coalescing strategies. Indeed. This is practical for instance when working on an existing compiler that uses a classical register allocation algorithm. and colors can be assigned in the reverse order of the simplification. there will still be “leaf” live-ranges. Ok to talk about biased. Eventually. • see next section for more involved coalescing • Fab: I think that you already talked about scan coloring. we will show next how to design a lighter coloring algorithm. Among the good things.e. the classical greedy coloring scheme of Chaitin et al. the Iterated Register Coalescing. Insert subtree figure HERE.

i.e. Proof. However. and use less memory. when the scanning arrives at the definition of a variable. One of the advantages of SSA is to make things simpler. Figure 23.. first come first served! We give the pseudo-code in procedure Tree_Scan. for each neighbour w in G available[color(w)] = False col = 0. where compilation time and memory prints matter more. This means the variables are simply colored in the order of their definitions. . Intuitively. then j < i and the node has already been simplified. For the register assignment problem. it requires constructing and maintaining the interference graph. . a valid order of the nodes on the stack after the greedy simplification scheme. for each color c from 1 to R if available[c] col = c available[c] = True /* prepare for next round */ color(v) = col add v to G Fig. Let us consider an ordering v 1 . . faster. without actually performing any simplification. the only colored variables are “above” it and since there is at most R − 1 other variables live at the definition. .3 Coloring and coalescing 319 Function Assign_colors(G) available = new array of size R and values True while stack != {} do v = pop stack . We will show that this order correspond to a greedy simplification. coloring the variables from the root to the leaves in a top-down order. . then v j appear before v i in the ordering ( j < i ). It works because it mimics the effect of the greedy scheme (“leaves” are simplified first hence colored last ). The algorithm we propose scans the dominance tree. there is no variable defined after v i on the dominance tree: the live-range of v i is a “leaf” of the subtree representation (see proof . . So. v i −1 have been simplified by the greedy scheme (for i = 1. If v i dominates any variable.23. there is always a free color. We will now prove more formally that this method always work. v 2 . . we now know that the existing greedy scheme based on simplification still works. It is not more complicated and it improves results. this is the initial graph). a structure judged to big and cumbersome to be used in a JIT context. We will now show how to perform a fast register assignment for SSA after spilling that does not need more than the dominance tree and defuse chains.6 Assign_color function • Fab: You can talk about biased coloring that uses the result of an aggressive coalescing (see with Quentin). say v j .7. Suppose v 1 . 23. . v n of the nodes of the nodes based on the dominance: if v i dominates v j . The key observation is that the order in which variables are considered corresponds to a reverse order of simplification.

..320 23 Register Allocation— (F.3. Tree_Scan(T) function assign_color(p. Bouchez. As we have seen previously in Section 23. the variables are “split” at join points by the use of φ -functions. coloring a branch would constrain the coloring on leaves on a different branch. S. such live-ranges are split and their colorings are independent. . [True. . The tree-scan coloring algorithm is really fast as it only needs one traversal of the dominance tree. .g. The pseudocode for function choose_color is deliberately not given yet. . So.3.. Hack) in Section 23. True]) Fig. it can be implemented as a bit-set to speed-up updates of the available array. In that case. and by induction the ordering v 1 .. . Under SSA form. we would have live-ranges that spans on multiple branches of the dominance tree. Under SSA.4 Coalescing under SSA form The goal of coalescing is to minimize the number of register-to-register move instructions in the final code.. v i can be chosen for simplification.7 Tree scan coloring algorithm for SSA. A very basic implementation could just scan the available array until it finds one that is not taken. 23. available) /* choose available color */ available[c] = False color(v) = c for each child p' of p assign_color(p'. . Since the number of colors is fixed and small. instruction “a = b.1. 23. available) for each v last use at p available[color(v)] = True /* colors not used anymore */ for each v defined at p c = choose_color(v. v 1 is a valid order for coloring. available) assign_color(root(T). . v n −1 . If this was not the case. See Figure ?? for an example. this means the reverse order v n . It is important to understand why this method does not work in the general non-SSA case. While there may not be so many such “copies” at the high-level (e. creating cycles in the representation. True. .2).” in C)—especially after a phase of copy . We will see in the next section how to bias the choosing of colors so as to perform some coalescing. v n is a valid order of simplification.3.

w ) meaning that v has an affinity with u of weight w . This is possible if we are under conventional SSA (CSSA. .5 if in a conditional branch. . In the case of SSA. . for instance ×10 for each level of nested loop. .)). . ). By adding a metric to this notion. and a ← c on the one from the right. .23. Suppose now that register allocation decided to put a and b in register R 1 . It is also possible to use empirical parameters.4). given two variables a and b . .3 Coloring and coalescing 321 propagation under SSA (TODO: See chapter X)—. a ← φ (b . b . Then. the copy a ← b does not need to be executed anymore. acting as the converse of the relation of interference and expressing how much two variables “want” to share the same register.8. For instance. to the same register. Such two variables cannot be assigned to the same register without breaking the program. the weight of the affinity between them is the number of occurrence of copy instructions involving them (a ← b or b ← a ) or φ -functions (a ← φ (. .1 Biased coalescing Since the number of affinities is sparse compared to the number of possible pairs of variables. The bias is very simple: each choice of color maximizes locally—under the current knowledge—the coalescing for the variable being colored. For instance. adding copies is a common way to deal with register constraints (see Section 23. but optimizations can break this property and interferences can appear between subscripted variables with the same origin. so as to remove copies between subscripts of the same variable (a 1 . etc. Actually. c ) means that instruction a ← b should be executed on the edge coming from the left. .) or b ← φ (. a . We propose a version of function choose_color that bias the choosing of colors based on these affinity lists. but c in register R 2 . a 2 . . 23. An even more obvious and unavoidable reason in our case is the presence of φ -functions due to the SSA form. .3. shown on Figure 23. . . . but a ← c still does: the value of variable c will be contained in R 2 and needs to transfered to R 1 . To bypass this problem and also allow for more involved coalescing between SSA variables of different origins. it measures the benefit one could get if the two variables were assigned to the same register: the weight represents how much instructions we would save at execution if the two variables share the same register. and a more accurate cost for each copy can be calculated using actual execution frequencies based on profiling. see Chapter ??). it is best to keep for each variable v an “affinity list” where each element is a pair (u . many such instructions are added by different compiler phases by the time compilation reaches the register allocation phase. . some parts of the program are more often executed than others. For instance. R 2 ” or the like. it is obviously better to assign variables linked by a φ function.4. which means the final code will contain an instruction “mov R 1 . we define a notion of affinity. A φ -function represents in fact parallel copies on incoming edges of basic blocks. ×0. It is not of course optimal since the general prob- .

S. There is another simple and interesting idea to improve biased coalescing. working on all “greedy-k -colorable” graphs . • Iterated Register Coalescing: conservative coalescing that works on the interference graph • Brute force strategy.3. TODO: what kind of aggressive? Ask fab. We still consider a JIT context where compilation speed is very important. .3 More involved coalescing strategies It is beyond the scope of this book to describe in details deeply involved strategies for coalescing.322 23 Register Allocation— (F. available) for each (u.4. 23. .2 Aggressive coalescing to improve biased coalescing.8 Choosing a color with a bias for coalescing.3. Hack) lem of coalescing is NP-complete.weight) in affinity_list(v) count[color(u)] += weight max_weight = -1 col = 0 for each color c if available[c] == true /* color is not used by a neighbour of v */ if count[c] > max_weight col = c max_weight = count[c] count[c] = 0 /* prepare array for next call to choose_color */ Fig.4. and biased algorithms are known to be easily misguided. Bouchez. 23. 23. So we will just give a short overview of other coalescing schemes that might be use in a decoupled register allocator. global variables count = array of size #colors and containing only zeros choose_color(v.

some registers might be dedicated to perform memory operations.1 Handling registers constraints In theory. . . it is not possible to pre-color them with color R 1 as the interference graph could not be chordal anymore. . . and if this constraint appears somewhere else in the program. . . . . the copies ensure the constraint is respected. . . Additionally. . . For instance. . variables involved in constraining operations are split before and after the operation. . . Division in registers R x and R y a /b a in R x . it breaks the SSA property since there would be multiple definition of the same variable. In that case putting them in the same register would break the program. . b in R y Memory load cannot use R z a = l oa d (b ) b cannot be in R z Functions arguments in R 1 . Splitting variables to handle register constraints. . We propose two solutions to deal with this problem.4. if not. . . .9 Examples of register constraints. for instance if they are both the first parameter of successive function calls. . Some examples are given in Figure 23. . . and the second is newer and promising. register allocation algorithms are always nicely working with nodes and colors. The problem with such constraints is that they cannot be expressed directly in the register allocation problem under SSA form. . a is forced to be in R x and is used instead of a in the instruction. . . . . 23. c ) b in R 1 . Depending on the architecture.4 Practical and advanced discussions 23. The first is classic in the literature. If a happens to be in R x after the register allocation. a and b maybe interfere. some instructions expects their operands to reside in particular registers. Traditionally. a also is redefined which also breaks the SSA form. This adds constraints to the register allocation. 23.9. . . For instance. . . . . . not all variables or registers are equivalent. . . . and in general conventions are defined to simplify for example the writing of libraries. In practice. a = f (b . Context x86 st200 ABI Constraint Instruction Effect on reg. by specifying how parameters to functions are passed and how results are returned. In SSA form. . . . . however. the problem is that variable a is R x . Then the instructions a ← a and a ← a are issued before and after the division. . . Furthermore. the copies can be removed. alloc. . c in R 2 result in R 0 a in R 0 Fig. . . . . suppose a is involved in an instruction that requires it to be in register R x .4 Practical and advanced discussions 323 . . . usually by restricting the coloring possibilities of variables. if variable a and b must absolutely reside in register R 1 .23. . . . R 2 .

d. an affinity of negative weight is added to the list. Their only relation with the other. it is however important to drive the coloring so that it still gives the right color whenever possible.10 Examples of full parallel splitting. d''.324 23 Register Allocation— (F. say R 2 . b. f') Live out = {a''. v a b . and add an interference between a and v a b and an . The use of parallel copies assures that the locally created variables. to the interference graph. Hack) A workaround solution is to split all variables alive before and after the operation using a parallel copy. d. with very short live-ranges (span of only one instruction). Bouchez. e} f = a/b Live out = {a. say a and b with weight w < 0. To minimize the number of conflicts on constraints. S. e. c'. On the contrary. and serves only to materialize affinities. c''. and then intervene to repair the coloring whenever it does not fit the constraints. Repairing problems afterwards. if variable a must reside in register R 1 we add R 1 to the affinity list of a . by using the following trick: we create a new “dummy” variable. d. e) f' = a'/b' (a''. as shown on Figure 23. one for each register. the second copy must must define new variables to keep the SSA property (and subsequent uses must be changed accordingly). b. c'. d''. we add a clique (complete graph) of R nodes. c. The only problem with this method is that many copy instructions can remain in the program if the coalescing is not good enough to assign most of the created variables to the same register. Moreover. e''. Existing coalescing algorithms in the literature usually do not support negative weight. e''. c. b.10. it is possible to emulate a negative weight between two variables. e') = (a. c. if a cannot be assigned to a register. “normal” variable are affinities that each variable share with the two other parts coming from the same original variable. d. e} (a'. f) = (a'. c''. b'. representing the added cost it will required if a is still put in register R 2 . 23. are completely disconnected from the rest of the interference graph. f} Live in = {a. c. d'. However. Another possibility is to let register allocation do its job. After Before Live in = {a. e'. by adding copies afterwards around mismatches. This clique is completely disconnected from the graph. For instance. To do so. d'. f} Fig.

2 Out-of-SSA and edge splitting φ -functions are not actual machine instructions. then copies a ← b and a ← c are created on the predecessors. they must be replaced by actual instructions when going “out-of-SSA. These copies can then be sequentialized using actual simple move instructions. and at the beginning of the current block. consider a ← φ (b . registers R 1 and R 2 form a cycle since their values must be swapped. coalescing v a b and b ensures a and b will not be in the same register. A very simple algorithm to sequentialize copies consist of first dealing with “cycles” in the parallel copy. if the definition and arguments of a φ -function have been allocated to the same register. This elegant method ensures the programs stays semantically correct. Of course. In practice. Hence. which requires three copies. non-SSA coloring. In our case.e. at the end the previous basic blocks. however. This creates a new shortlived variable for each φ -function. but the order of the final instructions is important. variables are already allocated to memory or registers. this translation out-ofSSA happens before register allocation. then it can be safely removed.” When going out-of-SSA after register allocation. The classical “Sreedhar” method involves inserting copies before and after the φ -functions. 23.23. We propose then a different approach. we could save a copy by choosing to save R 2 into R 3 instead of R 1 for the swap. as we come back to regular. for instance. For instance. Trying to color these variables could lead to a dead-end. and a ← a replaces the φ function. . Since the semantics of φ -functions is parallel. in Figure 23. c ). Better algorithms can be found in the literature but we prefer not to describe them here. The problem with this trick is that it adds many useless nodes and interferences in the graph. it is not actually possible to place code on edges. We then replace the remaining φ -functions by adding copies directly on the control-flow edges.11. uncolored variables. Sequentializing is not a difficult task.4 Practical and advanced discussions 325 affinity of weight |w | between v a b and b . First. We use the currently available register R 3 to act as a temporary register during the swap. Then. R 3 gets the value that was contained in R 2 and which is in R 1 after the swap. so we have to create new basic blocks on these edges to hold the added code.. the so-called “swap problem” and “lostcopy problem. and fixes problems of earlier methods.4. i. hence not requiring an actual modification of the graph. Shreedar’s method in this context would generate local.” Traditionally. it is best to modify the coalescing rules to act “as if” there were such a dummy node in case of negative weight affinities. hence arguments and results of φ -functions are not variables anymore but registers. we have to replace them with parallel copies on the edges.

it adds a new indirection in the code (such as a goto instruction). . Code added on a non-critical edge can in fact be moved either at the bottom the predecessor block (move up) or at the top of the successor block (move down). However. ) R 3 ← φ (R 2 . ) R 2 ← φ (R 1 .326 φ -functions R 1 ← φ (R 2 . can swap the values in registers. as long as the values are swapped back in 5 Some architecture. . while the back-edge of a “do. Splitting edges has multiple disadvantages. . R 1 . 23. as such code would then be always executed. For instance. For instance. depending on which one has only one exit or entry edge. For instance. this happens if there is a cycle and every other register contains a live variable. then perform all copies starting with R 1 ← R 2 and ending by loading the stored value to R n . . then it is possible to move this swap up on the preceding block.12. ) Fig. a parallel copy swaps the values in R 1 and R 2 . 23. there is no available register to “break” the cycle so we need to store one value in memory (spill). . in the case of a critical edge. Figure 23. or it might prevent the use of hardware loop accelerators to name a few.11 23 Register Allocation— (F.5 for instance R 1 . These cycles need a “free” register to be sequentialized. even if the control-flow is not taken.12. like VLIW. spilling is not necessary. In those cases. the move instructions can be placed at the end of the corresponding true and false branches. An important observation is that cycles can appear in parallel copies. . For instance. . S. . . see Figure 23. if there is a φ -function after the join point of the if-construct of Figure 23. code on this block cannot be scheduled with code elsewhere. It is also possible to swap integer values using three XOR instructions.12.3 Parallel copy motion We have seen in the previous section that we can split control-flow edges to add copy code to get rid of φ -functions. edges in a regular “if” construct are not critical. In that case. It is still possible in some cases to move copies out of critical edges. Hack) Parallel copy (R 1 . it is dangerous to move up or down the copies. .4. R 2 ) Sequential copies R3 ← R1 R1 ← R2 R2 ← R3 R3 ← R1 Replacing φ -functions by actual copies. . until” loop is critical. We can see however that splitting is not always necessary. Bouchez. What happens if there is no such free register. . it depends on the notion of criticality of the edge. We say that an edge is critical if it goes from a block with multiple successors to a block with multiple predecessors. R 2 . R 3 ) ← (R 2 . it is not possible to move the copy R 1 ← R 2 up or down as this would change the behaviour of the program TODO: figure.

. . . . It is also possible to have critical edges weak at the bottom. . . 23. We will not elaborate on this subject as our goal here is to show that optimizations are possible but not to cover them extensively. . . meaning we can move a parallel copy down.23. do.. . [61] IRC of George and Appel [123] Coalescing is NP-complete in most cases [42] Repairing ABI constraints: cite Fab’s paper on tree-scan [?] . it is not possible to get it back in R 1 on the other edges. . their successors have only one arriving control-flow edge in this case.5 Further Reading & References This section need to be completed/written: Stuff to mention for interested readers: • • • • • • • • • Examples of coalescing simpler / more involved because of split compilation. . . . see Figure ??. . . . 23. . i. ..e. . . . . . . on the existence of “strong” critical edges (as opposed to “weak” ones). . We advise the interested readers to read the existing literature on the subject. . . . then those swaps back can be moved down and no edge has been split. . . meaning a copy on it can be moved up. . They will find there extended discussions on the “reversibility” of parallel copies. . . . . . since their effects are not always “reversible”: for instance. R 1 ← R 2 erases the value on R 1 and if this value is not saved elsewhere. . . until loop example todo Fig. In practice. . . This one is weak at the top since all other edges leaving the predecessor are non-critical. . . .5 Further Reading & References 327 their correct registers on all other outgoing edges. there is always a means to solve our problem by spilling. i. . . . We call a critical edge with this property a weak critical edge. critical edge def if-construct example todo todo Examples of critical and non-critical edges.e. not all parallel copies can be moved up or down on weak critical edges. . . and what can be done for “abnormal” edges. . . . How to virualize dummy nodes with negative affinities: reference [?]. control-flow edges that cannot be split. . if all other edges arriving on the successor are non-critical.12 . If all those edges are not critical. swap and lost-copy problems: cite Briggs’ thesis better sequentialization algorithm parallel copy motion References to give: Chaitin et al. with some pointers given in the last section of this chapter. Here we will just conclude that in case we are stuck with our copies. .

Hack) • Spilling techniques are based on page eviction. • Bouchez et al. . • Gavril and Yannakakis [281] showed that. [43] showed that it is even NP-complete to find the minimum number of nodes to spill to decrease Maxlive just by one. it is NP-complete to find the minimum number of nodes to establish Maxlive ≤ R . even in this simplistic setting (spilleverywhere “with holes”). Belady’s algorithm [27]. • Farrach and Liberatore [109] showed that Belady’s algorithm gives good results on straight-line code. • Braun and Hack [45] extended Belady’s algorithm to general control-flow graphs. Bouchez. S.328 23 Register Allocation— (F.

In this context we also outline several compiler transformations that benefit from the SSA-based representation of a computation. We begin by highlighting the benefits of using a compiler SSA-based intermediate representation in this hardware mapping process using an illustrative example. . . While. As an example Figure 24. . . . . . . .1 Brief History and Overview Hardware compilation is the process by which a high-level language.CHAPTER 24 Hardware Compilation using SSA Pedro C. . .. . . . . . . The following sections describe hardware translation schemes for the core hardware logic or data-paths of hardware circuits. . . We conclude with a brief survey of various hardware compilation efforts both from academia as well as industry that have adopted SSA-based internal representations. high-level behavioral synthesis allowed them over the years to leverage the wealth of hardware-mapping and design exploration techniques to realize substantial productivity gains. or behavioral. . .1 illustrates these concepts of hardware mapping for the computation expressed as x = (a * b) 329 . . . description of a computation is translated into a hardware-based implementation i. . a circuit expressed in a hardware design language such as VHDL or Verilog which can be directly realized as an electrical (often digital) circuit. . . . . . . developers were forced to design hardware circuits using schematiccapture tools. initially. . .e. . . . . . . . . . . 24. . . . . . . . . . Diniz Philip Brisk Progress: 90% Text and figures formating in progress Abstract This chapter describes the use of SSA-based high-level program representation for the realization of the corresponding computation using hardware circuits. .

Finally.g. two adders/subtractors and six registers. the reduction of number of operators. this time using nine registers and the same amount of adders and subtractors. The increased number of registers allows for the circuit to be clocked at higher frequency as well as to be executed in a pipelined fashion. multiplexers 1 and increased complexity of the control scheme.1(d) we depict yet another possible implementation of the same computation but using a single multiplier operator. a designer can use behavioral synthesis tools (e.1(c) we depict a different implementation variant of the same computation. where the control input selects which of the two data inputs is transmitted to the output.330 24 Hardware Compilation using SSA — (Pedro C.1(b) a graphical representation of a circuit that di- rectly implements this computation is presented. a single output and a control input. The tool then derives the control scheme required to route the data from registers to the selected units so as to meet the designers’ goals. Mentor Graphics’s Monet R ) to automatically derive an implementation for a computation as expressed in the example in Figure 24. in figure 24. software compilation. hardware synthesis and thus hardware compilation has never enjoyed the same level of success as traditional. as it simply uses the data in the input registers. As it is apparent. Synthesis techniques perform the classical tasks of allocation. allow programmers to easily reason about program behavior as a sequence of program memory state transitions. This last version allows for the reuse in time of the multiplier operator and required thirteen register as well as multiplexers to route the inputs to the multiplier in two distinct control steps. The input data values are stored in the input registers on the left and a clock cycle later the corresponding results of the computation are ready and saved in the output registers on the right-hand-side of the figure. binding and scheduling of the various operations in a computation given specific target hardware resources.1(d). Despite the introduction of high-level behavioral synthesis techniques in commercially available tools. Philip Brisk) . Here there is a direct mapping of hardware operator such as adders and multipliers to the operations in the computation. This example illustrates the many degrees of freedom in high-level behavioral hardware synthesis. It can be viewed as a hardware implementation of the C programming language selection operator : out = (sel ? in1 : in2). Sequential programming paradigms popularized by programming languages such as C/C++ and more recently by Java. The execution using this design implementation can use a very simple control scheme. In figure 24. For instance. waits for a single clock cycle and stores the outputs of the operations in the output registers. in this particular case the multipliers carries a penalty on increased number of registers. The underlying processors and the corresponding system-level implementations present 1 A 2× 1 multiplexor is a combinatorial circuit with two data inputs.(c * d) + f. In figure 24. Values are stored in register and the entire computation lasts a single (albeit long) clock cycle. . Overall this direct implementation uses two multipliers.. Diniz.1(a) by declaring that it pretends to use a single adder and a single multiplier automatically deriving an implementation that resembles the one depicts in figure 24.

When developing hardware solutions. such as SystemC have emerged in recent years. in contrast. pipelined hardware design (c) design sharing resources (d). a stack. C++. thus substantially raising the barrier of entry for hardware designers in terms of productivity and robustness of the . Hardware compilation. and a heapâ that do not exist (and often do not make sense) in customized hardware designs. including I/O. a simple hardware design (b). has faced numerous obstacles that have impeded its progress and generality.1 Variants of mapping a computation to hardware: High-level source code (a). or Java. Precise timing and synchronization between distinct hardware components are key abstractions in hardware. The inherent complexity of hardware designs has hampered the development of robust synthesis tools that can offer high-level programming abstractions enjoyed by tools that target traditional architecture and software systems. Solid and robust hardware design implies a detailed understanding of the precise timing of specific operations.1 Brief History and Overview 331 Fig. that simply cannot be expressed in language such as C. 24.24. designers must understand the concept of spatial concurrency that circuits do offer. alternatives. which give the programmer considerably more control over these issues. a number of simple unified abstractions – such as a unified memory model. for this reason.

332

24 Hardware Compilation using SSA — (Pedro C. Diniz, Philip Brisk)

generated hardware solutions. At best, today hardware compilers can only handle certain subsets of mainstream high-level languages, and at worst, are limited to purely arithmetic sequences of operations with strong restrictions on control flow. Nevertheless, the emergence of multi-core processing has led to the introduction of new parallel programming languages and parallel programming constructs that may be more amenable to hardware compilation than traditional languages and, similarly, abstractions such as a heap (e.g., pointer-based data structures) and a unified memory address space are proving to be bottlenecks with respect to effective parallel programming. For example, MapReduce, originally introduced by Google to spread parallel jobs across clusters of servers, has been an effective programming model for FPGAs; similarly, high-level languages based on parallel models of computation such as synchronous data flow, or functional single-assignment languages have also been shown to be good choices for hardware compilation. Although the remainder of this chapter will be limited primarily to hardware compilation for imperative high-level programming languages with an obvious focus on SSA Formâ many of the emerging parallel languages will retain sequential constructs, such as control flow graphs, within a larger parallel framework. The extension of SSA Form, and SSA-like constructs, to these emerging languages, is an open area for future research; however, the fundamental uses of SSA Form for hardware compilation, as discussed in this chapter, are likely to remain generally useful.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24.2 Why use SSA for Hardware Compilation?
Hardware compilation, unlike its software counterpart, offers a spatially oriented computational infrastructure that presents opportunities that can leverage information exposed by the SSA representation. We illustrate the direct connection between SSA representation form and hardware compilation using the mapping of a computation example in Figure 24.2(a). Here the value of a variable v depends on the control-flow of the computation as the temporary variable t can be assigned different values depending on the value of the p predicate. The representation of this computation is depicted in Figure 24.2(b) where a φ -function is introduced to capture the two possible assignments to the temporary variable t in both control branches of the if-then-else construct. Lastly, we illustrate in Figure 24.2(c) the corresponding mapping to hardware. The basic observation is that the confluence of values for a given program variable leads to the use of a φ -function. This φ -function abstraction thus cor-

24.2 Why use SSA for Hardware Compilation?

333

Fig. 24.2

Basic hardware mapping using SSA representation.

responds in terms of hardware implementation of the insertion of a multiplexer logic circuit. This logic circuit uses the Boolean value of a control input to select which of its input’s value is to be propagated to its output. The selection or control input of a multiplexer thus acts as a gated transfer of value that parallels the actions of an if-then-else construct in software. Equally important in this mapping to hardware is the notion that the computation in hardware can now take a spatial dimension. In the suggested hardware circuit in Figure 24.2(c) the computation derived from the statement in both branches of the if-then-else construct can be evaluated concurrently by distinct logic circuits. After the evaluation of both circuits the multiplexer will define which set of values are used based on the value of its control input, in this case of the value of the computation associated with p. In a sequential software execution environment, the predicate p would be evaluated first, and then either branches of the if-then-else construct would be evaluated, based on the value of p; as long as the register allocator is able to assign t1, t2, and t3 to the same register, then the φ -function is executed implicitly; if not, it is executed as a register-to-register copy. There have been some efforts that could automatically convert the sequential program above into a semi-spatial representation that could obtain some speedup if executed on a VLIW type of processor. For example, if-conversion (See Chapter 21) would convert the control dependency into a data dependency: statements from the if- and else blocks could be interleaved, as long as they do not overwrite one another’s values, and the proper result (the φ -function) could be selected using a conditional-move instruction; however, in the worst case, this would effectively require the computation of both sides of the branch, rather than one, so it could actually lengthen the amount of time required to resolve the computation; in the spatial representation, in contrast, the correct result can be output as soon as two of the three inputs to the multiplexer are known (p, and one of t1 or t2, depending on the value of p).

334

24 Hardware Compilation using SSA — (Pedro C. Diniz, Philip Brisk)

When targeting a hardware platform, one advantage of the SSA representation is that assigning a value to each scalar variable exactly once makes the sequences of definitions and uses of each variable both formal and explicit. A spatiallyoriented infrastructure can leverage this information to perform the computation in ways that would make no sense in a traditional processor. For example, in an FPGA, one can use multiple registers in space and in time to hold the same variable, and even simultaneously assign distinct values to it; then, based on the outcome of specific predicates in the program, the hardware implementation can select the appropriate register to hold the outcome of the computation. In fact, this is precisely what was done using a multiplexer in the example above. This example highlights the potential benefits of a SSA-based intermediate representation, when targeting hardware architectures that can easily exploit spatial computation, namely: • Exposes the data dependences in each variable computation by explicitly incorporating into the representation each variable def-use chains. This allows a compiler to isolate the specific values and thus possible registers that contribute to each of the assumed values of the variable. Using separate registers for disjoint live ranges, allows hardware generation to reduce the amount of resources in multiplexer circuits, for example. • Exposes the potential for the sharing of hardware register not only in time and but also in space providing insights for the high-level synthesis steps of allocation, binding and scheduling. • Provides insight into control dependent regions where control predicates and thus the corresponding circuitry is identical and can thus be shared. This aspect has been so far neglected but might play an important role in the context of energy minimization. While the Electronic Design Automation (EDA) community had for several decades now exploited similar information regarding data and control dependences for the generation of hardware circuits from increasingly higher-level representations (e.g., Behavioral HDL), SSA makes these dependences explicit in the intermediate representation itself. Similarly, the more classical compiler representations, using three-address instructions augmented with the def-use chains already exposes the data-flow information as for the SSA-based representation. The later however, and as we will explore in the next section, facilitates the mapping and selection of hardware resources.

24.3 Mapping a Control Flow Graph to Hardware

335

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24.3 Mapping a Control Flow Graph to Hardware
In this section we are interested in hardware implementations or circuit that are spatial in nature and thus we do not address the mapping to architectures such as VLIW (Very Long Instruction Word) or Systolic Arrays. While these architectures pose very interesting and challenging issues, namely scheduling and resource use, we are more interested in exploring and highlighting the benefits of SSA representation which, we believe, are more naturally (although not exclusively) exposed in the context of spatial hardware computations.

24.3.1 Basic Block Mapping
As a basic block is a straight-line sequence of three-address instructions, a simple hardware mapping approach consists in composing or evaluating the operations in each instruction as a data-flow graph. The inputs and outputs of the instructions are transformed into registers2 connected by nodes in the graph that represent the operators. As a result of the "evaluation" of the instructions in the basic block this algorithm constructs a hardware circuit that has as input registers that will hold the values of the input variables to the various instructions and will have as outputs registers that hold only variables that are live outside the basic block.

24.3.2 Basic Control-Flow-Graph Mapping
One can combine the various hardware circuits corresponding to a control-flow graph in two basic approaches, respectively, spatial and temporal. The spatial form of combining the hardware circuits consists in laying out the various circuits spatially by connecting variables that are live at the output of a basic block, and therefore the output registers of the corresponding hardware circuit, to the registers that will hold the values of those same variables in subsequents hardware circuits of the basic blocks that execute in sequence. In the temporal approach the hardware circuits corresponding to the various CFG basic blocks are not directly interconnected. Instead their input and output registers are connected via dedicated buses to a local storage module. An execution controller "activates" a basic block or a set of basic blocks by transferring
2

As a first approach these registers are virtual and then after synthesis some of them are materialized to physical registers in a process similar to register allocation in software-oriented compilation.

336

24 Hardware Compilation using SSA — (Pedro C. Diniz, Philip Brisk)

data between the storage and the input registers of the hardware circuits to be activated. Upon execution completion the controller transfers the data from the output registers of each hardware circuit to the storage module. These data transfers do not need necessarily to be carried out sequentially but instead can leverage the aggregation of the outputs of each hardware circuits to reduce transfer time to and from the storage module via dedicated wide buses.

Fig. 24.3

Combination of hardware circuits for multiple basic blocks. CFG representation (a); Spatial mapping (b); Temporal mapping (c).

The temporal approach described above is well suited for the scenario where the target architecture does not have sufficient resources to simultaneously implement the hardware circuits corresponding to all basic blocks of interest as it trades off execution time for hardware resources. In such a scenario, where hardware resources are very limited or the hardware circuit corresponding to a set of basic blocks is exceedingly large, one could opt for partitioning a basic block or set of basic blocks into smaller blocks until the space constraints for the realization of each hardware circuit are met. In reality this is the common approach in every processor today. It limits the hardware resources to the resources required for each of the ISA instructions and schedules

24.3 Mapping a Control Flow Graph to Hardware

337

them in time at each step saving the state (registers) that were the output of the previous instruction. The computation thus proceed as described above by saving the values of the output registers of the hardware circuit corresponding to each smaller block. These two approaches, illustrated in Figure 24.3, can obviously be merged in a hybrid implementation. As they lead to distinct control schemes for the orchestration of the execution of computation in hardware, their choice depends heavily on the nature and granularity of the target hardware architecture. For finegrain hardware architectures such as FPGAs 3 a spatial mapping can be favored, for coarse-grain architectures a temporal mapping is common. While the overall execution control for the temporal mapping approach is simpler, as the transfers to and from the storage module are done upon the transfer of control between hardware circuits, a spatial mapping approach makes it more amenable to take advantage of pipelining execution techniques and speculation 4 . The temporal mapping approach can be, however, area-inefficient, as often only one basic block will execute at any point in time. This issue can, nevertheless, be mitigated by exposing additional amounts of instruction-level parallelism by merging multiple basic blocks into a single hyper-block and combining this aggregation with loop unrolling. Still, as these transformations and their combination, can lead to a substantial increase of the required hardware resources, a compiler can exploit resource sharing between the hardware units corresponding to distinct basic blocks to reduce the pressure on resource requirements and thus lead to feasible hardware implementation designs. As these optimizations are not specific to the SSA representation, we will not discuss them further here.

24.3.3 Control-Flow-Graph Mapping using SSA
In the case of the spatial mapping approach, the SSA form plays an important role in the minimization of multiplexers and thus in the simplification of the corresponding data-path logic and execution control. Consider the illustrative example in Figure 24.4(a). Here basic block BB0 defines a value for the variables x and y. One of the two subsequent basic blocks
3

Field-Programmable Gate Arrays are layers of two dimensional topology integrated circuits designed to be configurable after manufacturing
4

Speculation is also possible in the temporal mode by activating the inputs and execution of multiple hardware blocks and is only limited by the available storage bandwidth to restore the input context in each block which in the spatial approach is trivial.

338

24 Hardware Compilation using SSA — (Pedro C. Diniz, Philip Brisk)

BB1 redefines the value of x whereas the other basic block B2 only reads them.
A naive implementation based exclusively on live variable analysis would use for both variables x and y multiplexers to merge their values as inputs to the hardware circuit implementing basic block BB3 as depicted in Figure 24.4(b). As can be observed, however, the SSA-form representation captures the fact that such a multiplexer is only required for variable x. The value for the variable y can be propagated either from the output value in the hardware circuit for basic block BB0 (as shown in Figure 24.4(c)) or from any other register that has a valid copy of the y variable. The direct flow of the single definition point to all its uses, across the hardware circuits corresponding to the various basic blocks in the SSA form thus allows a compiler to use the minimal number of multiplexer strictly required5 .

Fig. 24.4

Mapping of variable values across hardware circuit using spatial mapping; CFG in SSA form (a); naive hardware circuit using live variable information for multiplexer placement (b); hardware circuit using φ -function for multiplexer placement (c).

An important aspect regarding the implementation of a multiplexer associated with a φ -function is the definition and evaluation of the predicate associated with each multiplexer’s control (or selection) input signal. In the basic SSA representation the selection predicates are not explicitly defined, as the execution of each φ -function is implicit when the control-flow reaches it. When mapping a computation the φ -function to hardware, however, one has to clearly define which predicate to use to be included part of the hardware logic circuit that defines the value of the multiplexer circuit’s selection input signal. The GatedSSA form previously described in Chapter 18 of the SSA form explicitly captures the necessary information in the representation. As the variable name associated with each gating predicate fulfills the referential transparency as any SSA vari5

Under the scenarios of a spatial mapping and with the common disclaimers about static control-flow analysis.

as opposed to the base name of the variable as this might have been assigned a value that does not correspond to the actual value used in the gate. If an hardware circuit already exists for the same predicate. In this representation. this variant of the SSA form is particularly useful for hardware synthesis. when generating code for a given basic block an algorithm will examine the various region nodes associated with a given basic block and compose (using AND operators) the outputs of the logic circuits that implementat the predicates associated with these nodes. basic blocks that share common execution predicates (i.24. Otherwise. the implementation simply reuses its output signal. (b) Gated-SSA representation.3 Mapping a Control Flow Graph to Hardware 339 able. As such.6 where some of the details were omitted for simplicity.. (c) hardware circuit implementation using spatial mapping.5 illustrates an example of a mapping using the information provided by the Gated-SSA form.e. . Nested execution conditions are easily recognized as the corresponding nodes are hierarchically organized in the PDG representation. Figure 24. When combining multiple predicates in the Gated-SSA form it is often desirable to leverage the control-flow representation in the form of the Program Dependence Graph (PDG) described in Chapter 18. (a) original code. The generation of the hardware circuit simply uses the register that holds the corresponding variable’s version value of the predicate.5 Hardware generation example using Gated-SSA form. Fig. both execute under the same predicate conditions) are linked to the same region nodes. This lazy code generation and predicate composition achieves the goal of hardware circuit sharing as illustrated by the example in Figure 24. 24. creates a new hardware circuit.

the explicit representation of the selection constructs in SSA makes it very natural to map and therefore manipulate/transform the resulting hardware circuit using multiplexer. a) source code. Diniz. 24. hardware multipliers naturally contain multi-operand adders as building blocks: a partial product generator (a layer of AND gates) is followed by a multi-operand adder called a partial product reduction tree. For these reasons. Other operations in the intermediate representation (e.3.g. A first transformation is motivated by a well-known result in computer arithmetic: integer addition scales with the number of operands.4 φ -Function and Multiplexer Optimizations We now describe a set of hardware-oriented transformations that can be applied to improve the amount of hardware resources devoted to multiplexer implementation. Philip Brisk) Fig. there have been several efforts in recent years to apply high-level algebraic . Although these transformations are not specific to the mapping of computations to hardware. b) the PDG representation skeleton.340 24 Hardware Compilation using SSA — (Pedro C.6 Use of the predicates in region nodes of the PDG for mapping into the multiplexers associated with each φ -function. predicated instructions) can also yield multiplexers in hardware without the explicit use of SSA Form.. 24. Building a large unified k-input integer addition circuit is more efficient than adding k integers two at a time. Moreover. c) outline of hardware design implementation.

24. In this figure each basic block is mapped to a distinct hardware unit. but leave one input unused when implementing multiplexers with two data inputs. the length of wires dictate the maximum allowed hardware clock rate for synchronous designs. which are too small to implement a multiplexer with four data inputs. As the boundaries of basic blocks are natural synchronization points.7(a) the transformation leads to the fact that an addition is always executed unlike in the original hardware design. 24. while exploiting the commutative property of the addition operator.b). The SSA-based representation facilitates these transformations as it explicitly indicates which values. As can be seen in Figure 24. so that they can be merged at the bottom. A floor-planning algorithm must place each of the units in a two-dimensional plane while ensuring that no two units overlap. These features can be explored to reduce the number of 4-LUTs required to implement a tree of multiplexers. The basic flavor of these transformations is to push the addition operators toward the outputs of a data flow graph. Example of these transformations that use multiplexers are depicted in Figure 24. and thus by tracing backwards in the representation.7(c) depicts a similar transformation that merges two multiplexers sharing a common input. However. Figure 24. many FPGAs devices are organized using 4-LUTs.3. In particular. whose spatial implementation is approximated by a rectangle. a 6-LUT can be programmed to implement a multiplexer with four data inputs and two selection bits. moving a φ -function from one basic block to another can alter the length of the wires that are required to transmit data from the hardware circuits corresponding to the various basic blocks. In the case of Figure 24.8(a) placing the block 5 on the right-hand-side of the plane will results in several midrange and one long-range wire connections. A second transformation that can be applied to multiplexers is specific to FPGA whose basic building block consists of a k-input lookup table (LUT) logic element which can be programmed to implement any k-input logic function. placing block 5 at the center of the design will virtually eliminate all mid-range connections as all connections corresponding to the transmission of the values for variable x are now next- . We illustrate this effect via an example as depicted in Figure 24. are involved in the computation of the corresponding values.5 Implications of using SSA-form in Floor-planing For spatial oriented hardware circuits.8. similarly. For example.7(a.3 Mapping a Control Flow Graph to Hardware 341 transformations to source code with the goal of merging multiple addition operations with partial product reduction trees of multiplication operations. where values are captured in hardware registers. For the example in Figure 24. a 3-input LUT (3-LUT) can be programmed to implement a multiplexer with two data inputs and one selection bit.7(b) the compiler would quickly detect a as a common variables in the two expressions associated with the φ -function.

neighboring connections. so it is very difficult to predict whether moving a φ -function will actually be beneficial. Philip Brisk) Fig. Diniz. moving a multiplexer from one hardware unit to another can significantly change the dimensions of the resulting unit. . reducing the number of multiplexers placed on the input of an adder (c). compiler optimizations that attempt to improve the physical layout must be performed using a feedback loop so that the results of the lower-level CAD tools that produce the layout can be reported back to the compiler. which is not under the control of the compiler.342 24 Hardware Compilation using SSA — (Pedro C. 24. Changing the dimensions of the hardware units fundamentally changes the placement of modules. For this reason.b). As illustrated by this example.7 Multiplexer-Operator transformations: juxtapose the positions of a multiplexer and an adder (a.

94. . . . . . . As not all instructions are executed in a hyper-block (due to early exit of a block). . As an approach to increase the potential amount of exploitable instructionlevel parallelism (ILP) at the instruction level. .4 Existing SSA-based Hardware Compilation Efforts 343 Fig. .and data-flow information to exploit fine grain parallelism. An hyper-block consists on a single-entry multi-exit regions derived from the aggregation of multiple basic blocks thus serializing longer sequences of instruction. . . .8 Example of the impact of φ -function movement in reducing hardware wire length. Often. . 24. . . . 248]).24. . . these efforts have been geared towards mapping computations to fine-grain hardware structure such as the ones offered by FPGAs. . The standard approach followed in these compilers has been to translate a high-level programming language such as Java in the case of the Sea Cucumber [259] compiler or C in the case of the ROCCC [129] to sequences of intermediate instructions. . . . possibly using predicates associated with the control-flow in the CFG thus explicitly using Predicated SSA representation (PSSA [60. . . . . . . many of these efforts (as well as others such as the earlier Garp compiler [59]) restructure the CFG into hyperblocks [179]. . . . but not always. . 24. hardware circuit implementation must rely on predication to ex- . .4 Existing SSA-based Hardware Compilation Efforts Several research projects have relied on SSA-based intermediate representation that leverage control. . . . . . . . . . . . . . . These sequences are then organized in basic blocks that compose the control-flow graph (CFG). . . . . For each basic block a data-flow-graph (DFG) is typically extracted followed by conversion in to SSA representation. .

Removed the reference [260] about gated-SSA in section SSA based CFG mapping. I removed this citation [7] concerning if-conversion and cross referred to the if-conversion chapter instead. high-level languages based on parallel models of computation such as synchronous data flow [169]. Removed the reference associated with the following sentence: T hese features can be explored to reduce the number of 4-LUTs required to implement a tree of multiplexers [190]. originally introduced by Google to spread parallel jobs across clusters of servers [95]. Please insert it here if appropriate. while exploiting the commutative property of the addition operator [268]. . The following section taken from the main body contains many references that I removed. Removed the reference for VLIW architecture [116] and for Systolic Arrays [164]. Can be put here if you feel this is necessary. I moved all the references found in the main body here. or functional single-assignment languages have also been shown to be good choices for hardware compilation [137. MapReduce. Diniz. 58]. Verilog [256]. Philip Brisk) ploit the potential for additional ILP . SystemC [208]. has been an effective programming model for FPGAs [282]. 152]. Put the reference for VHDL [17]. Might be put it here: For example. Removed the reference [152] on the section of Floor planing. I guess a sentence summarizing what VHDL is would be helpful. I think you should give one and summaries in a sentence what FPGA is??? Removed this citation given for resource sharing in section Basic CFG mapping [189]). I did not see any reference for FPGA that you talk about in the Basic CFG mapping section. Removed the reference associated with the following sentence: F igure 24.7(c) depicts a similar transformation that merges two multiplexers sharing a common input. This fine-grain synchronization mechanism is also used to serialize the execution of consecutive hyper-blocks. Add a reference for Mentor Graphics’s if you think this is appropriate. Removed the reference associated to the following sentence: Building a large unified k-input integer addition circuit is more efficient than adding k integers two at a time [245.344 24 Hardware Compilation using SSA — (Pedro C. similarly. The CASH compiler [57] uses an augmented predicated SSA representation with tokens to explicitly express synchronization and handle may-dependences thus supporting speculative execution. Please reorganize it. Removed in the Section Basic CFG Mapping the references [137. Removed the reference to the PDG [111]. thus greatly simplifying the code generation. 270]. 132.

Trinity College Dublin Progress: 90% Text and figures formating in progress . Gmail and Twitter. . . . . Techniques exist to handle the most common features of static languages. . . . . . . 25. not least because of strong support from browsers and simple integration with back-end databases. . . In these traditional languages it is not difficult to identify 345 . a general-purpose language that was originally designed for creating dynamic web pages.CHAPTER 25 Building SSA form in a compiler for PHP Paul Biggar and David Gregg. These include simple syntax. . Constructing SSA form for languages such as C/C++ and Java is a well-studied problem. . Python and Javascript are among the most widely-used and fastest growing programming languages.1 Introduction Dynamic scripting languages such as PHP . . . . Lero@TCD. Many prominent web sites make use significant of scripting. . These simple. and an extensive standard library. . . exploratory programming. . . . and in the case of many non-professional websites. . . . . . . a fast modify-compiletest environment for rapid prototyping. . . . . . flexible features facilitate rapid prototyping. . strong integration with traditional programming languages. . . a copy-pasteand-modify approach to scripting by non-expert programmers. Scripting languages are widely used in the implementation of many of the best known web applications of the last decade such as Facebook. . . . Wikipedia. . . interpreted execution and run-time code generation. . One of the most widely used scripting languages is PHP . . . PHP has many of the features that are typical of dynamic scripting languages. . and these solutions have been tried and tested in production level compilers over many years. . . . dynamic typing. . . Scripting languages provide flexible high-level features.

inter-procedural. . in order to obtain a nonpessimistic estimate. As a result.2 contains variables which cannot be trivially converted into SSA form. it is trivial to convert each variable into SSA form. the variable x has its address taken. . . In Figure 25. . outline the solutions we found to these some of these challenges. . and introduce a new version of x accordingly. Figure 25. and draw some lessons about the use of SSA in analysis frameworks for PHP . The information required to build SSA form — that is. but significant numbers of such variables can be found with very simple analysis. any scalar variables which do not have their address taken may be renamed. built-in functions. . x and ∗y alias each other. Alias analyses detect when multiple program names (aliases) represent the same memory location. Figure 25. In Figure 2. . . Scripting languages commonly feature run-time code generation. Better analysis may lead to more such variables being identified.25 346 Building SSA form in a compiler for PHP— (Paul Biggar and David Gregg. . . . .2 SSA form in traditional languages In traditional. . . . and how the presence of variables of unknown types affect the precision of SSA. . and the locations of their uses and definitions—is not available directly from the program source. Lero@TCD. . we must know that line 4 modifies x . 25. . . our open source compiler for PHP . . we find this is not the case. .1. Trinity College Dublin) a set of scalar variables that can be safely renamed. . . By contrast. . Ruling out the presence of these features requires precise. whole-program analysis. . the list of variables which are used and defined are immediately obvious from a simple syntactic check. . . In our study of optimizing dynamic scripting languages. most dangerously a function’s local symbol-table. . . Chapter hssa describes HSSA form. . To discover the modification of x on line 4 requires an alias analysis. such as C. . In C and C++. . . or heavily analysed. . some conservatively complete set of unaliased scalars. whose local variables may all be converted into SSA form. In Java. We discuss the futility of the pessimistic solution. . Less common — but still possible — features include the existence of object handlers which have the ability to alter any part of the program state. . the analyses required to provide a precise SSA form. we find a litany of features whose presence must be ruled out. it is straightforward to identify which variables may be converted into SSA form. all scalars may be renamed. . The rest of this chapter describes our experiences with building SSA in PHC. non-scripting languages.1 shows a simple C function. . . . and can not be derived from a simple analysis. . Instead. . a powerful way to represent indirect modifications to scalars in SSA form. . . . to convert x into SSA form. We identify the features of PHP that make building SSA difficult. specifically PHP . all of which may alter arbitrary unnamed variables. . . On line 3. . . . . . and variable-variables. For each statement. . C++ and Java.

. . . . . . . . . . However.3 PHP and aliasing 347 1: int factorial(int num) { 2: int result = 1. . 25. *y. it is not possible to perform an address-taken alias analysis. 6: } Fig. all address-taken variables are in the same alias set ). function calls. . . . . . 5: return x. . . 3: 4: while ( num > 1 ) { 5: result = result * num. this allows more complex alias analyses to be performed on the SSA form. of varying complexity. . All variables whose address has been taken anywhere in the program are considered to alias each other (that is. . 4: *y = 7.1 Simple C code 1: void func() { 2: int x=5. Address-taken alias analysis identifies all variables whose addresses are taken by a referencing assignment. . . In fact. 7: } 8: return result. When building SSA form with address-taken alias analysis. . . statements which may appear innocuous may perform complex operations behind the scenes. . 6: num --. 25. As a result of address-taken alias analysis. Variables not in SSA form do not possess any SSA properties. 9: } Fig. . . . The most common appearance of aliasing in PHP is due to variables that store references.2 C code with pointers There are many variants of alias analysis. and pessimistic assumptions must be made. . . . . .3 shows a small piece of PHP code. . . variables in the alias set are not renamed into SSA form. Figure 25. . and multiple levels of pointers. . . . .25. . it is straightforward to convert any C program into SSA form. 25. Creating references in PHP does not look so very different from C. . The syntactic clues that C provides — notably as a result of static typing — are not available in PHP . . Indeed. 3: y = &x.3 PHP and aliasing In PHP . Note that in PHP variable names start with the $ sign. . . . . affecting any variable in the function. taking into account control-flow. . . it is not difficult to perform a very simple alias analysis. The most complex analyse the entire program text. . . . without sophisticated analysis. . All other variables are converted. i.

the problem is actually with the variable y which contains the reference to the variable x. or vice versa. it is clear that x cannot be easily renamed. int y) { 2: x = x + y. They are separate variables that can be safely renamed. In fact a relatively simple analysis can show that the the assignment to x in line 2 of the C code can be optimized away because it is an assignment to a dead variable. In fact.3. Or at a given point in a program a given variable may be a reference variable or a non-reference variable depending upon the control flow that preceeded that point. due to dynamic typing. From the syntax it is easy to see that a reference to the variable x is taken. On first glance the PHP code in Figure 25. If a reference is passed as a parameter to a function in PHP . x may alias y upon function entry. we know that x and y are simple integer variables and that no pointer aliasing relationship between them is possible. and a sophisticated analysis over a larger region of code is essential to building a more precise conservative SSA form.2. and stop being references a moment later. the corresponding formal parameter in the function also becomes a reference. 4: } (a) Fig. In the PHP version in Figure 25.4(a). $y) { 2: $x = $x + $y. 3: return y. or if both x and y are references to a third variable. 1: function snd($x. PHP variables may be references at one point in the program.25 348 Building SSA form in a compiler for PHP— (Paul Biggar and David Gregg.3 is not very different from similar C code in Figure 25. 3: $y = 7. Lero@TCD. 25.4(a). 2: $y =& $x. . 4: } (b) Similar (a) PHP and (b) C functions with parameters The size of this larger region of code is heavily influenced by the semantics of function paramaters in PHP .3 Creating and using a reference in PHP In Figure 25. However. There is no type declaration to say that y in Figure 25. In the C version. It is important to note. Fig. Consider the PHP function in Figure 25. 25. however. Thus. whether the formal parametere x and/or y are references depends on the types of actual parameters that are passed when the function is invoked. it resembles the C function in Figure 25.3 is a reference. Instead. the variable y becomes a reference to the variable x. that the possibility of such aliasing is not apparent from the function prototype or any type declarations. Trinity College Dublin) 1: $x = 5. This can happen if x is a reference to y.4 1: int snd(int x. the assignment to y (in line 3) also changes the value of x.4(b). Once y has become an alias for x. 3: return $y. Superficially. PHP’s dynamic typing makes it difficult to simply identify when this occurs.

the statement’s successors (in the control-flow graph) are added to the worklist. . Furthermore. . In addition. . .4 Our whole-program analysis 349 The addition operation in line 2 of Figure 25. the results of the analysis are stored. . . This means the program is analysed by processing each statement in turn. This is similar to CFG-edges in the SCCP algorithm. . . Instead an interprocedural analysis is needed to track references between functions. A full description is beyond the scope of this chapter. .25. . . . . . conditional constant propagation [213]. . which is placed in a worklist. . this information must be determined by inter-procedural analysis. . . Instead. . . . SSA form allows results for to be 1 The analysis is actually based on a variation. loops must be fully analysed if their headers change. If the analysis results change. . As a result. . The execution state of the program begins empty. each function must be analysed with full knowledge of its calling context. . There is no parallel to the SSA edges. . 25. This analysis is therefore less efficient than the SCCP algorithm. a simple conservative aliasing estimate — similar to C’s address-taken alias analysis — would need to place all variables in the alias set. . As function signatures do not indicate whether parameters are references. . In the PHP version. on some executions of a function the passed parameters may be references. . . and modelling the affect of the statement on the program state. This would leave no variables available for conversion to SSA form. and the analysis begins at the first statement in the program.1 The SCCP algorithm models a function at a time. . our algorithm models the entire program. recall that dynamic typing in PHP means that whether or not a variable contains a reference can depend on control flow leading to different assignments. since the analysis is not performed on the SSA form. .1 The analysis algorithm The analysis is structured as a symbolic execution. . . . . . . . . . . . It is also less efficient in terms of space. . This is similar in principle to the sparse conditional constant propagation (SCCP) algorithm Chapter sccp. if x is a reference to y or vice versa. . . whereas on other executions they may not. . Therefore. Instead. . in terms of time. there are no syntactic clues that the variables may alias.4. We present an overview of the analysis below. and can be found elsewhere [31]. 25. . The worklist is then processed a statement at a time. This requires a whole-program analysis. For each analysed statement. .4(a) may therefore change the value of y.4 Our whole-program analysis PHP’s dynamic typing means that program analysis cannot be performed a function at a time. . . . .

A points-to graph is stored for each point in the program. Once a worklist has ended. constants are identified and propagated through the analysis of the program. Lero@TCD. or may -alias. our analysis also computes a conservative estimate of the types of variables in the program. Using our alias analysis. the analysis results for the exit node of the function are copied back to the calling statement’s results. and they must be analysed. leading to significantly more precise results from the analysis algorithm. using the SSA index as an array index. The first models the alias relationships in the program in a points-to graph [105].3 Building def-use sets To build SSA form we need to be able to identify the set of points in a program where a given variable is defined or used. Two unconnected nodes cannot alias. Since we cannot easily identfiy these sets due to potential aliasing. Where possible. and the edges between them indicate aliasing relationships. This portion of the analysis closely resembles using SCCP for type propagation [170]. Finally. Trinity College Dublin) compactly stored in a single array. Upon reaching a function or method call.25 350 Building SSA form in a compiler for PHP— (Paul Biggar and David Gregg. Resolving these branches statically eliminates unreachable paths. the algorithm resolves branches statically using propagated constant values. we build them as part of our program analysis. The worklist is then run until it is exhausted. and initialized with the first statement in the callee function. 25. the analysis results are copied into the scope of the callee. as described in Chapter sccp. A new worklist is created. polymorphic method calls are possible. 25. The graph contains variable names as nodes. the process recurses. If another function call is analysed. This is particularly valuable because our PHCahead-of-time compiler for PHP creates many branches in the intermediate representation during early stages of compilation. Graphs are merged at CFG join points. Each kind of result is stored at each point in the program. Since PHP is an object-oriented language. This is very space efficient.2 Analysis results Our analysis computes and stores three different of kinds of results. pausing the caller’s analysis. In our analysis. Secondly. the analysis begins analysing the callee function. we must instead store a table of variable results at all points in the program.4. like the SCCP algorithm.4. Upon reaching a callee function. the set of possible types of each variable is stored at each point in the program. any variables which may be written to or read from during . An aliasing relationship indicates that two variables either must -alias. As such.

. . . $x’s value is defined. a may-definition means that the variable may be defined at that point2 . we have the information needed to construct SSA. . 2. . are added to a set of defs and uses for that statement. $x’s value is defined. . In addition. . . $x = $y: 1.5) many of the definitions we compute are may-definitions.4. These are then used during construction of the SSA form. it is important to note that due to potential aliasing and potential side effects of some difficultto-analyse PHP features (see section 25.5 Other challenging PHP features 351 a statement’s executation. $x'’s value is may-defined instead of defined. . . .25. . . . The final distinction from traditional SSA form is that we do not only model scalar variables. . . If the alias is possible. .5 Other challenging PHP features For simplicity. the description so far of our algorithm has only considered the problems arising from aliasing due to PHP reference variables. . . . For an assignment by copy. . we use χ nodes. . . 4. . . . 25. $x'’s reference is used. can be represented in SSA form. . . $y’s reference is used. . . However. However. $x =& $y: 1. . $y’s value is used. . for each alias $x' of $x. . . to model may-definitions. . . . All names in the program. in the 2 Or more precisely. . . described in Chapter hssa. . 25. 3. . . For an assignment by reference. $y’s reference is used. . . .4 HSSA Once the set of locations where each variable is defined or used has been identified. $x’s reference is defined (it is not used — $x does not maintain its previous reference relationships). $x’s reference is used (by the assignment to $x). . $y’s value is used. . 2. . 5. In order to accomodate these may-definitions in our SSA form for PHP we use a variation of hashed SSA. . . Whereas a normal definition of a variable means that that variable will definitely be defined. 3. $x'’s value is defined. . . 4. . a may-definition means that a conservative analysis is unable to show that there does not exist some execution of the program where the variable is defined at that point. such as fields of objects or the contents of arrays. As in HSSA.

Trinity College Dublin) process of construction SSA in our PHP compiler several other PHP language features that make it difficult to identify all the points in a program where a variable may be defined. this creates a χ node for every variable in the program.1 Run-time symbol tables Figure 25. print $x. In this section we briefly describe these language features and how they may be dealt with in order to conservatively identify all may-definitions of all variables in the program. The same run-time value may be the domain of multiple string keys (references. all variables which do not begin with “x” do not require a χ node. In order to reduce the set of variables that might be updated by assigning to a variable-variable. 25. discussed above). and the updated variable is chosen by the user at runtime. allowing arbitrary values to be read and written. In this case. String analysis is a static program analysis that models the structure of strings. Lero@TCD. variable-variables allow the symbol-table to be accessed dynamically at run-time. Similarly. $var_name = readline( ). any variable can be updated.25 352 Building SSA form in a compiler for PHP— (Paul Biggar and David Gregg. Variables in PHP are not the same as variables in C. On line 3. all variables may be written to. the contents of the string stored in var_name may be modelled using string analysis [271]. PHP code using a variable-variable This feature is known as variable-variables. A PHP local variable is the domain of a map from strings to values. string analysis may be able to tell us that the name stored in the variable-variable (that is the name of the variable that will be written to) begins with the letter “x”. 25. $$var_name = 6. For example. and so know whether x has the value 5 or 6. On line 2.5 $x = 5. leading to a more precise SSA form. some variable—whose name is the value in var_name — is set to 5. a string value is read from the user. They are possible because a function’s symbol-table in PHP is available at run-time. In HSSA form. Upon seeing a variable-variable. 1: 2: 3: 4: Fig. That is. A C local variable is a name which represents a memory location on the stack. . It is not possible to know whether the user has provided the value ”x ”.5 shows a program which accesses a variable indirectly.5. and stored in the variable var_name.

25.5 Other challenging PHP features

353

25.5.2 Execution of arbitrary code
PHP provides a feature that allows an arbitrary string to be executed as code. This e v a l statement simply executes the contents of any string as if the contents of the string appeared inline in the location of the e v a l statement. The resulting code can modify local or global variables in arbitary ways. The string may be computed by the program or read in from the user or from a web form. The PHP language also allows the contents of a file to be imported into the program text using the i n c l u d e statement. The name of the file is an arbitary string which may be computed or read in at run time. The result is that any file can be included in the text of the program, even one that has just been created by the program. Thus, the i n c l u d e statement is potentially just as flexible as the e v a l statement for arbitarily modifying program state. Both of these may be modelled using the same string analysis techniques [271] as discussed in Section 25.5.1. The e v a l statement may also be handled using profiling [119], which restricts the set of possible e v a l statements to those which actually are used in practice.

25.5.3 Object handlers
PHP’s reference implementation allows classes to be partially written in C. Objects which are instantiations of these classes can have special behaviour in certain cases. For example, the objects may be dereferenced using array syntax. Or a special handler function may be called when the variable holding the object is read or written to. The handler functions for these special cases are generally unrestricted, meaning that they can contain arbitrary C code that does anythiny the programmer chooses. Being written in C, they are given access to the entire program state, including all local and global variables. There two characteristics of handlers make them very powerful, but break any attempts at function-at-a-time analysis. If one of these objects is passed in to a function, and is read, it may overwrite all variables in the local symbol table. The overwritten variables might then have the same handlers. These can then be returned from the function, or passed to any other called functions (indeed it can also call any function). This means that a single unconstrained variable in a function can propagate to any other point in the program, and we need to treat this case conservatively.

25 354 Building SSA form in a compiler for PHP— (Paul Biggar and David Gregg, Lero@TCD, Trinity College Dublin)

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25.6 Discussion, implications and conclusion
In this chapter we have described our experience of building SSA in PHC, an ahead-of-time compiler for PHP . PHP is quite different to traditional static languages such as C/C++ and Java. In addition to dynamic typing it has other dynamic features such as very flexible and powerful reference types, variablevariables, and features that allow almost arbitrary code to be executed. The main result of our experience is to show that SSA cannot be used and an end-to-end intermediate representation (IR) for a PHP compiler. The main reason is that in order to build SSA, significant analysis of the PHP program is needed to deal with aliasing and to rule out potential arbitrary updates of variables. We have found that in real PHP programs these features are seldom used in way that make analysis really difficult [31]. But analysis is nonethless necessary to show the absence of the bad used of these features. In principle our analysis could perform only the alias analysis prior to building SSA, and perform type analysis and constant propagation in SSA. But our experience is that combining all three analyses greatly improves the precision of alias analysis. In particular, type analysis significantly reduces the number of possible method callees in object-oriented PHP programs. The complexity of building SSA for PHP suggests that we are unlikely to see SSA used in just-in-time (JIT) compilers for PHP soon. There is a great deal of interest in building JIT compilers for scripting languages, especially Javascript, but most of the effort is aimed at eliminating dynamic type checks using profiling and trace methods [120], rather than static analysis. Once built the SSA representation forms a platform for for further analysis and optimization. For example, we have used it to implement an aggressive deadcode elimination pass which can eliminate both regular assignments and reference assignments. In recent years there has been increasing interest in the static analysis of scripting languages, with a particular focus on detecting potential security weaknesses in web-based PHP programs. For example, Jovanovic et al. [150] describe an analysis of PHP to detect vulnerabilities that allow the injection of executable code into a PHP program by a user or web form. Jensen et al. describe an alias analysis algorithm for Javascript which works quite differently to ours [31], but works well for Javascript [141]. To our knowledge we are the first to investigate building SSA in a PHP compiler. Our initial results show that building a reasonably precise SSA for PHP requires a great deal of analysis. Nonetheless, as languages such as PHP are used to build a growing number of web applications, we expect that there will be an increasing interest in tools for the static analysis of scripting languages.

25.6 Discussion, implications and conclusion

355

Acknowledgements
This work was supported in part by Science Foundation Ireland grant 10/CE/I1855 to Lero – the Irish Software Engineering Research Centre (www.lero.ie).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Index

φif -function, 8 ψ-T-SSA, 224, 225 ψ-permutation, 222 ψ-projection, 222 ψ-promotion, 222 ψ-reduction, 221 g a m m a -function, 8 ., 218 χ -function, 188 µ-function, 188 φ -function, 7 φ -function, as multiplexer, 7, 8 φ -function, parallel semantic of, 8 φ -web, 21, 36, 302 φ -web, discovery, 36 ψ-function, 218 ψ-web, 225–227 ψ-inlining, 221 aggressive coalescing, 37 aliasing, 187 application binary interface, 305 back-end, compiler, 12, 218 basic-block exit, 308 C-SSA, 223 coalescing, 225 coalescing, of variables, 36 code selection, 267 conservative coalescing, 37 constant propagation, 221 construction, of ψ-SSA, 220 construction, of SSA, 27 control flow graph, 7 conventional SSA, 36 conventional SSA form, 36 conventional SSA form, CSSA, 35

conventional SSA, C-SSA, 37 conventional, making SSA, 38 copy, insertion, 13 copy-folding, 305 copy-propagation, 40 critical edge, 36 data flow analysis, 10 data flow analysis, dense, 10 data flow analysis, non-zero value, 10 data flow analysis, sparse, 11 data flow tree, DFT, 267 data-flow, 190 dead φ -functions, 39 dead code elimination, 39, 191, 221 def-use, 69 def-use chain, 33, 39, 269 def-use chains, 11, 220 dependence, control, 219 dependence, flow-, 218 destruction, of SSA, 27, 223 DJ-graph, 31 dominance edge, D-edge, 31 dominance frontier edge, DF-edge, 31 dominance frontier, construction, 32 dominance frontier, DF, 29 dominance property, 29 dominance property, SSA form with, 35 dominates, strictly, 29 edge splitting, 36, 37 fixed-point arithmetic, 269 forward control flow graph, 16, 190 functional language, 12

357

358

INDEX

gate, 218 Gated-SSA form, 218 global vale numbering, 221 Global Value Numbering, 194 grammar, 267 greedy coloring scheme, 326 guard, 217, 218 Hashed SSA form, 187 HSSA, 187 if-conversion, 220 induction variable recognition, 221 insertion of φ -function, pessimistic, 40 insertion, of φ -function, 28 instruction set architecture, 305 interference, 226, 227, 301 interference graph, 19 iterated dominance frontier, DF+ , 29 join edge, J-edge, 31 join set, 29 just-in-time compiler, JIT, 11 live-range, 18, 21, 24, 28, 228 live-range splitting, 28, 33 liveness, 19, 226 liveness check, 228 local variable, 39 minimal SSA form, 28, 35 minimal, making SSA, 40 minimal, making SSA non-, 40 name, variable’s, 33 normalized-ψ, 225 out-of-SSA, 27 parallel copy, 36 parallel copy, sequentialization of, 38 parallel instruction, 189 partial redundancy elimination, 221 Partitioned Boolean Quadratic Programming, PBQP , 271 pattern matching, 267

phiweb, 223 pining, 225, 305 predicate, 217, 218 predicated execution, 217 predication, 222 predication, partial, 222 pruned SSA form, 35 pruned SSA, building, 39 reachable, by a definition, 28 reaching definition, 35 reducible CFG, 40 referential transparency, 6 register allocation, 12, 37 register promotion, 198 renaming, of variable, 33 renaming, of variables, 28, 33, 220 rule, base, 273 rule, chain, 268 rule, production, 273 semi-pruned SSA, 39 single definition property, 28, 217 speculation, 222, 223 SSA destruction, 13, 36 ssa destruction, 301 SSA graph, data-based, 269 strict SSA form, 35 strict, making SSA, 38 strict, making SSA non-, 40 symbol, non-terminal, 268 symbol, terminal, 268 top, , 10 transformed SSA, T-SSA, 37 undefined variable, 35 use-def chains, 39, 190 uses, of ψ-functions, 226, 227 value numbering, 150 variable name, 7, 21, 28 variable version, 190, 302 variables, number of, 13 version, variable’s, 33 zero version, 189

References

1. ACM Press, 1993. 2. Proceedings of the ACM SIGPLAN’96 Conference on Programming Language Design and Implementation (PLDI), volume 31. ACM Press, May 1996. 3. Proceedings of the third ACM SIGPLAN International Conference on Functional Programming (ICFP ’98). ACM Press, 1998. SIGPLAN Notices 34(1),January 1999. 4. A. V. Aho and S. C. Johnson. Optimal code generation for expression trees. J. ACM, 23(3):488–501, 1976. 5. Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers Principles Techniques and Tools. Addison Wesley, 1986. 6. Frances E. Allen. Control flow analysis. SIGPLAN Not., 5(7):1–19, 1970. 7. J. Allen, K. Kennedy, C. Porterfield, and J. Warren. Conversion of Control Dependence to Data Dependence. In Proc. of the ACM Conf. on Principles of Programming Languages (POPL), pages 177–189. ACM Press, 1983. 8. J. R. Allen, Ken Kennedy, Carrie Porterfield, and Joe Warren. Conversion of control dependence to data dependence. In Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, POPL ’83, pages 177–189, New York, NY, USA, 1983. ACM. 9. B. Alpern, M. Wegman, and F. Zadeck. Detecting equality of values in programs. In Conference Record of the Fifteenth Annual ACM Symposium on Principles of Programming Languages, pages 1–11, January 1988. 10. Bowen Alpern, Mark N. Wegman, and F. Kenneth Zadeck. Detecting equality of variables in programs. In POPL ’88: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 1–11, New York, NY, USA, 1988. ACM. 11. Bowen Alpern, Mark N. Wegman, and F. Kenneth Zadeck. Detecting equality of variables in programs. In Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL ’88, pages 1–11. ACM, 1988. 12. Scott Ananian. The static single information form. Master’s thesis, MIT, September 1999. 13. Andrew W. Appel. Compiling with continuations. Cambridge University Press, 1992. 14. Andrew W. Appel. Modern Compiler Implementation in ML. Cambridge University Press, 1998. 15. Andrew W. Appel. SSA is functional programming. SIGPLAN Notices, 33(4):17–20, 1998. 16. Andrew W. Appel and Jens Palsberg. Modern Compiler Implementation in Java. Cambridge University Press, second edition, 2002. 17. P. Ashenden. The Designer’s Guide to VHDL. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2001. 18. David I. August, Daniel A. Connors, Scott A. Mahlke, John W. Sias, Kevin M. Crozier, BenChung Cheng, Patrick R. Eaton, Qudus B. Olaniran, and Wen-mei W. Hwu. Integrated predicated and speculative execution in the impact epic architecture. In Proceedings of the 25th annual international symposium on Computer architecture, ISCA ’98, pages 227– 237, Washington, DC, USA, 1998. IEEE Computer Society.

359

. R. ACM Press. Special Issue on Software and Compilers for Embedded Systems. New York. Systems Journal. ABCD: eliminating array bounds checks on demand. and Vivek Sarkar. Compiling Standard ML to Java Bytecodes. Optimizing code generation from ssa form: A comparison between two formal correctness proofs in isabelle/hol. The program dependence graph and vectorization. March 2009. 29. Noyce. Baxter and H. 35. Belady. In Proceedings of the third ACM SIGPLAN International Conference on Functional Programming (ICFP ’98) [3]. 20. 23. Rastislav Bodík. 85(1). Glossop. III. Revisiting out-of-SSA translation for correctness. USA. Alain Darte. Chains of recurrences a method to expedite the evaluation of closed-form functions. Hobbs. 27:381– 423. PhD thesis. 38. Zima. Caroline S. Electronic Notes in Theoretical Computer Science. Samuel Bates and Susan Horwitz. Craig. of Parallel and Distributed Computing.January 1999. Wang. NY. 1989. In Proceedings of the international symposium on Symbolic and algebraic computation. Denis Barthou. Benoit Boissinot. The GEM optimizing compiler system. 25. 24. 37. Alain Darte. USA. IEEE Computer Society Press. 1994. pages 129–140. Christophe Guillon. 30. Steven O. 2012. Washington. J. 22. and efficiency. Davidson. Kenneth MacKenzie. L. Gianfranco Bilardi and Keshav Pingali. In PACT’09: Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques. In PLDI ’00: Proceedings of the Conference on Programming Language Design and Implementation. Nick Benton. A Study of Replacement of Algorithms for a Virtual Storage Computer. 2010. Interprocedural Load Elimination for Dynamic Optimization of Parallel Programs. Rajiv Gupta. 21. Fuzzy array dataflow analysis. Richard B. ACM Transactions on Embedded Computing Systems. . 26. In POPL ’93: Proceedings of the 20th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. and Steffen Mülling. IEEE Computer Society. 33. Andrew Kennedy. SIGPLAN Notices 34(1). and George Russell. Electronic Notes in Theoretical Computer Science. Jean-François Collard. David S. pages 384–396. J. 2000. Lennart Beringer. Jr. and Fabrice Rastello. 34. pages 321–333. Int. Benoît Dupont de Dinechin. 1997. David I. and Vivek Sarkar. Compiler Construction. Mahlke. 141(2):33–51. 27. ACM. pages 1–11. 2007. IBM ˘ S101. Peter W. Paul S. Philip Brisk. ACM. 36. pages 110–125. NY. Electronic Notes in Theoretical Computer Science. Jan Olaf Blech.360 References 19. Simple generation of static single assignment form. 2003. W.. Paul Biggar. pages 110–125. Fall 1992. 2005. Kent D. Digital Technical Journal. ACM. Benoit Boissinot. Parallel Program. SIGPLAN Not. SSI properties revisited. volume 1781 of Lecture Notes in Computer Science. pages 242–249. ABCD: Eliminating array bounds checks on demand. and William B. Neil Faiman. Hwu. 31(5):291–300. ACM. 4(4):121–136. 40(2):210–226. code quality. 31. and Ian Stark. Grove. pages 41–52. Wen-Mei W. USA. In POPL ’89: Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. In PLDI. John Aycock and Nigel Horspool. and Scott A. Lennart Beringer. May 1996. Simple generation of static single-assignment form. August. 2000. 1993. Sep 2009. Blickstein. Rajkishore Barik and Vivek Sarkar. Grail: a functional form for imperative mobile code. Bauer. 1966. 5:78âA¸ 28.. pages 321–333. Functional elimination of phi-instructions. and Eugene V. October 1999. Rajiv Gupta. 2000. USA. Johannes Leitner. In International Symposium on Code Generation and Optimization (CGO’09). 176(3):3–20. New York. and Paul Feautrier. 2000. Springer. DC. John Aycock and Nigel Horspool. A framework for generalized control dependence. Incremental program testing using program dependence graphs. New York. Olaf Bachmann. In Proceedings of the 9th International Conference in Compiler Construction. and Fabrice Rastello. NY. The partial reverse if-conversion framework for balancing control flow and predication. R. Trinity College Dublin. Rastislav Bodik. 32. Sabine Glesner. Design and Implementation of an Ahead-of-Time Compiler for PHP.

and L.J. In PLDI ’92: Proceedings of the ACM SIGPLAN 1992 conference on Programming language design and implementation. ACM Transactions on Programming Languages and Systems. and Christophe Guillon. Timothy S. and tools for embedded systems. USA. Cooper. Matthias Braun and Sebastian Hack. August 2005. Revisiting out-of-ssa translation for correctness. In PLDI. Nov 1994. and L. USA. 2009. 53. Briggs. Fabrice Rastello. Timothy S. Alain Darte. USA. Keith D. 48. Technical Report RR2005-33. Foad Dabiri. 25(5):772–779. Cooper. Fast copy coalescing and live-range identification. Benoit Dupont de Dinechin. pages 5–13. 28(8):859–881. T.T. Harvey. 1997. In LCTES ’07: Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages. 2009. 41. K. Simpson. and L Simpson. 54. NY. Taylor Simpson. P. Florent Bouchez. Harvey. 2006. Benoit Dupont de Dinechin. NY. In ODES 4. In Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization. DC. USA. Springer. 42. P. Reeves. Harvey. IEEE Computer Society. IJES. Cooper. In Compiler Construction 2009. In International . Preston Briggs. DC. Software—Practice and Experience. Benoit Boissinot. 45. Zoran Budimlic. pages 174–189. On the complexity of register coalescing. 4(1):2–16. and Fabrice Rastello.References 361 39. code quality and efficiency. If-conversion ssa framework for partially predicated vliw architectures. Reeves. pages 102–114. Cooper. In CGO ’07: Proceedings of the International Symposium on Code Generation and Optimization. 1998. Florent Bouchez. Preston Briggs. pages 114–125. Taylor Simpson. Timothy J. In Proceedings of the 7th annual IEEE/ACM International Symposium on Code Generation and Optimization. Register Spilling and Live-Range Splitting for SSAForm Programs. Cooper. Washington. Briggs. Bruel. 1998. Harvey. Cooper. Oberg. Florent Bouchez. Practical improvements to the construction and destruction of static single assignment form. Exper. Ken Kennedy. Alain Darte. Brandis and Hanspeter Mössenböck. Keith D. C. Cooper. In International Symposium on Code Generation and Optimization (CGO) ODES-04: Workshop on Optimizations for DSP and Embedded Systems. Practical improvements to the construction and destruction of static single assignment form. 2002. volume 5501 of Lecture Notes In Computer Science. March 2007. Preston Briggs. 49. Revisiting out-of-ssa translation for correctness. and L. Software – Practice and Experience. July 1998. Value numbering. 40. New York. Roozbeh Jafari. 1992. 2009. 50. Oberg.D. France. ACM. on CAD of Integrated Circuits and Systems. Software Practice and Experience. pages 311–321. IEEE Computer Society Press. Keith D. If-conversion ssa framework. and Fabrice Rastello. LIP. IEEE Trans. 47. 56. Philip Brisk. Alain Darte. June 2007. Fabrice Rastello. and Linda Torczon. 51. and Fabrice Rastello. 2006. 43. Pract. March 2006. and Steven W. Timothy J. code quality and efficiency. 2009.. 16(6):1684–1698.. pages 114–125. 55. Zoran Budimli´ c. ENS Lyon. Ken Kennedy. compilers. Christian Bruel. and Steven W. Softw. Christophe Guillon. 44. 27(6):701–724. Register allocation and spill complexity under ssa. Alain Darte. 52. Christian Bruel. K. and Christophe Guillon. and Majid Sarrafzadeh. SIGPLAN. Timothy J. 46. Keith D. Single-pass generation of static singleassignment form for structured languages. If-conversion for embedded vliw architectures. New York. ACM and IEEE. On the complexity of spill everywhere under ssa form. Fast copy coalescing and live-range identification. Marc M. Keith D. Practical improvements to the construction and destruction of static single assignment form. Optimal Register Sharing for High-Level Synthesis of SSA Form Programs. CGO ’09. Alain Darte. Harvey. Benoit Boissinot. pages 25–32. Washington. Timothy J. pages 103–112. Rematerialization. ACM. 28(8):859–881. 28(8):859–882.

2002. S. Incremental computation of static single assignment form. Ron Cytron. and Jeanne Ferrante. 2003. Ferrante. and Jeanne Ferrante. Predicated static single assignment. Box 704 10598 Yorktown Heights NY. 1997. Watson Research Center P. IEEE Computer Society. Phi-predication for light-weight ifconversion. F. SIGPLAN Not. DC. Gregory J. Martin E. Computer Languages. 67. IEEE Computer Society. Register allocation & spilling via graph coloring. 61. Weihaw Chuang. The Journal of Supercomputing. W. Journal of Computer Languages. In POPL. Hauser. and P. Ron Cytron. and Patryk Zadarnowski. pages 179–190. John Cocke. A new algorithm for partial redundancy elimination based on ssa form. 21(2):117–130. Cocke. Jong Choi. ACM. J. Mapping a single assignment programming language to reconfigurable systems. B. T. In Proc. Hopkins. Vivek Sarkar. Robert Kennedy. Wawrzynek. Ashok K. 64.. Draper. of the Intl. Washington. Calder. DC. Gregory J. 1989. pages 25– 32. pages 853–863. New York. and W. on Parallel Architectures and Compilation Techniques (PACT’99).362 References Conference on Programming Language Design and Implementation (PLDI’02). 68. 2003. Hopkins. Automatic construction of sparse data flow evaluation graphs. Chaitin. Auslander. J. In SIGPLAN Symp. Box 704 10598 Yorktown Heights NY P. L. Markstein. W. 58. Weihaw Chuang. Computer. O. 59. In POPL ’91: Proceedings of the 18th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. and Peter W. Conf. Chawathe. Ron Cytron. Phi-predication for light-weight ifconversion. 70. Jong-Deok Choi. Chandra. Automatic construction of sparse data flow evaluation graphs. New York. Raymond Lo. pages 55–66. Brad Calder. S. and Jeanne Ferrante. 1991. R. pages 98–105. O. Markstein. 24(7):146–160. pages 223–237. Chow. Chandra. pages 179–190. Chakravarty. Auslander. 66. page 245. 71. In In Proceedings of the International Symposium on Code Generation and Optimization. and J. Simon. 60. NY. and Jeanne Ferrante. Minimizing register usage penalty at procedure calls. Automatic construction of sparse data flow evaluation graphs. Rinker. 6:45–57. pages 85–94. Ross. L. Gabriele Keller. M. pages 273–286. Jong-Deok Choi. Carter. Goldstein. ACM Press. Tu. January 1981. 6:47–57. 69. on Field Programmable Logic and Applications (FPL). and P. A functional perspective on SSA optimisation algorithms. B. Brad Calder. T. A new algorithm for partial redundancy elimination based on ssa form.J. Conf. Jong-Deok Choi. of the 1999 Intl. 1991. Register allocation via coloring. 74. ACM SIGPLAN Notices. 1999. 33(4):62–69. R. February 2002. Fred Chow. Chaitin. M. G. Najjar. In POPL ’91: Proceedings of the Symposium on Principles of Programming Languages. Craig Chambers and David Ungar. E. B. Sun Chan. USA. 32(5):273 – 286. Tu. Fred C. Compiling application-specific hardware. 2000. USA. BÃ˝ uhm. Washington. 82(2). 1981. pages 55–66. In Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization. USA. and J. NY. USA. The Garp architecture and C compiler. In Compiler Construction. Chow. Electronic Notes in Theoretical Computer Science. and P. K. July 1988. 72. Chaitin. and Jeanne Ferrante. Kennedy. 2003. on Compiler Construction (CC’82). J. Marc A. Budiu and S. CGO ’03. pages 55–66. June 2002. . 1996. J. Customization: optimizing compiler technology for self. 63. Manuel M. C. Shin-Ming Liu. M. R. 57. 65. M. 73. IBM T. ACM Press. Hammes. ACM. and Edith Schonberg. a dynamically-typed object-oriented programming language. A. A. Chan. In Proceedings of the ACM SIGPLAN ’97 Conference on Programming Language Design and Implementation. 62. 1991. Lo. Register allocation via graph coloring. Liu. In Proceedings of the ACM SIGPLAN ’88 Conference on Programming Language Design and Implementation. Carter. Springer Berlin / Heidelberg. 1997. In Proc. Callahan. 1982.

Click. 84. A vliw architecture for a trace scheduling compiler. B. NY.References 363 75. Timothy J. Syst. Jeanne Ferrante. . Harvey. New York University. 2007. and Ken Kennedy. Jeanne Ferrante. and F. Mark N. PhD thesis. Halbwachs. B. CA. Kenneth Zadeck. ACM. 1987. ACM Transactions on Programming Languages and Systems. 85. Syst. 91. IEEE Computer Society. ACM Trans. October 1991. Efficiently computing static single assignment form and the control dependence graph. Wegman. pages 84–96. 2000. ACM Trans. USA. 248(1-2):243–287. 86. Rice University. K. Nix. and Christopher A. Program. Papworth. Cooper. N. pages 25–35. J. New York. NY. Efficiently computing static single assignment form and the control dependence graph. Improvements to the psi-ssa representation. L. Cousot and N. Mark N. François de Ferrière. 82. 90.. de Ferri. Stoutchinin. Benoît Dupont de Dinechin. In POPL. New York. pages 189–200. 2006. Cytron. CASES ’00. Efficiently computing static single assignment form and the control dependence graph. Ron Cytron. Kenneth Zadeck. Kevin Millikin. USA. 77. Efficiently computing static single assignment form and the control dependence graph. Wegman. 23(5):603–625. Clifford Click. Wegman. DC. USA. Courant Institute of Mathematical Sciences. Code generator optimizations for the st120 dsp-mcu core. Lambda-dropping: transforming recursive equations into programs with block structure. Kenneth Zadeck. IEEE Computer Society Press. 76. Cocke and J. and A. Rosen. 88. 87. Jeanne Ferrante. SCOPES ’07. Rodman. and Paul K. Alain Darte. Birkhauser Boston. David B. Ron Cytron. Scheduling and Automatic Parallelization. October 1991. pages 111–121. O’Donnell. and F. Program. 13(4):451–490. 81. and F. Nielsen. A unified software pipeline construction scheme for modulo scheduled loops. Barry K. Vick. Ron Cytron. 13:451–490. M. J. Ron Cytron. 78. P. Lang. Ron Cytron. ACM Trans. 13(4):451–490. Technical report. February 1995. and synthesis for embedded systems. Robert P. Barry K. New York. 83. Rosen. 1995. Jeanne Ferrante. ACM Trans. Robert P. 4th International Conference. Olivier Danvy and Ulrik Pagh Schultz. 17(6):793–812. Barry K. Lang. Zadek. and Lasse R. 92. Combining Optimizations. Jeanne Ferrante. In Proceedings of the 2000 international conference on Compilers. pages 93–102. 79. Program. In Proceedings of the second international conference on Architectual support for programming languages and operating systems. Lang. 13(4):451–490. On one-pass CPS transformations. 89. volume 1277 of Lecture Notes in Computer Science. In POPL ’89: Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. 2007. Syst. and F. 1991. K. Rosen.. Global code motion global value numbering. Lang. Schwartz. ACM. Program. pages 180–192. An empirical study of iterative data-flow analysis. Cooper.. 1989. 13(4):451–490. Mark N. architecture. Theoretical Computer Science. In SIGPLAN International Conference on Programming Languages Design and Implementation. Rosen. 2000. Olivier Danvy. April 1970. ACM Trans. Colwell. Rosen. In Parallel Computing Technologies. C. Syst. Keith D.. 1991. An efficient method of computing static single assignment form. USA. 2001. Program. pages 246 – 257. Journal of Functional Programming. R. Kenneth Zadeck. Ferrante. USA. Mark N. 1st edition. Wegman. Operator strength reduction. Barry K. ACM. 1991. In CIC ’06: Proceedings of the Conference on Computing. 2000. 80. Automatic discovery of linear restraints among variables of a program. 1978. and Frederic Vivien. Taylor Simpson. Efficiently computing static single assignment form and the control dependence graph. C. Wegman. Yves Robert. Wegman. 93. Kenneth Zadeck.. Lang. and F. 1997. Los Alamitos. Syst. ASPLOS-II. ACM. Barry K. F. Washington. NY. pages 266– 276. Combining Analyses. John J. In Proceedings of the 10th international workshop on Software & compilers for embedded systems. and F. Keith D. PaCT97. Programming languages and their compilers.. Guillon. Mark N. Rosen. Dupont de Dinechin.

Compilers. Archit. Dean and S. London. 1998. Philadelphia. USA. ACM. Dennis. B. UK. New York. UK. 1995. First version of a data flow procedure language. Andreas Krall. Alexandre E. Dhamdhere. The program dependence graph and its use in optimization. Springer-Verlag. Compiler algorithms on if-conversion. May 2000. K. Geoffrey Brown. Ghemawat. Jesse Zhixi Fang. 97. Hendren. 28(2):203–213. J. Evelyn Duesterwald. Compiler algorithms on if-conversion. Erik Eckstein. and Mary Lou Soffa. June 12-13. Springer-Verlag. UK. Rakesh Ghiya. and Bernhard Scholz. In Proceedings of the ACM SIGPLAN 1994 conference on Programming language design and implementation. 109. MICRO 28. 13(11):48–56. and Tools for Embedded Systems (LCTES’08). M. Florian Brandner. ACM. François de Ferrière. In ICS ’88: Proceedings of the 2nd international conference on Supercomputing. Mace. SpringerVerlag. Generating sequential code from parallel code. D. Fisher. pages 564–573. ACM. Bernhard Scholz. Lx: a technology platform for customizable vliw embedded processing. London. speculative predicates assignment and predicated code optimizations. compilers. 51(1):107–113. Peter Wiedermann. In Krisztián Flautner and John Regehr. Proceedings of the 2008 ACM SIGPLAN/SIGBED Conference on Languages. rüthing and steffen’s lazy code motion. Code Instruction Selection Based on SSA-Graphs. pages 49–65. MapReduce: simplified data processing on large clusters. 1974. speculative predicates assignment and predicated code optimizations. In LCTES ’08: Proceedings of the 2008 ACM SIGPLAN-SIGBED conference on languages. editors. ACM. UK. In SODA ’98: Proceedings of the ninth annual ACM-SIAM Symposium on Discrete Algorithms. M. Society for Industrial and Applied Mathematics. pages 357–373. Proceedings Colloque sur la Programmation. and J. J. A variation of knoop. E-path_pre: partial redundancy made easy. USA. USA. and Albrecht Kadlec. NY. Tucson. In SCOPES. Workshop on Software and Compilers for Embedded Systems (SCOPES). pages 31–40. 102. Register allocation for predicated code. IEEE Computer Society Press. Springer-Verlag. Rajiv Gupta. 107. In CC ’94: Proceedings of the Conference on Compiler Construction. ACM Trans. 9(3):319–349. J. Warren. 2008. pages 362–376. 2003. Eichenberger and Edward S. pages 135–153. Los Alamitos. AZ. NY. SIGARCH Comput. Stadel. LCPC ’96. In Proc. Computer. USA. 2008. 108. Drechsler and M. SIGPLAN Notices. 110. News. Simons. 37(8):53– 65. 105. ACM. 104. and B. Florian Brandner. Dennis. PA. Paolo Faraboschi. 95. and Albrecht Kadlec. 2008. In Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing. 2008. B. USA. PLDI ’94. Ferrante. Data flow supercomputers. New York. Context-sensitive interprocedural points-to analysis in the presence of function pointers. pages 582– 592. London. Dietmar Ebner. 101. 2007. New York.364 References 94. 2002. LCPC ’96. J. Improvements to the psi-SSA representation. USA. J. Andreas Krall. CA. Oliver König. Jesse Zhixi Fang. Commun. 106. 1988. Ferrante. Dietmar Ebner. Martin Farach and Vincenzo Liberatore. Davidson. UK. 1997. on Programming Languages and Systems. pages 135–153. Generalized instruction selection using SSA -graphs. 1994. pages 180–191. 103. pages 31–40. 111. Giuseppe Desoli. UK. 1994. Ottenstein. K. pages 242–256. SIGPLAN Notices. On local register allocation. 1997. and tools for embedded systems. and Fred Homewood. Reducing the cost of data flow analysis by congruence partitioning. In Proceedings of the 9th International Workshop on Languages and Compilers for Parallel Computing. 28(5):29–38. . Joseph A. 96. NY. 1987. Bernhard Scholz. 1993. 1980. Generalized instruction selection using SSA-graphs. Peter Wiedermann. In Programming Symposium. In Proceedings of the 28th annual international symposium on Microarchitecture. 100. 99. of the 10th Intl. Maryam Emami. and Laurie J. pages 111– 121. London. 98.

pages 1859–1866.. USA. and C. A. In PLDI. 119. Combin. Appel. pages 26–37. 130. PLDI ’09. Register Allocation for Programs in SSA form. UK. Brendan Eich. Amr Sabry. 2009. 2008.. Program. Efficient Algorithms. New York. Embedded computing: a VLIW approach to architecture. Helmut Seidl. 121. ICS ’88. Bruce F. Ottenstein. Parallel Program. . SAC ’09. Beyond induction variables: detecting and classifying sequences using a demand-driven SSA form. Smith.References 365 112. Eric Stoltz. Polynomial precise interval analysis revisited. and Michael Schlansker. Cortes. pages 442–452. 1987. Jesse Ruderman. volume 3923. Taming the ixp network processor. March 2006. 1:422 – 437. Sebastian Hack. Washington. On linearizing parallel code. 36(5):493–520. The Intersection Graphs of Subtrees in Trees are exactly the Chordal Graphs. USA. London. 116. Foster. ACM Trans. USA. Lang. In SAS ’00: Proceedings of the 7th International Symposium on Static Analysis. Springer. ACM. Boris Zbarsky. NY. Predicting conditional branch directions from previous runs of a program. ACM. Joseph A. New York. 115. pages 237–247. Code scheduling and register allocation in large basic blocks. DC. Warren. Gregoire Sutre. MICRO 29. 27:85–95. 2005. 114. September 1992. ACM. Ottenstein. Jason Orendorff. 113. 122. Young. Jeffrey S. 131. Warren. Jeanne Ferrante and Mary Mace. Elsevier Science Publishers B. Najjar. IEEE Computer Society. Program. Richard Johnson. Compiler Construction 2006. and W. David Anderson. 128. The essence of compiling with continuations. Syst. 1985. Unified analysis of array and object references in strongly typed languages. J. Karl J. Sequential abstract-state machines capture sequential algorithms. P. Iterated register coalescing. 127. Jerome Leroux. In Proceedings of the 2009 ACM symposium on Applied Computing. 2003. Cormac Flanagan. Mike Shaver. 18(3):300–324. New York. USA. Freudenberger. 9(3):319–349. Michael P. In Andreas Zeller and Alan Mycroft. Graydon Hoare. pages 114– 125. In Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture. R. Jeanne Ferrante. Fink. and Vivek Sarkar. Daniel Grund. Mitra. and Michael Hicks. and Gerhard Goos. F˘ anic˘ a Gavril. Lang. 9(3):319–349. Sabine Glesner. 2009. 1974. Buyukkurt. Blake Kaplan. Rick Reitmaier.. 125. 2000. J. The program dependence graph and its use in optimization. Gerlek. and Joe D. pages 465–478. The program dependence graph and its use in optimization. In POPL ’85: Proceedings of the 12th ACM SIGACT-SIGPLAN symposium on Principles of programming languages. Michael Furr. Goodman and W. (16):47–56. 124. J. 126. Michael Bebenita. ACM Trans. J. Jong-hoon (David) An. 129. Guo. Lang. New York.. editors. An ASM semantics for SSA intermediate representations. SIGPLAN Not. USA. Fisher and Stefan M. 1987. 1996. In Proceedings of the 2009 ACM SIGPLAN conference on Programming language design and implementation. Jan Reineke.-C. ACM. J.. Dz-ching Roy Ju. Faraboschi. Stephen J. May 1996. Trace-based just-in-time type specialization for dynamic languages. 118. Hsu. Fisher. Thomas Gawlitza. compilers and tools. Yuri Gurevich. ACM Trans. and Reinhard Wilhelm. ACM Transactions on Computational Logic (TOCL). ACM. A compiler intermediate representation for reconfigurable fabrics. and Michael Franz. Karl J. Mohammad R. NY. Z. Gillies. ACM Transactions on Programming Languages and Systems. Syst. Jeanne Ferrante. Syst. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’93) [1]. 2009. Kathleen Knobe. Mason Chang. 120. V. Static type inference for ruby. 2000. and Michael Wolfe. B. Andreas Gal. NY. pages 155–174. NY. pages 179–190. B. Lal George and Andrew W. Haghighat. 1995. and Matthias Felleisen. Lal George and Blu Matthias. pages 144–160. Program. David M. 1988. 123. Duba. Springer-Verlag. In Proceedings of the 2nd international conference on Supercomputing. David Mandelin. 117. Global predicate analysis and its application to register allocation. and Joe D.. Int. 1(1):77–111. 17(1):85–122. Theory Ser. Edwin W.

Pingali. Johnson. Pohua P. Hormati. 0:266. 137. (DAC). Kudlur. pages 171– 185. 148. New York. Chang. Roger A. New York. Anders Møller. and Satish Pillai. pages 1–16. 138. 147. Richard E. Ullman. Nancy J. and R. Havanki. MICRO 29. 1985.nec.com/428688. Warter. 1976. USA. In PLDI. and K. volume 201 of LNCS. 12th International Conference on Compiler Construction (CC’03) (April 2003. Rabbah. Springer. 2008. J. 7(1-2):229–248. Gustavo De Veciana. Richard Johnson and Keshav Pingali. D. pages 477–499. Johnson. 135. 143.-F. Code size optimization for embedded processors. 1998. NY. and R. Washington. pages 282–287. November 2004. In PLDI. Mahlke. 139. Hagiescu. Thomas Johnsson. In POPL ’73: Proceedings of the Symposium on Principles of Programming Languages. Dependence-based program analysis. Rabbah. Clustered vliw architectures with predicated switching. on Compilers. 140. Architecture. 142. USA. S. Construction of thinned gated single-assignment form. In Jouannaud [149]. The superblock: an effective technique for vliw and superscalar compilation. Supercomput. pages 258–263. pages 190–203. Banerjia. In In Proceedings of the Design Automation Conference. 133. Available: citeseer. NY.. 149. 150. Global data flow analysis and iterative algorithms. ACM. D. University of Cambridge. Pixy: A static analysis tool for detecting web application vulnerabilities (short paper). W. Computer Laboratory. Proceedings. In Proceedings of the 16th International Symposium on Static Analysis. R. 144. 1994. 2009. May 1993. of the 2008 Intl. Functional Programming Languages and Computer Architecture. Heidelberg. Berlin. Margarida F. IEEE Computer Society. 2009. and Daniel M. Optimus: Efficient Realization of Streaming Applications on FPGAs. In 2006 IEEE Symposium on Security and Privacy. Mahlke. 6rd Workshop on Programming Languages and Compilers for Parallel Computing. Hank. ACM. 1993. William Y. of the 2009 ACM/IEEE Design Automation Conf. Using static single assignment form to improve flowinsensitive pointer analysis. Ullman. Margarida F. Roland G. The program tree structure. pages 207–217. Jacome. Richard Johnson and Michael Schlansker. 146. 1998. pages 41–50. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation. Springer-Verlag. Nenad Jovanovic. Conte. Dependence-based program analysis. Gustavo de Veciana. pages 100–113. 1996. Ouellette. and Peter Thiemann. In Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture. 23(1):158–171. 2003. In Proc. Chen. pages 97–105. Scott A. M. Haab. Neil Johnson and Alan Mycroft. Conf. and Engin Kirda. Matthew S. Holm. Paul Havlak. Hecht and Jeffrey D. 2001. In In Proc. Neil E. D. Rebecca Hasti and Susan Horwitz. Bacon. Lavery. In In Proc. Christopher Kruegel. A computing origami: folding streams in FPGAs. 141. and Synthesis for Embedded Systems (CASES). . Richard Johnson and Keshav Pingali. Treegion scheduling for wide issue processors. 134. 2001. 1973. 136. Analysis techniques for predicated code. pages 78–89. International Symposium on. 1993.nj. Type analysis for javascript. In PLDI ’93: Proceedings of the Conference on Programming Language Design and Implementation. Online]. pages 696–701. USA. ACM. DAC. pages 238– 255. pages 696–701. 2001. A. Bacon.html. W. pages 78–89. Tokuzo Kiyohara. Journal of the ACM (JACM). Clustered vliw architectures with predicated switching. Hwu. 145. 1993. DC. Lambda lifting: Transforming programs to recursive equations. Wong. Simon Holm Jensen.366 References 132. Wen-Mei W. 2006. In Proceedings of the 38th Design Automation Conference. Grant E. High-Performance Computer Architecture. Kam and Jeffrey D. John B. SAS ’09. Jean-Pierre Jouannaud. editor. and T. A. Pearson. ACM. Jacome. John G. 151. Analysis of a simple algorithm for global data flow problems. Bringmann. S. Technical Report UCAM-CL-TR-607. In Proc. and Satish Pillai. Combined code motion and register allocation using the value state dependence graph. Springer Verlag.

and Varmo Vene. USA. 1994. Compiling with continuations. Technical report. Optimizing compilation with the Value State Dependence Graph. Liu. Twenty-fifth ACM Symposium on Principles of Programming Languages. Reprinted in Higher Order and Symbolic Computation. Rüthing. In Giorgio Levi. 155. University of Cambridge. 159. Multidimensional chains of recurrences. 170. A generalization of jumps and labels. SIGPLAN Not. Chan. In Proceedings of the Seventh International Conference on Compiler Construction. R. 1998. 21(3):627–676. January 1998. SAC ’00. Richard Kelsey. 1995. Allen Leung and Lal George. Kennedy. Theoretical Computer Science. P. New York. Knoop. P. continued. 169. (DAC’03). 1999. 154. R. In Proc. USA. Christopher Sadler. A Systolic Array Optimizing Compiler. Lo. Rüthing. and B. J. USA. Norwell. Ssa-based flow-sensitive type analysis: combining constant and type propagation. Proc. Lang. Steffen. Andrew Kennedy. and B.References 367 152. Streich. Bidirectional data flow analysis: myths and reality. Data communication estimation and reduction for reconfigurable systems. Chow. Peeter Laud. Kluwer Academic Publishers. 1993. Lam. 163. Program. Gupta. ACM Trans. Unrolling-based optimizations for modulo scheduling. ACM Trans. . and B. In Intermediate Representations Workshop. Uday P. 1988. S. Lazy code motion. ACM. Journal of Programming Languages. 364(3):292–310. Khedker and Dhananjay M. pages 199–206. P. 160. Zima. R. 75(9):1235–1245. August 1965. USA. Steffen. pages 616– 621. Alexandre Lenart. Rüthing. O. Static single assignment form for machine code. Syst. 2003. Kennedy. V. 161. Lo. R. 1999. 164. with a foreword by Hayo Thielecke. E. S. 165. 16(4):1117–1155. Lawrence. IEEE Computer Society Press. MICRO 28. Alan C. 1998. of the 40th Design Automation Conf. 156. 1503. Conditional constant propagation of scalar and array references using array SSA form. Partial redundancy elimination in ssa form. 34(6):47–57. Proceedings from the 5th International Static Analysis Symposium. Daniel M. Lecture Notes in Computer Science. New York. In Proceedings of the 2000 ACM symposium on Applied computing . Technical Report UCAM-CL-TR-705. of the IEEE. 167. Knoop. In Proceedings of the ACM SIGPLAN ’92 Conference on Programming Language Design and Implementation. 1992. Kaplan.. and E. Tarmo Uustalu. ACM. O. Lang. In Proceedings of the 28th annual international symposium on Microarchitecture. Liu. 166. Kathleen Knobe and Vivek Sarkar. pages 813–817. UNIVAC Systems Programming Research. Mitrofanov. S. Kastner. J. 162. Rec. 153. Lavery and Wen-Mei W. Knoop. and R. pages 13–23. California. A. 1987. Lazy strength reduction. 1(1):71–91. NY. December 2007. 168. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation. pages 177–190. 1995. pages 224– 234. Syst. 2006. F. Lee and D. Springer-Verlag. Program. Computer Laboratory. Peter Landin. 158. Optimal code motion: theory and practice. 1998. In Proceedings of the 1998 international symposium on symbolic and algebraic computation. 171. V. 1989. pages 327–337. Chow. A correspondence between continuation passing style and static single assignment form. CA. Kathleen Knobe and Vivek Sarkar. M. Steffen.. Tu. pages 33–56. ACM Press. Los Alamitos. 11(2):125-143. Type systems equivalent to data-flow analyses for imperative languages. Dhamdhere. P. San Diego. editor. Array SSA form and its use in Parallelization. 157. Synchronous data flow. Brisk. O. Hwu.Volume 2. MA. Strength reduction via ssapre. Tu. Conf. and M.. PLDI ’99. and Sandeep K. pages 204–214. Dahl. 1999. 2000. Messerschmitt. J. Kislenkov. S. and F. NY.

and Michael S. Exploiting conditional instructions in code generation for embedded vliw processors. The parallel assignment problem redefined. W. automation and test in Europe.. Hank. In SAC. David I. 175. A comparison of full and partial predicated execution support for ilp processors. 1999. Metzgen and D. May 1993. In Proceedings of the 22nd annual international symposium on Computer architecture. SIGMICRO Newsl. pages 138–150. Lichtenstein. Allen Leung and Lal George. USA. Cathy May. ACM. 180. R. In Proc. and Wen mei W. New York. Liu. New York. Kennedy. Chen. In Proceedings of the fifth international conference on Architectural support for programming languages and operating systems. Cathy May. Effective compiler support for predicated execution using the hyperblock. 1995. Global resource sharing for synthesis of control data flow graphs on FPGAs. Tu. IEEE Computer Society Press. Supercomput. Stefan M. USA. December 1992. G. 7(1-2):51–142. (DAC’03). D. June 1989. IEEE Transactions on. pages 184–188. Scott A. Richard E. Mahlke. Geoffrey Lowney. and T. Bringmann. 15:821– 824. Hwu. pages 138–150. Memik. McCormick. Lin. Multiplexer restructuring for FPGA implementation cost reduction. Eng. Hank.uiuc. Hank. Mahlke. pages 249–260. Ramakrishna Rau. Lin.. M. Mahlke. 182. S. R. IEEE Transactions on Software Engineering. LLVM Website. 23:45–54. DATE ’99. 190. Chen. David C. Mahlke. John S. J. ASPLOS-V. and Roger A. and R. 184.. Softw. pages 604–609. James E. Lin. of the ACM/IEEE Design Automation Conf. 174. 2001. 186. Mahlke. and John Ruttenberg. Pentagons: a weakly relational abstract domain for the efficient validation of array accesses. William Y. Hank. Chen. Schlansker. ISCA ’95. USA. ACM. 183. S. O’Donnell. 1998. In IN PROCEEDINGS OF THE 22TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE. In PLDI. R. B. http://llvm. and E. Hank. Mahlke. and P. S. P. December 1992. Fred Chow. Raymond Lo. Bringmann. ACM. 2008. 185. Memik. NY. A comparison of full and partial predicated execution support for ilp processors. A type system equivalent to static single assignment. USA. Register promotion by sparse partial redundancy elimination of loads and stores. August. Rainer Leupers. Wen-mei W. 189. Effective compiler support for predicated execution using the hyperblock. and Roger A. W. June 1989. S. Schlansker.edu. James E. on Microarchitecture (MICRO). R. 188. 1998. 176. 15(6):821–824. Sherwood. 1999.368 References 172. Scott A. In Proceedings of the ACM SIGPLAN ’98 Conference on Programming Language Design and Implementation. The parallel assignment problem redefined. R. R. Symp. 177. Scott A. 181. McCormick. 187. Register promotion by sparse partial redundancy elimination of loads and stores. Schreiber. Richard E. Jafari. Static single assignment form for machine code. Shin-Ming Liu. Chen. F. David I. 178. 1995. Chow. 179. D. CA. SIGMICRO Newsl. and Wen-Mei W. Scott A. David C. 20(11):1355–1371. Computer-Aided Design of Integrated Circuits and Systems. pages 204–214. Los Alamitos. Richard E. Kursun. Robert Kennedy. New York. and Peng Tu. Mahlke. 1992. August. NY. Fancesco Logozzo and Manuel Fahndrich. In Proc. William Y. The multiflow trace scheduling compiler. ACM. . Bringmann. P. of the 40th Design Automation Conf. Effective compiler support for predicated execution using the hyperblock. ACM Press.cs. In Proc. Scott A. NY. Hwu. Nix. 173. 23(1-2):45–54.. Karzes. In Proceedings of the conference on Design. Thomas J. William Y. Nancekievill. pages 45–54. pages 26–37. Sentinel scheduling for vliw and superscalar processors. (DAC). Ravindran. pages 238–247. Richard E. Bitwidth cognizant architecture synthesis of custom hardware accelerators. Yutaka Matsuno and Atsushi Ohori. of the 25th Annual Intl. pages 26–37. IEEE Trans. Lo. Robert P. 1992. ACM. Hwu. pages 421–426. In International Conference on Programming Language Design and Implementation (PLDI’99). 2003. Freudenberger. 2005.

data-. Mark Shields. V. 201. 93(1):55–92. Srikant. See also http://www. 196. Padua. print). Ciaran O’Donnell. Diego Novillo. Robert A. Inc. 212.edu/~palsberg/tba. 1979. In Proceedings of the GCC DevelopersâA Summit. and Arthur B. Antoine Miné. Optimal control dependence computation and the roman chariots problem. 2011. 19:31–100. 22(2):96–103. Ballance. 205. Type-based analysis and applications. Shpeisman. E.enst. MacCabe. 202. pages 144–154. V. 1997. Principles of program analysis. In TOPLAS. Robert A. APT: A data structure for optimal control dependence computation. Hanne Riis Nielson. PhD thesis. and P. USA. In Proc. 210. Paleri. 208. Science of Programming Programming. In ICSE. 2005. and Chris Hankin. The octagon abstract domain. Shankar. Robert Morgan. USA. Springer. 2005. Karl J. NY. Global optimization by suppression of partial redundancies. Information and Computation. New York. Secaucus. pages 462–491. In Proceedings of the 6th annual IEEE/ACM international symposium on Code generation and optimization. New York. NY. T. Murphy. pages 175–184. A propagation engine for GCC. Adl-Tabatabai. Eugenio Moggi. NY. Karl J. Technical Report HPL-91-58. USA. and Chris Hankin. Notions of computation and monads. Flemming Nielson. Building an optimizing compiler. Flemming Nielson. Ecole Nationale Superieure des Telecommunications. Ottenstein. pages 20–27. Y. 2005. 1994. Mangala Gowri Nanda and Saurabh Sinha. and Andrew Tolmach. High level compiling for low level machines.N. New York. Jens Palsberg. SystemC: a modeling platform supporting multiple design abstractions. . In PLDI ’90: Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation. Karl J. Springer. California. On predicated execution. 2001. J. 194. Simon Peyton Jones. Available from ftp. USA. pages 257–271. 207. The program dependence graph in a software development environment. 1984. Panda. Partial redundancy elimination: a simple. Park and M. Principles of program analysis (2. 198. Comput. January 1998. on Systems Synthesis (ISSS’01). USA. Springer. MacCabe. ACM. Flemming Nielson. Encyclopedia of Parallel Computing. Keshav Pingali and Gianfranco Bilardi. pages 257–271. 1990. Hanne Riis Nielson. S. and Arthur B.. 2006. Hanne R. and Chris Hankin. P. 206. F. Principles of Program Analysis. and demand-driven interpretation of imperative languages. B.ucla. 192. Schlansker. 1991. In Proceedings of the ACM SIGPLANSIGSOFT Workshop on Program analysis for software tools and engineering (PASTE’01). ˘Z ´ 200. 1995. Ottenstein and Linda M. Renvoise. of the 14th Intl. Schneider. 203. ACM.K.cs. pages 133–143. corr.References 369 191. pages 49–61. 193. editor. NY. 2003. Nielson. Fault-safe code motion for type-safe languages. 195. The program dependence web: a representation supporting control-. Accurate interprocedural null-dereference analysis for java. In SDE 1: Proceedings of the first ACM SIGSOFT/SIGPLAN software engineering symposium on Practical software development environments. 209. data-. 1991. Morel and C. 2001. and A. Palo Alto. ACM. New York. Ottenstein. Hewlett Packard Laboratories. Bridging the gulf: a common intermediate language for ML and Haskell. 197. ACM. 199. John Launchbury. Communications of the ACM. In PLDI ’90: Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation. ACM.. C. Symp. pages 211–222. 204. pages 75–80. ACM Press. In PLDI. and demand-driven interpretation of imperative languages. Ottenstein. pages 177– 184. 48(1):1–20. 1990. Keshav Pingali and Gianfranco Bilardi. The program dependence web: a representation supporting control-. 211. 2009. 2008. pragmatic and provably correct algorithm. ACM.fr. NJ. David A. Ballance. Higher Order Symbol. Menon. 1999. Springer-Verlag New York.

1998. Kenneth Zadeck. The discoveries of continuations. Barry K. 1974. Theoretical Computer Science. 2007. 2002. On sparse evaluation representations. Springer. John H. Vivek Sarkar and Stephen Fink. and John Keane.. Sebastian Pop. In Encyclopedia of Computer Science. Available from http://www. 277(1-2):119–147. Call-by-name. Technical report. Bernard P. In 5th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. 1972. Higher-Order and Symbolic Computation. Zadeck. François de Ferrière. 1988. Global value numbers and redundant computations. 2008. On sparse evaluation representations. 229. 228. USA. Tilting at windmills with Coq: Formal verification of a compilation algorithm for parallel moves. New York. UK. In Proceedings of the 21st ACM SIGPLAN-SIGACT symposium on Principles of programming languages . . 226. Optimal chain rule placement for instruction selection based on SSA graphs. Plotkin. ACM. N. 2011.pdf. USA. volume 2150 of Lecture Notes in Computer Science. Fisher. editors. An Incremental Algorithm for Maintaining the Dominator Tree of a Reducible Flowgraph. University of Illinois at Urbana-Champaign. Ramalingam. Fabrice Rastello. Q. John Gurd. K. 1993. Gordon D.cri. New York. In Rizos Sakellariou. 15(2-3):161–180. In IR ’95: Proceedings of the Workshop on Intermediate Representations. New York. Theoretical Computer Science. ACM. and F. Rosen. Reynolds. and Christophe Guillon. New York. Springer.fr/classement/doc/E-285. 224. 1(2):125–159. 217. pages 883–887. In International Symposium on Code Generation and Optimization (CGO’04). Stefan Schäfer and Bernhard Scholz. pages 141–156. Ganesan Ramalingam. pages 91–100. In Proceedings of the 25th ACM National Conference. Optimizing translation out of SSA using renaming constraints. Reynolds. Chichester. 220. Anthony Pioli and Michael Hind.ensmp. In and Out of SSA: a Denotational Specification. Rosen. 232. J. USA. Laurence Rideau. Ramakrishna Rau and Joseph A. 225. Wegman. Theoretical Computer Science. Wegman. 231. NY.POPL ’94. 218. 277(1-2):119–147. London. ACM Press. IBM T. and Xavier Leroy. 1988. Journal of Automated Reasoning. UK. G. Technical report. PhD thesis. 1996. NY. and F. pages 273–277. In POPL ’88: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. Efficient dependence analysis for java arrays. 215. and Fernando M. NY. In Proceedings of the 2nd Colloquium on Automata. July 2007. 1975. pages 12–27. 6(34):233–248. Combining interprocedural pointer analysis and conditional constant propagation. G Ramalingam and Thomas Reps. ACM. pages 265–278. 40(4):307–326. ACM Press. M. 2001. Global value numbers and redundant computations. 1995.370 References 213. and Georges-Andre Silber. In CC. Optimizing sparse representations for dataflow analysis. John C. pages 124–143. Optimizing nested loops using local CPS conversion. Lisp and Symbolic Computation. Pierre Jouvelot. Languages and Programming. Andrei Alves Rimsa. On the relation between direct and continuation semantics. 227. Definitional interpreters for higher-order programming languages. Erik Ruf. 1994. Optimization of Object-Oriented and Concurrent Programs. 216. John Bradley Plevyak. 223. Tainted flow analysis on e-SSA-form programs. Euro-Par 2001 Parallel Processing. Reppy. 2002. Len Freeman. call-by-value and the lambda-calculus. Pereira. pages 287–296. Serpette. 222. USA. Springer Berlin / Heidelberg. New York. Marcelo D’Amorim. Instruction-level parallelism. 221. Watson Research Center. John Wiley and Sons Ltd. Reprinted in Higher-Order and Symbolic Computation 11(4):363-397. John C. 230. 214. 2004. IEEE Computer Society Press. Mark N. Reynolds. B. John C. 219. pages 50–61. pages 717 – 714. 2002. K. 1999. B. pages 12 – 27. In SCOPES ’07: Proceedingsof the 10th international workshop on Software & compilers for embedded systems.

252. Jeremy Singer. 235. Sreedhar. EuroPar 2004 Parallel Processing. Dec. Translating out of static single assignment form. David M. Technical Report RADC-TR-69-313. 237. 1999. ACM. P. Gillies. Vugranam C. pages 194 – 204. Ravi. Guang R. 236. 2005. October 2011. In 10th International Euro-Par Conference. 245. A new framework for exhaustive and incremental data flow analysis using DJ graphs. In SPAA ’90: Proceedings of the second annual ACM symposium on Parallel algorithms and architectures. and Yong-fong Lee. In PLDI. A linear time algorithm for placing φ -nodes. Sreedhar. 2001. University of Cambridge. 1993. Sparse bidirectional data flow analysis as a basis for type inference. James Stanier. pages 172–181. USA. Translating out of static single assignment form. PhD thesis. and R. Scott Mahlke. NY. Optimal circuits for parallel multipliers. A class of polynomially solvable range constraints for interval analysis without widenings. D. 244. Efficient static single assignment form for predication.D. The representation of algorithms. Jonathan Babb. Michael Schlansker. Springer. and Vatsa Santhanam. 251. 247. Static Program Analysis based on Virtual Register Renaming. Control cpr: a branch height reduction optimization for epic architectures. Martel. New York. USA. pages 155– 168. Vugranam C. volume 3149 of Lecture Notes in Computer Science. Sreedhar and Guang R. In Static Analysis Symposium (SAS’99). 1996. and J. New York. 1990. Vugranam C. Guang R. and Richard Johnson. and Saman Amarasinghe. . Gao. University of Cambridge. NY. USA. Gao. Extended SSA with factored use-def chains to support optimization and parallelism. 234. 250. In Proceedings of the 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. Shapiro and Harry Saint. David M. If-conversion in ssa form. pages 350–359. Computer Laboratory. 248. PLDI ’99. 1995. ACM. Efficient static single assignment form for predication. September 1969. Michael P. In Papers from the 1995 ACM SIGPLAN workshop on Intermediate representations. of the 34th Annual Intl. ACM. and Vatsa Santhanam. March 1998. pages 172–181. Mark Stephenson. A. USA. New York. Rome Air Development Center. Simons. Gao. Zhendong Su and David Wagner. Ferrante. NY. 1999. Jeremy Singer. Eric Stoltz. 2004. 1999. thesis. Sreedhar.References 371 233. pages 194–210. B. USA. ACM. Vugranam C. and Michael Wolfe. 239. 249. 1995. Roy Dz-Ching Ju. pages 62–73. In In Proceedings of the 27th Annual Hawaii International Conference on System Sciences. 2001. Arthur Stoutchinin and Francois de Ferriere. Washington. Roy Dz-Ching Ju. 242. PhD thesis. Arthur Stoutchinin and Guang R Gao. IEEE Computer Society. V. Sreedhar. Symp. Theoretical Computeter Science. New York. Gerlek. DC. on Microarchitecture (MICRO). pages 108–120. School of Informatics. MICRO 34. In Proceedings of the ACM SIGPLAN 1999 conference on Programming language design and implementation. Bidwidth analysis with application to silicon compilation. 2004. IEEE Trans. 240. and Yong-fong Lee. on Computers. 243. pages 336–345. Stelling. In Proceedings of the 34th annual ACM/IEEE international symposium on Microarchitecture. Gao. Vugranam C. 2004. 47(3):273–285. A foundation for sequentializing parallel code. ACM. Static Program Analysis Based on Virtual Register Renaming. Robert M. 2000. de Ferrière. 241. 238. In Proceedings of the ACM SIGPLAN 1996 conference on Programming language design and implementation. 246. Oklobdzija. pages 278–290. Alpern. In Proc. C. 345(1):122–138. Technical report. 2006. pages 336–345. In APPSEM ’04: Web Proceedings of the Applied Semantics Workshop. Gillies. University of Sussex. Incremental computation of dominator trees. 2005. pages 43–52. In SAS. Jeremy Singer. Arthur Stoutchinin and Guang R. Ph. pages 1–12. Stoutchinin and F. Cambridge. UK. If-conversion in ssa form. NY.

The design and implementation of typed scheme. pages 414–423. Gary Wassermann and Zhendong Su. 13(1/2):131–133. Hosking. 8(4):367–412. X(X):2–25. 262. P. 1966. In PLDI ’95: Proceedings of the ACM SIGPLAN 1995 conference on Programming language design and implementation. . ACM. P. 1998. Reprinted in Higher-Order and Symbolic Computation 11(4):405-439. and Fabrice Rastello. Christopher A. on Field-Programmable Logic and Applications (FPL’02). Andre L. and Peter Lee. In Proceedings of the 9th international conference on Supercomputing. 1995. 261. Anticipation-based partial redundancy elimination for static single assignment form. Eben Upton. Gregory Morrisett. Gated SSA-based demand-driven symbolic analysis for parallelizing compilers. Formal Language Description Languages for Computer Programming. 1998. North-Holland. From ML to Ada: Strongly-typed language interoperability via source translation.. December 2004. 254. D. Gated SSA-based demand driven symbolic analysis for parallelizing compilers. New York. Software — Practice and Experience. Constant propagation with conditional branches. S. MA. C. IEEE Trans. Robert Harper. Compiling with data dependence graphs. 256. December 1975. pages 1767–1770. Conf. 27(10):1761–1774. 265. 264. P. 272. Invited paper with publication expected for 2012. 2004. Verma. editor. 1995. Q. USA. Thomas Vandrunen and Antony L. IEEE Trans. pages 875–885. and B. 1995. pages 47–55. A. Pereira. In T. ACM Transactions on Programming Languages and Systems. Wallace. October 2008. Eben Upton. pages 181–192. Journal of Functional Programming. 270. ACM. Recursive defnition of syntax and semantics. 263. Wegman and K. NY. POPL. The Verilog Hardware Description Language. A suggestion for a fast multiplier. New York. UK. In Proc. pages 167– 184.S. 2007. Norwell. Adriaan van Wijngaarden. In PDPTA. 269. Technical Report AI Lab Memo AIM-349. 1998. Gerald Jay Sussman and Guy Lewis Steele Jr. New York. USA. Peng Tu and David Padua. 255. Science of Computer Programming. A program representation for sparse dataflow analyses. on Supercomputing (SC’95). In Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation. Tolmach and Dino Oliva. 2008. In Proceedings of the 13th International Conference on Compiler Construction. NY. USA. USA. MIT AI Lab. 201X. of the ACM Intl. and P. 260. ACM. of the 12th Intl. 13(14):14–17. Conf. Peng Tu and David Padua. B. April 1991. Brisk. Scheme: An interpreter for extended lambda calculus. 2000. PLDI ’07. Sea Cucumber: A Synthesizing Compiler for FPGAs. J. Til: A type-directed optimizing compiler for ML. Andrew P. Tavares. Thomas and P. Mariza A. Efficient building and placing of gating functions. 259. Bigonha. Benoit Boissinot. Higher-Order and Symbolic Computation. Ienne. Dataflow transformations to maximize the use of carrysave representation in arithmetic circuits. Continuations revisited. David Tarditi. M. pages 13 –24. 258. 257. Zadeck. USA. In Proc. 13(2):181–210. Moorby. 34(15):1413– 1439. J. 266. Steel Jr. Fernando M. New York. Value-based partial redundancy elimination. NY. Sam Tobin-Hochstadt and Matthias Felleisen.372 References 253. ICS ’95. Tu and D. February 1964. C. Tripp. In Proceedings of the ACM SIGPLAN’96 Conference on Programming Language Design and Implementation (PLDI) [2]. Jackson. Wadsworth. on Computer-Aided Design of Integrated Circuits and Systems. Stone. 268. Padua. 267. 2002. Springer-Verlag. 2003. ACM. pages 395–406. Sound and precise analysis of web applications for injection vulnerabilities. 271. pages 414–423. 2006. on Electronic Computers. pages 32–41. Hutchings. Christopher P. Roberto Bigonha. NY. Thomas Vandrunen and Antony L. Kluwer Academic Publishers. Perry Cheng. Optimal sequentialization of gated data dependence graphs is NP-complete. Hosking. London.

In Proceedings of the 2001 international symposium on symbolic and algebraic computation. 2011. Kwan. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2011).G. 276. 280. ACM Press. C. Value dependence graphs: representation without taxation. 1992. and Bjarne Steensgaard. SIGPLAN Not. Mihalis Yannakakis and Fanica Gavril. 29(7):51–53. A lifetime optimal algorithm for speculative pre. pages 345–352. A. In SIGPLAN International Conference on Programming Languages Design and Implementation. Kenneth Zadeck. Cliff Young and Michael D. K. pages 297–310. W. 2011. Chow. 286. 1998. PhD thesis. 285. 274. Wenguang Chen. M. 279. pages 98–108. 282. 1994. of the 2008 16th Intl. 1991. 1987. Cai. An ssa-based algorithm for optimal speculative code motion under an execution profile. Constant propagation with conditional branches. Knoop. Leong. 281. Incremental Data Flow Analysis in a Structured Program Editor. 287. Constant propagation with conditional branches. pages 115–123. ACM. Michael Wolfe. Eugene V. J. C. Chan. and P. Zima. Zhou. 2006. 284. Yeung. ACM Transactions on Programming Languages and Systems. USA. and Fred C. 1984. J. IEEE Computer Society. Mark N. Wegman and K. Better global scheduling using path profiles. NY. 275. Wolfe. Map-reduce as a programming model for custom computing machines. Washington. 2001.. 2006. Beyond induction variables. pages 149–159. 1992. DC. and F. 13(2):181 – 210. Chen. Information Processing Letters. Beyond induction variables. M. J. 3(2):115–155. H. Xue and Q. Xue and J. J+=j. Hucheng Zhou. In POPL ’94: Proceedings of the 21st ACM SIGPLAN-SIGACT symposium on Principles of programming languages. IEEE Computer Society Press. USA. In Proceedings of the ACM SIGPLAN ’11 Conference on Programming Language Design and Implementation. Smith. 24(2):133–137. Daniel Weise. Tsang. Symp. Roger F. 2008. 278. The maximum k-colorable subgraph problem for chordal graphs. Michael Ernst. 277. 13(2):181–210. SIGPLAN Not. A fresh look at pre as a maximum flow problem. Tsoi. Rice University. on Field-Programmable Custom Computing Machines (FCCM’08). An ssa-based algorithm for optimal speculative code motion under an execution profile. ACM Transactions on Programming Languages and Systems. Wegman and F. ACM. MICRO 31. USA. . In Proc. In Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture. On computational properties of chains of recurrences. Los Alamitos. Cheung. Crew. Zadeck.References 373 273. 27(7):162–174. CA. B. Frank Kenneth Zadeck. pages 162 – 174. Michael Wolfe. 1991. ACM Transactions on Architecture and Code Optimization. 1994. New York. 283.. Chow.