You are on page 1of 21

Seminar

Constraint Based Analysis

Toni Suter

Fall Term 2014

Supervised by Prof. Peter Sommerlad


Abstract
Static program analysis is used in compilers and refactoring tools as a
means to examine the structure of a program. The result of an analysis
can help in optimizing the program and in finding bugs in the source
code. Many analyses need some knowledge about the control flow of the
program in order to do their job. However, some programming language
features can make this difficult because their use leads to code that is
very dynamic and therefore hard to analyse at compile time.
This paper briefly explains these problems and then shows how Con-
straint Based Control Flow Analysis can be used as a tool that enables
the analysis of such code. Finally, it walks through an example which
illustrates the steps that are involved in this process.

I
Contents

1 Introduction 2
1.1 Types of static program analysis . . . . . . . . . . . . . 2
1.1.1 Intraprocedural analysis . . . . . . . . . . . . . . 2
1.1.2 Interprocedural analysis . . . . . . . . . . . . . . 3

2 Control flow analysis 8


2.1 How to represent an analysis result . . . . . . . . . . . . 8
2.2 Acceptability . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Constraint generation . . . . . . . . . . . . . . . . . . . 11
2.4 Constraint resolution . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Graph Formulation . . . . . . . . . . . . . . . . . 12

3 Conclusion 18

1
1 Introduction
Constraint Based Analysis is a form of static program analysis that uses
systems of set contraints to describe certain aspects of a program. It
can be used for many different kinds of analyses (e.g. to perform type-
checking in a statically typed programming language or to analyse the
control flow in a dynamic programming language where functions are
first-class objects). This paper gives an overview over the various kinds
of static program analyses and explores control flow analysis with the use
of constraints in more detail.

1.1 Types of static program analysis


Static program analysis is often used in compilers to perform optimiza-
tions such as constant folding and common subexpression elimination
[ALSU06]. It can also be used to support stand-alone analysis and refac-
toring tools that help in finding bugs and improving the software quality.
The subsections 1.1.1 and 1.1.2 describe the two main types of static pro-
gram analysis: intraprocedural analysis and interprocedural analysis.

1.1.1 Intraprocedural analysis


In intraprocedural analysis, the analyser examines individual procedures
in isolation without the context of the entire program. As mentioned
above, static program analysis is often used in compilers to optimize a
program. Such an optimization consists of an analysis and a transfor-
mation. The analysis examines the program to determine whether it
is possible and worthwhile to do the transformation. The transforma-
tion makes the actual changes that should lead to a “better” (e.g, faster,
smaller, more energy-efficient, etc.) program. Many optimizations are
based on intraprocedural analyses [CT11].

2
1 Introduction

For example, Common Subexpression Elimination (CSE) is an optimiza-


tion that is based on an intraprocedural data flow analysis called Avail-
able Expression Analysis. Consider the pseudo-code in Listing 1.1:
Listing 1.1: Before optimization
1 a := (c + d) * 2
2 b := (c + d) * 5

The Available Expression Analysis determines for each point p in the


procedure, the expressions that are available. An expression is considered
available if:
1. the expression is evaluated on every path to the point p
2. and the expression is not invalidated (killed) after the evaluation
but before the point p (e.g, by assigning a new value to a variable
that is used in the expression)
Therefore, the expression (c + d) is considered available on line 2 and
the result of the expression can be reused. This is shown in Listing 1.2:
Listing 1.2: After optimization
1 tmp := c + d
2 a := tmp * 2
3 b := tmp * 5

This optimization improves the speed of the program by storing the result
of the expression (c + d) in a temporary variable thereby removing the
need to calculate the result twice.
Live Variable Analysis, Reaching Definitions Analysis and Very Busy Ex-
pressions Analysis are other examples of intraprocedural analyses [NNH99].
All of these techniques depend on the fact that it is relatively easy to
determine the control flow within a procedure. For example, if the con-
trol flow reaches the condition of an if-else statement, there are exactly
2 positions where it can continue; either in the if-block or the else-block.
Since intraprocedural analysis doesn’t analyse the interactions between
different functions, the control flow is more or less predictable at compile
time.

1.1.2 Interprocedural analysis


Some intraprocedural optimizations may be used on a whole-program
level. For example, Dead Code Elimination is an optimization tech-
nique that removes code which is never executed. If the analyser starts

3
1 Introduction

analysing the control flow between functions (inter-flow), it may be able


to apply this optimization across function calls. Consider the C++ code
in Listing 1.3:
Listing 1.3: Example for possible Dead Code Elimination
1 void f(int x) {
2 if(x < 0) {
3 //do something
4 }
5 else {
6 //do something else
7 }
8 }
9
10 int main() {
11 f(10);
12 f(5);
13 }

In this example, an optimizing compiler may be able to use interprocedu-


ral data flow analysis to find out that the parameter x in the function f
only ever takes on positive values in this program and that it’s therefore
possible to eliminate the if-branch from the code.
However, calculating the interprocedural control flow of a program is of-
ten not a straightforward task. For example, at the end of a procedure
there could be many different places where the control flow continues
because procedures are usually called from various locations. Addition-
ally, modern programming languages often support a form of dynamic
dispatch which means that it is sometimes not possible to determine at
compile time which function body is executed as a result of a function
application. The following two subsections describe these problems in
more detail.

4
1 Introduction

Higher-order functions in JavaScript


In JavaScript, functions are first-class objects which means that they can
be stored in variables, passed as an argument or returned from another
function [AS96]. Higher-order functions take one or more functions as
an input or return a function [Lip11]. Listing 1.4 shows a higher-order
function in JavaScript:
Listing 1.4: Higher-order function in JavaScript
1 //filters an array based on a custom predicate
2 function filter(arr, predicate) {
3 var filtered = [];
4
5 for(var i = 0; i < arr.length; ++i)
6 if(predicate(arr[i]))
7 filtered.push(arr[i]);
8
9 return filtered;
10 }
11
12 function is_even(x) {
13 return x % 2 == 0;
14 }
15
16 var numbers = [0,1,2,3,4,5,6,7,8,9,10];
17 console.log(numbers); //[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
18
19 var even_numbers = filter(numbers, is_even);
20 console.log(even_numbers); //[0, 2, 4, 6, 8, 10]

At the point inside the filter()-Function where the predicate()-Function


is called, it is not immediately clear where the control flow continues. In
this simple program it is easy to find out from the context that the func-
tion is_even() will be called. However, in a real world program, a utility
function like filter() will be called many times, possibly with different
predicates. Thus, there is no single specific predicate function body that
will be executed each time the filter() function is called. The function
body that is actually executed depends on the function that is stored in
the parameter predicate at the time of invocation.

5
1 Introduction

Dynamic dispatch through inheritance in C++


Many programming languages support dynamic dispatch through inher-
itance. This can be used to implement polymorphic types, that is, mul-
tiple types that provide a common interface [Str14]. Listing 1.5 shows
an example of dynamic dispatch in C++:
Listing 1.5: Dynamic dispatch through inheritance in C++
1 #include <iostream>
2 #include <string>
3
4 class Person {
5 std::string name;
6 public:
7 Person(std::string const& n):name{n} {}
8 virtual std::string description() const {
9 return name;
10 }
11 virtual ~Person()=default;
12 };
13
14 class Student : public Person {
15 int student_id;
16 public:
17 Student(std::string const& n, int i):Person{n}, student_id{i} {}
18 std::string description() const override {
19 return Person::description() + " (" + std::to_string(student_id) + ")";
20 }
21 };
22
23 void print(Person const& p) {
24 std::cout << p.description() << std::endl;
25 }
26
27 int main(int argc, char *argv[]) {
28 Person p{"Toni Suter"};
29 Student s{"Toni Suter", 1234};
30 print(p); // "Toni Suter"
31 print(s); // "Toni Suter (1234)"
32 }

Like in the JavaScript example in Listing 1.4, there is not just one func-
tion body that can be executed as a result of calling p.description().
Depending on whether the argument that is passed to the function print()
is a Person object or an instance of a Person subclass the results may
vary. The function body that is actually executed depends on the dy-
namic type of the parameter p.

6
1 Introduction

Motivation for Constraint Based Control Flow Analysis (CFA)


Constraints can be used to calculate the control flow between functions
(inter-flow) even for programs written in programming languages that
support higher-order functions and dynamic dispatch as shown in the
examples in Listing 1.4 and Listing 1.5. The general process of a Con-
straint Based Analysis is as follows:
1. Start with an initial solution
2. Generate a system of constraints (section 2.3)
3. Resolve the constraints to get the final solution (section 2.4)
For example, in the JavaScript code in Listing 1.4, the analyser would
start with an initial solution by assigning to each subexpression the set of
functions that it may evaluate to. In the beginning those sets are incom-
plete. Constraints that describe relationships and dependencies between
the sets of different subexpressions are then used to propagate the values
from one set to another if necessary [Aik99]. Finally, after the constraints
are resolved, the analyser knows that the parameter predicate can only
evaluate to the is_even() function and may be able to perform inter-
procedural optimizations based on that knowledge.
Therefore, control flow analysis somewhat depends on data flow analysis,
because in order for the analyser to know where the control flow continues
when the expression predicate(arr[i]) is executed, it has to know the
values that the variable predicate can evaluate to.

7
2 Control flow analysis
This section describes how to analyse the control flow of a program using
constraints. As described in section 1.1, control flow analysis is partic-
ularly interesting in programming languages that support some form of
dynamic dispatch. In this paper simple JavaScript examples are used to
show the process of Constraint Based Control Flow Analysis. Some of
the code is derived from the examples in the book Principles of Program
Analysis [NNH99] where the untyped lambda calculus [Chu32] with a
few extensions is used.

2.1 How to represent an analysis result


The goal of a Constraint Based Control Flow Analysis is to determine
the control flow between functions (inter-flow). As described in subsec-
tion 1.1.2 the control flow of a program often depends on its data flow.
For example, in a programming language like JavaScript where functions
are first-class objects, the control flow may depend on the function that
is currently stored in a variable / parameter. Therefore, the result of an
analysis must contain for each subexpression the set of functions that it
may evaluate to.
Usually, such an analysis result is defined by a pair (Ĉ, p). The abstract
cache Ĉ is a function that maps each subexpression of the program to
the set of functions that it may evaluate to. The function p is called the
abstract environment and it maps the variables of the program to the set
of functions that may be stored in them. In order to be able to refer to
each of the different subexpressions in the program they need to have a
unique label. Listing 2.1 shows a simple, labelled JavaScript program:

8
2 Control flow analysis

Listing 2.1: JavaScript example (the labels are not part of JavaScript)
1 (function apply_two(f) {
2 (return f1 (2);)2
3 })3
4
5 (function square(x) {
6 (return x4 * x5 ;)6
7 })7
8
9 (function triple(y) {
10 (return y8 + 3;)9
11 })10
12
13 (apply_two(square11 );)12
14 (apply_two(triple13 );)14

The function apply_two() is a higher-order function which has one func-


tion parameter f. It calls f with the argument 2 and returns the result of
this function call. The most interesting part of this program is at label 1
where dynamic dispatch happens. Table 2.1 and Table 2.2 show the anal-
ysis result in terms of the abstract cache and the abstract environment,
respectively:

Table 2.1: Abstract cache Ĉ


Label Ĉ
1 { square, triple }
2 {}
3 { apply_two } Table 2.2: Abstract environment p
4 {} Variable p
5 {} apply_two { apply_two }
6 {} square { square }
7 { square } triple { triple }
8 {} f { square, triple }
9 {} x {}
10 { triple } y {}
11 { square }
12 {}
13 { triple }
14 {}

9
2 Control flow analysis

The sections 2.3 and 2.4 show how one may get to such a result by
generating and resolving a set of constraints. Since this is a pure control
flow analysis, the result contains only function values. This is also the
reason why there are so many empty sets in the result. For example,
in this program the parameters x and y can only ever have the value 2.
Because this is a number and not a function value it does not appear in
the abstract environment p.
The abstract cache for label 1 is the set { square, triple }. We therefore
know, that the control flow at this point either continues in the function
square() or the function triple(). As described in subsection 1.1.2
this may enable the compiler to perform intraprocedural optimizations
on a whole-program level.

2.2 Acceptability
An analysis result is considered acceptable/valid, if it contains for each
subexpression the set of functions that it may evaluate to. Thus, a result
that contains the set of all functions for each subexpression would be a
valid result, altough not a very useful one. The goal is to narrow the
result down to the least solution because this increases the chances of
being able to perform additional optimizations [NNH99].
There are a few simple rules that define for each type of expression (e.g.
function application, binary expression, etc.) the conditions that have to
be true in order for an analysis result to be valid. These rules are used to
check whether an analysis result is acceptable for a particular program.
Consider for example Listing 2.2:
Listing 2.2: Acceptability example
1 (function my_func(f) {
2 (f1 ();)2
3 })3
4
5 (my_func((function() {
6 //brilliant code
7 })4 );)5

It should be clear, that an analysis result is only acceptable for this


program, if Ĉ(4) ⊆ p(f ) or in other words, if the anonymous function in
subexpression 4 is in the set of possible values for the parameter f.

10
2 Control flow analysis

2.3 Constraint generation


Constraint generation is the process of building a set of constraints for a
particular program. The constraints describe relationships between the
abstract cache and the abstract environment of each subexpression. The
goal is to generate a set of constraints and then to find the least solution
that satisfies these constraints and is still considered acceptable. The
constraints can be constructed by following specific rules for each kind
of language construct. For example, consider the following rule var:

[var] CJxl K = { p(x) ⊆ Ĉ(l) }

C is a function that maps expressions to the set of constraints that they


generate. The rule var defines the constraints that are generated for
expressions that consist of a single variable. In this case it’s only one
constraint: p(x) ⊆ Ĉ(l). This means that the abstract cache Ĉ for the
subexpression with the label l should at least contain all the values from
the abstract environment of the variable x. In other words, if a variable
x can take on a certain set of values, an expression that only consists of
this variable can evaluate to the same set of values.
This was a very simple example. The rule for function definition fn is a
little more complex:

[f n] CJ(f unction(x) { return e0 ; })l K


= {{f unction(x) { return e0 ; }} ⊆ Ĉ(l)} ∪ CJe0 K

The rule fn defines the constraints that are generated for function def-
initions. First, the constraint {f unction(x) { return e0 ; }} ⊆ Ĉ(l) is
generated which causes the function value to propagate to the abstract
cache of the surrounding expression with the label l. Additionally, the
constraints that are generated by the expression e0 are added to the set
as well.
Similar to the examples above, there are rules that define the set of
constraints that needs to be generated for every language construct (e.g.,
function application, binary expression, etc.).

11
2 Control flow analysis

2.4 Constraint resolution


Constraint resolution is the process in which the constraints of a program
are turned into an actual analysis result. The goal of the constraint reso-
lution is to find the least solution without actually running the program.
A solution is considered to be the least solution if the abstract cache and
the abstract environment are narrowed down as much as possible and
the result is still acceptable.

2.4.1 Graph Formulation


One way to solve a system of constraints is to build a graph from the
environment p, the cache Ĉ and the constraints and then use a special
algorithm to propagate the contents of the sets through the graph. There-
fore, we have to start with an initial cache and environment. This initial
result should contain just the values that can be identified by looking at
each subexpression in isolation of the rest of the program. For example,
consider the labelled JavaScript code in Listing 2.3:
Listing 2.3: Graph Formulation example code
1 ((function(x) {
2 return x1 ;
3 })2 ((function(y) {
4 return y3 ;
5 })4 ))5

In this simple example, an anonymous identity function is defined and


immediately called with another anonymous identity function passed as
argument. This may be a strange example, but longer, more meaningful
programs would make it very hard to illustrate the Graph Formulation
algorithm concisely.
For this example, the initial result would just contain Ĉ(2) = {idx } and
Ĉ(4) = {idy } where idx and idy are shorthands for the two identity
functions.

12
2 Control flow analysis

Furthermore, let’s assume that there is already an existing set of con-


straints that was generated from rules similar to those shown in sec-
tion 2.3:
• p(x) ⊆ Ĉ(1)
• p(y) ⊆ Ĉ(3)
• idx ⊆ Ĉ(2) ⇒ Ĉ(1) ⊆ Ĉ(5)
• idx ⊆ Ĉ(2) ⇒ Ĉ(4) ⊆ p(x)
• idy ⊆ Ĉ(2) ⇒ Ĉ(3) ⊆ Ĉ(5)
• idy ⊆ Ĉ(2) ⇒ Ĉ(4) ⊆ p(y)
The following subsections describe the different steps in the Graph For-
mulation algorithm. Each step description contains a figure that shows
the state of the graph after the step has been executed as well as an
explanation of what happened.
Additionally, each step title contains the state of the worklist W. The
worklist W is the list of nodes that has to be processed (orange nodes in
the figures).

13
2 Control flow analysis

Step 1 : W = [Ĉ(4), Ĉ(2)]


Step 1 is the starting position of the
algorithm. The graph has nodes for
the cache of each subexpression and
for the environment of each variable.
For each constraint, there is at least
one directed edge in the graph that
connects the nodes that are involved
in this constraint. For example, the
constraint p(x) ⊆ Ĉ(1) is responsi-
ble for the edge that leads from the
environment of the variable x to the
cache of the subexpression with the
label 1.
Figure 2.1: Step 1

Step 2 : W = [Ĉ(4), Ĉ(2)]


In step 2, some constraints are
grayed out. The edges of these con-
straints should not be traversed by
the algorithm. This is because the
first part of these constraints is not
true. For example, in the constraint
idy ⊆ Ĉ(2) ⇒ Ĉ(4) ⊆ p(y) the first
part of the implication is not true
and therefore the whole constraint is
“inactive”.

Figure 2.2: Step 2

14
2 Control flow analysis

Step 3 : W = [p(x), Ĉ(2)]


In step 3, the first node in the work-
list (Ĉ(4)) is processed by traversing
each of its outgoing, active edges.
In this case, there is only one such
edge, namely the one with the con-
straint idx ⊆ Ĉ(2) ⇒ Ĉ(4) ⊆ p(x).
Since the first part of the implica-
tion is true, the values in Ĉ(4) can
be propagated to p(x). Additionally,
the node p(x) is added to the work-
list.

Figure 2.3: Step 3

Step 4 : W = [Ĉ(1), Ĉ(2)]


The same thing happens when the
node p(x) is processed. The values
in p(x) can be propagated to Ĉ(1)
because of the edge that leads from
p(x) to Ĉ(1). The node Ĉ(1) be-
comes a new member of the worklist.

Figure 2.4: Step 4

15
2 Control flow analysis

Step 5 : W = [Ĉ(5), Ĉ(2)]


The constraint idx ⊆ Ĉ(2) ⇒
Ĉ(1) ⊆ Ĉ(5) causes the values from
Ĉ(1) to be propagated to the node
Ĉ(5).

Figure 2.5: Step 5

Step 6 : W = []
The node Ĉ(5) doesn’t have any out-
going edges, so it can be removed
from the worklist. The node Ĉ(2)
is the only remaining member of the
worklist. It can be removed too, be-
cause its outgoing edges are due to
constraints that have already been
processed. This is the final result of
the analysis.

Figure 2.6: Step 6

16
2 Control flow analysis

Final result
The tables 2.3 and 2.4 show the final result of the Graph Formulation in
tabular form:

Table 2.3: Abstract cache Ĉ


Label Ĉ
Table 2.4: Abstract environment p
1 {idy }
Variable p
2 {idx }
x {idy }
3 {}
y {}
4 {idy }
5 {idy }

This represents the least solution for the code in Listing 2.3. For example,
it shows that the overall expression with the label 5 can only evaluate to
the identity function idy . Since all the possible function values for each
subexpression are now known, the analyser also knows which function
bodies may be executed as a result of calling a function (inter-flow).
In larger, more meaningful programs this knowledge may be useful to
perform intraprocedural optimizations on a whole-program level as shown
in subsection 1.1.2.

17
3 Conclusion
My goal for this paper was to give a brief overview over different kinds
of static program analysis as well as a more detailed description of Con-
straint Based Control Flow Analysis (CFA). After reading this paper it
should be clear to the reader that Constraint Based CFA is a powerful
tool to analyse the control flow between procedures (inter-flow) which
can lead to further optimization possibilities.
It is worth noting that there exist more constraint based analysis tech-
niques than the one that was described in this paper. However, explaining
all of them in detail would be beyond the scope of this paper.

18
Bibliography
[Aik99] Alexander Aiken. Introduction to Set Constraint-Based Pro-
gram Analysis. 1999.
[ALSU06] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D.
Ullman. Compilers - Principles, Techniques and Tools. 2006.
[AS96] Harold Abelson and Gerald Jay Sussman. Structure and In-
terpretation of Computer Programs. 1996.
[Chu32] Alonzo Church. A Set of Postulates for the Foundation of
Logic. 1932.
[CT11] Keith Cooper and Linda Torczon. Engineering a Compiler.
2011.
[Lip11] Miran Lipovaca. Learn You a Haskell for Great Good! 2011.
[NNH99] Flemming Nielson, Hanne Riis Nielson, and Chris Hankin.
Principles of Program Analysis. 1999.
[Str14] Bjarne Stroustrup. Bjarne Stroustrup’s C++ Glossary, De-
cember 2014. http://www.stroustrup.com/glossary.html.

19

You might also like