You are on page 1of 26

1 Introduction

( 1. Introduction, p. 3)

1.1 Compiler Phases

• Due to the complexity of the compilation task, a compiler typically proceeds in a


sequence of compilation phases.

– In the Tiger book—as in this lecture—each chapter is devoted to one compilation


phase.
– The phases communicate with each other via clearly defined interfaces.
– Interface:
 data structure (e.g., a tree),
 set of exported functions
– Each phase operates on an abstract intermediate representation of the source
program, not the source program text itself (except the first phase).

c 2002/03 T.Grust · Compiler Construction: 1. Introduction 19
• Breaking the compiler into many phases enables reuse (of phase implementations).

– Example:
If we need to adapt our compiler to translate a different source language than Tiger,
we only need to rewrite the early phases (Lex → Translate, see below).
All phases following Translate remain untouched (after Translate, all specifics of the
source language have been “abstracted away”).


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 20
1.1.1 A Brief Overview of the Tiger Compiler Phases

• Let us trace the compilation of the Tiger program below and see how the different
phases transform the initial source program (later on, we pick certain source
program fragments only to keep the exposition short).
rem.tig
1 /* compute the remainder when dividing x by y */
2 let function rem (x : int, y : int) : int =
3 let var d := x / y
4 in
5 x - d * y
6 end
7
8 var r := 0
9 in
10 r := rem (10, 3)
11 end


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 21
1.2 Intermediate Representations (Tree Languages)

• We have seen that the compiler phases pass different intermediate representations
(IR).

– Often, these IR take the form of a tree.


– In these trees, each node assumes one of many node types.
– Each node type carries a number of attributes to store details about the program
(fragment) being represented by that node.

Example: Recall from our review of the compiler phases:

– Nodes of type CallExp (function call) carry as attributes:


 function name (a string)
 argument list (an IR tree rooted in an ExpList node).
– Nodes of type IntExp (integer constant) carry as attribute:
 constant value (an integer).


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 22
• How do we precisely describe the valid IR (tree) forms?

– Use grammars (grammars and trees are equivalent).


– Grammar: set of rules of the form

L→R (T )

 T denotes one possible tree node type in our IR.


 The righthand side R indicates how the subtree below a node of type T may look
like (L may occur in R, one grammar may have several L → . . . rules).
– Example:
Grammar:
E → E Op E (OpExp)
E → num (NumExp)
Op → + (Plus)
Op → * (Times)


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 23
A conforming tree:
OpExp
ooo
OOO
O
ooo OOO
ooo OO
NumExp Plus NumExp
num num
N.B.

– Some node types are marked to be leaves (typewriter font).


– For the num node it might be sensible to add an attribute holding the actual numeric
value represented by that node.


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 24
• Example: Grammar that describes the valid IR trees for a simple straight-line
programming language (no loops, no gotos):

Stm → Stm ; Stm (CompoundStm) ExpList → Exp , ExpList (PairExpList)


Stm → id := Exp (AssignStm) ExpList → Exp (LastExpList)
Stm → print ( ExpList ) (PrintStm) Binop → + (Plus)
Exp → id (IdExp) Binop → - (Minus)
Exp → num (NumExp) Binop → * (Times)
Exp → Exp Binop Exp (OpExp) Binop → / (Div)
Exp → ( Stm , Exp ) (EseqExp)

• A valid program written in the straight-line programming language (provided that 3,5,10
and a,b are acceptable for num and id, respectively):

a := 5+3; b := (print (a, a-1), 10*a); print (b)

• Informal semantics of the straight-line programming language shown on next slide.



c 2002/03 T.Grust · Compiler Construction: 1. Introduction 25
– Stm (statement): may have side effects (variable assignment, I/O).
 Stm ; Stm
Execute left statment, then execute right statement.
 id := Exp
Evaluate Exp, then assign the numeric value to variable id
 print ( ExpList )
Evaluate all expressions in the list (left to right), then print the resulting numeric
values separated by spaces, terminated by newline.
– Exp (expression): evaluates to a numeric value.
 id
Evaluates to the current value of variable id.
 num
Evaluates to the value of the numeric constant.
 Exp Binop Exp
Evaluate left expression, then evaluate right expression, then apply binary operator.
 ( Stm ; Exp )
Execute statement, then evaluate Exp whose value is the value of the expression.

c 2002/03 T.Grust · Compiler Construction: 1. Introduction 26
• The IR tree corresponding to the above example program:

a := 5+3; b := (print (a, a-1), 10*a); print (b)

CompoundStm[[[[[[[[[[[
eeee
eeeeeeeeeee [[[[[[[[[
[[[[[[[[[
e
eeeeee [[[[[[[[
AssignStm
 ??
CompoundStm
eeee WWWWW
 ??
eeeeeeeee WWWWW
WWWWW
 ?? eee

 eeeeee WWWW
a OpExp
oo OOO AssignStm
 ?? PrintStm
ooo OOO  ??
oo OOO  ??
ooo O


NumExp Plus NumExp b jjj
EseqExpTTTT LastExpList
jjjj TTTT
jjjj TTTT
jjjj TT
5 3 PrintStm OpExp
oo OOO IdExp
ooo OOO
oo OOO
ooo O
PairExpList
oo OOO NumExp Times IdExp b
oo OOO
ooooo OOO
O
o
IdExp LastExpList 10 a

a OpExp
oo OOO
ooo OOO
oo OOO
ooo O
IdExp Minus NumExp

a 1
N.B.
– This IR tree shows all node attributes (not just the IR subtrees of a node).

c 2002/03 T.Grust · Compiler Construction: 1. Introduction 27
• How can we represent these IR trees in C code (i.e., inside our compiler)?

– Represent each IR tree node by a C struct.


A C struct will give us the possibility to attach attributes (= struct fields) as well
as subtrees to a node.
Rule: For each lefthand side grammar symbol (Stm, Exp, ExpList, Binop),
introduce a C struct type.
– Example:
Stm 7→ struct A stm
Exp 7 → struct A exp
ExpList 7 → struct A expList
Binop 7→ struct A binop

– We will use pointers to these structs to link tree nodes, thus:


C code
1 typedef struct A_stm_ *A_stm;
2 typedef struct A_exp_ *A_exp;
3 typedef struct A_expList_ *A_expList;
4 typedef struct A_binop_ *A_binop;


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 28
• In each of these structs, embed

1 a kind field to indicate which node type this node actually has (e.g., for Exp

(struct A_exp_) kind could be IdExp, NumExp, OpExp, EseqExp),
2 all attributes and subtree (pointers) for this specific node type.

Rule: If a node type is described by a single attribute value (e.g., NumExp),
embed this value in the struct; if we need to represent more attribute
values/subtrees, embed a nested struct that groups this information.

• Example:
C code
1 struct A_exp_ {
2 enum { A_idExp, A_numExp, A_opExp, A_eseqExp } kind;
3 string id; /* A_idExp */
4 int num; /* A_numExp */
5 struct { A_exp left;
6 A_binop oper;
7 A_exp right; } op; /* A_opExp */
8 struct { A_stm stm;
9 A_exp exp; } eseq; /* A_eseqExp */
10 }


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 29
• The kind field determines which attribute/subtree information is valid for any given
node4. All other fields are unused and may not be accessed!

• Unused fields? ⇒ Use C union to save space for each node.

We get:
C code
1 struct A_exp_ {
2 enum { A_idExp, A_numExp, A_opExp, A_eseqExp } kind;
3 union {
4 string id; /* A_idExp */
5 int num; /* A_numExp */
6 struct { A_exp left;
7 A_binop oper;
8 A_exp right; } op; /* A_opExp */
9 struct { A_stm stm;
10 A_exp exp; } eseq; /* A_eseqExp */
11 } u;
12 }

4 For example, accessing the op attributes (right, oper, left) while kind == A_idExp will result in havoc!


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 30
slp.h

c 2002/03 T.Grust · Compiler Construction: 1. Introduction

1 typedef struct A_stm_ *A_stm;


2 typedef struct A_exp_ *A_exp;
3 typedef struct A_expList_ *A_expList;
4 typedef struct A_binop *Abinop;
5
6 struct A_stm_ {
7 enum { A_compoundStm, A_assignStm, A_printStm } kind;
8 union {
9 struct {
10 A_stm stm1, stm2;
11 } compound;
12 struct {
13 string id;
14 A_exp exp;
15 } assign;
16 struct {
17 A_expList exps;
18 } print;
19 } u;
20 };
21 struct A_exp_ {
22 enum { A_idExp, A_numExp, A_opExp, A_eseqExp } kind;
23 union {
24 string id;
25 int num;
26 struct {
27 A_exp left;
28 A_binop oper;
29 A_exp right;
30 } op;
31 struct {
32 A_stm stm;
33 A_exp exp;
34 } eseq;
35 } u;
36 };
37 struct A_expList_ {
38 enum { A_pairExpList, A_lastExpList } kind;
39 union {
40 struct {
41 A_exp head;
42 A_expList tail;
43 } pair;
44 A_exp last;
45 } u;
46 };
47 struct A_binop_ {
48 enum { A_plus, A_minus, A_times, A_div } kind;
49 };
31
• As IR tree nodes “live on the heap”, they need to be allocated via malloc() and
initialized appropriately:

– Example: create an A_opExp node with subtrees A_exp e1 and e2 and A_binop op:
C code
1 A_exp n;
2
3 n = malloc (sizeof (*n));
4 if (!n) { ... handle memory allocation failure ... };
5
6 n->kind = A_opExp;
7 n->u.op.left = e1;
8 n->u.op.oper = op;
9 n->u.op.right = e2;

• Such node creation routines will be needed over and over the compiler.
⇒ Provide node constructors to allocate and initialize IR tree nodes.

Rule: never call malloc() outside these constructors.


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 32
• Example: node constructors for node types A_CompoundStm and A_IdExp.
C code
1 A_stm A_CompoundStm (A_stm stm1, A_stm stm2)
2 {
3 A_stm s = checked_malloc (sizeof (*s));
4
5 s->kind = A_compoundStm;
6 s->u.compound.stm1 = stm1;
7 s->u.compound.stm2 = stm2;
8
9 return s;
10 }
C code
1 A_exp A_IdExp (string id) /* typedef char *string */
2 {
3 A_exp e = checked_malloc (sizeof (*e));
4
5 e->kind = A_idExp;
6 e->u.id = id;
7
8 return e;
9 }


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 33
• To actually construct larger IR trees, we can now simply plug the constructors together
and build trees bottom-up:

1 Use constructors to build the leaf nodes,



2 use the results of these calls as arguments to constructors for inner tree nodes.

– Example: build the IR tree corresponding to the straight-line program

a := 52; print (a)

CompoundStm
oo OOO
oo OOO
ooooo OOO
OO
o
AssignStm
 ?? PrintStm
 ??
 ??

a NumExp LastExpList

42 IdExp

a
C code
1 A_stm p = A_CompoundStm (A_AssignStm ("a", A_NumExp (42)),
2 A_PrintStm (A_LastExpList (A_IdExp ("a"))));


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 34
1.2.1 Summary of IR Tree Representation Rules

1 Valid IR trees are described by a grammar.


2 Each lefthand side grammar symbol E is translated into a corresponding struct



definition:
E 7→ struct X_E_ { ... }

3 The struct X_E_ itself is never used anywhere else, instead declare X_E (pointer to

struct):
typedef struct X_E_ *X_E;

4 Each struct X_E_ contains a kind enum which contains a enumeration constant for

each grammar rule with lefthand side E, and a union u to carry the specific
attributes/subtrees:

struct X_E_ { enum { ... } kind; union { ... } u; };


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 35
5 In union u, collect the information represented on the righthand side for each grammar

rule for E. If several attributes/subtrees need to be represented, embed a struct
carrying this information (e.g., compound in A_stm_).

6 If a single value describes the righthand side of a grammar rule for E, embed this value

directly (e.g., num in A_exp_).

7 Each IR node type X_E will have a constructor that initializes all struct fields;

malloc() is never called outside these constructors.

8 Each C file (compiler phase or module) will have a prefix X_ unique to that file.

9 Naming/capitalization:

Exp (IdExp) 7→ struct X_exp_ { enum { X_idexp } kind; ... };


typedef struct X_exp_ *X_exp;
X_exp X_IdExp (...) { ... };


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 36
• Variations of these general IR representation rules:

1 Use a single struct definition to represent all IR node types uniformly.



C code
1 typedef struct A_node_ *A_node;
2
3 struct A_node_ {
4 enum { A_compoundStm, A_assignStm, A_printStm,
5 A_idExp, A_numExp, A_opExp, A_eseqExp,
6 A_pairExpList, AlastExpList,
7 A_plus, A_minus, A_times, A_div } kind;
8 union {
9 struct { A_node stm1,
10 A_node stm2; } compound;
11 ...
12 struct { A_node left;
13 A_node oper;
14 A_node right } op;
15 ...
16 } u;
17 }


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 37
– Bad idea, because the C compiler loses the ability to check that we do not build
“nonsense” IR trees (everything is a generic A_node and may occur anywhere).

Example:
C code (buggy)
1 A_node n = A_OpExp (A_IdExp ("a"),
2 A_AssignStm ("b", A_NumExp (42)),
3 A_PrintStm (A_LastExpList (A_NumExp (0))));

N.B.
– In a real compiler, we would write code to build complex IR trees and bugs in that
code might not be that obvious to us at all.


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 38
2 Consider the A_binop constructors:

C code
1 A_binop A_Plus ()
2 {
3 A_binop op = checked_malloc (sizeof (*op));
4 op->kind = A_plus;
5 return op;
6 }

– The A_binop nodes encapsulate a single enum value kind only. This is uniform
but unnecessarily complex and wastes space.
– A_binop nodes only occur inside A_exp (of kind A_opExp) nodes.
⇒ Encode the operator inside A_opExp directly (using an enum) .


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 39
C code (modified A exp node)
1 enum { A_plus, A_minus, A_times, A_div } A_binop;
2
3 struct A_exp_ {
4 enum { A_idExp, A_numExp, A_opExp, A_eseqExp } kind;
5 string id; /* A_idExp */
6 int num; /* A_numExp */
7 struct { A_exp left;
8 A_binop oper;
9 A_exp right; } op; /* A_opExp */
10 struct { A_stm stm;
11 A_exp exp; } eseq; /* A_eseqExp */
12 }


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 40
1.3 C Coding Guidelines for the Tiger Compiler Project

• The Tiger compiler will be a rather complex piece of software.

We strongly suggest that you follow the guidelines below when you build C source code
for the compiler.

1 Each phase of the compiler belongs in its own .c source file (which #includes an

associated .h header file containing exported function prototypes and type
declarations).
[Separate compilation, handling, reusability]
2 Each phase shall have an identifier prefix X_ unique to this phase. All global names

(struct/union fields are not global) shall start with the prefix.
[Organize the otherwise flat C namespace (avoid clashes), clarify origin of name]


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 41
3 All functions shall have prototypes and the C compiler shall be told to warn about

uses of functions without prototypes. (gcc: -Wmissing-prototypes)
[In C, functions without prototypes default to return int and to accept int
arguments (e.g., pointers, characters may be implicityly casted to int)]

4 Each phase includes util.h and the compiler is linked against util.o.

util.h
1 #include <assert.h>
2
3 typedef char *string;
4 typedef char bool;
5
6 #define TRUE 1
7 #define FALSE 0
8
9 void *checked_malloc(int);
10 string String(char *);


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 42
– assert: halt program if asserted expression yields 0.
Example:
C code
1 A_exp e;
2 e = malloc (sizeof (*e));
3 assert (e);

Should malloc() fail:


tiger: phase.c:42: foo: Assertion ‘e’ failed. Aborted.
To disable all assertion checks, compile with -DNDEBUG.

– bool: simulate boolean type in C, use type bool if a variable/function actually


deals with truth values.

– checked_malloc(n) allocates n bytes and returns pointer into heap. Halts


program if allocation fails. ⇒ If checked_malloc() returns, the returned pointer
is valid.


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 43
5 Values of type string are heap-allocated strings. Constructor String("foo")

allocates four bytes and copies the argument string.
Convention: a function that receives a string argument may assume the
string contents never change ⇒ it is safe to store the associated character
pointer, there is no need to copy the string.
6 Never call malloc() directly, aways use checked_malloc().

[We may later re-implement checked_malloc() to, e.g., use a GC library.]

7 Never call free() to release heap-allocated memory.



– Correct usage of free() can be tricky: avoid space leaks (call free() early
enough), avoid corruption/overwrites (call free() not too early).
– Good practice: p = 0 if you plan to never access *p anymore.

[Again, a GC library could make the compiler production-strength nevertheless.]


c 2002/03 T.Grust · Compiler Construction: 1. Introduction 44

You might also like