You are on page 1of 19

CHAPTER-4 Syntax-Directed Translation

Syntax-directed translation (SDT) refers to a method of compiler implementation where the


source language translation is completely driven by the parser, i.e., based on the syntax of the
language. The parsing process and parse trees are used to direct semantic analysis and the
translation of the source program. Almost all modern compilers are syntax-directed.
SDT can be a separate phase of a compiler or we can augment our conventional grammar with
information to control the semantic analysis and translation. Such grammars are called attribute
grammars.
We augment a grammar by associating attributes with each grammar symbol that describes its
properties. With each production in a grammar, we give semantic rules/ actions, which describe
how to compute the attribute values associated with each grammar symbol in a production.
The general approach to Syntax-Directed Translation is to construct a parse tree or syntax tree
and compute the values of attributes at the nodes of the tree by visiting them in some order. In
many cases, translation can be done during parsing without building an explicit tree.
A class of syntax-directed translations called "L-attributed translations" (L for left-to-right)
includes almost all translations that can be performed during parsing. Similarly, "S-attributed
translations" (S for synthesized) can be performed easily in connection with a bottom-up parse.
There are two ways to represent the semantic rules associated with grammar symbols.
1. Syntax-Directed Definitions (SDD)
2. Syntax-Directed Translation Schemes (SDT)

Syntax-Directed Definitions
A syntax-directed definition (SDD) is a context-free grammar together with attributes and rules.
Attributes are associated with grammar symbols and rules are associated with productions.
An attribute has a name and an associated value: a string, a number, a type, a memory location,
an assigned register, strings. The strings may even be long sequences of code, say code in the
intermediate language used by a compiler. If X is a symbol and a is one of its attributes, then we
write X.a to denote the value of a at a particular parse-tree node labeled X. If we implement the
nodes of the parse tree by records or objects, then the attributes of X can be implemented by data
fields in the records that represent the nodes for X. The attributes are evaluated by the semantic
rules attached to the productions.
Example: PRODUCTION SEMANTIC RULE
E → E1 + T E.code = E1.code || T.code || ‘+’

SDDs are highly readable and give high-level specifications for translations. But they hide many
implementation details. For example, they do not specify order of evaluation of semantic actions.
Syntax-directed definition:

Syntax trees for assignment statements are produced by the syntax-directed


definition. Non-terminal S generates an assignment statement. The two binary operators +
and * are examples of the full operator set in a typical language.

PRODUCTION SEMANTIC RULE

S → id : = E S.nptr : = mknode(‘assign’,mkleaf(id, id.place), E.nptr)

E → E1 +E2 E.nptr : = mknode(‘+’, E1.nptr, E2.nptr )

E → E1 * E2 E.nptr : = mknode(‘*’, E1.nptr, E2.nptr )

E → - E1 E.nptr : = mknode(‘uminus’, E1.nptr)

E → (E1 ) E.nptr : = E1.nptr

E → id E.nptr : = mkleaf( id, id.place )

Syntax-Directed Translation Schemes (SDT)


SDT embeds program fragments called semantic actions within production bodies. The position
of semantic action in a production body determines the order in which the action is executed.
Example: In the rule E → E1 + T { print ‘+’ }, the action is positioned after the body of the
production.

Inherited and Synthesized Attributes


Terminals can have synthesized attributes, which are given to it by the lexer (not the parser).
There are no rules in an SDD giving values to attributes for terminals. Terminals do not have
inherited attributes.
A nonterminal A can have both inherited and synthesized attributes. The difference is how they
are computed by rules associated with a production at a node N of the parse tree.
1. A synthesized attribute for a nonterminal A at a parse-tree node N is defined by a
semantic rule associated with the production at N. Note that the production must have
A as its head.
A synthesized attribute at node N is defined only in terms of attribute values at the
children of N and at N itself.
2. An inherited attribute for a nonterminal B at a parse-tree node N is defined by a
semantic rule associated with the production at the parent of N. Note that the
production must have B as a symbol in its body.
An inherited attribute at node N is defined only in terms of attribute values at N's
parent, N itself, and N's siblings.

Construction of Syntax Trees

SDDs are useful for the construction of syntax trees. A syntax tree is a condensed form of parse
tree.

Syntax trees are useful for representing programming language constructs like expressions and
statements.
• They help compiler design by decoupling parsing from translation.
• Each node of a syntax tree represents a construct; the children of the node represent the
meaningful components of the construct.

• e.g. a syntax-tree node representing an expression E1 + E2, has label + and two children
representing the sub expressions E1 and E2

• Each node is implemented by objects with suitable number of fields; each object will have an
op field that is the label of the node with additional fields as follows:
1. If the node is a leaf, an additional field holds the lexical value for the
leaf. This is created by function Leaf(op, val)

2. If the node is an interior node, there are as many fields as the node has
children in the syntax tree. This is created by function Node(op, c1, c2,...,ck) .

Example: The S-attributed definition in figure below constructs syntax trees for a simple
expression grammar involving only the binary operators + and -. As usual, these operators are at
the same precedence level and are jointly left associative. All nonterminals have one synthesized
attribute node, which represents a node of the syntax
tree.
Syntax tree for a-4+c using the above SDD is shown below.

CHAPTER-5(Type Checking)
Types and Declarations

We begin with some basic definitions to set the stage for performing semantic analysis. A type is
a set of values and a set of operations operating on those values. There are three categories of
types in most programming languages:

Base types
int, float, double, char, bool, etc. These are the primitive types provided directly by the
underlying hardware. There may be a facility for user-defined variants on the base types (such as
C enums).

Compound types
arrays, pointers, records, structs, unions, classes, and so on. These types are constructed as
aggregations of the base types and simple compound types.

Complex types
lists, stacks, queues, trees, heaps, tables, etc. You may recognize these as abstract data types. A
language may or may not have support for these sort of higher-level abstractions.

In many languages, a programmer must first establish the name and type of any data object (e.g.,
variable, function, type, etc). In addition, the programmer usually defines the lifetime. A
declaration is a statement in a program that communicates this information to the compiler. The
basic declaration is just a name and type, but in many languages it may include modifiers that
control visibility and lifetime (i.e., static in C, private in Java). Some languages also allow
declarations to initialize variables, such as in C, where you can declare and initialize in one
statement. The following C statements show some example declarations:

double calculate(int a, double b); // function prototype


int x = 0; // global variables available throughout
double y; // the program
int main()
{
int m[3]; // local variables available only in main
char *n;
...
}
Function declarations or prototypes serve a similar purpose for functions that variable
declarations do for variables. Function and method identifiers also have a type, and the compiler
can use ensure that a program is calling a function/method correctly. The compiler uses the
prototype to check the number and types of arguments in function calls. The location and
qualifiers establish the visibility of the function (Is the function global? Local to the module?
Nested in another procedure? Attached to a class?) Type declarations (e.g., C typedef, C++
classes.

Type System
In programming languages, a type system is a set of rules that assign a property called type to
various constructs a computer program consists of, such
as variables, expressions, functions or modules. The main purpose of a type system is to reduce
possibilities for bugs in computer programs, by defining interfaces between different parts of a
computer program, and then checking that the parts have been connected in a consistent way.
This checking can happen statically (at compile time), dynamically (at run time), or as a
combination of static and dynamic checking.

A type system associates a type with each computed value and, by examining the flow of these
values, attempts to ensure or prove that no type errors can occur.

Advantages provided by compiler-specified type systems include:

 Optimization – Static type-checking may provide useful compile-time information. For


example, if a type requires that a value must align in memory at a multiple of four bytes, the
compiler may be able to use more efficient machine instructions.
 Safety – A type system enables the compiler to detect meaningless or probably invalid
code. For example, we can identify an expression 3 / "Hello, World"as invalid, when the rules
do not specify how to divide an integer by a string. 
if a type system is both sound (meaning that it rejects all incorrect programs)
and decidable (meaning that it is possible to write an algorithm that determines whether a
program is well-typed), then it will always be possible to define a program that is well-typed yet
does not encounter runtime errors. For example, consider a program containing the code:
if <complex test> then <do something> else <generate type error>

Even if the expression <complex test> always evaluates to true at run-time, most type checkers
will reject the program as ill-typed, because it is difficult (if not impossible) for a static analyzer
to determine that the else branch will not be taken.

Type Checking

Type checking is the process of verifying that each operation executed in a program respects the
type system of the language. This generally means that all operands in any expression are of
appropriate types and number. Much of what we do in the semantic analysis phase is type
Checking. Sometimes the rules regarding operations are defined by other parts of the code (as in
function prototypes), and sometimes such rules are a part of the definition of the language itself
(as in "both operands of a binary arithmetic operation must be of the same type").

If a problem is found, e.g., one tries to add a char pointer to a double in C, we encounter a type
error. A language is considered stronglytyped if each and every type error is detected during
compilation. Type checking can be done compilation, during execution, or divided across both.

Static type checking is done at compile-time. The information the type checker needs is
obtained via declarations and stored in a master symbol table. After this information is collected,
the types involved in each operation are checked. It is very difficult for a language that only does
static type checking to meet the full definition of strongly typed..

Dynamic type checking is implemented by including type information for each data location at
runtime. For example, a variable of type double would contain both the actual double value and
some kind of tag indicating "double type". The execution of any operation begins by first
checking these type tags. The operation is performed only if everything checks out. Otherwise, a
type error occurs and usually halts execution

Types of conversion:

In computer science, type conversion, type casting, and type coercion are different ways of


changing an entity of one data type into another. An example would be the conversion of
an integer value into a floating point value or its textual representation as a string, and vice versa.

Two important aspects of a type conversion is whether it happens implicitly or explicitly, and


whether the underlying data representation is converted from one representation into another, or
a given representation is merely reinterpreted as the representation of another data type. 

Implicit Conversion Explicit Conversion

Implicit Conversion is Explicit Conversion is done programatically.


done automatically.
In Implicit conversion, no data In explicit conversion, data loss may or may not be
loss take place during the data take place during data conversion. Hence there is
conversion. a risk of information loss.
No possibility of It might throw error if tried to do without type
throwing exception during the casting.
conversion and therefore is
called type safe.
Implicit conversion do not Explicit conversion do require cast operator to
require any special syntax. perform conversion.
 Example :  Example :
Conversion of smaller number to Conversion of larger number to smaller number is
larger number is implicit explicit conversion.float k=123.456
conversion. int i= (int) k  
Conversion of integer type data // This is Explicit conversion and  (int) is type cast
to float.float i=0; operator. Here we may be able to escape an exception
int j=10; but there is noticeable data loss.i.e. i=123
i=j;   // .456 is lost during conversion
// This is implicit conversion
since float is larger than
integer,hence no loss of data &
no exception.

CHAPTER-6

INTERMEDIATE CODE GENERATION

INTRODUCTION

The front end translates a source program into an intermediate representation from
which the back end generates target code.
Benefits of using a machine-independent intermediate form are:

1. Retargeting is facilitated. That is, a compiler for a different machine can be created
by attaching a back end for the new machine to an existing front end.

2. A machine-independent code optimizer can be applied to the intermediate representation.

Position of intermediate code generator


Parser static intermediate intermediate code
checker code generator generator
Code

INTERMEDIATE LANGUAGES

Three ways of intermediate representation:

1. Syntax tree

2. Postfix notation

3. Three address code

The semantic rules for generating three-address code from common programming language
constructs are similar to those for constructing syntax trees or for generating postfix notation.

Graphical Representations:

Syntax tree:
A syntax tree depicts the natural hierarchical structure of a source program. A dag
(Directed Acyclic Graph) gives the same information but in a more compact way because
common subexpressions are identified. A syntax tree and dag for the assignment statement a : =
b * - c + b * - c are as follows:

assign assign

a + a +

* * *

b uminus b uminus b uminus

c c c

(a) Syntax tree (b) Dag

Postfix notation:

Postfix notation is a linearized representation of a syntax tree; it is a list of the nodes


of the tree in which a node appears immediately after its children. The postfix notation for
the syntax tree given above is

a b c uminus * b c uminus * + assign

Syntax-directed definition:

Syntax trees for assignment statements are produced by the syntax-directed


definition. Non-terminal S generates an assignment statement. The two binary operators +
and * are examples of the full operator set in a typical language. Operator associativities and
precedences are the usual ones, even though they have not been put into the grammar. This
definition constructs the tree from the input a : = b * - c + b* - c.

PRODUCTION SEMANTIC RULE

S → id : = E S.nptr : = mknode(‘assign’,mkleaf(id, id.place), E.nptr)

E → E1 +E2 E.nptr : = mknode(‘+’, E1.nptr, E2.nptr )

E → E1 * E2 E.nptr : = mknode(‘*’, E1.nptr, E2.nptr )

E → - E1 E.nptr : = mknode(‘uminus’, E1.nptr)

E → (E1 ) E.nptr : = E1.nptr

E → id E.nptr : = mkleaf( id, id.place )

Three-Address Code:

Three-address code is a sequence of statements of the general form

x : = y op z

where x, y and z are names, constants, or compiler-generated temporaries; op stands for any
operator, such as a fixed- or floating-point arithmetic operator, or a logical operator on boolean-
valued data. Thus a source language expression like x+ y*z might be translated into asequence

t1 : = y * z
t2 := x + t1
wheret1 and t2 are compiler-generated temporary names.

Advantages of three-address code:


 The unraveling of complicated arithmetic expressions and of nested flow-of-control
statements makes three-address code desirable for target code generation and
optimization.

 The use of names for the intermediate values computed by a program allows three-
address code to be easily rearranged – unlike postfix notation.
Three-address code is a linearized representation of a syntax tree or a dag in which
explicit names correspond to the interior nodes of the graph. The syntax tree and dag are
represented by the three-address code sequences. Variable names can appear directly in three-
address statements.

Three-address code corresponding to the syntax tree and dag given above

t1 : = - c t1 : = -c

t2 : = b * t1 t2 : = b * t1

t3 : = - c t5 : = t2 + t2

t4 : = b * t3 a : = t5

t5 : = t2 + t4

a : = t5

(a) Code for the syntax tree (b) Code for the dag

The reason for the term “three-address code” is that each statement usually contains three
addresses, two for the operands and one for the result.

Types of Three-Address Statements:

The common three-address statements are:

A. Assignment statements of the form x : = y op z, where op is a binary arithmetic or logical


operation.
B. Assignment instructions of the form x : = op y, where op is a unary operation. Essential unary
operations include unary minus, logical negation, shift operators, and conversion operators
that, for example, convert a fixed-point number to a floating-point number.

C. Copy statements of the form x : = y where the value of y is assigned to x.


D. The unconditional jump goto L. The three-address statement with label L is the next to be
executed.
E. Conditional jumps such as if x relop y goto L. This instruction applies a relational operator (
<, =, >=, etc. ) to x and y, and executes the statement with label L next if x stands in relation
relop to y. If not, the three-address statement following if x relop y goto L is executed next,
as in the usual sequence.
F. param x and call p, n for procedure calls and return y, where y representing a returned value is
optional. For example,
param
x1
param
x2
...
param
xn call
p,n
generated as part of a call of the procedure p(x1, x2, …. ,xn ).

G. Indexed assignments of the form x : = y[i] and x[i] : = y.

H. Address and pointer assignments of the form x : = &y , x : = *y, and *x : = y.

Implementation of Three-Address Statements:

A three-address statement is an abstract form of intermediate code. In a compiler,


these statements can be implemented as records with fields for the operator and the operands.
Three such representations are:
 Quadruples

 Triples

 Indirect triples

Quadruples:

 A quadruple is a record structure with four fields, which are, op, arg1, arg2 and result.

 The op field contains an internal code for the operator. The three-address statement x : =
y op z is represented by placing y in arg1, z in arg2 and x in result.

 The contents of fields arg1, arg2 and result are normally pointers to the symbol-table
entries for the names represented by these fields. If so, temporary names must be entered
into the symbol table as they are created.

Triples:

 To avoid entering temporary names into the symbol table, we might refer to a temporary
value by the position of the statement that computes it.
 If we do so, three-address statements can be represented by records with only three
fields: op, arg1 and arg2.
 The fields arg1 and arg2, for the arguments of op, are either pointers to the symbol table
or pointers into the triple structure ( for temporary values ).
 Since three fields are used, this intermediate code format is known as triples.

op arg1 arg2 result

(0) uminus c t1

(1) * b t1 t2

(2) uminus c t3

(3) * b t3 t4

(4) + t2 t4 t5

(5) := T5 a

(a) Quadruples
arg
op arg1 2

(0) uminus c
(1) * b (0)
(2) uminus c
(3) * b (2)
(4) + (1) (3)
(5) assign a (4)

(b) Triples

Indirect Triples:

 Another implementation of three-address code is that of listing pointers to triples,


rather than listing the triples themselves. This implementation is called indirect triples.
 For example, let us use an array statement to list pointers to triples in the desired
order. Then the triples shown above might be represented as follows:

Statement op arg1 arg2

(0) (14) (14) uminus c


(1) (15) (15) * b (14)
(2) (16) (16) uminus c
(3) (17) (17) * b (16)
(4) (18) (18) + (15) (17)
(5) (19) (19) assign a (18)

(c) Indirect triples representation of three-address statements

Declarations
As the sequence of declarations in a procedure or block is examined, we can lay out
storage for names local to the procedure. For each local name, we create a symbol-table entry
with information like the type and the relative address of the storage for the name. The relative
address consists of an offset from the base of the static data area or the field for local data in an
activation record.

Declarations in a Procedure:
The syntax of languages such as C, Pascal and Fortran, allows all the declarations in a
single procedure to be processed as a group. In this case, a global variable, say offset, can
keep track of the next available relative address.

In the translation scheme shown below:

 Nonterminal P generates a sequence of declarations of the form id : T.

 Before the first declaration is considered, offset is set to 0. As each new name is seen ,
that name is entered in the symbol table with offset equal to the current value of offset,
and offset is incremented by the width of the data object denoted
by that name.
 The procedure enter( name, type, offset ) creates a symbol-table entry for name, gives its
type type and relative address offset in its data area.
 Attribute type represents a type expression constructed from the basic types integer and
real by applying the type constructors pointer and array. If type expressions are
represented by graphs, then attribute type might be a pointer to the node representing a
type expression.

 The width of an array is obtained by multiplying the width of each element by the
number of elements in the array. The width of each pointer is assumed to be 4.

Back patching

Back patching is the technique to solve the problem of replacing symbolic names into goto
statements by the actual target addresses.

Back patching usually refers to the process of resolving forward branches that have been planted
in the code, e.g. at 'if' statements, when the value of the target becomes known, e.g. when the
closing brace or matching 'else' is encountered.

Flow Control Statements


The main concern with flow control is the additional branching instructions that must fit between
the other blocks of code that is represented by simple nonterminals.

stmt → IF expr THEN stmt


| IF expr THEN stmt ELSE stmt
| WHILE expr DO stmt
Procedure Call
A procedure is an executable object stored on the data source. A procedure can have zero or
more parameters. It can also return a value, as indicated by the optional parameter marker ?= at
the start of the syntax.

{call procedure-name[([parameter][,[parameter]]...)]}

A programmatic subroutine (function, procedure, or subprogram) or a sequence of code which


performs a specific task, as part of a larger program that is grouped as one or more statement
blocks with the typical intention of doing one thing well.The parameters which follow the
procedure’s name are passed to the procedure. 

Examples
ERASE

This is a procedure call to a subroutine to erase the current window. There are no explicit inputs
or outputs. Other procedures have one or more parameters. For example:
PLOT, Circle, Square

calls the PLOT procedure with the parameter Circle, Square.

CHAPTER-7 (Table Representation)

Symbol Table:
A new symbol table is created when a procedure declaration D  proc id D1;S is seen,
and entries for the declarations in D1 are created in the new table. The new table points back to
the symbol table of the enclosing procedure; the name represented by id itself is local to the
enclosing procedure. The only change from the treatment of variable declarations is that the
procedure enter is told which symbol table to make an entry in.

For example, consider the symbol tables for procedures readarray, exchange, and quicksort
pointing back to that for the containing procedure sort, consisting of the entire program. Since
partition is declared within quicksort, its table points to that of quicksort. The symbol table is
accessed by most phases of a compiler, beginning with lexical analysis, and continuing through
optimization.A compiler may use one large symbol table for all symbols or use separated,
hierarchical symbol tables for different scopes.
Symbol tables for nested procedures

sort

nil header
A
X
Readarray to readarray
Exchange to exchange
Quicksort

Readarray exchange quicksort

Header header Header


I K
V
Partition

partition
Header
I
J

The semantic rules are defined in terms of the following operations:

1. mktable(previous) creates a new symbol table and returns a pointer to the new table. The
argument previous points to a previously created symbol table, presumably that for the
enclosing procedure.

2. enter(table, name, type, offset) creates a new entry for name name in the symbol table pointed
to by table. Again, enter places type type and relative address offset in fields within the entry.

3. addwidth(table, width) records the cumulative width of all the entries in table in the header
associated with this symbol table.

4. enterproc(table, name, newtable) creates a new entry for procedure name in the symbol table
pointed to by table. The argument newtable points to the symbol table for this procedure
name.
Hash Table
A common data structure used to implement symbol tables is the hash table. Hash tables are used
to organise a symbol table, where the keyword or identifier is 'hashed' to produce an array
subscript. Collisions are inevitable in a hash table, and a common way of handling them is to
store the synonym in the next available free space in the table.
Hashing is the process of mapping large amount of data item to a smaller table with the help of
a hashing function. The essence of hashing is to facilitate the next level searching method when
compared with the linear or binary search. The advantage of this searching method is its
efficiency to hand vast amount of data items in a given collection (i.e. collection size).

Example: Here, we construct a hash table for storing and retrieving data related to the citizens of
a county and the social-security number of citizens are used as the indices of the array
implementation (i.e. key). Let's assume that the table size is 12, therefore the hash function
would be Value modulus of 12.

Hence, the Hash Function would equate to:


(sum of numeric values of the characters in the data item) %12 
Note! % is the modulus operator

Let us consider the following social-security numbers and produce a hashcode:


120388113D => 1+2+0+3+8+8+1+1+3+13=40
Hence, (40)%12 => Hashcode=4

310181312E => 3+1+0+1+8+1+3+1+2+14=34


Hence, (34)%12 => Hashcode=10

041176438A => 0+4+1+1+7+6+4+3+8+10=44


Hence, (44)%12 => Hashcode=8

Therefore, the Hashtable content would be as follows:


-----------------------------------------------------
0:empty
1:empty
2:empty
3:empty
4:occupied Name:Drew Smith SSN:120388113D
5:empty
6:empty
7:empty
8:occupied Name:Andy Conn SSN:041176438A
9:empty
10:occupied Name:Igor Barton SSN:310181312E
11:empty
-----------------------------------------------------
How to represent scope information in the symbol table:

Idea: 1. Hierarchy of scopes in the program.

2. Use of similar hierarchy of symbol able.

3. One symbol table for each scope.

4. Each symbol table contains the symbol declared in the lexical scope.it solve the problem of
resolving name collisions (solve same name and overlapping scopes)

You might also like