You are on page 1of 175

Chapter -1

Introduction to Compilers

1.1 Overview of Compilers

Compiler is system software that takes source program as input and converts it into target program.
The source programs are independent of the machine on which they are executed where as target
programs are machine dependent. Source program can be any high level language like C, C++, etc.,
and the target program may be in assembly language or machine language.

1.1.1 Necessicity of compiler

To solve any problem-using computer, it is required to generate a set of instructions to solve the
problem. These instructions can be in machine level code, assembly level language or high level
language. Machine level code consists of instructions written in machine code consisting of ‘0’s and
1’s.

Example: A3 06 0000 0075

The above instruction is used to move contents form register AX to BX. These codes are easily
understood by the computer system and hence their execution is faster. They can be executed on the
system without any intermediatory software. It is difficult for the programmer to read, write and
debug instructions in machine code.

In assembly level language, instruction consists of mnemonics and operands.

Example: MOV AX,BX

The above instruction moves the contents of register AX to BX. These instructions are written based
on the number and type of general purpose registers available, addressing modes and organization of
memory. Though assembly code is easier to read and write when compared to machine code, the
programmer should have complete knowledge of efficiently using registers, choose appropriate
instruction for faster and better utilization of memory. It requires an intermediatory software called
assembler, which converts assembly code to machine code before execution, hence it is slower when
compared to machine code.

In high level language, instructions are defined using programming language like C, C++, Java etc.,

Example: c=a+b

This instruction adds two values of variables a and b, and stores the result in c. These are very easy to
be read, write and debug by programmer, but are difficult to understand by the system. Hence it
requires intermediatory software to convert instructions to machine code and also check for errors.
Compilers serve this purpose. It converts program written in high level language to assembly level
language or machine language.

Compilers can also be used to convert code written in one high level language to another, like ‘C’ to
Java or Pascal to ‘C’.

1
1.2 Why Compilers?

The different steps involved in converting instructions in high level code to machine level code is
called language processing. There are different components involved in language processing. They are

a. Preprocessor

b. Compiler

c. Assembler

d. Linker /loader

The steps involved in language processing is clearly shown in Fig 1.1.

1.2.1 Preprocessor

First step in language processing is pre-processing. Input to this phase is source program. Different
parts of the source program may be stored in different files. Example- function definition may be in
one file and the main program may be in another file. Preprocessor collects all these files and creates a
single file. It also performs macro expansion. Macros are small set of instructions written to perform
specific operations. In C #define and #include are expanded during pre-processing. Some pre-
processors also delete comments from the source program.

1.2.2 Compiler

Compiler takes pre-processed file and generates assembly level code. It also generates symbol table
and literal table. Compiler has error handler which displays error messages and performs some error
recovery if necessary. In order to reduce the amount of time taken for the execution and for better
utilization of memory, compiler generates intermediate form of code and optimizes this code.
Functionality of compiler is divided into multiple phases. Each phase performs a set of operations.
Like lexical analysis generating tokens, code optimiser optimizing code etc.

1.2.3 Assembler

Assembler takes assembly code as input and converts it into relocatable object code. The instruction
in assembly code will have two parts opcode part and an operand part. Opcode specifies the type of
operation like ADD for addition, SUB for subtraction, INC for increment etc. The operand part

2
consists of number of operands on which the operations are to be applied. These operands may be
memory location, register or immediate data. Assemblers may be single pass or two pass assembler.
In a single pass assembler, reading assembly code, generation of symbol table and conversion of
opcode to machine instruction are all done in a single pass. In two pass assembler, first pass reads the
input file and stores the identifiers in symbol table. In second pass, it translates opcode to sequence of
bits (machine code or relocatable code)with the help of symbol table.

1.2.4 Linker/Loader

In the final step, an executable code is generated with the help of linker and loader. Linkers are used
to link system wide libraries and resources supplied by operating system such as I/O devices, memory
allocator etc. Loader resolves all relocatable address relative to starting address and produces absolute
executable code.

1.3 The Translation Process

This section concentrates on the translation process of compiler. The process of converting source
program to target code requires many functions to be done. The functions of the Compiler can be
divided into six phases or steps namely

• Lexical Analysis

• Syntax Analysis

• Semantic Analysis

• Intermediate code generation

• Code optimization

• Final code generation.

This is pictorially represented in Fig 1.2.

1.3.1 Lexical Analysis

The main function of Lexical Analyser is to break the source program into tokens. These tokens are
small meaningful sequence of characters. Tokens are classified as key words, identifiers, operators,
etc. Lexical analyser then creates symbol table and literal table to store the identifiers and constants
respectively. This phase takes care of detecting few lexical errors and applies some strategy and tries
to recover from errors.

Consider the following example of ‘C’ code. The statement A=B*C+10 multiplies the values of
variables B and C, adds 10 to it and stores the result in variable A. As A, B and C satisfy the rules for
an identifier, they are recognised as identifiers and token called id is generated are stored in Symbol
Table. Similarly, * and + satisfy the rule called operator so they are recognised as operators and token
for operator is generated. Constant value 10 satisfies the rule for numeric constant so token num is
generated. The value 10 is stored in literal table. Lexical analyser uses regular expression or finite
automata for defining rule for token and validating the tokens.

Example: A = B * C + 10

A, B, C will be recognised as identifiers. *, +, = are treated as operators. 10 as Numeric value.

Lexical Analyser skips blank spaces, tab and new line character from the source program.

3
1.3.2 Syntax Analyser

Syntax Analyser determines the structure of the program. The tokens generated from Lexical
Analyser are grouped together and checked for valid sequence defined by programming language.
Output of Syntax Analyser is parse tree or syntax tree which is hierarchical / tree structure of the
input. Syntax Analyser uses context free grammar to define and validate rules for language construct.
In the previous A,B and C would be converted as identifiers id1, id2 and id3 respectively. Fig 1.3
shows the syntax tree for the expression A=B*C+10. Leaf nodes are identifiers or constants and
intermediate nodes and root nodes are operators. In this phase each statement in the program is
checked with the definition of language construct. Example: expression = expression1 operator
expression2, expression1 and expression2 can be identifiers or expression itself. Syntax analyser had

4
different error detecting and recovery mechanisms. This helps in correcting most of the syntax error.
Is also displays error message to the user.

1.3.3 Semantic Analyser

Input to semantic analysis phase is parse tree or syntax tree. Important function of semantic analysis
phase is type checking. It checks for each operator which can be used which specific operands
permitted by language specification. Example + can be used for adding two integers and same + can
be used for adding two floating point numbers. Some languages do not support for adding an integer
to floating point number. It also handles type conversion. Some of the languages support for mixed
mode of operations like performing binary operations on integer data. For the example A=B*C+10, If
A, B & C are floating point numbers and if the language does not support mixed mode of operation.
Then integer 10 has be converted to floating number 10.0 using the function inttofloat( ). Hence the
syntax tree generated would be as Fig 1.4. Semantic analyser also checks for some semantic errors
like real identifier used for array indexing. Output of semantic analyser is an annotated parse tree.

1.3.4 Intermediate Code Generator

Intermediate code generator generates intermediate code for annotated parse tree. This code is
dependent on source program and independent of the target program (machine on which the target
program runs).

The intermediate code uses three address codes and uses temporary variables to store the intermediate
results. Three address code is of the form address1 = address2 operator address3. Address2 and
address3 stores two operands and the result of applying operator on address2 and address3 is stored in

5
address1. This three-address code is implemented as quadruples. For the above example the product
of id1(B) and id2(C) are first stored in temporary variable temp1, next numerical value 10 is added to
temporary variable temp1 and stored in temp2. Finally contents of temp2 are stored back in id3 (A).

Example: A = B * C + 10

temp1 = id 1 * id 2

temp2 = temp1 + 10

id 3 = temp 2

1.3.5 Code optimization

Code optimization phase is the most important phase of the compiler. The purpose of code optimiser
is to reduce the number of operations or reduce the amount of time taken for execution. It also takes
care that it uses minimum temporaries to store intermediate values. Based on the amount of time taken
to execute the instruction, most appropriate instruction will be selected. Code optimizer concentrates
on that part of the code which will be executed many times and tries to optimise that code. This also
gives guidelines in better utilization of registers to store intermediate values as register access are
always faster than memory access. The code optimization may be machine independent optimization
when optimization is doe on intermediate code. Optimization may be machine dependent when
optimization is done on target code. Machine independent optimisation can be like

• Replacing slower instructions by faster one.

o Example x = y * 2 will be replaced by x = y + y

• Eliminating that part of the code which is never reached

• Optimise that part of the code which may be executed more number of times

All the above optimization strategies help in reducing the execution time of the program. Machine
dependent optimizations can be like using registers efficiently by having efficient register allocation
policy. Determining the cost of instruction and choose the most appropriate instruction.

1.3.6 Final Code Generation

Input to code generation phase is the imtermediate optimised code generated from code optimization
phase or intermediate code with no optimization. Final phase is to generate a code that runs on target
machine, i.e., machine dependent code. Normally, this will be assembly level code. Assembly code
has two parts opcode part and mnemonics part. Code generation or target code is based on the number
of registers available on the target machine. It is also based on the set of addressing mode available on
the system, memory organization and instruction set. There may be multiple ways to perform an
operation; code generator has to find the most appropriate set of instruction for that operation. For the
above example, the final code generated will be as follows: A=B*C+10

Example: MOV AX,B

MUL AX,C

ADD AX, #10

First instruction moves the contents of memory location B into AX register. Second instruction
multiplies contents of memory location C and AX register and stores the result in AX register. Third

6
instruction adds numeric value 10 to the contents of AX register and the final result is stored in AX
register.

1.4 Design of compilers

Compilers may be designed based on the functionalities that it performs like scanning, generating
parse tree, generating intermediate code, optimizing it and finally generating target code. In this type
of design of compilers is called as designed based on phases. Some times design of compilers may be
based on passes or based on front end or backend or as analysis part and synthesis part. Following
section deals with different aspects of design of compilers. Design of compiler based on functionality
is already explained in detail in the previous section. Now let us deal with other aspects of design.

1.4.1 Single pass and Multipass

Compilers may be single pass compilers or multipass compliers. In a single pass compiler all the
phases are executed in one pass. In this case input program is scanned only once, and at this time,
symbol table generation, error handling and code generation all are done. In multipass compiler many
passes are used. Which means the input program is scanned multiple times like one pass for scanning
and parsing, one pass for semantic analysis and source level optimization and final pass for code
generation.

1.4.2 Front End and Back End

In this form of design complete functions of compiler is divided into two groups, front end and back
end. This is shown in Fig 1.5. Front End deals with machine independent phases like lexical analysis,
syntax analysis, semantic analysis and intermediate code generation. It generates an intermediate
representation which is different from that of both source and target code. It also takes care of creation
of symbol table and error handling. This is dependent on source program and independent on target
machine. Back End deals with phases like code optimization and code generation and also takes care
of error handling. This depends on target machine and independent on source language.

1.4.3 Analysis and synthesis

Compilers can also be studied in two parts analysis part and synthesis part. In analysis part the source
program is broken into pieces and creates intermediate form of source program. Synthesis part deals
with checking whether the syntax and semantics are correct. It also performs error checking and error
correction if any. Finally it constructs target program from intermediate form.

1.5 Interactive Development Environment (IDE)

Interactive Development Environment (IDE) provides user friendly environment for programmer for
creation of files, editing, compiling and running. Compiler uses these tools for easy creating and
debugging of code. Some of the important tools are structured editor, debuggers or checkers, profilers
and project manager.

Structured editor

7
Structured editor are means through which user can enter the source program in the specified
language. Based on the language specifications the display on the editor helps in differentiating
keywords, identifiers, constants etc. It also helps user to differentiate between executable statements
like assignment statements, flow control statements etc (which are used by compilers) and non
executable statements like comments (which are not used by compilers). This provides convenient
environment for

• Creation of text and editing of text like cut, paste, undo, search etc.

• Indentation of program by introducing hierarchical structure.

• Keywords and identifiers and displayed in different colours.

• Comments displayed in different font.

• Support for running other tools.

Debuggers/Checkers

These are used to discover the bugs without running the code. Example: Part of the program never
executed. Logical errors like assigning real variables as pointers. It also provides break points and line
numbers, which can be used in debugging and error correction.

Profilers

Profilers are used to study the statistical behaviour of the object program. Some of behaviour like
number of times a function is called or percentage of execution spent on each function. This helps in
estimating the time taken for execution of entire program.

Project manager

There may be a situation when functions are written in many different files. During compilation all
these are to be scanned and generated code independently. Under such situations projects are to be
created to store all the related files and compiled together. Project manager helps in creating new
projects, compiling and linking projects components etc.

1.6 Bootstrapping and porting

In designing a compiler three programs are involved, source program S, target program T and the
program written in language L to convert S to T. This L is language used for program of compiler.
This can be represented as T diagram. L is run on target machine. This is shown in Fig 1.6

8
Compilers may be written in same language that of a target code. This is shown in Fig 1.7. Compiler
is written in S to convert source S to target code T

If a target code is generated for different machine other than the one which it runs, they are called as
cross Compilers. This is shown in Fig 1.8.

Fig 1.8 show how source S can be converted to target T with the help of K.

Bootstrapping is a process of refining compilers from inefficient design to an efficient one using quick
compilers. Some of the cross compilers are shown from Fig 1.9 and Fig 1.10.

9
Fig 1.10 shows how a compiler can be improved from inefficient compiler to efficient compiler.

1.6.1 Porting

Following are some of the examples of cross compilers that can helps in porting. Fig 1.11 and Fig
1.12 shows these cross compilers.

1.7 Data structure

Compilers are required to store all the identifiers defined in the source program. Identifiers may be
program variable or function name. Program variables are to be checked for their place of declaration
and their usage in the program, in order to avoid multiple definitions. In C all variables must be
declared before they are used. It is also necessary to check for the variable being assigned values of
same data type as declared. Similarly functions to be defined before they are called. It is also

10
necessary to check for the number of arguments and their data types and also the return values of the
function are same as defined. Hence compiler uses a data structure called Symbol table to store these
information. Apart from identifiers source program also has constants whose values do not change
during the execution of the program like const A=10. Source program may also have statements
which are used to display constant strings.

Example: printf (“Hello”); the string Hello has to be stored. These strings and constants are stored in
a data structure called literal table.

During compilation some error like missing parenthesis, invalid function name or invalid operator
may be found. In such cases, the location of errors like line no or function name or file name has to be
displayed to the user along with the error message hence they should also be stored in symbol table as
scope information. While displaying error message it is required to associate the message with the
location of error. In order to store this location of error, error tables are used. This error table is also
used by error handler routine. The error handler routine may performs some error corrections like
adding missing parenthesis etc. This is mainly done in order to continue the compilation and generate
code.

1.7.1 Symbol table

Symbol table is used to store the description of identifiers used in the source program. Contents of
symbol table are name of the identifier, data type, scope and value. If the identifier is the name of the
function along with name it also stores list of arguments their data type and return type of function.
Symbol table helps in avoiding multiple declarations of variables and keeping track of scope
information. It also has a pointer which points to the memory location where the identifier is stored.

Example: Consider the following C code.

int main()

int x=10;

float y, z;

Printf(“enter value for z\n”);

scanf(“%f”,&z);

y = (float) x +z;

printf(“ the value of y=%f\n”,y);

Table Tab 1.1 shows the sample symbol table generated for the above example.

11
1.7.2 Literal table

Literal table stores the constant values and strings used in the source program. The main function of
the literal table is to conserve memory by reusing the constant and the string. In the following
example there is constant value stored in ‘a’, the value of ‘a’ do not change during the execution of
the program. This can be stored in the literal table as Literal1. In same way printf statement displays
string “Hello”, this is also stored literal table as Literal 2. Tab 1.2 shows the sample literal table for
the below example.

Example: int main()

const a=7;

printf(“Hello”);

1.7.3 Error table

Whenever error is found in any phase of compiler, it is recorded in error table. Based on the type of
error, error messages are displayed and if required some error corrections are done in order to
continue the compilation. Some of the common errors that can be detected in lexical analysis are
incomplete token so the error message would be “misspelled identifier”. Similarly errors during
syntax analysis are “mismatched parenthesis”. It is said that most of the syntax error are detected
during syntax analysis. Error table stores the type of the error like lexical error, syntax error etc, along
with the string and the location of error.

Example: Consider the following C code

Line no 1: main( )

Line no 2: {

Line no 3: in abc;

Line no 4: abc=10+(10*3;

Line no 5: }

In the above program, there are 2 errors, one in line no 3 and another in line no 4. The error in line no
3 is misspelled keyword, i.e. instead of int user has entered as in. So the error message will be “line no
3: in misspelled keyword or known identifier in”, because int is entered as in, the data type of abc is
not clear. So there would be one more error message like “line no 3: data type of abc undefined”.
Second error is matching parenthesis. So the error message will be “Line no 4: missing right
parenthesis”. Error handler can also take care of exception handling.

12
Interpreters

Interpreters are similar to compilers but it generates target code without generating any intermediate
code. Some of the languages which use interpreter are BASIC, LISP etc. Interpreter takes each line of
source program and converts it to target code. But compiler considers entire program as a single unit.
It scans the entire program generates intermediate code for the entire program and then only converts
it to target code. As interpreters do not generate intermediate code there is less scope for code
optimization. If speed of execution is primary concern then compilers are preferred, but if time taken
to generated target code is primary concern then interpreters are referred.

1.8 Run time environment

One of the important aspects of compiler construction is structure and behaviour of the runtime
environment. Run time environment deals with allocation of memory to variables of the program.
Program variables may be static or dynamic variables. Static variables are those whose size is fixed
and do not change during program execution.

Example: int a, b[10];

The above example is ‘C’ declaration statement. It defines two variables, one integer a and another b
an integer array consisting of 10 elements. Generally integer value is stored in two bytes. Therefore
memory allocated for a would be two bytes and memory allocated for b would be 2*10=20 bytes.
Dynamic variables are those which get their existence only during runtime. Hence their size cannot be
predefined. Example: memory allocated using the function malloc().

There are basically three methods for memory allocation for runtime environment. They are

• Static allocation

• Stack based allocation

• Heap allocation

Some of the languages have only static variable and do not have support to use dynamic variables or
recursive procedure calls. In this case memory can be allocated to variable during compile time. In
other words the address to variables can be known before the code is run. This type of binding address
is called early binding. Such languages use static run time environment. FORTRAN 77 uses static run
time environment. There are large set of languages like ‘C’ which uses pointer and dynamic variables
along with static variables. They also use recursive procedure calls. For these languages memory
cannot be allocated during compile time. Memory is allocated only during run time, this is called as
late binding. Memory can be dynamically allocated using the commands like malloc(), calloc() or
realloc() in ‘C’ program. Once the usage of dynamic memory is completed, the programmer has to
explicitly deallocate the memory. This can be done using the function like free() in ‘C’ language.
These languages use stack based allocation. For other set of languages memory has to be allocated
when required and deallocated automatically after its use. This allocation and freeing has to be done
by the run time environment. The system should have functionalities like ‘garbage collection’ which
takes care of allocation and deallocation. The languages that use this feature uses heap allocation.

1.9 Compiler Construction tools

Compiler construction tools are used to design different phases of compilers independently and hence
each phase can be efficiently designed for its purpose. Tools help in ease of effort in designing the
compiler. It is also possible to have different types of analysis integrated on same tool. These tools are
called as compiler-compilers, compiler-generators or translator-writing system. Tools help in
achieving compatibility within range of compilers. Some of the tools are:

13
• LEX- This tool can be used as scanner generator. It generates token from the source program,
validate the token and display error message if any. Validation of is done using
regular expressions or finite automata. Each regular expression has associated
translation rule. If the input string satisfies a regular expression, then the
corresponding translation will be executed.

• YACC- YACC stands for Yet Another Compiler Compiler. This tool does the work of parser
generator. It verifies whether the program statement matches the language constructs.
It is based on the context free grammar.

• Automatic code generator – This tool is used to generate machine level code from
intermediate code. It uses a set of conversion rules for converting the code. The rules
are similar to template generation for a specific operation and code conversion is
simply a template matching function.

• Data flow engines – These are mainly used for code optimization. There can be multiple ways
of implementing single operation. Data flow engines performs data flow analysis and
finds the best way of executing the specific operation. Analysis can be based on
amount of time taken to execute the code or minimum number of memory access.

Assignment:

1. Explain the working of two pass assembler.


2. Describe the concept of error detection and reporting in compiler.
3. Explain different compiler construction tools.
4. Describe the different data structures used in compiler.
5. Write a note on interpreters.
6. Explain the different tools used for analysis process of compiler.
7. Write a note on tool used for automatic code generator.
8. What are the different steps involved in converting source program to target code, explain
in brief.
9. Explain the function of syntax analysis with example.
10. Describe the functionality of preprocessors

14
Chapter 2
Lexical Analysis

2.1 The Role of lexical analyser

Primary goal of lexical analysis is to read source program and break it into sequence of character
called tokens. These tokens are grouped as identifiers, keywords, literals, operators etc. Basically
Lexical analyser is pattern matching function which generates token with the help of regular
expression or finite automata. Other functions of lexical analyser are

• Eliminating blank spaces, tab and new line characters.

• Detecting lexical errors if any, correlating error message with position of error, like line
number, function where the error is detected or file where the error is found.

• Macro implementation

Lexical analyser is designed as two parts. The source program is given as input to first part. Here the
program is broken into tokens. This part is called as scanning. In the second part each token which is
generated from scanning is analysed and grouped into specific category. This part is called as
analysis. The intention of dividing lexical analysis into two parts is to make the design of lexical
analyser simpler and enhances portability. It also increases the efficiency by introducing specialized
buffering technique.

2.2 The Scanning Process

Any programming language statement consists of identifiers, control constructs, function etc. These
are to be converted to tokens, so in lexical analysis three terms are used. They are lexeme, pattern and
token. Pattern defines set of rules that should satisfy for string to place it under specific category.
Lexeme is that string extracted form source program which satisfy a pattern. Token is a string that is
sent to the parser for which the lexeme satisfy the pattern.

Example : Pattern to recognize identifier is pattern-1 = “String starting with a letter and followed by a
letter or digit”

ie pattern-1 = l/(l/d)* and token= “identifier” is generated

Consider the following source program

main()

{ int abc, pos, a12;

abc= 10;

…. }

When the lexeme abc or pos or a12 is considered, it satisfies the pattern-1. Hence token = “identifier”
would be sent to parser for abc, pos and a12. ie id1 for abc, id2 for pos and id3 for a12.

Example : Pattern to recognize number is pattern-2 = “one or more digits”

15
Then, pattern-2 = digit+ and token = “number”

Consider the previous program statements

Lexeme 10 of line 3 satisfy pattern-2. Hence token = “number” is generated.

Example :

It may be possible that may lexeme matches a pattern. Example: when number 465 is scanned, it may
be treated as either number 4 or number 46 or number 465 as all these three matches the pattern for
number. Therefore token to be generated for the longest matching lexeme. So only 465 is considered
as lexeme and token would be number. These tokens influence the parsing decision hence these token
are to be entered into symbol table or literal table along with token attributes. Token attributes may be
string associated with the identifier or value associated with number. Entry in the symbol table
indicates the type of the identifiers, line no where it is defined or used.

Most of the languages define the list of keywords called reserved words. Theses words cannot be used
as identifiers, their meaning is predefined. Example: ‘if ‘cannot be used as identifier in ‘C’ as they are
reserved words. In some languages like PL/1 keywords are not reserved words, they can be used as
identifiers. Example: if then = else then else = then else then = else.

2.3 Data structure

Tokens may be from several sets. Reserved words set which has {if, then, else, while, until,..}. Set
consisting of identifiers like {a, b, abdc, x,xyz,….}, set of numbers {34, 56, -99,77.9, -45.8,…}, set of
operators {+, -,*, /, …} etc. Each set is defined as follows

Char keywords*[ ] ={ if, then, else, while,…}

Char operator *[ ]= {+, -,*,/,…}

Char id*[ ]={a,abc,x,xyz…}

Tokens are stored in symbol table along with its attributes. The records of symbol table will be of the
form token_record. Token can be any one, either keyword or identifier or number or operator, hence
union is used. Attributes of the token like name of variable is stored in stringval and its value is stored
in numval.

Struct token_record{

Union token{

Char keyword*[ ];

16
Char operator*[ ];

Char id*[ ];

};

Char *stringval;

int numval;

2.4 Lexical errors

As stated earlier, lexical analyser detects most of the lexical errors. Errors are specific to source
program. Some of the errors may be like misspelled keywords or undefined variables. As soon as
errors are detected corresponding error messages are to be displayed.

Example: Consider the following C code

main()

{ int 2a, res4;

float x;

……}

In the above example, 2a does not match to any pattern. So error message like “2a not a valid
identifier “would be displayed.

Example:

intx( );

In the above example, there may be two possibility of error

Case 1: int x( ); may be the correct form, in which x is the function name and int is the return type of
the function. If x is not declared before then it should give an error message “undefined identifier x”.

Case 2: intx( ); may be the correct form, in which intx is the function name. If intx is not declared
then it should display the error message as “undefined identifier intx”.

The lexical analyser not only displays error message, it should also perform some error recovery in
order to continue the compilation. Some of the lexical error recovery strategies are

• Delete extraneous character from input

o Example: intt a; changed to int a; by deleting extra ‘t’.

• Inserting missing character to input

o Ex: flot x; insert ‘a’ to make it float x;

• Replacing an incorrect character with correct one

17
o Ex: whale( ) replace ‘a’ with ‘i’ to make it while( )

• Interchanging the adjacent characters

o Ex: fro( ;; ) interchange ‘o’ and ‘r’ to get for( ;; )

Recovery strategies should be such that with minimum transformation it is required to get
syntactically well formed code.

2.5 Buffering

The source program is to be stored in buffers before scanning. The buffer is to be scanned to retrieve
the token. For this purpose two pointers are used, viz, Begin pointer and forward pointer. Initially
begin pointer and forward pointer both points to the beginning of buffer (starting of program).
Forward pointer moves one symbol by another until a lexeme is found. Current lexeme is a string
between begin pointer and forward pointer. This lexeme is converted to token and sent to parser. The
token will be entered onto symbol table or literal table based on the type of token. In order to increase
the speed of token generation, buffer is divided into two halves. Initially first half of buffer is loaded,
and both begin pointer and forward pointer pointing to the starting of first buffer. When forward
pointer crosses first half of buffer, other half can be loaded with next set of characters from the input
program. Following steps explains the concept of lexical analyser with respect to buffering.

1. Initialize pointers (begin and forward) to beginning of buffer (program)

2. If begin pointer and forward pointer at end of file, exit from lexical analyser

3. Advance forward pointer

4. If a proper lexeme exists between begin pointer and forward pointer,

begin

i. Convert lexeme into token and send to parser

ii. Make any entry in symbol table or literal table based on type of token

iii. Set begin pointer = forward pointer

iv. Advance forward pointer

v. go to step 5

end

else

i. advance forward pointer

5. If forward pointer is at end of first half

begin

i. Load second half

18
ii. Advance forward pointer

iii. go to step 2

end

6. If forward pointer is at end of second half

begin

i. Load first half

ii. Advance forward pointer

iii. go to step 2

end

2.6 Specification of tokens

Alphabet is a finite set of all characters, digits, operators and punctuation marks that can be used in
the source language.

Example in C language ∑= {0,1,…9, a, b,c, ….z, A,B,….Z,+,- , *, / , ( , ), &, <, > , =,…..}

String is a finite sequence of symbols drawn from the alphabet.

Example: w = abc or s = abc234

2.6.1 Operations on string

1. Prefix of string s - A string derived from eliminating zero or more trailing symbols of string s.

Example: Let s = abcde , then p the prefix of s = ε,a, ab, abc, abcd, abcde

If prefix p is not equal to string s, then p is called proper prefix.

2. Suffix of string s – A string derived from eliminating zero or more leading symbols of string
s.

Example: Let s = abcde , then s1 the suffix of s = e, de, cde, bcde, abcde, є

If suffix s1 is not equal to string s, then s1 is called proper suffix.

3. Substring of string s – A string derived from eliminating zero or more consecutive symbols of
string s

Example: Let s = abcde , then s2 the sub string of s = ab, abc, cde, abcde, bcd, є, etc.

Note the ae is not the substring of s. If substring s2 is not equal to string s, then s2 is called
proper substring.

4. Subsequence of string – Any string derived by deleting zero or more not necessarily
continuous symbol is called subsequence.

Example: Let s = computer, then subsequence p of s = mue

19
5. Concatenation of string / product of string (.) – If s1 and s2 are two strings of arbitrary length,
then s is concatenation s1 and s2 is given by s1.s2= all the symbols of s1 in the same order as
they appear in s1 followed by all the symbols of s2 in the same order as the appear in s2.

Example: s1= comp s2 = lier s=s1.s2 = compiler

s.є = ε.s =s

2.7 Regular Expressions and language

Regular expressions are a shorthand notation for sets of strings

• Epsilon (ε) is a regular expression denoting the set containing the empty string
• Any letter in the alphabet set is also a regular expression denoting the set containing a one-
letter string consisting of that letter. Example: if a  , then a is regular expression
• If r and s are regular expressions then, r | s is a regular expression denoting the union of r and
s. The language denoted r│s is L(r) U L(s), where L(r) and L(s) are language generated by r
and s respectively.
• If r and s are regular expressions then, rs is a regular expression denoting the set of strings
consisting of a member of r followed by a member of s, this is also called as concatenation.
Language generated is rs is L(r)L(s)
• If r and s are regular expressions then, r* is a regular expression denoting the set of strings
consisting of zero or more occurrences of r. The language generated by r*is (L(r))*
• Regular expression can be parenthesized to specify operator precedence (usual precedence is
closure, concatenation and union)

Although these operators are sufficient to describe all regular languages, in practice some extensions
are used:

• If r is a regular expression then, r+ is a regular expression denoting the set of strings


consisting of one or more occurrences of r.
• If r is a regular expression then r? is a regular expression denoting the set of strings consisting
of zero or one occurrence of r. Equivalent to r|ε
• The notation [abc] is short for a|b|c. [a-z] is short for a|b|...|z. [^abc] is short for any character
other than a, b, or c.

2.7.1 Some Regular Expression Examples

The best way to get a feel for regular expressions is to see examples. Note that regular expressions
form the basis for pattern matching in many UNIX tools such as grep, awk, perl, etc.

Following is the list of regular expression for each of the different lexical items that appear in C
programs

20
2.8 Finite Automata

A finite automaton (FA) is an abstract, mathematical machine, also known as a finite state machine,
with the following components:

1. A set of states S
2. A set of input symbols  (the alphabet)
3. A transition function  which states move (state, symbol) = new state
4. A start state S0
5. A set of final states F

The word finite refers to the set of states. There is a fixed state in this machine. The word automaton
refers to the execution mode, which executes the same instruction over and over.

while ((c=getchar()) != EOF) S = move(S, c);

Finite automata is mainly of two types, Deterministic Finite Automata (DFA) and Non Deterministic
Finite Automata (NFA). In DFA the transition function  generates only one state on input symbol. A
NFA can generate multiple states on single input. Special category of NFA which is NFA with 
moves, generates many states on  input or no input.

2.9 Deterministic Finite Automata

The type of finite automata that is easiest to understand and simplest to implement is called a
deterministic finite automaton (DFA). The word deterministic here refers to the return value of
function move (state, symbol), which goes to at most one state.

Example: Following is an example of DFA.

S = {s0, s1, s2}


∑ = {a, b, c}

Transition function 

S0 = s0
F = {s2}

To draw the transition diagram for a finite automaton:

• Draw a circle for each state s in S; put a label inside the circles to identify each state by
number or name
• Draw an arrow between Si and Sj, labeled with x whenever the transition says to

move (Si, x)= Sj

• Draw a "wedgie" into the start state S0 to identify it


• Draw a second circle inside each of the final states in F

21
Following figure Fig 2.1 is the transition diagram for the above example.

2.9.1 DFA Implementation

DFA’s are used to represent regular expression through transition diagram. DFA's can be efficiently
implemented on computers. A lexical analyzer might associate different final states with different
token categories.

Example: Consider the DFA to recognize increment, addition and assignment operator.

For the above example S= {0,1,2,3,4}, start state S0 = 0, = {+,=,other} final states F = {2,3,4}

Transition function 

Example: Consider the following DFA which validates C Comments

22
For the above example S= {0,1,2,3,4,5}, start state S0 = 1, = {/,*,other} final states F = {5}

Transition function 

2.10 Nondeterministic Finite Automata (NFA)

Notational convenience motivates more flexible machines in which function move() can go to more
than one state on a given input symbol, and some states can move to other states even without
consuming an input symbol (ε-transitions). This type of automata is termed as nondeterministic finite
automata.

One can prove that for any NFA, there is an equivalent DFA. They are just a notational convenience.
So, finite automata help us in designing a machine which converts a set of regular expressions to a
computer program that recognizes them efficiently.

NFA Examples
ε-transitions make it simpler to merge automata. Following is an automata to recognize string ++ or
+= or +. If the automata stops at state 3, it has recognized += and at state 6, it has recognized ++ and
at state 8, it has recognized +.

23
For the above example S= {0,1,2,3,4,5,6,7,8}, start state S0 = 0, ∑= {+,=} final states F = {3,4, 8}

Transition function 

The above transition diagram can be simplified and rewritten as follows. From this example it can be
concluded that multiple transitions on the same symbol handle common prefixes.

Factoring the above transition diagram it is possible to reduce the number of states. Hence we can
generate the following transition diagram.

24
Regular expressions can be converted automatically to NFA's using the following set of rules.

Each rule in the definition of regular expressions has a corresponding NFA.

1. For ε, draw two states with a single ε transition.

2. For any letter in the alphabet, draw two states with a single transition labeled with that letter.

3. For regular expressions r and s, draw r | s by adding a new start state with ε transitions to the
start states of r and s, and a new final state with ε transitions from each final state in r and s.

4. For regular expressions r and s, draw rs by adding ε transitions from the final states of r to the
start state of s.

25
5. For regular expression r, draw r* by adding new start and final states, and ε transitions
o From the start state to the final state,
o From the final state back to the start state,
o From the new start to the old start and from the old final states to the new final state.

6. For parenthesized regular expression (r) we can use the NFA for r.

Operations to keep track of sets of NFA states:

ε_closure(s) set of states reachable from state S via ε


ε_closure(T) set of states reachable from any state in set T via ε
move(T,a) set of states to which there is an NFA transition from states in T on symbol a

2.11 NFA to DFA Algorithm

Following is the algorithm for converting NFA to DFA

Dstates := {ε_closure(start_state)}
while T := unmarked_member(Dstates) do {
mark(T)
for each input symbol a do {
U := ε_closure(move(T,a))
if not member(Dstates, U) then
insert(Dstates, U)
Dtran[T,a] := U
}
}

Example: Converting NFA to DFA

26
2.12 Design of Lexical analyzer generator

Use "install_id()" instead of "strdup()" to avoid duplication in the lexical data.

%{
/* #define's for token categories LT, LE, etc.
%}

whitespace [ \t\n]+
digit [0-9]
id [a-zA-Z_][a-zA-Z_0-9]*
num {digit}+(\.{digit}+)?

%%

{ws} { /* discard */ }
if { return IF; }
then { return THEN; }
else { return ELSE; }
{id} { yylval.id = install_id(); return ID; }
{num} { yylval.num = install_num(); return NUMBER; }
"<" { yylval.op = LT; return RELOP; }
">" { yylval.op = GT; return RELOP; }

%%

install_id()
{
/* insert yytext into the literal table */
}

install_num()
{
/* insert (binary number corresponding to?) yytext into the literal table */
}

2.13 LEX

Lex programs take a lexical specification given in a .l file and create a corresponding C language
lexical analyzer in a file called lex.yy.c. The lexical analyzer is then linked with the rest of your
compiler.

27
The C code generated by lex has the following interface. Note the use of global variables instead of
parameters, and the use of the prefix yy to distinguish scanner names from program names. This
prefix is also used in the YACC parser generator.

FILE *yyin; /* set this variable prior to calling yylex() */


int yylex(); /* call this function once for each token */
char yytext[]; /* yylex() writes the token's lexeme to an array */
int yywrap(); /* called by lex when it hits end-of-file */

The .l file format consists of a mixture of lex syntax and C code fragments. The percent sign (%) is
used to define lex elements. The whole file is divided into three sections separated by %%:

Header section
%%
Body
%%
Helper functions

The header consists of C code fragments enclosed in %{ and %} as well as macro definitions
consisting of a name and a regular expression denoted by that name. lex macros are invoked explicitly
by enclosing the macro name in curly braces. Following are some example lex macros.

letter [a-zA-Z]
digit [0-9]
identifier {letter}({letter}|{digit})*

The body consists of sequence of regular expressions for different token categories and other lexical
entities. Each regular expression may have a C code fragment enclosed in curly braces that executes
when that regular expression is matched. For most of the regular expressions this code fragment (also
called a semantic action) consists of returning an integer that identifies the token category to the rest
of the compiler, particularly for use by the parser to check syntax. Some typical regular expressions
and semantic actions are

"" { /* no-op, discard whitespace */ }


{identifier} {return IDENTIFIER}
"*" {return ASTERISK}

28
Regular expressions are also required for lexical errors such as unterminated character constants, or
illegal characters.

The helper functions in a lex file typically compute lexical attributes, such as the actual integer or
string values denoted by literals. One of the helper function is yywrap(), which is called when lex hits
end of file. yywrap() returns 1, if lex quits on end of file. If yywrap() switches yyin to a different file
and wants lex to continue processing, then yywrap() return 0. The lex library (-ll) have default
yywrap() function which return a 1.

Assignment:

1. Classify the lexemes that make up the tokens in the following program.
Generate symbol table and literal table
float max (int i, int j)
{ int x ; float y
x = i + j * 10;
y = (float) x / 25.3;
return y ;
}

2. Write regular expression for the following language and draw transition diagram
1. All strings starting with the letter ‘c’
2. All strings which has even no. of 0’s and odd no. of 1’s
3. All strings which has even no. of 0’s and even no. of 1’s

3. Construct NFA for the regular expression (a*/b*)*

4. Construct NFA for the regular expression (a/b)*abb (a/b)

5. Construct NFA for the regular expression and convert it to DFA


(a / b) * a (a / b)

6. Draw DFA that accepts the reserve words case, char, const and continue of C language.

7. Classify the lexemes that make up the tokens in the following program.
Generate symbol table and literal table.
main ( )
{ float a[10], x;
int i;
for (i = 1; i < 10 ; i + +)
{
a [i] = 0;
Printf (“%d; a [i]);
}
}

8. Write transition diagram to recognize relational operators in C.

9. Construct NFA and convert to DFA for the regular expression (b)+a(a/b/)

29
10. Construct NFA for recognizing number, where regular expression is digit+(.digit+)?(E(+/-
)?digit+)?

30
Chapter 3

Syntax Analysis

3.0 Introduction

Syntax analysis is that part of a compiler which deals with checking syntax or grammatical
correctness of a statement in a computer program expressed in any of high level languages like C,
C++, Java etc. As in English language to know whether a sentence is correct, we need to check with
it’s grammar. The grammar used in programming languages is known as ‘Context Free Grammar’
because the meaning of a statement or sentence in a programming language is independent of the
context in which it is used. In contrast, the meaning of a sentence in English language depends on the
context in some cases. For example, meaning of ‘charge’ could be ‘cost’, ‘Atomic charge’ or ‘fight’
depending on the context. It could also mean ‘charge a battery’, if you are in a battery shop.

In this chapter we will be studying context free grammar, which are used to express the syntax of all
programming languages. Any statement can be expressed as a Parse tree, which is a pictorial
representation of syntax as derived from the given grammar. We will also be discussing a type of
grammar called ‘Ambiguous grammar’, which we try to avoid in defining syntax of any programming
language. Formal properties of context free languages are briefly discussed and finally a tool called
‘Parser Generator’, which is a part of all modern compilers, will be highlighted.

3.1 Role of a parser.

In previous chapter we discussed about the lexical analyzer, whose main job is to identify the tokens
in a given statement. This stream of tokens is input to the syntax analyzer. The job of the parser is,

1. To identify the valid statement represented by the stream of tokens as per the syntax of the
language. If it is a valid statement, it will be represented by a parse tree.

2. If it is not a valid statement, then a suitable error message is displayed, so that the
programmer is able to correct the syntax error.

3.2 The parsing process

To understand the parsing process, let us parse or check the grammatical correctness of a English
language sentence, wherein only subject and verb are allowed. This can be expressed as grammar as
follows.

<sentence> → <subject> <verb>

<subject> → Ram │ Raj

<verb> → goes │ comes │ eats

The above grammar says that, all items enclosed in angle brackets (<>) are some generic tokens and it
could be replaced by the tokens in RHS. The items without angled bracket are actual tokens or
terminal strings that appear in a sentence.

31
If a sentence “Raj eats” is given then, the lexical analyzer will give a stream of generic tokens as
<subject> <verb>. The job of parser is to recognize that, the stream represents a correct sequence of
tokens to be called as a sentence as defined by the grammar given above.

Since the parsing is success, it will be represented as parse tree as shown in Fig. 3.1.

On the other hand, if the input sentence was to be “Raj jumps”, the lexical analyzer will be able to
recognize the token ‘Raj’ as <subject> but it will not be able recognize the word ‘jumps’ as any valid
token. Lexical analyzer will indicate an error hence syntax analyzer will not able to recognize it as
any valid sentence.

Let us consider an input sentence “comes Ram”. Lexical analyzer will recognize this as stream of
tokens <verb> <subject>. But the syntax analyzer will not be able to recognize it as valid sentence as
the <subject> <verb> tokens are reversed in the actual sentence. Therefore, no parse is tree generated
and an error is indicated by the syntax analyzer. Note that the order of occurrences is also important.

3.3 Context free grammars.

Context Free Grammars (CFG) are used to represent the grammatical structure of all programming
languages like C, Java etc. Understanding of CFG is important from the point of parsing an input
sentence. Parser takes CFG as a database for checking the grammatical correctness of a statement in a
programming language.

Definition of CFG: Denoted by G = ( V, T, P, S )

Where, V is a set of variables, (non-terminals)

T is a set of terminals,

P is a set of productions,

S is unique start symbol and S E V

Normally only production rules are given and V and T are recognized by the convention that V uses
single capital letters or symbols and T uses lower case alphabets and strings The productions are
expressed as some sort of equation with LHS have only one element from the set of variables and
RHS consisting combination of non-terminals and terminals. RHS and LHS are separated by symbol
→. For example S → aaBd indicates that non-terminal S can be replaced by aaBd.

Note: Production rule S → a S b │ c

32
Example: Let us consider the grammar G, where

S → a S b │ c . Find all the four components of G

Solution: G = ( V, T, P, S )

V= {S}

T = { a, b, c }

P = { S → a S b, S → c }

S= S

Example 3.2 Show how the above grammar validates input string ‘aacbb’

Solution: Lexical analyzer will not give error as each input is a valid token of terminal type. With the
following parse tree, we conclude that the above input is indeed generated by the given grammar. It
may be noted that, we start the tree with start node as start symbol i.e S. Replace S by any production,
in this case by either a S b or c but input dictates that most suitable replacement is S → a s b. In each
level replacing the non-terminal by the corresponding RHS of the production as shown in fig 3.2

Example: Consider the grammar G with productions.

S→abB

B→cdB│ef│Є

Where Є defines no symbol or empty string show that above grammar derives abcdcd

Solution: The grammar is defined by

V = { S, B}

T = { a, b, c, d, e, f, Є }

P = { S → a b B, B→ c d B, B → e f, B → Є }

S=S

33
Derivation

Yield is the string derived by a parse tree. Yield is always read from top to bottom and left to right.

We have seen how a parse tree shows the derivation of a given string using various productions of the
grammar, always starting from start symbol as the root node of the parse tree. There are other ways to
show how the input string can be derived. Namely

–- Using leftmost derivation

– Using rightmost derivation.

3.4 Leftmost Derivation:

Here we use a symbol  which is read as directly derives. For example S → a b b│c is the given
grammar to show that how using leftmost derivation string aaacbbb can be derived is explained
below.

SaSb /* always start with start symbol and replace leftmost non-terminal

aaSbb in RHS. Here it is S (underlined symbol) continue the procedure till the
given string is derived. */

aaaSbbb /* Note S is replaced by aSb and other symbols are undisturbed */

aaacbbb /* S is replaced by c */

Example: Consider a grammar

S→aAB

34
A→bBC

B→d

C→e

Show the derivation for abded

Solution:

S  a A B→ /* Note only one symbol is replaced at each step */

SabBCB /* only leftmost B is replaced by d */

SabdCB /* only leftmost C is replaced by e */

SabdeB

Sabded

3.5 Rightmost Derivation:

In deriving a sentence we replace rightmost symbol in RHS at each step. For previous example we
derive the rightmost derivation as follows.

SaAB /* B is the rightmost symbol of RHS, hence replace it with its production */

SaAd /* Now rightmost symbol is A */

SabBCd /* Rightmost symbol now is C */

SabBed /* Rightmost symbol is B */

Sabded

Rightmost and Leftmost derivations are equivalent they derive the same string. As you must have
observed leftmost derivation is easier than rightmost derivation as input guides us which production to
use in each step.

In chapter 4, we will see that parser is classified according to the leftmost derivation or rightmost
derivation, modern parser always derive rightmost derivation.

Example: Consider the grammar

S→aAS│a

A → S b A│S S│b a

Draw the derivation tree for aabbaa. Also show leftmost and rightmost derivations.

35
Solution:

Leftmost derivation:

S  aAS

 aSbAS

 aabAS

 aabbaS

 aabbaa

Rightmost derivation:

S  aAS

 aAa

 aSbAa

 aSbbaa

 aabbaa

Note: Rightmost derivation can easily derived after the parse tree is drawn

Example: Consider the grammar

S → b A │a B

A→bAA│ab│a

B→aBB│bS│b

36
Draw the derivation tree for

3.6 Abstract Syntax Tree:

Parse Tree: Shows the derivation of given string using the grammar. But the final aim is to get the
translation to intermediate code. Intermediate code generation is explained in detail in Chapter – 7.
To generate the intermediate code we do not require all the steps that are required to show the
derivation. Hence Abstract syntax trees are used Abstract syntax tree retains the essential features of
a parse tree that are important for deriving the intermediate code.

Example: Consider the Grammar

E→E+E

E→E*E

E → a│b │ c

37
Consider the Derivations for the parse tree is as shown in Fig 3.6 and Abstract Syntax Tree is shown
in Fig. 3.7, for the string a + b * c.

We will see that from abstract syntax tree or simply syntax tree it is easier to construct the
intermediate code. As the computer is able to execute one operation at a time, the intermediate code
consists of series of calculations as shown below. Note that the evaluation is from left to right and
bottom to top.

t1= b *c t2 = a + t1

t2 is equivalent to the given expression in high level language i.e., a + b * c.

Example: Construct both Parse Tree and Abstract Syntax Tree for the following expressions using
the grammar given below E → E + E E→a│b│c E→E*E

i) a+b+c ii) b*d*c

Solutions: i) a + b + c

38
Solutions: ii) b * d * c

Example:

Given grammar construct Parse Tree and Abstract Syntax Trees

E → E + T │T

T → T* F│ F

F → a │b│ c

For i) a + c * b ii) a * b * c * b iii) a * c + b * a

Solution: i) a + c * b

39
Solution: ii) a * b * c * b

iii) a * c + b * a
Parse Tree Syntax Tree

Note: Generation of Intermediate Code is left as an exercise (see Chapter – 7)

3.7 Ambiguity

The principle behind the design of programming languages is “Syntax Implies Semantics”. It means
that from the syntax itself the semantics or the meaning is clear. As from our earlier discussion,

40
English or any natural language does not have this property. This is essentially feature of all
programming languages only which have been designed to have this property, otherwise translation
will be difficult or ambiguous. We will look at a some grammar which is like natural language
ambiguous.

Example: Consider the following grammar

E → E + E

E → E * E

E → a│ b│ d

Parse tree for expression b * d + a can be as follows:

From our knowledge of precedence and associativity, we know that we must multiply b * d first then
add a to it. But the syntax tree interpreting is add d and a multiply the result by b. This is in Fig.
3.18(a) representations. To verity this let us take the values of b = 5, d = 2, and a = 3.

The correct answer is 5 * 2 + 3 = 13 but Syntax Tree Fig. 3.18 (b) gives the answer as 5 * (2 + 3) = 25
which is wrong.

The second tree show in Fig. 3.19 gives the correct answer. But both the derivations of Fig. 3.18 &
Fig. 3.19 are correct, we have no method to derive only the 2nd tree, as both are valid parse trees.

41
This is a serious matter, as we were expecting our answer 13 and computer would give as 25. This
happened because the grammar choosen was ambiguous.

The unambiguous grammar for the same is discussed in ex 3.6. which always draws the tree as per our
no. of associativity and precedence and gives always correct answer.

3.8 Definition: Ambiguous grammar

If for any given sentence there are two or more parse trees or syntax trees, then the grammar is
ambiguous.

Example: Consider the following grammar

A → BC │ aaC

B → a │ Ba

C → b

Show that this grammar is ambiguous for string ‘a a b’

Example : Consider the grammar given below

S→ aSbS│bSaS│ Є

Show that for a string ‘a b a b’ the grammar is ambiguous

42
Solution:

S aSbS S  aSbS

abS  abSaSbS

abaSbS  abaSbS

ababS (replace s by Є )  ababS

abab  abab

Leftmost derivation: Tree 1 Leftmost derivation: Tree 2

Example: Show the following grammar is ambiguous for string ‘abb’

S→abB│aA

A→bA│b

B→b

Solution:

Derivation – I Derivation – II

S abb SaA

 abb S abA

abb

Again we have two leftmost derivations. Therefore the grammar is ambiguous.

43
3.9 Extended Notations

We have seen the context free grammar description in terms of four components, namely
terminal, non-terminal symbols, productions and unique start symbol. We shall study some notations
which are equivalent and often varying degree of convenience for understanding. In this context we
will discuss three notations as listed below

1. BNF Notation

2. EBNF Notation

3. Syntax diagrams

3.10 BNF Notation:

The BNF (Barkus-Naur form) grammar was developed for the syntactic definition of
ALGOL (Algorithmic Language) by John Barkus. At about the same time a similar grammar form,
the context-free grammar was developed by Noam Chomsky for the definition of natural language
syntax. The BNF and Context-free grammar forms are equivalent in power. The differences are
essentially only notational.

A BNF grammar or notation is composed of a finite set of BNF grammar rules, which
together define a programming language. A BNF grammar defines a language in simple cases by
directly listing the elements. For example

< binary digit> : : = 0 | 1

The above grammar rule is read as “A binary-digit is either ‘0’ or a ‘1’. The term binary-digit
serves as a name for the language defined by the grammar rule. The meaning of symbol is as follows

: : = means “is defined as”

| is read as or and separates alternatives

< > defines a syntactic category or name

Once we define a basic set of syntactic categories, we may use these in constructing more complex
languages.

Example 1:

< digit > : : = 0 | 1 | 2 | 3 | -------- | 9

< unsigned integer> : : = < digits > | <unsigned integer> <digit>

Here we have defined unsigned integer as a sequence of <digits> by using the syntactic
category <unsigned integer> recursively. We may parse unsigned integer 342 as follows using the
grammar rules given.

44
Example 2:

<signed integer> : : = <sign> <integer>

<sign> ::= + | -

<digit> ::= 0|1|2|3|4|5|6|7|8|9

<integer> : : = <digit> | <digit> <integer>

Parsing for + 450

45
Example 3: Previous Example2 could be alternatively defined as

<signed integer> : : = <sign> <unsigned integer> as unsigned integer is already defined in


Example1.

Example 4:

<variable> : : = x | y

<exp> : : = (<exp>) | <exp + <exp> | <exp> * <exp> | <variable>

<exp> represents expression

46
Here exp is defined as exp enclosed in round brackets or, exp + exp or exp * exp or variable. Variable
has been defined already as x or y. Using the BNF grammar parse ( x + y ) * y

3.11 EBNF Notation:

Extended BNF notation is a notation to avoid recursion and in this process it has the following
modifications.

(1) No angled brackets for syntactic units, (2) = is used instead of ::=,
(3) quote is used for terminal symbols

S = ‘a b’

B = ‘+’ [ ‘a b’ ]

C = ‘+’, { ‘a b’ }

Where [ ] repetition zero or one time.

{ } zero, one or more times.

With this interpretation

S=ab

B = +, + a b

C = +, + a b, + a b a b, + a b a b a b, - - - - -

Examples: signed integer (SI)

47
D = ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | ‘5’ | ‘6’ | ‘7’ | ‘8’ | ‘9’

SI = ‘+’ | ‘-‘, D, { D }

Example: Signed Integer (SI)

D = ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | ‘5’ | ‘6’ | ‘7’ | ‘8’ | ‘9’

SI = [‘+’] | ‘-’ D {D}

Note here the + is optimal i.e., following expression are correct as per this grammar

-56, 56, +56, +567, 858

(Note: As per the previous grammar 56, 858 are invalid as ‘,’ is not a part of grammar)

3.12 Syntax Diagrams

Syntax diagrams are formed when EBNF rules are represented graphically. For example for
signed integer, where + is optional is represented as follows

It can be interpreted as signed integer is to begin with +, -, or nothing followed by one or more digits.
As this explains the signed integer much more clearly and all possible paths can be traced to know the
various ways of representing the signed integer.

Example:

Consider the syntax diagram of identifier in ‘C’ language.

48
Where L = ‘a’ | ‘b’ | ‘c’ | ‘z’ | ‘A’ | ‘B’ | - - - - | ‘Z’

D = ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | ‘5’ | ‘6’ | ‘7’ | ‘8’ | ‘9’

The interpretation is that an identifier should start with a letter and can optionally be followed by
letter or digit any number of times.

3.13 Formal properties context free languages

Basically a context free grammar is recognized by a stack machine. For example consider
recognizing.

L={anbcn │n ≥1}

Using a stack

➢ Put a unique bottom of stack symbol, say Ǿ to start with.

➢ Push the a’s till b’s is encountered as shown below. Assume there were 4 a’s

➢ When b is read do not push or pop

➢ When ‘c’ are read for each ‘c’ pop one ‘a’.

49
➢ If the number of a’s and c’s are equal, then Ǿ is exposed. Then the stack machine recognizes
the above language, where no more input is present and Ǿ is on the top of stack.

➢ The stack machine is able to tell the string is not of type L if when ‘c’s are over, top of stack
is still a or when ‘c’ are being read to top. There are no a’s to pop; also in either conditions
the input language is not of type L.

The above mechanism lends us to the conclusion that L = { an bm cn dm │ n ≥ 1 & m ≥ 1 }can not be
recognized as when a’s and b’s are pushed top of the stack contain b which cannot be cross checked
against c’s as we need to check against a’s which are below b’s & not accessible. Implementing the
above result is matching of formal and actual parameters in ‘c’ language can not be checked by
grammar, but compiler will do this checking using house keeping and data base like symbol table.

Example 1: Consider the function add and subtract. Add is a function to add three nos. in the
parameter list. Subtract is the function to subtract second parameter from first parameter.

int add (int a , b , c)

{ return ( a + b + c ) ; }

int subtract ( int x, y )

{return ( x – y )

main ( )

{ int e,f,g,h,I,result1, result 2;

scanf ( “%d, %d, %d, %d, %d”, & e, &f, &g, &h, &I ) ;

result 1 = add ( e, f, g ) ;

result 2 = subtract ( h, i ) ;

printf ( “%d %d”, result1, result 2 ) ;

The above program is similar to

L={ (add)3 (subtract)2 (call add )3 (call subtract )2 }

Where add 3 means add with 3 formal parameters a, b, c

(call add) 3 means call add function with 3 actual parameters e, f, g

Similar interpretation can be used for subtract function.

Since stack machine can not recognize L={ an bm cn dm │ n, m ≥ 1 }

L can not be a context free language. In the above example a feature that, cannot be in a context free
language, is being used but it cannot be checked automatically by the grammar.

Second Property: The grammar having the language forms like

L = {abb │ c │ abb, aaacaaa, baac │ baa, ….}

50
If L = { w c w │w is formed from any combination if a and b }

cannot be recognized by stack machine, because it is not a context free language and therefore CFG
grammar cannot be constructed.

Consider C Language program

main ( )

{int x, y, z; /* it is like 1st w = x y z */ of previous example

int a, b, /* it is like c * / of previous example

x = 0; y=0; z=0; /* this is like 2nd W = xyz */ of previous example

When the x = 0 occurs compiler has to check, whether x has been declared or not. Similarly when
y=0, z=0 occur, through compiler do this check successfully but using house keeping but not
automatically using grammar constructs. It may recalled here that arithmetic expression are checked
for their correctness entirely by grammar as for arithmetic expressions have context free grammar.

This leads us to the conclusion that there are many features exists in C Language like

➢ Matching of formal & actual parameters

➢ Declaration of identifiers before their use cannot be modeled by any context free grammar.
This is to say that C is context free language but there are features which cannot be modeled
by context free language.

Example 3:

L = { an bn cn │ n ≥ 0 } can not be a context free language as well as we cannot construct a stack


machine to recognize the same. The corresponding example of ‘C’ Language for this grammar is too
involved and is beyond the scope of this text.

3.14 Parser Generator:

The earliest and still most common form of compiler – compiler is a parser generator, whose input is a
grammar (may be in BNF) of the language. A typical parser generator associates executable code
with each of the rules of the grammar that should be executed when theses rules are applied by the
parser. These pieces of code are sometimes referred to as semantic action routine since they define
the semantics of the syntactic structure that is analyzed by the parser. Depending upon the type of
parser that should be generated, these routines may construct a parse tree or generate executable code
directly.

The computer program ‘YACC’ is parser generator developed by Stephen .C. Johnson for the Unix
Operating System. It generates a parser based on the grammar written in BNF. YACC generates the
code for the parser in C programming language.

Assignments:

1. Consider the grammar with productions

S → aAB

51
A → bBb

B → A│Є

Show the derivation tree for a b b b b and also give rightmost and leftmost derivations.

2. Consider the grammar with productions

S → A S1 │ S1 B

S1 → a S1 b │ Є

A → aA│a

B → bB│b

And show the derivation for a a b b b

3. Show the following grammar is ambiguous

E → E

E → E+E

E → E*E

E → (E)

I → a│b│c

4. Show the following grammar is ambiguous

S→ A B │a a B

A → a│Aa

B → b

And discuss importance of not having ambiguous grammar.

5. Give BNF definition of floating point constant in ‘C’ language

6. Give EBNF for the problem in Q. No. 5

52
7. Consider the grammar

E→E+T/E

T→T*F/T

F → ( E ) / id

Construct parse tree for

i) (id * id) + id

ii) (id + id) * id

8. Consider the grammar

S → (L) / a

L → L, S/ S

Construct parse tree for

i) ( a, ( a, a) )

ii) ( (a, a),. (a) )

9. Show that the grammar is ambiguous

S → As│b

A → SA│a

10. Show that the grammar is ambiguous

S→ aSbS│bSaS│Є

53
Chapter 4

Top-down Parsers

4.0 Introduction:

Parsers can be classified as

1) Topdown Parser

2) Bottom up Parser

These are two broad category of parsers. Top-down parsers are suitable for constructing parser by
hand, whereas the bottom-up parsers are modern parsers constructed using tools. Top-down parsers
are again classified as

1) Recursive Descent Parser

2) Predictive Parser

Bottom-up parsers are classified as

1) LR (0) Parsers

2) SLR (0) Parsers

3) LR (1) Parsers

4) LALR(1) Parsers.

4.1 Prerequisites for topdown parsers

1. Elimination of Left-recursion

2. Left Factoring

3. First Set

4. Follow Set

Here we shall study these aspects without reference to topdown parsers, but when we design topdown
parsers importance of each of the above topic will be clear.

4.1.1 Elimination of Left recursion:

Recursion as you know from your earlier studies, is describing something using itself. Recursion is a
powerful method to think solution of problems. Recursion functions are those which calls itself.
Recursion should have terminating condition.

54
Example: Function Fact (N) is used to compute the factorial of N

Fact (N) = 1 if N = 0

N * Fact ( N – 1 ) otherwise

Here Fact (N) is defined in terms of Fact itself.

In context free grammar, consider the grammar

S → a S │b

Here S is defined using S itself. Since S appears on the right-hand-side, it has right recursion. If we
define

S → S a │b

Here S has left recursion since S appears on the left-hand-side of expression. So, Note that

S → b has no recursion type of

Example: Recognize the recursion and the terms containing recursion.

S → a b S │a b│S b│S b a│a S

Production – 1 & 4 are right recursions.

Production 3 & 4 have left recursion and production 2 has no recursion.

Example:

S→aSba /* Right recursion */

S→b /* No recursion */

S→SAB /* Left recursion

A→Aa /* Left recursions since A is defined in terms of itself */

B→aBbb /* Right recursion */

Now that we know how to recognize left recursion we are ready to understand the procedure to
remove the same.

Consider S → S  │ , where ,  are some combination of terminal and non-terminals but do not
contain ‘S’. There we recognize that S → S  has left recursion and  has no recursion. We can
remove the left recursion, after this grammar equivalent to the above is as follows.

S→S1

S1 →  S1 │ Є

55
Where S1 is a new non-terminal. The above three productions are equivalent to S → S │. In
general the procedure is to

a. Separate the terms containing left-recursions and right part of these productions may be called 1,
2, ---- n Concatenate this to new non-terminal S1 to get production for S1 as S1 → 1 S1│2 S1
------ n │ S1│λ

b. Introduce new production S1 → λ

c. All the productions without left recursion are called 1, 2 - ------- m are, appended with S1 for
production S. For example, if

S → S 1 │S 2 │---- │S n │1 │ 2 │ ---- │m

The equivalent grammar will be

S → 1 S1│2 S1│ - - -- │ m S1

S1 → 1 S1│2 S1│ - - - │ n S1│Є

Example:

S → a S│b S│c S│S d│S e│S f│ g │ h

Then

S → S d│S e│S f contains left recursion

1 = d, 2 = e, 3 = f

The remaining terms do not contain left recursion they are  terms. Namely

1 = aS , 2 = b│S, 3 = c│S, 4 = g, 5 = h

Therefore grammar containing without left recursion is as follow. Where S1 is a new non-terminal
introduced.

S → a S S1 │ b S S1│c S S1│g S1│h S1

S1 → d S1│e S1│f S1│Є

Example:

S → a │b│c│S S a│S b│d S a│e S f│S g

Solution:

Terms with left recursion are

S→ S S a│S b│S g│therefore

1 = S a, 2 = b and 3 = g

The  terms are

56
1 = a, 2 = b, 3 = e

4 = d S a, 5 = e S f

The equivalent grammar without left recursion will be

S → a A│b A│e A│d S a A│e S f A

A → S a A│b A│g A│λ

Where A is the new non-terminal introduced where left-recursion is removed.

Recursion could exist in more than one terms. We need to remove from all terms.

Example:

S → S a│S b│g│f│g A

A → a│A e│c A B

B → B g│d

The terms containing left recursion are

S → S a│S b

A→Ae

B→Bg

All the other terms contain no left recursion.

S → g S1│f S1│g A S1
S1 is a new
S → a S │b S │Є
1 1 1 non-terminal

A → a A1 │ C A B A1
A1 is a new
A → e A │λ
1 1
non-terminal
B → d B1
B1 is a new
B → g B │λ
1 1 non-terminal

4.1.2 Left factoring:

If one or more productions have same prefix then left-factor exists and left-factoring need to be done.
For example first we will recognize the left-factor exist or not

S → a b S│a b A S│a b A S d│d

Note that the first three production contain ab as prefix. Therefore left factors exist. To remove left
factors we need to introduce a new non-terminal along with common prefix, namely ‘ab’ in this case.

57
S→abD

D → S│A S│A S d

Other production without left-factor like S → d is unaffected.

Example : Consider the grammar

S → b S│b S B│b S D│e

Here left factor is ‘bS’, introduce a new non-terminal say ‘E’

S → b S E│e

Now E should produce all the terms remaining is the original production.

E → Є │B│D

In 1st production only bs exists therefore λ is added. In the 2nd production after removing bs only ‘B’
remains and similarly D in 3rd production.

4.1.3 Calculation of First Set

First refers to the set of (terminal symbols) that are encountered when a non-terminal is expanded.
For Example S → a S b│c

Then First ( S ) = { a, c }, when S is expanded. S can be replaced by two string a S b and c. The First
symbol of these strings are a & c.

Consider another example:

S → b S d │ c S b│λ

First ( S ) = { b, c, λ }

When non-terminal, say A begins with a non-terminal, B then it First(A) = First (B) U other terminal
symbols which is derived from other productions of A. For example

A → B c d │C e f │ g │ λ

B → g B│i

C → j│k│e

Since First (A) begins with non-terminals B, First (B) = { g, i} and C we need to calculate First(C) =
{j, k, e}, First(B) and First(C). They should be included in First(A).

First ( A ) = First ( B ) U First (C) U { g, λ }

= { g, i } U { j, k. e } U { g, λ }

= { g, i, j, k, e, g, λ }

58
In the above example of B → g B│i│λ i.e., as B → λ production is added. Then First (B) = { g, i, k}
as B → λ First (A) should include c also, because B sometimes can be null string then c will also be in
First (A). Therefore

First (A) = { h, i, j, k, e, g, λ, c }

Example: Consider the following grammar

S → a S b│B d e│C f g│h First (B) = { i, j, k, λ}

B → i│j│k│λ First (C) = {m, n, p}

C → m│n│p

Solution:

First(S) = First(B) U First(C) U { a, h }

= { i, j, k, Є } U { m, n, p } U { a , h }

Since First (B) includes Є we must additionally include the symbols following B i.e., d.

First (S) = { i, j, k, Є } U { m, n, p } U { a, h } U { d }

= { i, j, k, Є, m, n, p, a, h, d }

Example: Consider the following grammar where in you need to apply the above rule i.e., inclusive
of all symbols following non-terminal if it contain λ production.

S → a S b│B C d│E F g│h

B → i│j│λ First (B) = { i, j, λ }

C → k│e│λ First (C) = { k, e, λ }

E → m│λ First (E) = { m, λ }

F → p│q First (F) = {p, q }

Solution:

First (S) = First(B) U First(E) U {a, h}

= { i, j, λ } U { m, λ } U { a, h }

Since First of B includes λ and First of C includes and as E has λ production, include First (F) into
First (S). We should include next symbols.

First (S) = { i, j, λ } U First(C) U { m, λ } U First(F) U { a, h }

= { i, j, λ } U { k, e, λ } U { m, λ } U { p, q } U { a, h }

Since First of C includes λ we should include d 

First(S) = { i, j, λ } U { k, e. λ } U { d } U { m, λ } U { p, g } U { a, h }

59
= { i, j, k., e, d, λ } U { m, p, q } U { g, h }

= { i, j, k, e, d, m, p, q, a, h, λ}

4.1.4 Calculation of Follow Set

Follow refers to the set of symbols (terminals symbols) that follow non-terminals. Consider the
following grammar.

S → a S b │ a S d │e S f│g

Here non-terminal S is followed by b, d, f therefore

Follow (S) = { b, d, f, $ }

$ is included if the non-terminal is a start symbol

Consider another example

S → a S A b│a S B d│c S D f│g

A → h│i

B → j│k│λ

D → l│m

Here S is followed by non-terminals A, B, D then we need to calculate First of A, B and D

First (A) = { h, i }

First (B) = { j, k, λ }

First (D) = { l, m }

We need to include First (A), First (B) and First (D). If any of these contains λ than next First of all
non-terminals or terminal should be included. In this case only First (B) contain λ, therefore
following B symbol d should be included.

Follow (S) = First (A) U First (B) U First (D) U { $ }

= { h, i } U {j, k, λ} U { l, m } U { $ }

= { h, i, j, k, l, m, d, $ }

Normally a is not included in Follow if any symbol can have λ.

----

Follow (A) = { b }

Follow (B) = { d }

Follow (D) = { f }

60
Example:

Compute First and Follow symbols for the grammar given below.

S → a S B d │c A e │ f

B→g│h

A→i│j│k

Solution:

First (S) = { a, c, f }

First (B) = { g, h }

First (A) = { i, j, k }

Follow (B) = { d }

Follow (A) = { e }

Follow (S) = First (B) U { $ }

= { g, h, $ }

Example:

Compute First and Follow symbols for the grammar given below.

S → a B b│c D E d │F e │ f

B → g│h│λ

D → i│λ

E → j│k

F → i│λ

Solution:

First (S) = { a, c } U First (F) U { f, }

= { a, c } U { i, λ } U { f }

Since First(F) contains λ, we should include ‘e’

First (S) = { a, c, i, e, f, λ }

Note:

Є is not included

First (B) = { g, h, λ }

61
First (D) = { i, λ }

First (E) = { j, k }

First (F) = { l, λ }

Follow (F) = { e }

Follow (E) = { d }

Follow (D) = First (E)

= { j, k }

Follow (B) = { b }

Follow (S) = { $ }

4.2 Recursive descent parser design

Consider the following grammar which defines = { an c bn │n ≥ 1 }

S → a S b│a c b

To derive recursive descent parser, we need to check

1. Existence of left recursion

2. Existence of Left Factoring

We observe that there is no left recursion but left factor exists i.e., ‘a’ Therefore the grammar is
modified as follows

S→aA

A → S b│c b

Now we are ready to write a recursive routines for S and A

Procedure S ( )

If input = ‘a’ then

{ get-nxt - token ( );

A();

else error ( ) ;

Procedure A ( )

62
If input = ‘c’ then

{ get – nxt - token ( ) ;

If input = ‘b’ then

return;

else

{ S ( );

If input = ‘b’ then

return;

else

error ( );

/*
The grammar which generates strings of form {an c bn │ n>=1} is:

S→aSb│acb

There is no left recusion in the above grammar;but there is left factoring


which is eliminated as follows:

S → a A '#'
A → S b │c b

The '#' at the end of the input is for easy recognition of end of input
and detection of extra b's.
*/

# include <stdio.h>

char next_token; /* the variable holds the next available token of input*/
char input[60]; /*the input expression taken as string from user*/
int i; /*count variable*/
int level; /*variable holds level of recursive depth for printing*/

void S();
void A(); /*the function declarations*/
void space(int);

void getnext_token() /*function returns the next available token in the input expression*/

{
next_token=input[i];

63
i++;
}

void enter_nonterminal(char name)


{
space(level++);
printf("%c: Entering, \t",name);
printf("Next_token == %c\n", next_token);

void leave_nonterminal(char name)


{
space(--level);
printf("%c: Leaving, \t", name);
printf("Next_token == %c\n", next_token);
}

void space(int local_level)


{
while (local_level-- > 0)
printf("│");
}

void S()/*root of the parse tree, recognizes the occurrence of one or more a*/
{
enter_nonterminal('S');
if(next_token=='a')
{
getnext_token();
A();
}
else
{
printf("\nError1,string must start with a");
exit(1);
}
leave_nonterminal('S');
}

void A() /*function resulting from elimination of left factoring, recognizes matching b for a*/
{
enter_nonterminal('A');
if(next_token=='c')
{
getnext_token();
if(next_token=='b');
}
else if(next_token=='a')
{
S();
getnext_token();
if(next_token=='b');
else
{

64
printf("\nError2,no matching b for a");
exit(2);
}
}
else
{
printf("\nError 3,invalid position of characters");
exit(3);
}
leave_nonterminal('A');
}

int main()
{
int ans;
do
{
printf("\nEnter a string of form(a^n c b^n) ending with '#':");
scanf("%s",input);
i=0;
level=0;
getnext_token();
S();
getnext_token();
if(next_token!='#')
printf("\nError4,too many b's or not ending #");
else
printf("\nSuccessful Parse");
printf("\ndo you want to continue?(0 or 1)");
scanf("%d",&ans);
}while(ans);
return 0;
}

/*
Sample Run:

Enter a string of form(a^n c b^n)ending with '#':aacbb#

S: Entering, Next_token == a
| A: Entering, Next_token == a
| | S: Entering, Next_token == a
| | | A: Entering, Next_token == c
| | | A: Leaving, Next_token == b
| | S: Leaving, Next_token == b
| A:Leaving, Next_token == b
S: Leaving, Next_token == b

Final token=#

Successful Parse

Do you want to continue(0 or 1):1

65
Enter a string of form(a^n c b^n)ending with '#':aacb#

S: Entering, Next_token == a
| A: Entering, Next_token == a
| | S: Entering, Next_token == a
| | | A: Entering, Next_token == c
| | | A: Leaving, Next_token == b
| | S: Leaving, Next_token == b

Final token=#

Error2,no matching b for a

Do you want to continue(0 or 1):1

Enter a string of form(a^n c b^n)ending with '#':abc#

S: Entering, Next_token == a
| A: Entering, Next_token == a

Final token=#

Error3, invalid position of characters

/*
The grammar under consideration for arithmetic expression

E → E + T │T
T → T*F│F
F → ( E )│a│b│c

The grammar after elimination of left recursion:

E → T E'
E' → +TE'│λ
T → FT'
T' → * F T'│λ
F → (E)│a│b│c

where epsilon stands for "empty string";


a,b,c are the terminal symbols

In the program I=E',U=T';so the grammar would now look like

66
E → TI
I → + T I│epsilon
T → FU
U → * F U│epsilon
F → ( E )│a│b│c

*/
#include<stdio.h>

char next_token; /* the variable holds the next available token of input*/
char input[60]; /*the input taken as string from user*/
int i; /*count variable*/
int level=0; /*variable holds level of recursive depth for printing output*/

void E();
void T();
void U(); /*the function declarations*/
void F();
void I();
void enter_nonterminal(char);
void leave_nonterminal(char);
void space(int);

void getnext_token() /*function returns the next available token in the input expr*/
{
next_token=input[i];
i++;
}
void E() /*root for the parse tree,parsing begins here*/
{
enter_nonterminal('E');
T();
I();
leave_nonterminal('E');
}
void I() /*recognise the token '+' followed by an expression of type T
i.e +(a+b),+b etc..*/
{
enter_nonterminal('I');
if(next_token=='+')
{
getnext_token();
T();
I();
}
leave_nonterminal('I');
}
void T() /* recognises expression of the form a*b,(a)*(b) i.e paranthesised
expression with a '*' token*/
{
enter_nonterminal('T');
F();
U();
leave_nonterminal('T');

67
}
void U() /*recognises '*' token followed an expression of type F
i.e *(a+b),*c etc..*/
{
enter_nonterminal('U');
if(next_token=='*')
{
getnext_token();
F();
U();
}
leave_nonterminal('U');

}
void F() /*recognises fully paranthesised expression and non terminal
symbols*/
{
enter_nonterminal('F');
if(next_token=='(')
{
getnext_token();
E();
if(next_token==')')
getnext_token();
else
{
printf("\nerror1,missing paranthesis");
exit(1);
}

}
else if(next_token=='a'||next_token=='b'||next_token=='c')
getnext_token();
else
{
printf("\nerror 2,undefined symbol or incomplete expression");
exit(2);;
}
leave_nonterminal('F');
}
void enter_nonterminal(char name)
{
space(level++);
printf("%c: Entering, \t", name);
printf("Next_token == %c\n", next_token);

void leave_nonterminal(char name)


{
space(--level);
printf("%c: Leaving, \t", name);
printf("Next_token == %c\n", next_token);
}

68
void space(int local_level)
{
while (local_level-- > 0)
printf("| ");
}

int main()
{
int ans=1;
do
{
printf("\nenter the expression with +,*:");
scanf("%s",input);
i=0;
level=0;
getnext_token();
E();
printf("\nsuccessful parse");
printf("\ndo you want to continue:");
scanf("%d",&ans);
}while(ans);
return 0;
}

/*
Sample run:

Enter an arithmetic expression with +,* :a+b


E: Entering, Next_token == a
| T: Entering, Next_token == a
| | F: Entering, Next_token == a
| | F: Leaving, Next_token == +
| | U: Entering, Next_token == +
| | U: Leaving, Next_token == +
| T: Leaving, Next_token == +
| I: Entering, Next_token == +
| | T: Entering, Next_token == b
| | | F: Entering, Next_token == b
| | | F: Leaving, Next_token ==
| | | U: Entering, Next_token ==
| | | U: Leaving, Next_token ==
| | T: Leaving, Next_token ==
| I: Leaving, Next_token ==
E: Leaving, Next_token ==

Successful Parse

do you want to continue(0 or 1):1

Enter a arithmetic expression with +,* :(a*b

E: Entering, Next_token == (

69
| T: Entering, Next_token == (
| | F: Entering, Next_token == (
| | |E: Entering, Next_token == a
| | | | T: Entering, Next_token == a
| | | | | F: Entering, Next_token == a
| | | | | F: Leaving, Next_token == *
| | | | | U: Entering, Next_token == *
| | | | | | F: Entering, Next_token == b
| | | | | |F: Leaving, Next_token ==
| | | | | U: Leaving, Next_token ==
| | | | T: Leaving, Next_token ==
| | | | I: Entering, Next_token ==
| | | | I: Leaving, Next_token ==
| | | E: Leaving, Next_token ==

error1,missing paranthesis

*/

/*
The grammar that generates the expression of form { a bn c dm│n,m>=1}

S → aAcB
A → A b│b
D → d D│d

Eliminating left recursion in the second statement of grammar:

S → aAcB
A → b A'
A' → b A'│epsilon
D → d D│d

Eliminating the left factoring in the third statement:

S → aAcB
A → b A'
A' → b A' │epsilon
D → dE
E → D │epsilon

In the program, I = A' :

S → aAcB
A → bI
I → b I │epsilon
D → dE
E → D │epsilon

70
*/
#include<stdio.h>

char next_token; /* the variable holds the next available token of input*/
char input[60]; /*the input expression taken as string from user*/
int i; /*count variable*/
int level=0; /*variable holds level of recursive depth for printing output*/

void S();
void A();
void I();
void D();
void E();
void enter_nonterminal(char);
void leave_nonterminal(char);
void space(int);

void getnext_token() /*function returns the next available token in the input expression*/
{
next_token=input[i];
i++;
}

void S() /*root of the parse tree, recognises the entire expression*/
{
enter_nonterminal('S');
if(next_token=='a')
{
getnext_token();
A();
if(next_token=='c')
{
getnext_token();
D();
}
else
{
printf("error3,missing c after one or more b");
exit(3);
}
}
else
{
printf("error1,expr must start with a");
exit(1);
}
leave_nonterminal('S');
}

void A() /*recognise the first occurence of b*/


{
enter_nonterminal('A');
if(next_token=='b')
{

71
getnext_token();
I();
}
else
{
printf("error 2,missing b after single a");
exit(2);
}
leave_nonterminal('A');
}

void I() /*recognising multiple b i.e b^m*/


{
enter_nonterminal('I');
if(next_token=='b')
{
getnext_token();
I();
}
else
printf("\nNO MORE B\n");
leave_nonterminal('I');
}

void D() /*recognise the first d and calls E for reconing the
remaining d*/
{
enter_nonterminal('D');
if(next_token=='d')
{
getnext_token();
E();
}
else
{
printf("error 4,missing d after single c");
exit(3);
}
leave_nonterminal('D');
}

void E()
{
enter_nonterminal('E');
if(next_token=='d')
D();
else
printf("\n NO MORE D\n");
leave_nonterminal('E');
}

void enter_nonterminal(char name)


{
space(level++);
printf("%c: Entering, \t", name);

72
printf("next_token == %c\n", next_token);

void leave_nonterminal(char name)


{
space(--level);
printf("%c: Leaving, \t", name);
printf("next_token == %c\n", next_token);
}

void space(int local_level)


{
while (local_level-- > 0)
printf("| ");
}
int main()
{
int ans=1;
do
{
printf("\nenter the expression of form(a b^n c d^m):");
scanf("%s",input);
i=0;
level=0;
getnext_token();
S();
printf("\nsuccessful parse");
printf("\ndo you want to continue:");
scanf("%d",&ans);
}while(ans);
return 0;
}

/*
Sample Run:
enter the expression of form(a b^n c d^m):abcd

S: Entering, Next_token == a
| A: Entering, Next_token == b
| | I: Entering, Next_token == c
NO MORE B
| | I: Leaving, Next_token == c
| A: Leaving, Next_token == d
| D: Entering, Next_token ==
| | E: Entering, Next_token ==
NO MORE D
| | E: Leaving, Next_token ==
| D: Leaving, Next_token ==
S: Leaving, Next_token ==

Successful Parse

enter the expression of form(a b^n c d^m):abcc


S: Entering, Next_token == a

73
| A: Entering, Next_token == b
| | I: Entering, Next_token == c
NO MORE B
| | I: Leaving, Next_token == c
| A: Leaving, Next_token == c
| D: Entering, Next_token == c

error 4,missing d after single c

Example:

Consider a grammar that generates c = {a bn c dm│n, m≥1 }

S→aAcB

A → A b │b

B → b B│b

First we need to check whether left recursion exists. We know A → Ab has left recursion, we remove

S→aAcB

A → b A1

A1 → b A1│λ

B→bB│b

Now we need to check left factoring. We know that B → b B│b has left factor b. So we modify it as
follows

B→bC

C → B│λ

The pseudo code for the above will be procedure S ( )

If input = ‘a’ then

{ get - nxt - token ( );

A ( );

If input = ‘c’ then

{ get - nxt - token ( );

74
A();

else

error ( );

else

error ( );

procedure A ( )

if input = ‘b’ then

{ get - nxt - token ( ) ;

A1 ( ) ;

else Error ( ) ;

Procedure A1 ( )

If input = ‘b’ then

{ get - nxt - token ( );

A1 ( );

Procedure B ( )

If input = ‘b’ then

{ get - nxt - token ( );

C ( );

else

error ( );

procedure C ( )

B( );

Example:

Consider a grammar that recognizes the arithmetic expressions using symbols a, b or c as operands

75
E→E+T│T

T→T*F│F

F → ( E)│a│b│c

We need to consider removal of left recursion existing in production E →E + T and T → T * F. Therefore after
removal of left recursion

E → T E1

E1 → + T E1│Є

T → F T1

T1→ * F T1│Є

F → ( E)│a│b│c

As there are no left factors we proceed to write recursive routines for each non-terminal as follows

Procedure E ( )

{ T ( );

E1 ( );

Procedure E1( )

{ If input = ‘+’ then

{ get -nxt – token ( );

T ( );

E1 ( );

Procedure T ( )

{ F ( );

T1( );

Procedure T1 ( )

{ if input = ‘*’ then

{ get –nxt - token ( );

F ( );

76
T1 ( );

Procedure F ( ),

{ If (input = ‘a’ or Input = ‘b’ or Input = ‘c’) then

{ get - nxt -token ( );

return ( );

else if input = ‘c’ then

{ get –nxt - token ( );

E();

If input = ‘ ) ’ then

{ get –nxt - token ( );

return ( )

else

error ( );

else

error ( );

4.3 LL (1) Parser:

LL (1) Parser stands for Left to right scan and Leftmost derivation, which it tries to derive. This is
also a topdown parser, it implements recursive descent parser efficiently without using recursion,. A
block diagram of the parser is given below.

77
Input to be parsed is put in input buffer backed by a $ as shown. Stack is in a bottom of the stack
being $. Parser uses parsing table T as database. Parsing table T can be visualized as a two
dimensional matrix, indexed by stack symbol (S), i.e., a non-terminal and the input symbol (a) printed
by input buffer pointer i.e., a terminal symbol. There are four actions are possible depending on the
stack symbol (S) declares and input symbol ‘a’ i.e. stack symbol is equal $

1. If S = a = $ Parser halts, Input sentence, presently being parsed is declared as successful

2. If S = a  $ i.e., stack symbol and input symbols are same, then it pops the stack and input
buffer pointer is advanced to next symbol.

3. In case of 1 and 2 stack symbol was terminal symbol. If stack symbol is a non-terminal
there are two actions depending on the entry in the parsing table.

i. If T [S, A] is not empty, then the contents if the table are pushed on to the stack in
the reverse order.

For example if the stack contents say T[S, A] = ABCD Then the on the stack D is
pushed 1st, C is pushed 2nd so on. Therefore A will be on the top of the stack.

ii If T[S, A] = empty then parse has encountered as error. It calls for error
recovery.

Before designing a parser table let us assume the parser table is given us, we will concentrate how to
parse and find out where parsing for a given sentence is successful or not.

Assume a Grammar for L = { an c bn │n ≥ 0}

S → a S b │c

One parsing table T will be as follow

78
Note:

T [ S, a ] = a S b

T[ S, b ] = Error (because entry is blank)

T [ S, c ] = c

T [ S, $ ] = Error (because entry is blank)

Using the above Table 4.5 let us parse a sentence “a a c b b”

First we need to push $ on the stack followed by start symbol and input buffer should contain
aacbb$

Now let us find out how an entry a b c is parsed. We know that this is not in the language.

Consider another example whose grammar is given below

S → a │^│(T)

T → S, T │ S

79
After left factoring

S → a│^│(T)

T→S

K → , T│λ

Table 4.6 for the above grammar will be as follows

Follow (K) = 1

Let us parse ( a, ^)

80
Consider parsing of a sentence a ) ^

4.4 Error Recovery in Top-down Parsers

Principle on which the compiler is designed is to find all possible error in one go. That is unlike
interpreter which stops execution at the first syntax or semantic error. For example consider the
following C program

main ( )

{ int x, y ;

x=;

x = x + y;

y=x+;

In the above program there are two errors one in lime number 3 i.e., x = ; and one in 5 th line i.e., y = x
+ ; compiler is supposed to find both error and because if it should not skip the syntax checking of
line number 4 which has no syntactic errors. This is possible if compiler on encountering error in line
number 3, syntactically correct it and should add one fictitious operand after = sign. Then this will be
syntactically correct. This is known as recovery from an error.

There are various method of error recovery exist depending on the complexity we can offered. One
simple method often implemented in simple compiler is known as panic mode recover.

4.4.1 Panic mode recovery

This is a simple method which can be implemented in all types of parsers including top-down parsers.
In panic mode a parser ignores input symbols until a synchronizing symbol such as a delimiter like
semicolon in ‘C’ language is found.

Then parser deletes the entries in the stack until it encounters a symbols for which correct symbol is
encountered i.e., synchronizing symbol say ‘;’.

The advantage of this simple recovery method is that it never goes in infinite loop.

Assignments:

81
1. Construct a recursive descent parser for the following grammar

S → aAcBe

A → Ab│b

B → d

2. Explain the elimination of left recursion and left factoring with an example

3. Remove left recursion from the following grammar

E → E+T│T

T → T*F│F

F → (E)│a│b

4. Remove left recursion from the following grammar and left factor it later

A → A c │ A a d │ b d A │b d e

5. Explain working of a predictive parser

6. Calculate FIRST SET for the following grammar

E → T E1

E1 → + T E1 │ Є

T → F T1

T1 → * F T1 │ Є

F → (E) │ a │ b

7. Calculate the Follow Set for the above grammar

8. Design a predictive parser for the grammar given in Q. No. 6

9. Calculate the predictive parser for the grammar

S → a │ ^ │ (T)

T → T1 S │ S

10. Show the parsing for (a + b) * b for the grammar Q. No. 6

82
Chapter 5

Bottom up parsing

5.0 Introduction

Given a grammar and a sentence belonging to that grammar, if we have to show that the given
sentence belongs to the given grammar, there are two methods.

1. Leftmost – Derivation

2. Rightmost – Derivation

In left most derivation we start from the start symbol of the grammar and by choosing the production
judiciously we try to derive the given sentence. The parsing methods based on this are known as top-
down methods and we have already discussed this in previous chapter.

In rightmost derivation we start from the given sentence and using various production, we try to reach
the start symbol. The parsing method based on this concept are known as bottom-up method and are
generally used in compilers generated using various tools like LEX and YACC. The bottom up
parsing methods are more general and efficient. They can find syntactic error as soon as they occur.

We will be studying the following methods of parsing in this chapter

1. LR(0) Parsing

2. SLR (1) Parsing

3. LALR (1) Parsing Methods

When LR stands for Left to Right scan and Rightmost derivation.

SLR Stands for simple LR and LALR for Look Ahead LR and 1 in brackets indicate the number of
look aheads.

LR (0) stands for no look ahead and LR (1) for one look ahead symbol. In general there can be LR
(K) parsers which look ahead ‘K’ symbols.

5.1 Bottom up parsing

Bottom-up parsers in general use explicit stack. They are also know as shift reduce parsers.

Example: Consider a grammar S → a S b │ c and a sentence a a c b b. The parser put the sentence
to be recognized in the input buffer appended with end of input symbol $ and bottom of stack has
also $.

83
The parser consults a table indexed by two parameters to be discussed later. The parameters what is
on the top of stack and input character pointed by input buffer pointer. Assume for the time being that
the table tells the parser to do one of the following activities.

The table formation will be discussed later. From the figure above, after 3 shifts the table tells the
parser to reduce in step-4 that is pop c and replace by (i.e., push) ‘S’. In the 5th Step again shift is
executed. In 6th step the parser is told to pop 3 symbols and replace it by S, in 7 th step again parser is
told to shift. In 8th step the parser pops 3 symbols & replaces it by S i.e., reduce action takes place.
When top of stack is start symbol i.e., S in this case and input buffer is empty i.e., input buffer is
pointing to $ (this indicates there is no more input available). The parser is told to carryout accept
action. The parser is able to recognize that aacbb is indeed a valid sentence of the given grammar.
We shall see later on how shift reduce and accept actions are indicated.

Example:

Consider another example of parsing using the grammar

S→aAcBe

A→Ab│b

B→d

Recognize abbcde

84
Rightmost Derivation

S  aAcBe

 aAcde replace B → d

 aAbcde replace A → A b

 abbcde replace A → b

Reduction done in the shift reduce parses are exactly is the reverse order of rightmost derivation
replace statements.

5.1.1 LR (0) Items: An LR (0) Items, is any production with a dot on the right hand side of a
production.

Example 1:

S → a S b has the following LR (0) Items

S → •aSb

S → a•Sb

S → aS•b

S → aSb•

As we can see that there are 3 symbols on the right hand side of the production. Therefore there are
four positions where in a dot can be put, in general if there are n characters on RHS of a production
there will be n + 1 LR (0) Items from that production.

Example 2: Find the LR (0) Items

E → E+T

Solution: LR (0) Items are

85
E → •E+T

E → E•+T

E → E+•T

E → E+T•

5.1.2 Augmentation of grammar

A grammar must be augmented before an LR parser could be constructed. Augmentation is nothing


but adding a new production, which is not already present in the grammar that derives the start
symbol.

Example:

Given grammar S → a S b│c

Augmented grammar

S1 → S

S → a S b│c

Where S1 is a new start symbol which derives S i.e., the start symbol of the grammar. Now S 1
becomes the new start symbol. This is required to uniquely identify the accept state.

There is a relation between LR (0) Items & Finite Automata. It is possible to construct a DFA to
recognize all viable prefixes of a given grammar. Viable prefixes are right-sentential forms that do
not contain any symbols to the right of the handle. Handle is any RHS of a production.

For Example:

S → a S b│c

Handles are aSb and c

Viable prefixes are – aac because c is handle

– aaSb because a c b is a handle

– aaaSb because a c b is a handle

To start the construction of DFA we need to augment the grammar as follows

S1 → S

S → a S b│c

then list all the Items

S1 → • S

S→•aSb

S→a•Sb

86
S→aS•b

S→aSb•

S→ •c

S→c•

There are seven LR (0) Items in the above grammar. These LR (0) Items form the state S of the DFA.
We need to put S1 → • S always is the start state as follows. This state is called as state 0

Here we need to take closure of LR (0) Items. That is whenever a dot appears before a non-terminal
we need to add to this set the first LR (0) Item derived by S i.e., S → • a S b and S → • c. This is
recursive definition. Now there are dots before only terminal a and c no closure is needed. Therefore
state 0 of DFA has 3 Items.

Now we have to advance each of LR (0) Items in state 0. This is possible by assuming input is S, a
and c (Note S cannot appear as input, only terminals can appear as input). The resulting starts are

87
The resulting states are 1, 2 & 3. When dot moves to the end of RHS of a production and there are no
other LR (0) Items, such states are called reduce states. Now we have 1, 3 & 5 are reduced states,
because dot has moved to the rightmost position and no other LR (0) Items in those states. Whereas 0
and 2 state has three LR (0) Items, we say that, if ‘a’ occurs in state 0 then action to is shift and
resulting state is 2. Similarly if ‘c’ occurs in state 0 then action is shift and goto state 3.

Consider state 2. Here there are three LR (0) Items, because of the closure of production – 1 now to
progress from state of three are 3. Possibilities i.e., S, a and c. If ‘c’ occurs it goes to state 3 and ‘a’
occurs it go to it self as shown. When S occurs in state 2 it goes to state 4. By taking closure in state
4 no new Items are added. In state 4 when b comes it goes to state 5. Since the dot has reached the
last position in state 5, no further states are generated.

We conclude that with 6 states (i.e., state 0 to 5) we can recognize all viable prefixes i.e.,

Viable Prefixes are

1) S

2) a a* c

3) a a* S

4) a a* S b

5) c

Since we need to recognize five viable prefixes we need to have 6 states.

5.1.3 Classification of states of DFA

States of the DFA are classified as following:

88
Accept State: The state which recognizes S1 → S. Note this is also a special
reduce state

Reduce State: Where LR (0) Items have their dot placed on the extreme right position
example 3 and 5

Shift State: Where all LR (0) Items have their do not in the extreme right position.

Shift/Reduce State: Mix of shift & reduce Items as defined above.

5.1.4 Constructing parsing table SLR (1)

The parsing table has rows indexed by state of DFA. Where as columns are indexed by terminals
(including $) and non-terminals (except S1). The portion of table having terminal has index is known
as ACTION Part and that indexed by non-terminals is known as GOTO portion as shown below.

The filling of table is simple. In state 0 on ‘a’ it goes to state 2. Therefore it is a shift action, hence an
entry S2. Similarly on ‘c’ it goes to state 3, again shift action, hence S3. On ‘S’ it goes to state ‘1’
since ‘S’ is non-terminal it is filled only with state without any action.

Similarly all the other states are filed. Since $ is always follows any end of sentence, since we are
putting our selves to recognize the end of input and also when S1 → S. comes. It is treated as accept
state, as this state is unique.

Example: Consider the following grammar

S1 → S 1) S → (S) S

S → (S) S 2) S → Є

S → Є

89
FOLLOW(S)={ ), $ }

Parsing using the above table. Input ( )

90
Example: Consider the following grammar.

A → (A) │ a , Augment the grammar

A1 → A

1) A → (A)

2) A→a

Follow (A) = { ), $ }

91
Example: Consider the grammar S → (L) │ a , L → L, S │ S, Augment the grammar

S1 → • S 1) S → (L)

S → • (L) │a 2) S → a

L → • L, S│S 3) L → L, S

4) L → S

,
Follow (L) = { , ) }

Follow (S) = { ), $}

92
We have seen the construction SLR (0) parser. This parser is able to parse most of the programming
constructs.

5.1.5 Construction of LR(0) Parser:

LR (0) Parser is not a popular parser and it can be constructed following the procedure used for SLR
(1) parser. The difference between LR(0) and SLR (1) parser is LR(0) parser actions, do not depend
on the look ahead.

Consider the SLR (0) parser constructed for the simple grammar

S → a S b│c

Procedure followed earlier is used to construct the DFA shown in fig 5.3. There are 6 states Action of
each state depends whether it is a rreduce state or shift state. In fig. 5.3 we see that there are

1) Reduce Stattes 1, 3 and 5

2) Shift states 0, 2 & 4

Here the shift or reduce actions are done without looking at the input symbol. The table will have the
following entries.

Construction of the table is same as followed in SLR (1) parser and there is no need to calculate
follow symbol for all the reduce states.

Now let use the above parser table to check the input a c b

93
To start with put starting state i.e., state 0 in parsing stack and input ‘acb’ appended by $ in input.
Stack has stage ‘0’, which is a shift state therefore shift a in to the stack. 0 state on a goes to state 2 as
seen in the table. Therefore push state 2 on to the stack. State 2 is again shift state, therefore shift
character c on to the stack. State 2 on character ‘c’ goes to state 3, therefore push state 3 on to the
stack. This procedure continues i.e., state on the top of the stack decides the action to be gone without
looking at the input symbol. State 1 is a reduce state but it is special because occurs only when we are
ready to accept the input. As we see there is not much change from SLR (1) parsing.

5.2 Limitations of SLR (1)

SLR(1) as we have seen in simple, yet powerful to parse almost all languages constructs. But this fail
in two cases.

1. Shift/Reduce conflict: When a state has both shift and reduce Items are present and parser
indicates that both the operations to be done on a particular symbol. At thin point parser
will not be able to resolve and if fails.

2. Reduce/Reduce conflict: When a particular state has LR(1) Items which are both reduce
Items which are both reduce Items i.e., the dot in LR (0) Items has reached rightmost
position. And both are to be reduced on the same symbol.

Let us consider the parser to understand shift/reduce conflict.

S1 → S

S → L=R

S → R

L → AR

L → id

R → L

94
Consider the SLR(1) Parser

Let us analyze the state 2. There are two Items

1. S→L•=R

2. R→L•

Item–1 is shift them and Item–2 is reduce Item of we calculate the follow symbols of
R={=}

There is a conflict on symbol =. We are unable to decide whether to use production R → L to reduce
or shift S → L • = R to next state. This is known shift/reduce content. There is no solution and parser
can not proceed and SLR (1) fails.

5.3 Construction of LR (1) Parser:

LR (1) parsing or canonical parsing is resorted when SLR (1) fails Here we need to calculate LR (1)
Items instead of LR (0) Items. i.e., the state of DFA will have LR (1) Items.

LR (1) Items have two parts

1. First part, same as LR(0) Item

2. Second part, look ahead or follow symbols that can occur at that instant.

Consider a grammar given below, to illustrate the calculation of LR (1) Items. once LR(1) Items are
calculated the construction of DFA is simple and straightforward.

S1 → S

S → DD

D → cD

D→d

95
For S1 → • S the follow symbols obviously is $ as S1 is a start symbol. Therefore LR(1) Item is S1 →
• S, $

If we take closure of • S and replace S by • D D still the follow symbol remain $. Therefore

S → • D D, $

Now take closure of D we replace D by • iD. Note that 1st D is replaced. All the First symbols of
second D will become follow symbols.

D → • c D, c / d

D → • d, c / d

Note here we are calculating follow symbol in that context. If we had

S → D • D, $

When we take closure of D, it is the 2nd D . Therefore the Follow symbol will remain $. i.e.,

D → • c D, $ & D → • d, $

We complete the DFA as follows with LR (1) Items.

96
The parsing table is constructed in the same way as SLR (1) parser but there is no need to calculate
follow symbols.

The r1, r2, r3 are as follows

r1: S→ D D

r2: D → c D

r3: D → d

Let us now parse a sentence derived from the above grammar. Let the given sentence be

5.3.1 Limitations of LR(1) Parser:

In the previous example (Fig 6.2) we see that state 8 and 9 have the same first component but they
differ only in their look ahead symbols. Similarly states 4 & 7 and 3 & 6. LR (1) parser has more
states than SLR (1) parser and for a typical programming language it may have several thousands of
states. This increase the memory requirement.

97
5.4 Construction LALR(1) Parser:

This can easily be constructed from the table of LR (1) parser. Basic principle being the states which
have the same first component (called common core) are combined, to form new states, which we
may call them by their combined states. For example

States 3 & 6 are combined & a new state is called 36 state

Similarly we create 89 and 47 states.

The action and GOTO are modified and the new follows LALR (1) parsing table is as

The procedure for parsing is same as in SLR (1) parser.

LALR (1) parsers are the most common and are modern parsers. The parser generator generally
construct LALR (1) parsers. These parser do have states comparable to that of SLR (1) parser.

5.5 Parser Generator : YACC:

/*Yacc program for checking the validity of an arithmetic expression*/

%{
#include<stdio.h>
int flag;
%}
%left '-' '+'
%left '*' '/'
%left '%'
%token DT, OP /* DT → digits; OP → operands; these
are defined in the lex program below*/
%%

E:S {flag = 1; printf("\n valid expression \n");}


S:S '+' S /* S is being defined in this section;
|S '-' S S could be defined as S+S; or S-S;
|S '*' S S*S; or S/S or (S) or digit or
|S '/' S operand for checking the validity of
|'(' S ')' the expression */
|DT
|OP
;
%%

98
int main()
{
char ch;
flag = 0; /* initialization of flag */
printf("\n enter expression : ");
yyparse();

return 0;
}

int yyerror()
{
if(flag==0)
printf("\n invalid expression\n");
return 0;
}

/*Supporting Lex program for checking the validity of an arithmetic expression*/

%{
#include "y.tab.h"
%}
%%
[0-9]+ {return DT;}
[a-zA-Z]+ {return OP;}
[\t] ;
\n {return 0;}
. {return yytext[0];}
%%

Output:

1. enter expression : (A+B)


valid expression

2. enter expression : a*b-c


valid expression

3. enter expression : a/c+b-e


valid expression

4. enter expression : 5+(8-4)


valid expression

5. enter expression : 5(8-4)


invalid expression

6. enter expression : (a*b-c


invalid expression

5.6 Error Recovery in Bottom-up parser:

99
As discussed earlier error recovery is to give suitable error message and also give a correction so that
parsing is successful.

Consider the grammar following grammar

S→aSb│c

And the corresponding parsing table at 5.14. In this table blanks indicate error. Here we are to give
connection to suitable error routine which gives a suitable error messages and does the appropriate
correction. The table is modified as follows

e1: Error routing : Error – Message – rising a push a & cover is with state 2

e2: Error routine : Error Message – unbalanced b & remove input symbol.

Now let us see how parser will be able recover while parsing a syntactically wrong sentence and also
give suitable messages.

Above recovery shows that with error routines e1 and e2 not only relevant error messages were given
and parser was able recover from the error to accept the syntactically wrong input ‘b c b’

In general error routine design requires careful consideration. The above error recovery is only to
illustrate a typical case. However we have left other error places (blanks) without any action. By
careful consideration all the error places we can suggest suitable recovery action.

100
Assignment

1. Construct LR (0) items for grammar

E→(L)/a

L→ L,E/E

2. Construct LR (0) items for grammar

S →A a │ b A c│B c│b B a

A →d

B →d

3. Consider the grammar Q no 1, construct the SLR parsing table for the same

4. Show that the grammar

S → A a Ab │ B b B a

A→ Є

B→Є

is not SLR

5. Construct LR (0) items for grammar

6. Construct SLR parsing table for the grammar in Q. No.5

7. Show the parsing stack and the actions of an SLR parser for the input string ((a),a,(a,a))
considering the grammar of Q no 1

8. Construct DFA of LALR(1) items for the grammar in Q no 1

9. Construct LALR parsing table for the grammar in Q no 1

10. Show that the following grammar is not LR(1)

A→ aAa / 

101
Chapter 6

Semantic Analysis

6.0 Introduction

Semantic analysis deals with a construct which has been grammatically correct. All programming
language have one basic characteristic, which implies that if syntax is correct its semantics is fixed.
However this does not follow for all constructs and we need to look at language specification given by
language designer to implement it correctly.

For example: Consider a construct a + b * c. We know that the meaning implied here is ‘b*c’ should
be done first and the result should be added to ‘a’. This can be easily comprehended if we construct
abstract syntax tree as follows.

As the evaluation is left to right and top to bottom, the meaning is correctly brought out i.e., ‘b*c’ is
calculated first and a is added to it as left operand.

Consider another example, where in we will not be able to construct abstract syntax tree and infer
from that the meaning or semantic like the above example consider.

b = x[i];

Here it is important to know the data type of b, x, & i otherwise we will not be able to know the
semantics of this statement. Here we need to refer to the symbol table to know the type b, x, & i. This
example requires information not contained in this statement. This makes the semantic analysis
difficult. This is because non-context feature of programming language. In general semantic analysis
to involved i.e., unlike syntax analysis we do not have LEX and you which can automatically check
the syntax.

Attribute is any property of a programming language construct, like data type, value and location etc.,
where as attribute grammar tells us how to compute semantics for each syntactic rule.

6.1 Attribute Grammars

In syntax-directed translation attributes are associated with each grammar symbol which defines the
syntax of the language.

Consider a grammar for unsigned numbers

102
num →num digit digit

digit → 0│1│2│ - - - - - - │9

With the above grammar rules we ascertain that given input is a correct unsigned number or not, but
we will not be able to calculate its value. This is where attribute grammar becomes important.

If input number is 385 we can draw syntax tree as follows to check its syntax.

But we are interested in its semantics i.e., its value is 384. This can be calculated by devising a
attribute grammar wherein each syntax rule has a semantic rule or attribute equation associated. All
the semantic rules constitute attribute grammar as illustreated below.

103
By executing semantic rules whenever we traverse the syntax tree, we end up in calculating i.e.,
meaning in this case happens to be value. Following figure show how value is calculated as we
traverse the tree, in bracket all semantic rules execution is shown.

Example 3:

Consider the grammar for integer arithmetic expressions.

expn → expn + term│term

term → term * factor│factor

factor → (expn)│number

Consider the parse tree for expression 24 * 3 + 5

104
But in semantic analysis, our interest is to compute the value of the expression, we need to associate
with each syntactic rule a semantic rule as follows.

Now executing semantic rule as and when we execute syntactic rule we will be able to calculate the
‘attribute’.

105
106
Example 4: Evaluation of expression: 15+4*6*5

Example: Consider the following grammar for assignment expression.

Assign → ID = E

E → E+E

E → E*E

E → ID

ID → a│b

Consider the parse tree for b = a + b * a

107
ABSTRACT SYNTAX TREE

Using this abstract syntax tree if the attribute to calculated is say intermediate code then each node we
need one temporary variable to store the result of intermediate calculation. There for each node we
give label as t1 t2 etc., the intermediate code generated is as follows

t1 = b * a

t2 = a + t1

a = t2

If we need to generate using attribute grammar we need to think of Semantic rules for each syntactic
action. Before we do think about one more attribute that stores the value. We call this attribute place.
Code is the main attribute. Create temp( ) is a function to generate temporary variable. This can be
used to stare intermediate results.

108
109
a= t1

t1 = b + a

a=b*a+a

t1 = b * a

t2 = t1 + a

a = t2

Example: a = b * a + a

110
Example 2: a = b + a

6.2 The Symbol table

This is an important data structure which keeps the information about identifiers and function names.
This table interacts with all the stages of compiler to provide information or to insert information in
the table.

The main operations involved are

- Insert into symbol table

- Lookup for a variable

- Delete a variable.

111
6.2.1 Insert operation

Insert operation is done during the scanner process, if scanner identifies a token as identifier it is
inserted into the table. For example

int count;

When scanner looks at the above statement three tokens are encountered, namely a reserved word int,
variable name count and delimiter ‘;’. Now scanner inserts count in symbol table along with the data
type of count & its scope.

6.2.2 Lookup operation

When we encounter a statement count = count + 1;

We need to check whether count is declared or not, if it is declared then we are to find this symbol
table. The semantic analysis phase lookup for this variable count in symbol table.

6.2.3 Delete operation:

Delete operation is required as the scope of variable change. For example

{ int i ;
:
: Block – 1

----

{ float i ;

:
: Block – 2

----

When the semantic analyzer is in Block-1 the symbol table should allow to see i as integer variable
and its related information. When the semantic analyzer is in Block-2 the symbol table should allow
to see i as float type, i.e., the variable i as integer should be deleted. In actual case the variable i as
integer is not deleted, but at that time that will not be visible.

6.3 The structure of symbol table:

We need to design the symbol table such that all the 3 operations, namely insert, lookup, and delete
operations are efficiently performed. Depending on the size of compiler we may have following types
of symbol table organizations.

112
6.3.1 List

List is a simple and easy to implement data structure for a symbol table. The list can be implemented
as a single array or linked list.

If the table has n entries to insert a new entry we need to search all the n entries, therefore the
complexity is O(n)

Similarly to lookup operations i.e., when we want to know whether an identifier is available in the
table, we may have to search at least the table. Again the complexity is O (n). Delete operation is
discussed later.

We conclude that as the size of the table increases, the insert and lookup time are going to increase.

6.3.2 Tree

Symbol table can be constructed as a binary tree, as follows, assuming we have encountered the
following variables.

t, z, a, f, c, g

113
A binary tree is constructed such that all the names to the left of a node have value less and to the
right have value more. The value could be their sequence no or alphabetical order. Here the values of
a, f, c, g are less then t and therefore found on the left link of t. The insert time is proportional to
O(log n). This organization is not used, though its complexity is less but it has memory over head and
also skew tree may be formed, in some cases.

6.3.3 Hash Table Organization

This is the most common type of organization for symbol table construction. This is efficient in
performing the insert, lookup and delete operations efficiently, that it has a complexity of O (1).

A hash table is an array of entries, called buckets. Buckets are indexed by our integer. The range of
integer will be namely O to (size if table – 1). Normally table size is a prime number. This makes the
hashing function to behave better. For example if table size were 10 then 11, 21, 31 give the same
reminder. Instead if table size were to be 11 then 11, 21, 31 give remainders 0, 10, 9 which is a better
distribution.

6.3.3.1 Formation of Hash

Hash is formed generally adding all the character’s ASCII values and then divide by the table size.
The remainder is termed as a hash. Which is a integer number from 0 to (table size – 1). This integer
number is used as index for all the symbol table operations. It is possible that two or more identifiers
may give the same hash value. Then collision is said to occur. When collisions occur, there are two
ways of resolving

- Open addressing

- Chaining

114
6.3.3.2 Open Addressing

In this method, when collision happens say while inserting a identifier name. That is the location is
already filled, then we see the next location if it is free and so on till we find a tree location.

The problem will open addressing is clusters are formed and the operations start taking longer and
longer as the table gets filled up. This significantly offers the operations. Further problem is that
delete operation does not improve the performance.

For example: Assume that id1, id2, id3 are having a hash value 10, 10, 11, then the table will be filled
as follows. The id3 has a value 11, but it is stored in bucket 12.

6.3.3.3 Chaining

In this case each bucket has a list, which could be dynamic. That for each hash value there will be a
list. All the identifiers having the same hash value will be entered in the same list. That is to say we
need to search the list corresponding a hash value in case of collision and search is limited to that list
only. It does not spill into next hash value, like open addressing scheme. For the example above

115
As we can see for searching for id3 only one problem required where as in open addressing 2 probes
are required. If the n1 identifier in a table are informally distributed in table of size on their we need to
search maximum n/m entries. This is for superior compared to open addressing.

6.4 Data types checking

Data types: We know that data types are important in programming languages as they define

- The range of values

- The operation defined on them

- Memory allocation

For example when we define int C; then the range of values that C can take gets fixed. Since this is
not an array this data type can not be indexed. In some computers this data type may have either 2
memory locations or 4 memory location.

6..4.1 Data types checking Procedure

Data type checking is required to implement the correct semantics as the programming language
specification. For example:

struct { float r ;

Int i ;

} x;

int y ;

With the above declaration

x = y; is not a semantically correct, though it is syntactically correct. The parser say x = y; is


syntactically correct because this do not violate any grammar rule. But x is structure and y is an
integer. These two types are not equivalent and meaningful code can not be generated as this is a
non-sense statement. This fact could be ascertained during type checking, by checking type
equivalence.

6.4.1.1 Type Equivalence

Type equivalence can be ascertained by generating a syntax tree for each type. For example x defined
above had a syntax tree as follows

116
When we assign x = y; type checking will done whether both x and y have the same structure for their
respective syntax trees. It will be found that x and y do not have same syntax tree structure and this
statement will be flagged as error.

6.4.1.2 Name Equivalence

Name equivalence can also be used provided they are the same types. For example

int x, y ;

x=y;

Here x and y are of the same type i.e., they have the same name, int. This name equivalence can be
applied to structures also.

Struct type same

{ double r ;

int i ;

117
};

Struct type same x, y;

When x = y; is assigned, since x and y are of same type names i.e., type same. This is declared as
valid by the type checking program.

Assignment:

1. Given the following grammar for unsigned binary number. Write semantic rule to calculate the
decimal value of the binary number

bnum → bnum bdigt │ bdigt

bdigit → 0│1

and also for the number 1011 Draw 1) Parse tree, 2) Apply semantic rules to calculate decimal
value.

2 Given the following grammar for unsigned octal number. Write semantic rules to calculate the
decimal value of the octal number

Onum→Onum odigit│odigit

odigit → 0│1│2│- - - │7

And also for the number 356 draw 1) Parse tree, 2) Apply Semantic rules to calculate decimal
value.

3. Same as problem 2 but calculate binary value.

4. For the grammar

expn → expn + term│term

term → term * factor │ factor

factor → (expn) │ number

using semantic rules for the expression (3 + 5) * 8 calculate the value

5. Same as problem 4 but for the expression (3 +5) * (2 + 1) calculate the value

6. Consider the grammar

A → ID = E

E→E+E

E → ID

118
ID → a │b│c

Write a parse tree fro a = a + b + c and using semantic rules generate intermediate code

7. Same as problem 6 write a parse tree for b = a + b + c + c

8. Discuss the importance of symbol table and various ways how symbol table can be
organized.

9. If the input symbols of all a, b, aa, b1, dd, da, db, aaa Construct symbol table tree for the
above symbols. (Items = follow dictionary order

10. Describe the hash symbol table organization and different method of collision resolving.

119
Chapter 7

Intermediate Code Generation

7.1 Introduction

In the first pass of the compiler, source program is converted into intermediate code. The second pass
converts the intermediate code to target code. The intermediate code generation is done by
intermediate code generation phase. It takes input from front end which consists of lexical analysis,
syntax analysis and semantic analysis and generates intermediate code and gives it to code generator.
Fig 7.1 shows the position of intermediate code generator in compiler. Although some source code
can be directly converted to target code, there are some advantages of intermediate code. Some of
these advantages are:

1. Target code can be generated to any machine just by attaching new machine as the
back end. This is called retargeting.

2. It is possible to apply machine independent code optimization. This helps in faster


generation of code.

3. Language dependent optimization can be done only at this stage

This chapter deals with the intermediate representation in the form of three address code. Other forms
of intermediate representations are syntax tree, postfix notation or Directed Acyclic Graph (DAG).
The semantic rule for syntax tree and three address code are almost similar.

7.2.1 Graphical & Linear representation

Intermediate representation can be either in linear or graphical form. Graphical form includes syntax
tree and DAG where as linear representation may be postfix notation or three address code. Fig 7.2
shows the syntax tree for the expression a = b * c.

120
Syntax tree

Directed Acyclic Graph (DAG)

For Example a = -b * C

Linear representation: Postfix and three address code are the two forms of linear representation of
any expression.

Postfix notation for the expression a = -b * c is b-c*a=

Three address code for the expression a = - b * c is

t1= -b

t2 = t1 * c

a = t2

Intermediate code can be of many forms. They can be either

• Language specific like P-code for Pascal, byte code for Java etc or

• Specific to machine on which implementation is performed or

• Independent of language being implemented and target machine

The three address code that we are considering here for intermediate representation is Independent of
language being implemented and target machine.

121
7.2.2 Three address code

Most instruction of three address code is of the form

a = b op c

where b and c are operands and op is an operator. The result after applying operator op on b and c is
stored in a. Operator op can be like +, -, * or . Here operator op is assumed as binary operator. The
operands b and c represents the address in memory or can be constants or literal value with no runtime
address. Result a can be address in memory or temporary variable.

Example: a = b * c + 10

The three address code will be

t1 = b * C

t2 = t1 + 10

a = t2

Here t1 and t2 are temporary variables used to store the intermediate results.

7.2.2.1 Types of three address code

There are different types of statements in source program to which three address code has to be
generated. Along with operands and operators, three address code also use labels to provide flow of
control for statements like if-then-else, for and while. The different types of three address code
statements are:

1. Assignment statement a = b op c

In the above case b and c are operands, while op is binary or logical operator. The result of
applying op on b and c is stored in a.

2. Unary operation a = op b

This is used for unary minus or logical negation.

Example: a = b * (- c) + d

Three address code for the above example will be

t1 = -c

t2 = t1 * b

t3 = t2 + d

a = t3

3. Copy Statement a=b

The value of b is stored in variable a.

122
4. Unconditional jump goto L

Creates label L and generates three-address code ‘goto L’

5. Conditional jump if exp go to L

Creates label L, generate code for expression exp, If the exp returns value true then go to the
statement labelled L. exp returns a value false go to the statement immediately following the
if statement.

6. Function call For a function fun with n arguments a1,a2,a3….an ie.,

fun(a1, a2, a3,…an),

the three address code will be

Param a1

Param a2

Param an

Call fun, n

Where param defines the arguments to function.

7. Array indexing- In order to access the elements of array either single dimension or
multidimension, three address code requires base address and offset value. Base address
consists of the address of first element in an array. Other elements of the array can be
accessed using the base address and offset value.

Example: x = y[i]

Memory location m = Base address of y + Displacement i

x = contents of memory location m

similarly x[i] = y

Memory location m = Base address of x + Displacement i

The value of y is stored in memory location m

8. Pointer assignment x = &y x stores the address of memory location y

x = *y y is a pointer whose r-value is location

*x = y sets r-value of the object pointed by x to the r-value of y

Intermediate representation should have an operator set which is rich to implement most of the
operations of source language. It should also help in mapping to restricted instruction set of target
machine.

7.3 Data Structure

123
Three address code is represented as record structure with fields for operator and operands. These
records can be stored as array or linked list. Most common implementations of three address code are-
Quadruples, Triples and Indirect triples.

Quadruples- Quadruples consists of four fields in the record structure. One field to store operator op,
two fields to store operands or arguments arg1and arg2 and one field to store result res. res = arg1
op arg2

Example: a = b + c

b is represented as arg1, c is represented as arg2, + as op and a as res.

Unary operators like ‘-‘do not use agr2. Operators like param do not use agr2 nor result. For
conditional and unconditional statements res is label. Arg1, arg2 and res are pointers to symbol table
or literal table for the names.

Example: a = -b * d + c + (-b) * d

Three address code for the above statement is as follows

t1 = - b

t2 = t1 * d

t3 = t2 + c

t4 = - b

t5 = t4 * d

t6 = t3 + t5

a = t6

Quadruples for the above example is as follows

Triples – Triples uses only three fields in the record structure. One field for operator, two fields for
operands named as arg1 and arg2. Value of temporary variable can be accessed by the position of the
statement the computes it and not by location as in quadruples.

124
Example: a = -b * d + c + (-b) * d

Triples for the above example is as follows

Arg1 and arg2 may be pointers to symbol table for program variables or literal table for constant or
pointers into triple structure for intermediate results.

Example: Triples for statement x[i] = y which generates two records is as follows

Triples for statement x = y[i] which generates two records is as follows

Triples are alternative ways for representing syntax tree or Directed acyclic graph for program defined
names.

Indirect Triples – Indirect triples are used to achieve indirection in listing of pointers. That is, it uses
pointers to triples than listing of triples themselves.

Example: a = -b * d + c + (-b) * d

125
When target code is generated each program variable and temporary variable is assigned a memory
location. This address of the location is stored in symbol table. Quadruples use this address for output
generation. If an assignment statement for b is moved from one location to another (inter change the
instruction), it requires no change for regeneration of intermediate code. This is because it is possible
to access symbol table for program variable or temporary variable directly. In case of triples, as there
are no temporary variables stored in symbol table and all references are only to the position of
statement and not location, the compiler should change all references to arg1 and arg2. Thus triples
are not very efficient in optimizing compilers. In case of indirect triples, no pointer refers to
temporary variable, hence no change in pointer required when instructions are interchanged. The
statements can be moved by reordering the statement list.

Both indirect triples and quadruples give almost performance with respect to space and reordering
code. However indirect triples can be space saving if temporary variables are reused. For the previous
example statement no (13) can be removed and the value of statement no (1) can be reused.

7.4 Basic Intermediate Code Generation Technique

Program consists of assignment statements like a=b op c or control statements like if-then-else, while
loop or for statements. This section deals with generation of three address code for assignment
statement and control statements.

7.4.1 Assignment statement

This section deals with the generation of intermediate code for assignment statement. It describes the
way in which symbol table can be searched for an identifier. Identifiers can be simple variable or
single or multidimensional array or a constant value (stored in literal table). Next step is generation of
three address code for the program statement.

• Searching in symbol table

During the process of generation of intermediate code, symbol table has to be searched for identifier.
The lexeme for identifier is stored in variable id.name. Searching for identifier in symbol table is
achieved through the function search(). Search() returns the pointer of identifier in symbol table, if
id.name has an entry in symbol table. If search fails it returns null indicating id.name not found.

• Generate code

126
Intermediate code generator uses function called produce() to generate three address code and store it
in output file. It also uses a variable E.value to store the name of E that holds the value of E. All the
intermediate results will be stored in temporary variables. Hence the function newtemp() is used. It
generates new temporary variables like t1,t2,… every time newtemp() is called.

Example: Consider the following grammar for assignment statement

S → id=E

E → E1 + E2

E → E1 * E2

E → -E1

E → (E1)

E → id

Translation scheme to produce three address code is as follows

Example: Generate three address code or the following arithmetic expression

a=-b*c

127
Consider the Fig 7.4 which shows the syntax tree for the expression a = - b * c

1. Using the production E → id i.e.. E → b Check for entry b in the symbol table if not present
display error msg., if present made E.value = x i.e., E.value = b (pointer to symbol table for
the entry b)

2. Create a temporary variable t1 with production 4, this produces an intermediate code

t1 = - b

3. Same as step 1, searches for entry c in symbol table & assigns E.value = c (pointer to symbol
table for the entry c)

4. Using the production E → E1 * E2 it creates temporary variable t2 using newtemp(). Three


address code E.value = E1.value + E2.value generates

t2 = t1 + c

5. Using the production S → id = E searches for a in symbol table, assuming it is stored


produces code x = E.value

a = t2

7.4.2 Reusing temporary variables

The newtemp() function is used to create new temporary variable to store intermediate results. If the
number of temporary variables increases to a large value, it becomes difficult to maintain these
variables. In some cases the temporary variables values may not be required until the end. So, the
memory reserved for theses variables goes unused, in order to use memory efficiently and reduce the
number of temporary variables, reusing of temporary variables is done. The newtemp() function has
to be modified to achieve this. Instead of always generating new temporary variables every time a
newtemp() is called, it should find those temporary variable whose use is completed in the subsequent
code, then the intermediate results can be stored is these temporary variables ie they are reused.

128
The other significance of reusing temporary is that, there can be some sub expressions being repeated
in the whole expression. Instead of creating new temporary variable for all sub expressions, it can be
reused.

Example: a = -b * c + (-b) * c

The three address code will be as follows:

(1) t1 = -b

(2) t2 = t1 * c

(3) t3 = t2 + t2

(4) a = t3

after the statement (2) t1 is not used hence instead of creating new. Temporary t3, t1 can be reused.
Hence generates following code

(1) t1 = - b

(2) t2 = t1 + c

(3) t1 = t2 + t2

(3) a = t1

number of temporary variables used is 2 instead of 3

Example:

(1) t1 = - b (1) t1 = - b

(2) t2 = t1 * c (2) t2 = t1 * c

(3) t3 = - b can be changed to (3) t1 = t2 + t2

(4) t4 = t3 * c (4) a = t1

(5) t5 = t2 + t4

(6) a = t5

reducing number of temporary variables from 5 to 2 & reusing value of t2

7.4.3 Addressing array elements

Elements of array are stored in consecutive memory location. If the array is n, and size of each
element is s then the ith element of the array can be accessed at base + (i – low) * s

Where base is the base address of array or the address of the 1st element of array and low is the lower
bound of array or the index of first element of array.

129
Example: Let A[10] be an array of 10 elements. Let size of each element is 2 ie., s=2 and the array is
stored from memory location 1000 ie base address=1000.

Expression can be written as i * s + base – low * s where part I of expression is (i*s) and part II
of the expression is (base - low * s).

In part II all the components are known before compilation hence they can be pre-computed stored.
This reduces the time taken to generate address of ith element.

In case of multi-dimension array like matrix, elements are either stored as Row Major or Column
Major.

Example: Consider Array A[3,3] with elements

It can be stored as

130
C language and Pascal uses row major storage where as Fortran language uses column major storage.

Address of element A [i, j] in row major storage is given by the expression as follows.

A[i,j] = base + ((i - low1) * n2 + j - low2) * s (Exp 1)

where low1 and low2 are lower bounds of i  j and n2defines the number of columns. s defines the
size of each element. Expression (Exp 1) can be written as

A[i,j] = (( i * n2) + j) * s + ( base – (( low1 * n2) + low2) * s (Exp 2)

The second part of the Expression (Exp 2) can be pre-computed by knowing the value of base, low1,
low2 and s. This helps in faster generation of address for A[i,j].

Arranges may be at various dimensions, hence the generalized production list is as follows

A → Alist]│id

Alist → Alist, E│id [E

Translation scheme for array with complete definition can be as follows

S → A=E

E → A

A → Alist ]

A – id

Alist → Alist, E │id [ E

A can be a simple name (has only one base address and no offset) or an indexed name (has base
address and object) assignment to location.

131
L-value of A has two attributes viz A.offset and A.value. If A is a simple attribute then A.value points
to symbol table and A.offset = null.

If A if array then A.offset is temporary variable which stores first part of expression (Exp 2). A.value
stores second part of Exp(2).

c defines the second component of expression (exp 2), m denotes the dimension of array ( m=1
indicates single dimension array, m=2 defines 2-dimension array etc.

Function lmt() defines the maximum number of elements present in the jth dimension of array.
Width() defines the size of array.

Example: Let A be two-dimension matrix with the size 20 * 10 and lower bound of both be equal to
one. ie low1 = low2 = 1 and n1 = 20 n2 = 10 and let the size of each element is 4 ie s = 4. Generate
the address of x = A[y, z].

Fig 7.5 shows the annotated parse tree for the array assignment statement x = A[y,z]

132
7.4.4 Logical Expression

An expression may consists of not only arithmetic operators like +, -, * etc., it may also have boolean
/ logical operation like and, or and not. An expression with relational operation like <, >, ,   etc.,
along with logical operators are mainly used in flow control statements like if then else, while-do and
repeat-until. not operations has the highest precedence-level followed by and and or is at least
precedence level.

Logical expressions always results in values either true or false. True can be treated as non zero or
non negative or 1 value. Whereas false may be 0 or negative value. Following is the translation
scheme using a numerical representation for logical expression.

133
Example:

1. a or b and not c

Three address code for the above expression will be as follows

t1 = not c

t2 = b and t1

t3 = a or t2

2. if a < b then 1else 0

Three address code for the above statement is as follows

10: if a < b go to 13

11: t1 = 0

12: go to 14

13: t1 = 1

14:

134
3. a < b or c < d and e < f

10: if a < b go to 13

11: t1= 0

12: go to 14

13: t1 = 1

14: if c < d go to 17

15: t2 = 0

16: go to

17: if e < f go to 20

18: t3 = 0

19: go to 21

20: t3 = 1

21:

7.4.5 Flow control statements

Control statements are used to alter the sequential flow of execution. Some the control statements are
if-then-else statement, while statement. Following is the pictorial representation of flow control
statements.

Control flow for if-then statement is as show in Fig 7.6

135
Control flow for while statement is as shown in Fig 7.7

to E.true E.code
to E.true
to E.false
to E.false
E.code E.true : S1.code

S1.code goto S.next

... S2.code
E.false :

...
S.next :
(a) if- E.true : then (b) If-then- else

to E.true
S.begin E.code
to E.false
E.false :
E.true S1.code

goto S.begin

E.false S2.code

(c) while-do

Fig 7.8 Code for if-the, if-then-else and while statement

136
Code for if-then and if-then-else can be generated using the following translation rules.

Example:

Generate three address code for the following statement

while a<b do

if c< d then

x=y+z

else

x=y-z

Solution: the three address code will be as follows

L1: if a<b then GOTO L2

GOTO LNEXT

L2: if c<d then GOTO L3

GOTO L4

137
L3: t1 = y + z

x = t1

GOTO L1

L4: t1= y - z

x = t1

GOTO L1

LNEXT:

7.5 Code Generation

Code generator phase generates the target code taking input as intermediate code. The output of
intermediate code generator may be given directly to code generation or may pass through code
optimization before generating code.

7.5.1 Issues in Design of Code generation:

Target code mainly depends on available instruction set and efficient usage of registers. The main
issues in design of code generation are

• Intermediate representation: Linear representation like postfix and three address code or
quadruples and graphical representation like Syntax tree or DAG. Assume type checking is
done and input in free of errors. This chapter deals only with intermediate representation as
three address code.

• Target Code: The target code may be absolute code, re-locatable machine code or assembly
language code. Absolute code can be executed immediately as the addresses are fixed. But in
case of re-locatable it requires linker and loader to place the code in appropriate location and
map (link) the required library functions. If it generates assembly level code then assemblers
are needed to convert it into machine level code before execution. Re-locatable code provides
great deal of flexibilities as the functions can be compiled separately before generation of
object code.

• Address mapping: Address mapping defines the mapping between intermediate


representations to address in the target code. These addresses are based on the runtime
environment used like static, stack or heap. The identifiers are stored in symbol table during
declaration of variables or functions, along with type. Each identifier can be accessed in
symbol table based on width of each identifier and offset. The address of the specific
instruction (in three address code) can be generated using back patching

• Instruction Set: The instruction set should be complete in the sense that all operations can be
implemented. Some times a single operation may be implemented using many instruction
(many set of instructions). The code generator should choose the most appropriate
instruction. The instruction should be chosen in such a way that speed is of execution is
minimum or other machine related resource utilization should be minimum.

Example: Consider the set of statements

a=b*c
d=a*e

138
Three address code will be as follows
t1 = b * c
t2 = t1 + 10
t3 = t1 + t 2
Final code generated will be as follows
MOV b, R0 / load b to register Ro,
MUL C, R0
MOV.R0, a Mov a to Ro and moving Ro to a can be eliminated
MOV a, R0
MUL e, R0
MOV R0, d
Redundant instruction should be eliminated.
Replace n instruction by single instruction
x=x+1
MOV x, R0
ADD 1, R0  INC x
MOV R0. x

7.5.2 Register allocation: If the operands are in register the execution is faster hence the set of
variables whose values are required at a point in the program are to be retained in the registers.

Familiarities with the target machine and its instruction set are a pre-requisite for designing a good
code generator.

7.6 Target Machine: Consider a hypothetical byte addressable machine as target machine. It
has n general purpose register R1, R2 ------- Rn. The machine instructions are two address instructions
of the form

op-code source address destination address

Example:

MOV R0, R1

ADD R1, R2

Target Machine supports for the following addressing modes

1. Absolute addressing mode

Example: MOV R0, M where M is the address of memory location of one of the operands. MOV
R0, M moves the contents of register R0 to memory location M.

2. Register addressing mode where both the operands are in register.

Example: ADD R0, R1

3. Immediate addressing mode – The operand value appears in the instruction.

139
Example: ADD # 1, R0

4. Index addressing mode- this is of the form C(R) where the address of operand is at the location C
+Contents(R)

Example: MOV 4(R0), M the operand is located at address = contents (4+contents(R0))

Cost of instruction is defined as cost of execution plus the number of memory access.

Example:

MOV R0, R1, the cost = 1 as there are no memory access.

Where as MOV R0, M cost = 2.

7.6.1 Register and address descriptor

Register descriptor gives the details of which values are stored in which registers and the list of
registers which are free.

Address descriptor gives the location of the current value can be in register, memory location or in
stack is based on runtime environment.

7.7 Code generation algorithm

Consider the simple three address code for which the target code to be generated.

Example: a = b op c

1. Consult the address descriptor for ‘b’ to find out whether b is in register or memory location.
If b is in memory location, generate code.

MOV b, Ri where Ri is one of the free register as per register descriptors. Update address
descriptor of b and register descriptor for free registers.

2. Generate code for OP C, where C can be in memory location or in register.

3. Store result ‘a’ in location L. L can be memory location M or register R, based on availability
of free register and further usage of ‘a’. Update register descriptor and address descriptor for
‘a’ accordingly.

Example: x = y + z

Check for location of y,

Case 1: If y is in register R0 and z may be in register or memory. The instructions will be

ADD z, R0

MOV R0, x

In this case the result x has to be stored in memory location x.

Case2: If y is in memory, fetch y to register, update address and register descriptor

MOV y, R0

140
ADD z, R0

MOV R0, x

Example:

P = (x – y) + ( x – z) + ( x – z)

t1 = x – y

t2 = x – z

t3 = t1 + t2

t4 = t3 + t2

Three address code

Example: Generate code for instruction x = y[i] and x [i]=y

Stmt i in reg Ri i in Memory i in Stack


Code Cost Code Cost Code Cost
x = y [i] MOV y (Ri), R 2 MOV M, R 4 MOV Si (x), R 4
MOV b (R1, R2) MOV y (R), R
x [i] = y MOV y, x (Ri) 3 MOV M, R 5 MOV Si(x),x 5
MOV y, x (R)

7.8 Code generation for function call

Code generation for function code is based on the runtime storage. The runtime storage can by static
allocation or stack allocation. In case of static allocation the position of activation record in memory
is fixed at the compile time. To recollect about activation record, whenever a function is called,
activation records are generated, these records store the parameters to be passed to functions, local
data, temporaries, results & some machine status information along with the return address. In case of
stack allocation, every time a function is called, the new activation record in generated & is pushed
onto stack, once the function completes, the activation record is popped from stack. The three address
code for function call consists of following statements

1. Call.

2. Return

3. end

141
4. action

Call statement is used for function Call, it has to mail the control to the function along with saving the
status of current function.

Return statement is used to give the control back to calling function. Action defines other operations
or instructions for assignment or flow control statements. End indicates the completion of operations
of called function.

7.8.1 Static allocation: This section describes the final code generation for function calls, where
static allocation is used as runtime environment.

• Call statement : The code generated for call stmt is as follows.

MOV # current + 20, function.static_area

GOTO function.code_area

# current + 20 indicates the address of next instruction to which the return of function, i.e, the
instruction of called function which has to be executed after the called function completes execution.
20 defines the size of goto statement following call stmt.

Function.static_area defines the address of activation record of function. Function.code_area defines


the address of 1st instruction of called function.

• Return Statement: Code generated for return stmt is

goto * function.static_area.

This allows the control back to the called function.

Example:

/* code for main */

action 1

call fun

action 2

end

/* code for fun */

action 3

return

Three address code that will be generated for the above set of statements is as follows.

10: action 1

20: MOV # 40, 200 /* Save return address 40 at location 200 */

30: GOTO 100

142
40: Action 2

50: end

/* code for function */

100: action 3

100: GOTO * 200

200: 40(return address)

7.8.2 Stack allocation: Whenever the function is called the activation record of called fun c is
stored on Stack, once the function returns, it is removed from Stack. Final code that will be generated
for stack area for initialize the Stack is

MOV # Stack.begin, SP /* initialize the Stack Pointer */


SP denotes Stack Pointer.
Code for Call statement is as follows
Add # main.record size, SP /*main.recordsize referes to
record size of caller function*/
MOV # current +16, *SP /*Save return address*/
GOTO function.code_area

Return statement has the following target code.

GOTO *0(SP)

SUB # main.recordsize, SP

Example: For the below three address code

/* code for a */

action1

call c

action 2

end

/* code for b */

action 3

return

/* code for c */

action 4

call b

143
action 5

call c

action 6

call c

return

The final code generated will be as follows:

/* code for a * /

100: MOV # 600, SP // initialize stack

110: action 1

120: ADD # a.size, SP

130: MOV # 150, * SP

140: GOTO 300

150: SUB # a_size, SP

160: action 2

170: end

/* code for b */

200: action 3

210: GOTO * 0(SP)

/* code for c */

300: action 4

310: ADD # c_size, SP

320: MOV #340, *SP

330: GOTO 200

340: SUB # c_size, SP

350: action 5

360: ADD # C_Size_SP

370: MOV # 390, * SP

380: GOTO 300

144
390: SUB # C_Size_SP

400: Action 6

410: ADD # C_Size_SP

420: MOV # 440, * SP

430: GOTO 300

440: SUB # C_Size_SP

450: GOTO *0(SP)

600: Stack Starts here

Assignment

1. Convert the following statement to quadruples and triples

(a + b) * (c + d) + ( a + b + c)

2. Generate three address code for the following C code

i = 1;

while ( i <= 10)

{ a [ i ] = 0;

i=i+1;

3. Convert the following statement to quadruples and triples.

( x + y) * (( y + z ) / P – q/r ) +x + y

4. Generate target code for the following. Consider only three registers are available.

a=b+c*d

x = a / (b + c) – d * (e + f)

5. Generate three address code for the following C code

a [ i ]+ = b [c [i]]

6. Generate three address code for the following C code

for (i=0; i<10; i++)

A[i]=0;

145
7. Considering stack allocation generate code for

x = f (a) + f (a)

x = f(a) + g(b,c)

8. Generate target code and compute cost

x = a [i]

a[j] = y

z = a[i]

9. Compute the cost of set of instruction

MOV # C, R

MOV a, R,

MOV R1, a

MOV R2, * R1

MOV C (Rj), R1

ADD C (R2), R1

ADD R2, R1

INC R1

10. Generate target code for three address code generated for the Q no 5. Consider only three registers
are available.

146
Chapter 8

Runtime Environment

8.1 Introduction

In the previous chapters the phases of the compilers to the scanning, parsing and intermediate code
generation are studied. The study was completely independent of target language. This chapter deals
with runtime environment which concentrates on target computers memory structure and maintenance
of memory. There are manly three types of environments – Static Environment used in FORTRAN
77, Stack based environment used in languages like Pascal, C & C++. Fully dynamic environment
used in languages like LISP. There can be hybrid environment also. This chapter explains about the
relationship between language features and the environment. The environment takes care of
properties like scoping, procedure calls and parameter passing mechanisms. The allocation and de-
allocation of memory [data] objects is managed by the runtime support package which consists of
routines loaded with the generated target code.

8.2 Memory organization during program execution

Memory for the program execution is broadly divided into two areas, one for storing user data called
data area and other for storing program called program area. Normally the contents of program area
do not change during the execution of the program. Data area stores the global or static constants or
literals.

Example:

Printf (“ The solution is = % d”, 426);

In the above example the value 426 is constant and the string “The solution is =” are to be stored in
global area. Other than global variables, there will be local variables whose value changes during the
execution, these are to be stored in area local of the particular function. For this purpose stacks are
used. The runtime memory is divided into following parts.

1. Code area to store target code

2. Static data area – to store global variables or literals

3. Stack area – to store activation record during procedure calls and return. Stack Operates in LIFO
fashion [Last In First Out]

4. Heap – This is used for dynamic memory allocation.

147
Stack and heap may have separate memory blocks or they may share the same memory area.

8.2.1 Activation Record

Important unit of memory allocation is for activation record for function/procedure call. The
execution of function is referred to as activation of function. Function consists of function name,
formal and actual parameter, body of function and return value. When the function is called, the
activation record for that function is created and it is stored on stack. The components of activation
record are

• Return value - Return value is used to store the value that the function returns to
called function after its execution.

• Actual parameters - Actual parameters are those which are used for sending input to
functions from caller function.

• Optional control link - This points to the activation record of the caller. This is very useful
in case of recursion.

• Optional access link - This is used to access non-local data, it can point to caller data area
to access global data.

• Machine status - Machine status consists of values of program counter, machine registers
etc.,

• Local data - stores the local data of the called function

• Temporaries - used to store the intermediate results during execution of large expressions.

The runtime environment determines the sequence of operations that must be performed when a
function is called. Some operations must be performed during returning, this is called as call sequence
and return sequence respectively. Caller is responsible for computing arguments and placing them in
location [activation record] during call sequence. Callee has to take care of control and access links
along with temporaries and local data. Any additional book keeping information may be done either
by callee or caller

8.3 Fully Static Runtime Environment:

In static runtime environment, the data is stored in fixed memory location during the execution of
program. The language that uses static environment will not have the concept of pointer variables and
no dynamic memory allocation. They also can not have recursive functions. All the variables are
allocated statically and the location of activation record is also fixed before execution. If does not
require any runtime support as the names are bound to storage at compile time.

148
Memory organization for static environment is shown in above figure. This consists of code for main
function followed by code for function 1 to function in code area. This is followed by global data and
activation record for function 1 to function n.

When a function is called, each argument is stored in activation record. And the return address is
saved. Then a jump to first instruction of caller function is made. On return, a simple jump is made
to the return address.

Example:

int a = 10,

main ( )

{ int x, y ;

y = a;

x = fun (g);

print f (“x = %d”, x),

int fun (int a)

int i;

i = a + 10;

return i;

The memory allocation for the above program is as follows

8.4 Stack based runtime environment

The machine that supports for stack allocation for runtime management stores the activation for
procedure call in stack rather than static location. This kind of allocation is efficient with the
languages that support recursion. Every procedure may have different activation records on the call
stack at any time. New activation record is pushed (stored) on top of the stack for every function call
and popped when function returns.

149
Stack based environment requires three pointer

1. Pointer to current activation record to access local variables. This pointer is called as current
activation pointer (CAP)

2. Pointer to previous (caller) activation record (control link)

3. Stack pointer (SP) which pointes to top of the stack.

Example:

Consider simple recursive function to compute GCD of 2 positive integers

int x, y

int gcd (int a, int b)

if (b = = 0)

return a;

else

return gcd (b, a% b) ;

main ( )

{ scanf (“% d % d”, & x, & y);

printf (“ GCD of % d and % d = % d”, x, y, gcd(x,y));

return 0;

If the input are 15 and 10, then x = 15 and y = 10, main calls gcd (15, 10)

The contents of stack are as in Fig 8.1.

Each time a gcd( ) is called activation record for gcd is pushed on to stack, The control link of ith
activation record points to the return address of (i – 1)th activation record. Once the function
completes its execution. The activation records are popped from stack.

150
x = 15 Global area
y = 10
Activation Record of main

a = 15
b = 10
Activation Record of GCD
control link
return address
a = 10
b=5
Activation Record of GCD
control link
return address
a=5
b=0
Activation Record of GCD
control link
CAP return address

SP

Fig 8.1 stack contents on recursive function call

8.4.1 Access to names

Parameters and local variables can be accessed by the offset from the starting point of activation
record. As the declaration of a function is fixed at compile time and the memory size to be allocated
for each declaration is fixed by its data type, the offset can be statically computed.

Example: Consider the C function

void fun (int a. char b)

double y;

…..

151
Assume two bytes for integer, one four character, eight for double precision floating point number and
four byes for address. Assume that the stack grows from higher to lower address.

The following offset exists

Variable Offset

a + 5 Control link is address hence 4 bytes + 1byte for char b

b +4

y –4

Non local and static names have fixed location and can be accessed directly. This mechanism is
called as static scope.

8.4.2 Variable length data

Some times compilers are required to deal with variable length data. In cases where number of data
objects for function call may vary and size of each object may also change.

Example: printf (“ Hello”);

This has only one argument

Printf (“%s%d%f”, a,b,c); has three arguments.

The number of arguments to printf is defined by format string. Number of arguments may vary from
call to call. C Compiler pushes the arguments in the reverse order onto stack. The first parameter is
always located at a fixed offset from CAP.

8.4.3 Local Temporaries

The stack based environment should take care of storing intermediate results during function calls.
Consider the following example

a [ i, j ] = [ x + y ] * ( p / q ) * fun (z)

There will be three partial results

1) x + y 2) p / q 3) a [ i, j ]

These results could be stored in stack before function call fun or stored in registers. If stored on sack,
then the stack will be as follows

152
Compiler can easily locate stack top from CAP.

8.4.4 Nested declaration

Consider the function declaration which has nested blocks.

int fun (int x; int y)

{ char a;

double b;

int z;

B1: { int i ;

float j;

----

-----

B2: { int p;

double q;

----

------

return z ;

Function fun has two blocks B1 & B2 variables of each blocks are local, with in the blocks & their
values do not exists outside the blocks.

One way of implementing this is treating each block as functions & creating Activation record, when
entered the block and remove Activation record on exit. This is not efficient as blocks do not have

153
parameters and return value. The simpler way of implementing this is allocating stack on entering
and de-allocating on exit.

Example: The stack for above example is as follows.

As B1 & B2 do not exists at the same time the area in stack can be reused as follows.

Such implementation helps in computing location of variables with respect to offset from CAP during
compile time.

8.5 Dynamic Memory

Though stack based runtime environment is efficient when compared to static environment it has the
problem of dangling reference.

Example: Consider the following C code.

main ( )

{ int * a;

P; dangle ( );

154
}

int. *dangle ( )

{ int. i = 20;

return &i;

a is a dangling reference because i cannot be accessed after the activation record of dangle is freed.
The stack cannot be used if the values of locals to be retained even after the activation ends. Hence
stack based environment is not efficient for general environment. The next alternative environment is
fully dynamic environment. In this case the deallocation of activation record is done at arbitrary times
during execution. The full of dynamic environment is very complex compared to stack based system
as it has to take care of keeping references and deallocating unwanted areas of memory at arbitrary
time. This concept is called as garbage collection. Basic structure of activation record remains same,
i.e., allocating space during procedure calls for parameters, local variables, control link and access
links, the only difference is deallocation of space at the later time.

8.5.1 Dynamic Memory in Object Oriented Language

Object oriented language supports for objects, methods, inheritance and dynamic binding an object in
memory is combination of record and an activation record with instance variables as fields of record.
One way of implementing this is virtual function table. This table consists of list of pointers to
methods of each class. Virtual table has advantage in computing offset, as each object points to
virtual table than class structure.

Example:

Class x

{ public

Int. a , b;

void f1 ( ) ;

virtual void g ( ) ;

};

Class y ; public x

{ public

int. i ;

void f1 ( );

virtual void h ( );

};

155
Object of class x in memory is represented as follows

8.5.2 Heap management

Heap is a linear block of memory which is used to handle pointer allocation and deallocation. Heap
performs two operations, allocate and free. Allocate operation takes input as size in bytes and returns
pointer to block of memory of defined size. If no memory exists, it returns null pointer. Free operation
is used to free the allocated block. Pascal uses new and dispose, where as C+ uses new and delete for
allocate and free operation respectively. C language uses malloc and free as a part of standard library
stdlib.h for allocation and dellocating memory. The prototype of these functions are as follows

void *malloc (unsigned size);

void free (void *ptr);

One way of implementing heap is maintaining a circular list of free blocks, from which memory can
be drawn through malloc function and returned through free function. Though this is very simple to
implement and maintain, it has few disadvantages. One of the disadvantage is that, the pointer to free
block may not be the one given to malloc. Possibility of user giving invalid pointer corrupts the heap.
Secondly there can be small fragments of free blocks. This has to be compacted so that large blocks of
continuous memory are available for malloc.

More efficient way of heap implementation is using circular linked list which keeps track of both
allocated and free blocks. Heap consists of nodes (blocks) which has information of size of used area
and size of free area followed by user space and free space as shown below.

156
It also has next pointer which points to next block in heap memory. Heap also uses one more pointer
called memptr this points to a block that has some free space. This free space will always be
initialized to null value.

8.5.3 Automated management of Heap

malloc and free are explicitly called in the program for dynamic management of memory. In case of
run time stack the memory management should be automatically done by the calling sequence. Fully
dynamic runtime environment automatically reclaim previous allocated blocks which are not used
further without explicit free call. This process is called as garbage collection. Garbage collection can
be achieved in any of the following methods

• mark and sweep

• stop and copy

• generational garbage collection

Mark and sweep: In this method no memory is freed until malloc fails for insufficient memory. At
this point, the mark process marks the memory blocks whose values are not used any more. In the
sweep process the marked memory blocks are cleared and put into free list. Some time memory
compaction may be required in order to get large free block.

Stop and copy: In this method, the memory is divided into two halves and allocating storage only
from one half at a time. During the marking process all the updated blocks (the blocks whose values
are changed) are stored in second half. If performs memory compaction automatically. Once all
blocks in the used area have copied, the used and unused halves of memory are interchanged and the
processing continues.

157
Generational garbage collection: The aim of this method is to reduce the delay. In order to do this,
the allocated objects that survive for long time are copied onto permanent space and are not
deallocated during reclaimation. This reduces the search space for newer storage and hence reducing
the time for searching.

8.6 Parameter passing mechanism

During function call the activation record is filled with the parameters send by caller to the called
function. This acts as input to the called function. The process of mapping actual parameters to formal
parameters is referred to as binding of parameters to arguments.

Argument values are interpreted by the function based on the parameter passing mechanism. Most
common parameter passing mechanisms are

• Pass by value

• Pass by reference

• Pass by value result

• Pass by name

Pass by value: In this method actual parameter is evaluated and their r-value is passed to function.
The formal parameter is treated just as local variable. Hence they can be stored in activation record.
The caller evaluates the actual parameters and places their values in the storage of formals. The
operations on formal parameters do not affect the values in the activation record of the caller.

Pass by reference: In this method the caller passes to the called function a pointer to the storage
address of each actual parameter. If an actual parameter is a name or an expression having an l-value,
then that l-value itself is passed. A reference to a formal parameter in the called procedure becomes,
in the target code, an indirect reference through the pointer passed to the called function.

Pass by value result: This method is a hybrid of pass by value and pass be reference. Before passing
the control to the called function, it copies the values of actual parameters into activation record of
called function. Later the control moves to the function and it gets executed. After execution the
formals are copied into activation record of caller.

Pass by name: In this method the function is treated as a macro. Its body is substituted for the call in
the caller, with actual parameters literally substituted for the formals. The local names of the called
function are kept distinct from the names of the calling function ie each local of the called function
being systematically renamed into a distinct new name before the macro expansion is done. The
actual parameters are surrounded by parentheses if necessary to preserve the integrity.

Assignment:

1. Describe the advantage of using stack as runtime environment when compared with static
allocation

2. What are the steps involved in call sequencing

3. Give the output of the following program using the four parameter passing methods

P ( x , y, z)

{ y = y + 1;

158
z = z + x;

main()

{ int a = 2, b = 3,x

P(a +b, a, a );

Printf(“%d”, a);

};

4. Give the output of the following program using the four parameter passing methods

int i=0;

void p(int x, int y)

{ x+=1 ;

i+= 1 ;

y+= 1;

main()

{ int a[2]={1,1} ;

p(a[i],a[i]) ;

printf(“%d %d\n”, a[0],a[1]);

return 0;

5. What are the components of activation record and what is its significance.

6. Draw the runtime environment. For the following C program

a) After entry into block B1 in fun f1

b) After entry into block B2 in fun F2

int a[10];

char * x = “hello”;

int f1 (int i, int b [ ])

{ int i = j;

159
A : { int i = j ;

char c = b [ i ];

- ----

return 0;

void f2 (char *s )

{ char c = s [ 0 ];

B : { int a [5];

-- ---

main ( )

{ int x = 1;

x = f1 (x, a);

f2 (s);

return 0;

7. Draw the stack of activation record for the following program

main ( )

{ int x = 2;

P (x);

void P (int i)

{ i = i * 10;

q = (i);

void q(int j)

160
{

j=j*10;

8. Write the stack of activation record for the program in Q no 5.

9. Write the stack of activation record for the program in Q no 2.

10. Explain how heap can be used for dynamic memory environment.

161
Chapter 9

Code Optimization

9.1 Introduction

Code Optimization phase in mainly use to optimize the code for better utilization of memory and
reduce the time taken for execution. Code optimization takes input from intermediate code generator
and performs machine independent optimization. Code optimizer may also take input from code
generator and perform machine dependent code optimization. Compilers that use code optimization
transformations are called as optimizing compilers. Code optimization does not consider target
machine properties for optimization (like register allocation and memory management) if input is
from intermediate code generator.

Code optimization tries to optimize that part of the code which are executed more number of times,
like statements within flow control block of for statement and while statement. This is because the
most programs always spend maximum execution time on executing only few statements Code
optimization analysis programs in two levels, namely control flow analysis and data flow analysis. In
control flow analysis code optimization concentrates more on improving the code of inner loops than
outer statements, as inner loops are executed more number of times than outer ones. A detailed data
flow analysis is required for debugging the optimized code. Data flow analysis collects the
information of statistics about statements being executed more number of times. This information is
used in the process of optimization. Code optimization should be such that best results are achieved
with minimum effort.

Code Optimization has to mainly achieve two goals

1. Preserve the meaning of code – The output generated before (without) Code Optimization
should be same as the code after optimization.

2. Optimization should reduce the cost of execution considerably. The effort spent on code
optimization should be worthy of it.

It implies that amount of time taken for optimization should be very less when compared to the
reduction of overall execution time. Generally, a fast non optimizing compilers are preferred for
debugging programs

Code improvement always need not be in code optimization phase. It can be incorporated in source
program or in intermediate code or on target code. In source program say, for sorting program, user
can choose different algorithm based on the cost function like minimum space or minimum time.
Each algorithm can be efficient it its own way or other, like quick sort is very fast on unsorted/random
array where as other sorting like bubble sort is efficient on partially sorted array. Intermediate code
can be improved by improving loops and efficient address calculation may give better results. In final
code generation phase, optimized code can be efficiently generated by selecting appropriate
instruction, use registers efficiently and some instruction transformations. Example: Keeping most
used variables in registers which avoids frequent fetching and storing in memory location. This
chapter deals with optimization of intermediate code represented as three address code. Intermediate
code is relatively independent at target machine so optimization is machine independent.

Programs are represented as flow graphs to study control flow and temporary variables are used to
store intermediate results help in data flow analysis. It is seen that compilation speed is proportional to
the size at program being compiled hence amount of time taken for code optimization should be
relatively less.

162
9.2 Principal of code optimizations

This sections deals with identifying that part of the program where optimization is required. By using
the concept of proper register allocation, elimination of dead code and finding the cost of instruction,
it is possible to improve the efficiency of program statements.

9.3 Unnecessary Operation

In a program there may be some part of code which never executes. It would be waste to generate
code for these statements. It may also happen that some of the values of temporary variables may
never be used. These are called as dead codes, it has to be removed. There can be some sub-
expression whose value is computed many times. This can be optimized by calculating the value of
sub-expression only once and other statements can just use this value.

Example:

x=1

while (x != 1)

{ …}

Statements of while never executed hence do not generate code for statements within while statement.

Example:

x=y+z

a = x + 10

p=y+z

b = p + 20

Both x and p computes same sub-expression hence generate code for x only once and p uses value of
x instead of re-computing from x & y.

After intermediate code generation it may so happen that there can be a jump statement whose target
statement is next statement itself. In this case jump statement should be avoided, which reduces code
generating time.

9.4 Constant Folding : If the assignment statement consists of only constants to the right hand
side of assignment statement. Then the value of the expression can be pre-computed.

Example: y = 2 * 5 + 6

The value of y can be computed as 16 and stored. Then the three address code generated would by
y=16 instead of

t1 = 2 * 5

t2 = t1 + 6

y = t2

163
This helps in constant propagation i.e, from the above example if y is used in any other expression,
instead of substituting y = 2 * 5+6 it can be substituted with y = 16.

Example:

y = 2 * 5 + 16

x=y+z

without optimization

x=2*5+6+z

with optimization

y = 16

x = 16 + z

Some of the operations like procedure call are very expensive, especially recursive procedure calls. In
order to reduce this, recursive procedures may be converted to iterative by providing lables. Issues
regarding procedure call it that before transferring the control to procedure. The status of procedure
has to be stored in registers. It has to be restored after procedure returns. Hence increases load and
store instructions.

9.5 Predicting program behavior

In order to generate more optimized code, Code optimization phase has to find out number of
variables used, their value set, those expressions which are used many times. It should also perform
some statistical analysis like-part of the code never reached, part of code which will executed many
times, procedures likely to be called. This information helps in adjusting loop structure and procedure
code to minimize execution speed.

9.6 Other Methods of Optimizations

Some of the optimization techniques are used to improve the loop statements. These are code motion
and reduction in strength of expression.

9.6.1 Code Motion:

Optimization is done for those statements which are executed frequently. Hence the statements whose
values do not change with respect to loop invariants should be removed from being inside the loop.

Example:

a = 1;

while (a! = 10)

b = x + 100;

a=a+1;

printf(“%d”,a);

164
}

In the above example, variable b with in while loop, is independent of loop invariant a and the value
of x do not change inside loop, hence b = x + 100 can be executed before while loop or after while
loop.

b = x + 100;

a = 1;

while (a ! = 0 )

{ a = a + 1;

printf (“%d”,a);

9.6.2 Reduce the strength of expression: If the intermediate code consists of multiplication or
division, it can be replaced by addition or subtraction, this reduces the strength of expression.

Example:

while ( i > 10)

{ i = i + 1;

t1 = 4 * i;

The statement with in while loop, will be executed until ‘i’ greater than 10. Initially if i = 0, for the
first iteration i = 1 and t1 = 4, for the 2nd instruction i = 2 and t1 = 4 * 2 = 8

or t1 = 4 * (i + 1)

t1 = 4 * i + 4 ( t1 = 4 * i )

t1 = t1 + 4

As the expression for evaluating t1 which requires multiplication is reduced to addition, its execution
is faster.

9.7 Local, Global & Inter-Procedural Optimization:

In case of local optimization straight line codes with in basic block are optimized. The basic block
consists of only assignment statements with no jumps or loops. Some of the optimization techniques
that can be used for local optimization are constant folding, constant propagations and algebraic
transformations.

Optimization considering many basic blocks of single procedure is called global optimization. They
use optimization techniques like code motion, elimination of induction variables and reduction in
strength of expression. Global optimization requires data flow analysis to detect jump boundaries
before optimization.

165
Inter-procedural optimization deals with optimization of entire program as a whole. This is very
difficult to achieve as it has to take care of different parameters passing mechanization and non local
variable access. The advantage of inter procedural optimization is that each procedure can be
optimized independently and linked together at the end with the help of linker which performs
optimization later on.

9.8 Machine dependent optimization

Some of the optimizations are machine dependent, like register allocation and cost of instruction.

9.8.1 Register Allocation:

Number of times variable in each block of program may vary, but there are fixed number at register in
the system. Hence these registers are to be efficiently used. As far as possible the temporary variable
or intermediate values should be kept in register this reduces the load and store to memory.

Example:

x=y+z

a = x + 10

b = x + 20

As the value of x is used after it has been assigned a value. Retain the value of x in the register, to
avoid storing and reloading from memory.

9.8.2 Cost of Instructions:

Each instruction takes some machine cycles to perform the operation. The optimization strategies
should be such that it should reduce the number of machine cycles or in other words the strength of
instruction should be reduced to have better optimization.

Example:

x2 can be replaced by expression x * x.

Expressions like adding 0 or multiplying by 1 can be removed, as these do not change the value of
variable.

Example:

1) x = x + 0

2) a = a * 1

These instructions can be eliminated as they do not change the value of x and a.

This is called algebraic transformation.

9.9 Data Structure:

Syntax trees can be used for some of the optimization techniques like constant folding, constant
propagation etc., but for optimization like eliminating loop invariant, or dead code elimination, it is
not very efficient, Specially for global optimization syntax tree is not efficient as it requires the study
for control flow. Hence flow graphs are used. Flow graphs consist of basic blocks as nodes and edges

166
connecting basic blocks indicate the control flow. The sequence of three address statement is
converted to flow graph using following steps.

1. Construct basic block

2. Generate flow graph

1. Construction of Basic Blocks

a) Determine set of header statements. Header statements are the first statement of each basic
block.

b) First statement is a header statement

c) Any statement which is target of conditional or unconditional jump is a header statement.

d) Any statement following conditional or unconditional jump is a header statement.

2. Construct flow graph

Construct graph with B1 as the starting node where B1 is basic block which has first statement of the
program. Generate edge from Bi to Bj if control flows from block Bi to Bj. Entry for any block Bk
will be from the first statement of Bk and exit from Bk will be from last statement only. No
intermediate jump or return can happen in the basic block.

Example: Consider the following C statement

for i = 1 to n do

for i =1 to n do

C[i, j] = 0;

Three address code generated will be as follows

1) i=0

2) if i < n go to 4

3) go to 15

4) j=1

5) if j < n go to 7

6) go to 13

7) t1 = i * 10

8) t2 = t1 + j

9) t3 = 4 * t2

10) C[t3] = 0

11) j=j+1

167
12) go to 5

13) i=i+1

14) go to 2

15)

Basic blocks will be as follows

Flow graph for the basic blocks is as follows in Fig 9.1

168
9.10 Directed Acyclic Graph

Flow graphs are mainly used for global optimization. These are not very efficient for local
optimizations on basic blocks. Hence Directed A cyclic Graphs (DAG) is used. Leaves of DAG are
used to represent variable names or constants. Interiors nodes and root of DAG is used to represent
operator symbol. Nodes have label which denotes the most recent value for the variables.

For any statement a = b op c the DAG is in Fig 9.2

b, c the leaves represents variables. Interior node OP represents operator OP and a is the label for OP
which gives the value of b OP c.

For exp like x = y no node is created for x. Only the label y will be added to the node which had label
x.

Example: Consider the following code

t1 = a + b

t2 = t1

169
DAG for the three address code is represented in Fig 9.3. For 2nd expression no new node is created,
but it will use the same node +. Initially t1 will be the label of + after 2nd statement t2 is also added as
label of 

Example: Consider the following statements

a=b+c

b=a–d

Fig 9.4 shows the DAG for the above three address code

Example: Consider the following statements

c=c+d

e=b+c

Fig 9.5 shows the DAG for the above three address code

170
Example: Consider the following expression

a = -b * c + d

The three address code will be

t1 = –b

t2 = t1 * c

t3 = t2 + d

a = t3

Fig 9.6 shows the DAG for the expression a = -b * c + d represented as three address code

Example: Consider the following expression

a=b*d+b*d+c

The three address code will be

t1 = b * d

t2 = b * d

171
t3 = t1 + t2

t4 = t3 + c

a = t4

Fig 9.7 shows the DAG for the expression a = b * d + b * d + c represented as three address code.
From the above DAG it is found that node * has 2 labels t 1 & t2. Hence there is no necessary to
generate code twice for the same expression. Final code can be generated from DAG by topological
sorting. Topological sorting is the traversal of tree from leaf to root in which children are visited
before their parents. As there can be multiple topological sorts. There can be many code sequences for
single DAG.

Example:

Consider intermediate code

t1 = a + b

a = t1

t2 = b – 1

b = t2

t3 = b + 5

172
After topological sorting

t2 = b – 1

t1 = a + b

a = t1

t3 = b + 5

b = t2

Reordering of code helps in eliminating unnecessary use of temporaries. Hence the code would be as
follows.

a=a+b

b=b–1

t3 = b + 5

DAG gives the information of how many references exists for node. This helps in good register
allocation. If a value has many references then it can be retained in registers for long time. If the
value has no reference it can be removed from the register.

Assignment:

1. Consider the following C statements

for ( i = 0; i < 10; i + +)

for (j = 0; j < 10; j ++)

C [ i, j]; 0 ;

for (k = 1; k = 10; k ++)

C [ i, j] = C [i, j] + A [i, k] * B[k, j];

a) Produce three address code

173
b) Generate target code from three address code

2. For the three address code generated for Q no 1

a. Construct flow graph

b. Perform optimization of flow graph

3. Consider the following code

i=2;

for j = 2 * j to n by i do

a[j]=1;

a) Generate three address code

b) Construct basic blocks and flow graph

4. Form the three address code generated in Q no 3

a) Optimize code &

b) Generate target code

5. Construct DAG for

e=a+b

f=e–c

g=f*d

h=a+b

i=i–c

j=i+g

6. Construct DAG for the following set of statements

a=b+c

b=a–d

c=b+c

d=a–d

7. Generate optimal code for x = a + (b + c / d * e) / ( f * g – h * i )

8. Optimize the three address code for the expression a [ i, j] + b [i, j] – c [ a [k, l]]

9. Construct DAG for the following set of statements

174
d=b*c

e=a+b

b=b*c

a=e–d

10. Optimize the following flow graph

175

You might also like