You are on page 1of 89

PRINCIPLES OF COMPILER DESIGN

UNIT – I : Lexical Analysis

Part-A
1. Define Compiler.
Compiler is a program that reads a program written in one language –the source language- and
translates it into an equivalent program in another language- the target language. In this translation
process, the complier reports to its user the presence of the errors in the source program.
2. What are the classifications of a compiler?
The classifications of compiler are: Single-pass compiler, Multi-pass compiler, Load and go
compiler, Debugging compiler, Optimizing compiler.
3. Define linker.
Linker is a program that combines (multiple) objects files to make an executable. It converts
names of variables and functions to numbers (machine addresses).
4. Define Loader.
Loader is a program that performs the functions of loading and linkage editing. The process of
loading consists of taking re-locatable machine code, altering the re-locatable address and placing
the altered instruction and data in memory at the proper location.
5. Define interpreter.
Interpreter is a language processor program that translates and executes source code directly,
without compiling it to machine code
6. Define editor.
Editors may operate on plain text, or they may be wired into the rest of the complier, highlighting
syntax errors as you go, or allowing you to insert or delete entire syntax constructed at a time.
7. What is meant by semantic analysis?
The semantic analysis phase checks the source program for semantic errors and gathers type
information for the subsequent code generation phase. It uses the hierarchical structure determined
by the syntax-analysis phase to identify the operators and operand of expressions and statements
8.What are the phases of a compiler? or How will you group the phases of compiler
?(Nov/Dec 2013)
 Lexical analysis
 Syntax analysis
 Semantic analysis
 Intermediate code generation
 Code optimization
 Code generation
9. Mention few cousins of the compiler?(May/June-2012)
 Preprocessors.
 Assemblers.
 Two pass assembly.
 Loader and link editors.
10. What is front-end and back-end of the compiler?
Often the phases of a compiler are collected into a front-end and back-end.
Front-end consists of those phases that depend primarily on the source language and largely
independent of the target machine. Back-end consists of those phases that depend on the target
machine language and generally those portions do not depend on the source language, just the
intermediate language. In back end we use aspects of code optimization, code generation, along
with error handling and symbol table operations.

1
11. List of some compiler construction tools.
 Parser generators
 Scanner generators
 Syntax-directed translation engines
 Automatic code generators
 Data flow engines.
12. Define Preprocessor. What are its functions?
Preprocessors are the programs that allow user to use macros in the sources program. Macro
means some set of instructions which can be repeatedly used in the sources program . The output
of preprocessor may be given as input to complier.
The functions of a preprocessor are:
 Macro processing.
 File inclusion.
 Rational preprocessors
 Language extensions
13. What is linear analysis?
The stream of characters making up the source program is read from left-to-right and grouped into
tokens that are sequence of characters having a collective meaning.
14. What is cross complier? (May/June 2014)
There may be a complier which run on one machine and produces the target code for another
machine. Such a complier is called cross complier. Thus by using cross complier technique
platform independency can be achieved
15. What is Symbol Table or Write the purpose of symbol table.(May/June 2014)
Symbol table is a data structures containing a record for each identifier, with fields for the
attributes of the identifier. The data structure allows us to find the record for each identifier
quickly and to store or retrieve data from that record quickly.
16. Define Passes.
In an implementation of a compiler, portion of one or more phases are combined into a module
called pass. A pass reads the source program or the output of the previous pass, makes the
transformation specified by its phases and writes output into an intermediate file, which is read by
subsequent pass.
17. Difference Between Assembler, Compiler and Interpreter.
Assembler: It is the Computer Program which takes the Computer Instructions and converts them
in to the bits that the Computer can understand and performs by certain Operations.
Compiler:
 It Converts High Level Language to Low Level Language.
 It considers the entire code and converts it in to the Executable Code and runs the code.
 C, C++ are Compiler based
Languages. Interpreter:
 It Converts the higher Level language to low level language or Assembly level language.
It converts to the language of 0' and 1's.
 It considers single line of code and converts it to the binary language and runs the code on the
machine.
 It it founds the error, the programs need to run from the beginning.
 BASIC is the Interpreter based Language.
18. Mention the issues in a lexical analyzer.(May/June-2013)
There are several reason for separating the analysis phase of compiling into lexical analysis and
parsing.1.Simpler design is perhaps the most important consideration. The separation of lexical
analysis from syntax analysis often allows us to simplify one or the other of these phases. 2.
Compiler efficiency is improved. 3. Compiler portability is enhanced.

2
19. What is the role of lexical analysis phase?
The lexical analyzer breaks a sentence into a sequence of words or tokens and ignores white
spaces and comments. It generates a stream of tokens from the input. This is modeled through
regular expressions and the structure is recognized through finite state automata.
20. What are the possible error recovery actions in a lexical analyzer?(May/june-2012)
i. Panic mode recovery.
Delete successive characters from the remaining input until the lexical analyzer can find a
Well-formed token. This technique may confuse the parser.
ii. Other possible error recovery actions:
 Deleting an extraneous characters
 Inserting a missing characters.
 Replacing an incorrect character by a correct character.
 Transposing two adjacent characters.
21. Define Tokens, Patterns and Lexemes (or) Define Lexeme.(May/June 2013&2014)
Tokens: A token is a syntactic category. Sentences consist of a string of tokens. For example
number, identifier, keyword, string etc are tokens
Lexemes: Lexeme: Sequence of characters in a token is a lexeme. For example 100.01, counter,
const, "How are you?" etc are lexemes.
Patterns: Rule of description is a pattern. For example letter (letter | digit)* is a pattern to
symbolize a set of strings which consist of a letter followed by a letter or digit
22. Write short note on input buffering. (or).Why is buffering used in lexical analysis ?
What are the commonly used buffering methods?(May/June 2014)
Lexical analyzer scan the sources program line by line. For storing the input string, which is to be
read, the lexical analyzer makes use of input buffer. The lexical analyzer maintains two pointer
forward and backward pointer for scanning the input string. There are two types of input buffering
schemes- one buffer scheme and two buffer scheme.
23. What are the drawbacks of using buffer pairs?
i. This buffering scheme works quite well most of the time but with it amount of
look ahead is limited.
ii. Limited look ahead makes it impossible to recognize tokens in situations
Where the distance, forward pointer must travel is more than the length of buffer.
24. Define regular expressions.
Regular expression is used to define precisely the statements and expressions in the source
language. For e.g. in Pascal the identifiers is denotes in the form of regular expression as letter
letter(letter|digit)*.
25. What are algebraic properties of regular expressions?
The algebraic law obeyed by regular expressions are called algebraic properties of regular
expression. The algebraic properties are used to check equivalence of two regular expressions.

S.No Properties Meaning


1 r1|r2=r2|r1 | is commutative
2 r1|(r1|r3)=(r1|r2)|r3 | is associative
3 (r1 r2)r3=r1(r2 r3) Concatenation is associative
4 r1(r2|r3)=r1 r2|r1 r3 Concatenation is distributive
(r2|r3) r1=r2 r1|r3 r1 over |
5 Єr=rє=r Є is identity
6 r*=(r|є)* Relation between є and *
7 r**=r* *is idempotent
26. What is Regular Expressions?

3
Regular expression is an important notation for specifying patterns .A regular expression is built
up out of simpler regular expressions using a set of defining rules. Let S be a set of characters. A
language over S is a set of strings of characters belonging to S
 A regular expression r denotes a language L(r)
 Rules that define the regular expressions over S
- ? is a regular expression that denotes { ? } the set containing the empty string
- If a is a symbol in S then a is a regular expression that denotes {a}
27. List the operations on Language.
 The various operations on languages are:
 Union of two languages L and M written as L U M = {s | s is in L or s is in M}
 Concatenation of two languages L and M written as LM = {st | s is in L and t is in M}
 The Kleene Closure of a language L written as

28. List the rules for constructing regular expressions?


The rules are divided into two major classifications(i) Basic rules (ii) Induction rules
Basic rules:
i) ε is a regular expression that denotes {ε} (i.e.) the language contains only an empty string.
ii) For each a in ∑, then a is a regular expression denoting {a} , the language with only one
string , which consists of a single symbol a.
Induction rules:
i) If R and S are regular expressions denoting languages LR and LS respectively
then, ii)(R)/(S) is a regular expression denoting LR U LS
iii)( R ).(S) is a regular expression denoting LR . LS
iv)(R )* is a regular expression denoting LR*.
29. How to recognize tokens
Consider the following grammar and try to construct an analyzer that will return <token, attribute>
pairs.
relop < | = | = | <> | = | >
id letter (letter | digit)*
num digit+ ('.' digit+)? (E ('+' | '-')? digit+)?
delim blank | tab | newline
ws delim+
30. What is recognizer ?
Recognizer are machines. These are the machines which accepts the strings belonging to certain
language. If the valid strings of such language are accepted by the machine then it is said that the
corresponding language is accepted by that machine, otherwise it is rejected .
For example-
1.The finite state machines are recognizer for regular expressions or regular
language 2.The push down automate are recognizer for context free language.
31. What is finite automate ?
A finite automate is a mathematical model that consists of-
i)Input set Σ
ii) A finite set of states S
iii) A transition function to move from one state to
other. iv)A special state called start state.
v)A special state called accept state.
The finite automate can be diagrammatically represented by a directed graph called transition
graph.

4
32. What is the need for separating the analysis phase into lexical analysis and parsing?
i) Separation of lexical analysis from syntax allows the simplified design.
ii) Efficiency of complier gets increased. Reading of input source file is a time consuming process
and if it is been done once in lexical analyzer then efficiency get increased . Lexical analyzer uses
buffering techniques to improve the performances.
iii) Input alphabet peculiarities and other device specific anomalies can be restricted to the lexical
analyzer.
33. What are sentinels ? What is its usage ?
Sentinels are special string or characters that are used to indicate an end of buffer. The sentinel is
not a part of sources language. one can avoid a test on end of buffer using sentinels at the end of
buffer. For example :‘eof‘.
34. Draw a transition diagram to represent relational operators.
The relational operators are:<,<=,>,>=,=,!=

35. Write one example in operations on Language.


Let L be a the set of alphabets defined as L = {a, b, .., z} and D be a set of all digits defined as D
= {0, 1, 2, .., 9}.
 LUD is a set of letters and digits
 LD is a set of strings consisting of a letter followed by a digit
 L* is a set of all strings of letters including ?
 L(LUD)* is a set of all strings of letters and digits beginning with a letter
 D + is a set of strings of one or more digits
36. Define LEX.

5
LEX is a tool for automatically generating lexical analyzers. A LEX source program is a
specification of a lexical analyzer, consisting of a set of regular expressions together with an
action for each regular expression. The output of LEX is a lexical analyzer program.
37. State any reasons as to why phases of compiler should be grouped.(May/June 2014)
1. By keeping the same front end and attaching different back ends one can produce a compiler for
dame source language on different machine.
2. By keeping different fronts ends and same back end one can compile several different
language on the same machine.

PART-B
1. (a)Describe the following software tools i. Structure Editors ii. Pretty printers
iii. Interpreters (8)

Software tools used in Analysis part:


1) Structure editor:
 Takes as input a sequence of commands to build a source program.
 The structure editor not only performs the text-creation and modification functions of
an ordinary text editor, but it also analyzes the program text, putting an appropriate
hierarchical structure on the source program.
 For example , it can supply key words automatically - while …. do and begin….. end.
2) Pretty printers :
 A pretty printer analyzes a program and prints it in such a way that the structure of the
program becomes clearly visible.
 For example, comments may appear in a special font.
3) Static checkers :
 A static checker reads a program, analyzes it, and attempts to discover potential
bugs without running the program.
 For example, a static checker may detect that parts of the source program can never be
executed.
4) Interpreters :
 Translates from high level language ( BASIC, FORTRAN, etc..) into machine language.
 An interpreter might build a syntax tree and then carry out the operations at the nodes as it
walks the tree.
 Interpreters are frequently used to execute command language since each operator
executed in a command language is usually an invocation of a complex routine such as an
editor or complier.

(b) Write in detail about the analysis of the source program. (8)

Analysis consists of 3 phases:


Linear/Lexical Analysis :
 It is also called scanning. It is the process of reading the characters from left to right and
grouping into tokens having a collective meaning.
 For example, in the assignment statement a=b+c*2, the characters would be grouped into
the following tokens:
i) The identifier1 ‗a‘
ii) The assignment symbol (=)
iii) The identifier2 ‗b‘
iv) The plus sign (+)

6
v) The identifier3 ‗c‘
vi) The multiplication sign (*)
vii) The constant ‗2‘
Syntax Analysis :
 It is called parsing or hierarchical analysis. It involves grouping the tokens of the source
program into grammatical phrases that are used by the compiler to synthesize output.
 They are represented using a syntax tree as shown below:
 A syntax tree is the tree generated as a result of syntax analysis in which the interior nodes
are the operators and the exterior nodes are the operands.
 This analysis shows an error when the syntax is incorrect.
Semantic Analysis :
 It checks the source programs for semantic errors and gathers type information for
the subsequent code generation phase. It uses the syntax tree to identify the operators
and operands of statements.
 An important component of semantic analysis is type checking. Here the compiler checks
that each operator has operands that are permitted by the source language
specification.

2. Write in detail about the analysis cousins of the compiler.(16)(May/June-2013)

Cousins Of Compiler
1. Preprocessor
2. Assembler
3. Loader and Link-editor

Preprocessor:
A pre-processor is a program that processes its input data to produce output that is used as input to
another program. The output is said to be a pre-processed form of the input data, which is often
used by some subsequent programs like compilers.

They may perform the following functions :


1. Macro processing
2. File Inclusion
3. Rational Preprocessors
4. Language extension

Assembler:
Assembler creates object code by translating assembly instruction mnemonics into machine code.

There are two types of assemblers:

One-pass assemblers.
Two-pass
assemblers.

Linker and Loader:


A linker or link editor is a program that takes one or more objects generated by a
compiler and combines them into a single executable program.

3. What are the phases of the compiler? Explain the phases in detail.(May/June-2012)(16)

7
PHASES OF COMPILER
A Compiler operates in phases, each of which transforms the source program from one
representation into another. The following are the phases of the compiler:

Main phases:
1) Lexical analysis
2) Syntax analysis
3) Semantic analysis
4) Intermediate code generation
5) Code optimization
6) Code generation

Sub-Phases:
1) Symbol table management
2) Error handling

LEXICAL ANALYSIS:
 It is the first phase of the compiler. It gets input from the source program and produces
tokens as output.
 It reads the characters one by one, starting from left to right and forms the tokens.
 Token : It represents a logically cohesive sequence of characters such as
keywords,operators, identifiers, special symbols etc.
Example: a + b = 20
Here, a,b,+,=,20 are all separate tokens.
Group of characters forming a token is called the Lexeme.
 The lexical analyser not only generates a token but also enters the lexeme into the symbol
table if it is not already there.

SYNTAX ANALYSIS:
 It is the second phase of the compiler. It is also known as parser.
 It gets the token stream as input from the lexical analyser of the compiler and
generates syntax tree as the output.
Syntax tree:
It is a tree in which interior nodes are operators and exterior nodes are operands.
Example: For a=b+c*2, syntax tree is

=
a +

8
b *
c 2
SEMANTIC ANALYSIS:
 It is the third phase of the compiler.
 It gets input from the syntax analysis as parse tree and checks whether the given syntax
is correct or not.
 It performs type conversion of all the data types into real data types.

INTERMEDIATE CODE GENERATION:


 It is the fourth phase of the compiler.
 It gets input from the semantic analysis and converts the input into output as intermediate
code such as three-address code.
 The three-address code consists of a sequence of instructions, each of which has
atmost three operands.
Example: t1=t2+t3

CODE OPTIMIZATION:
 It is the fifth phase of the compiler.
 It gets the intermediate code as input and produces optimized intermediate code as output.
 This phase reduces the redundant code and attempts to improve the intermediate code
so that faster-running machine code will result.
 During the code optimization, the result of the program is not affected.
 To improve the code generation, the optimization involves
- deduction and removal of dead code (unreachable code).
- calculation of constants in expressions and terms.
- collapsing of repeated expression into temporary string.
- loop unrolling.
- moving code outside the loop.
- removal of unwanted temporary variables.

CODE GENERATION:
 It is the final phase of the compiler.
 It gets input from code optimization phase and produces the target code or object code as
result.
 Intermediate instructions are translated into a sequence of machine instructions
that perform the same task.
The code generation involves
- allocation of register and memory
- generation of correct references
- generation of correct data types
- generation of missing code

SYMBOL TABLE MANAGEMENT:


 Symbol table is used to store all the information about identifiers used in the program.
 It is a data structure containing a record for each identifier, with fields for the attributes of
the identifier.
 It allows to find the record for each identifier quickly and to store or retrieve data from that
record.
 Whenever an identifier is detected in any of the phases, it is stored in the symbol table.

9
ERROR HANDLING:
 Each phase can encounter errors. After detecting an error, a phase must handle the error so
that compilation can proceed.
 In lexical analysis, errors occur in separation of tokens.
 In syntax analysis, errors occur during construction of syntax tree.
 In semantic analysis, errors occur when the compiler detects constructs with right syntactic
structure but no meaning and during type conversion.
 In code optimization, errors occur when the result is affected by the optimization.
 In code generation, it shows error when code is missing etc.
To illustrate the translation of source code through each phase, consider the
statement a=b+c*2.
The figure shows the representation of this statement after each phase:

4.Elaborate on grouping of phases in a compiler.(6)

Compiler can be grouped into front and back ends:


- Front end: analysis (machine independent)
These normally include lexical and syntactic analysis, the creation of the symbol table, semantic
analysis and the generation of intermediate code. It also includes error handling that goes along
with each of these phases.
- Back end: synthesis (machine dependent)
It includes code optimization phase and code generation along with the necessary error handling
and symbol table operations.

Compiler passes
A collection of phases is done only once (single pass) or multiple times (multi pass)
 Single pass: usually requires everything to be defined before being used in source
program.
 Multi pass: compiler may have to keep entire program representation in memory.

10
Several phases can be grouped into one single pass and the activities of these phases are
interleaved during the pass. For example, lexical analysis, syntax analysis, semantic analysis and
intermediate code generation might be grouped into one pass.

5.Explain in detail about the compiler construction tools.(May/June-2012)

The following are the compiler construction tools:


1) Parser Generators:
-These produce syntax analyzers, normally from input that is based on a context-free grammar.
-It consumes a large fraction of the running time of a compiler.
-Example-YACC (Yet Another Compiler-Compiler).
2) Scanner Generator:
-These generate lexical analyzers, normally from a specification based on regular expressions.
-The basic organization of lexical analyzers is based on finite automation.
3) Syntax-Directed Translation:
-These produce routines that walk the parse tree and as a result generate intermediate code.
-Each translation is defined in terms of translations at its neighbor nodes in the tree.
4) Automatic Code Generators:
-It takes a collection of rules to translate intermediate language into machine language. The
rules must include sufficient details to handle different possible access methods for data.
5) Data-Flow Engines:
-It does code optimization using data-flow analysis, that is, the gathering of information about
how values are transmitted from one part of a program to each other part.

6.Explain about input buffering.

We often have to look one or more characters beyond the next lexeme before we can be
sure we have the right lexeme. As characters are read from left to right, each character is stored in
the buffer to form a meaningful token as shown below:

We introduce a two-buffer scheme that handles large look aheads safely. We then
consider an improvement involving "sentinels" that saves time checking for the ends of buffers.

BUFFER PAIRS
 A buffer is divided into two N-character halves, as shown below

 Each buffer is of the same size N, and N is usually the number of characters on one disk
block. E.g., 1024 or 4096 bytes.
 Using one system read command we can read N characters into a buffer.
 If fewer than N characters remain in the input file, then a special character, represented by
eof, marks the end of the source file.
Two pointers to the input are maintained:

11
1. Pointer lexeme_beginning, marks the beginning of the current lexeme, whose extent we
are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found. Once the next lexeme is
determined, forward is set to the character at its right end.
 The string of characters between the two pointers is the current lexeme. After the lexeme
is recorded as an attribute value of a token returned to the parser, lexeme_beginning is set
to the character immediately after the lexeme just found.

Advancing forward pointer:


Advancing forward pointer requires that we first test whether we have reached the end of
one of the buffers, and if so, we must reload the other buffer from the input, and move forward to
the beginning of the newly loaded buffer. If the end of second buffer is reached, we must again
reload the first buffer with input and the pointer wraps to the beginning of the buffer.

Code to advance forward pointer:


if forward at end of first half then begin
reload second half;
forward := forward + 1
end
else if forward at end of second half then begin
reload second half;
move forward to beginning of first half
end
else forward := forward + 1;

SENTINELS
 For each character read, we make two tests: one for the end of the buffer, and one to
determine what character is read. We can combine the buffer-end test with the test for the
current character if we extend each buffer to hold a sentinel character at the end.
 The sentinel is a special character that cannot be part of the source program, and a natural
choice is the character eof .
 The sentinel arrangement is as shown below:

Note that eof retains its use as a marker for the end of the entire input. Any eof that appears other
than at the end of a buffer means that the input is at an end.

Code to advance forward pointer:


forward : = forward + 1;
if forward ↑ = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward + 1
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half

12
end
else /* eof within a buffer signifying end of input */
terminate lexical analysis
end

7. Explain in detail about the role of Lexical analyzer with the possible error recovery
actions. (May/June-2013) (16)

Lexical analysis is the process of converting a sequence of characters into a sequence of


tokens. A program or function which performs lexical analysis is called a lexical analyzer or
scanner. A lexer often exists as a single function which is called by a parser or another function.

THE ROLE OF THE LEXICAL ANALYZER


 The lexical analyzer is the first phase of a compiler.
 Its main task is to read the input characters and produce as output a sequence of tokens that
the parser uses for syntax analysis.

 Upon receiving a ―get next token‖ command from the parser, the lexical
analyzer readsinput characters until it can identify the next token.

ISSUES OF LEXICAL ANALYZER


There are three issues in lexical analysis:
 To make the design simpler.
 To improve the efficiency of the compiler.
 To enhance the computer portability.

TOKENS
A token is a string of characters, categorized according to the rules as a symbol (e.g.,
IDENTIFIER, NUMBER, COMMA). The process of forming tokens from an input stream of
characters is called tokenization.
A token can look like anything that is useful for processing an input text stream or text file.
Consider this expression in the C programming language: sum=3+2;

LEXEME:
Collection or group of characters forming tokens is called Lexeme.

PATTERN:

13
A pattern is a description of the form that the lexemes of a token may take. In the case of a
keyword as a token, the pattern is just the sequence of characters that form the keyword. For
identifiers and some other tokens, the pattern is a more complex structure that is matched by many
strings.

Attributes for Tokens


Some tokens have attributes that can be passed back to the parser. The lexical analyzer
collects information about tokens into their associated attributes. The attributes influence the
translation of tokens.
i) Constant : value of the constant
ii) Identifiers: pointer to the corresponding symbol table entry.

ERROR RECOVERY STRATEGIES IN LEXICAL ANALYSIS:


The following are the error-recovery actions in lexical analysis:
1) Deleting an extraneous character.
2) Inserting a missing character.
3) Replacing an incorrect character by a correct character.
4) Transforming two adjacent characters.
5) Panic mode recovery: Deletion of successive characters from the token until error
is resolved.

8.Describe the specification of tokens and how to recognize the tokens (16) (May/June-2013)

SPECIFICATION OF TOKENS
There are 3 specifications of tokens:
1) Strings
2) Language
3) Regular expression

Strings and Languages


An alphabet or character class is a finite set of symbols.
A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
A language is any countable set of strings over some fixed alphabet.
In language theory, the terms "sentence" and "word" are often used as synonyms for "string." The
length of a string s, usually written |s|, is the number of occurrences of symbols in s.
For example, banana is a string of length six. The empty string, denoted ε, is the string of length
zero.

Operations on strings
The following string-related terms are commonly used:
1. A prefix of string s is any string obtained by removing zero or more symbols from the end
of string s.
For example, ban is a prefix of banana.
2. A suffix of string s is any string obtained by removing zero or more symbols from the
beginning of s.
For example, nana is a suffix of banana.
3. A substring of s is obtained by deleting any prefix and any suffix from
s. For example, nan is a substring of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those prefixes, suffixes,
and substrings, respectively of s that are not ε or not equal to s itself.

14
5. A subsequence of s is any string formed by deleting zero or more not necessarily
consecutive positions of s.
For example, baan is a subsequence of banana.

Operations on languages:
The following are the operations that can be applied to languages:
1.Union
2.Concatenation
3.Kleene closure
4.Positive closure
The following example shows the operations on strings:
Let L={0,1} and S={a,b,c}
1. Union : L U S={0,1,a,b,c}
2. Concatenation : L.S={0a,1a,0b,1b,0c,1c}
3. Kleene closure : L*={ ε,0,1,00….}
4. Positive closure : L+={0,1,00….}

Regular Expressions
Each regular expression r denotes a language L(r).

Here are the rules that define the regular expressions over some alphabet Σ and the languages
that those expressions denote:

1. ε is a regular expression, and L(ε) is { ε }, that is, the language whose sole member is the
empty string.

2. If ‗a‘ is a symbol in Σ, then ‗a‘ is a regular expression, and L(a) = {a}, that is, the
language with one string, of length one, with ‗a‘ in its one position.

3. Suppose r and s are regular expressions denoting the languages L(r) and L(s). Then,

a) (r)|(s) is a regular expression denoting the language L(r) U L(s).


b) (r)(s) is a regular expression denoting the language L(r)L(s).
c) (r)* is a regular expression denoting (L(r))*.
d) (r) is a regular expression denoting L(r).

4. The unary operator * has highest precedence and is left associative.

5. Concatenation has second highest precedence and is left associative.

6. | has lowest precedence and is left associative.

Regular set
A language that can be defined by a regular expression is called a regular set. If two
regular expressions r and s denote the same regular set, we say they are equivalent and write r = s.
There are a number of algebraic laws for regular expressions that can be used to
manipulate into equivalent forms. For instance, r|s = s|r is commutative; r|(s|t)=(r|s)|t is
associative.

Regular Definitions

15
Giving names to regular expressions is referred to as a Regular definition. If Σ is an
alphabet of basic symbols, then a regular definition is a sequence of definitions of the form
dl → r 1
d2 → r2
………
dn → rn

1. Each di is a distinct name.


2. Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l}.

Example: Identifiers is the set of strings of letters and digits beginning with a letter. Regular
definition for this set:
letter → A | B | …. | Z | a | b | …. | z |
digit → 0 | 1 | …. | 9
id → letter ( letter | digit ) *

Shorthands
Certain constructs occur so frequently in regular expressions that it is convenient to
introduce notational shorthands for them.
1. One or more instances (+):
- The unary postfix operator + means ― one or more instances of‖ .
- If r is a regular expression that denotes the language L(r), then ( r )+ is a regular expression that
denotes the language (L (r ))+
- Thus the regular expression a+ denotes the set of all strings of one or more a‘s.
- The operator + has the same precedence and associativity as the operator *.
2. Zero or one instance ( ?):
- The unary postfix operator ? means ―zero or one instance of‖.
- The notation r? is a shorthand for r | ε.
- If ‗r‘ is a regular expression, then ( r )? is a regular expression that denotes the language L( r ) U
{ ε }.
3. Character Classes:
- The notation [abc] where a, b and c are alphabet symbols denotes the regular expression a | b | c.
- Character class such as [a – z] denotes the regular expression a | b | c | d | ….|z.
- We can describe identifiers as being strings generated by the regular expression, [A–Za–z][A–
Za–z0–9]*

Non-regular Set
A language which cannot be described by any regular expression is a non-regular set.
Example: The set of all strings of balanced parentheses and repeating strings cannot be described
by a regular expression. This set can be specified by a context-free grammar.

RECOGNITION OF TOKENS

Consider the following grammar fragment:

stmt → if expr then stmt


| if expr then stmt else stmt

expr → term relop term

16
| term

term → id
| num
where the terminals if , then, else, relop, id and num generate sets of strings given by the
following regular definitions:

if → if
then → then
else → else
relop → <|<=|=|<>|>|>=
id → letter(letter|digit)*
num → digit+ (.digit+)?(E(+|-)?digit+)?

For this language fragment the lexical analyzer will recognize the keywords if, then, else, as well
as the lexemes denoted by relop, id, and num. To simplify matters, we assume keywords are
reserved; that is, they cannot be used as identifiers.

Transition diagrams
It is a diagrammatic representation to depict the action that will take place when a lexical
analyzer is called by the parser to get the next token. It is used to keep track of information about
the characters that are seen as the forward pointer scans the input.

9. Describe the language for specifying lexical Analyzer.(16)

There is a wide range of tools for constructing lexical analyzers.


 Lex
 YACC

LEX
Lex is a computer program that generates lexical analyzers. Lex is commonly used with
the yacc parser generator.
Creating a lexical analyzer

 First, a specification of a lexical analyzer is prepared by creating a program lex.l in the Lex
language. Then, lex.l is run through the Lex compiler to produce a C program lex.yy.c.
 Finally, lex.yy.c is run through the C compiler to produce an object program a.out,
which is the lexical analyzer that transforms an input stream into a sequence of tokens.

17
Lex Specification
A Lex program consists of three parts:
{ definitions }
%%
{ rules }
%%
{ user subroutines }
 Definitions include declarations of variables, constants, and regular definitions
 Rules are statements of the form
p1 {action1}
p2 {action2}
…p
n {action n}
where pi is regular expression and action i describes what action the lexical analyzer should take
when pattern pi matches a lexeme. Actions are written in C code.
 User subroutines are auxiliary procedures needed by the actions. These can
be compiled
separately and loaded with the lexical analyzer.

YACC- YET ANOTHER COMPILER-COMPILER


Yacc provides a general tool for describing the input to a computer program. The Yacc
user specifies the structures of his input, together with code to be invoked as each such structure is
recognized. Yacc turns such a specification into a subroutine that handles the input process;
frequently, it is convenient and appropriate to have most of the flow of control in the
user'sapplication handled by this subroutine.

10. (a) Explain the various phases of a compiler in detail. Also write down the output for the
following expression after each phase a:=b*c-d. (8) (May/June-2013)

Refer QNo:3

11. (a)Explain the functions of the Lexical Analyzer with its implementation.
(10) (b)Elaborate specification of tokens.(6)

Refer QNo:8

12. Describe in detail about input buffering. What are the tools used for constructing
a compiler?(16)

Refer QNo:6

18
13. Explain in detail about the phases of compiler and translate the statement pos:=init
+rate*60.(Nov/Dec-2012).

Refer QNo:3
14.(i).What are the issues in lexical analysis? (May/June-2012) (4)

There are three issues in lexical analysis:


 To make the design simpler.
 To improve the efficiency of the compiler.
 To enhance the computer portability.

(ii).Elaborate in detail the recognition of tokens. (May/June-2012) (12).

Refer QNo:8

Unit-II : Syntax analyzer and Run-time Environments


Part-A
1. Define parser.
A parsing or syntax analysis is a process which takes the input string w and produces either a
parse tree(syntactic structure) or generates the syntactic errors.
2. Write notes on syntax analysis? Also discuss various issues in parsing.
Syntax analysis is also called parsing. It involves grouping the tokens of the source program into
grammatical phrases that are used by the compiler to synthesize output.
There are two important issues in parsing:
1) Specification of syntax.
2) Representation of input after parsing.
3. What is parsing? What are the types of parsing?
Parsing is the process of determining if a string of tokens can be generated by a grammar. A parser
can be constructed for any grammar.
Parsing can be of two types
 Top down parsing
 Bottom up parsing
4. What are the error handling options done by the parser?
 It should report the presence of error clearly and accurately.
 It should recover from each error quickly enough.
 It should not significantly slow down the process of correcting programs.
5. Why lexical and syntax analyzers are separated out?
The reason for separating the lexical and syntax analyzer.
 Simpler design.
 Compiler efficiency is improved.
 Compiler portability is enhanced.
6. Define a context free grammar.
A context free grammar G is a collection of the following
 V is a set of non terminals
 T is a set of terminals
 S is a start symbol
 P is a set of production rules
G can be represented as G = (V,T,S,P)

19
7. Define ambiguous grammar.
A grammar G is said to be ambiguous if it generates more than one parse tree for some sentence of
language L (G). i.e. both leftmost and rightmost derivations are same for the given sentence.
8. What are the various error-recovery strategies?
a) Panic mode - On discovering this error, the parser discards the input symbols one at a
time until one of a designated set of synchronized tokens is found.
b) Phrase level – On discovering an error, a parser perform local correction on the remaining
input ; that is , it may replace a prefix or the remaining input by some string that allows the parser
to continue.
c) Error production - If we are having good idea of error we recover it.
d) Global correction – Use the compiler to make as few changes as possible in processing an
input string.
9. What are the functions of parser?
i. Representation of the tokens by codes (integers)
ii. Grouping of tokens together into syntactic structure
10. What is meant by left most derivation?
In derivation, only the leftmost non-terminal in any sentential form is replaced at each step. Such
derivations are termed leftmost derivation.
11. Define ambiguous grammar.
A grammar G is said to be ambiguous if it generates more than one parse tree for some sentence of
language L (G). i.e. both leftmost and rightmost derivations are same for the given sentence.
12. List various types of grammars.
i. Phase sensitive grammar ii. Context sensitive grammar iii. Context-free grammar iv. Regular
grammar
13. What are the advantages of Grammar?
 Easy to understand.
 Imparts a structure to a programming language.
 Acquiring new constructs.
14. When a grammar is said to be ambiguous?
When a grammar produces more than one parse tree for a same sentence then the grammar is said
to be ambiguous
15. Write a grammar to define simple arithmetic expression?
expr → expr op expr
expr → (expr)
expr → - expr
expr → id
op → + | - | * | / | ^
16. Define context-free language.
Given a grammar G with start symbol S, we can use the  relation to define L(G) , the language
generated by G. We say a string of terminals w is in *
L(G) if and only if S  w. The string w is called a sentence of G. the language that can only
generated by a grammar is said to be a context-free language.
17. What is meant by left recursion?
A grammar is left recursive if it has a non terminal A such that there is derivation A=+> Aα for
some string α . Top down parsing methods cannot handle left-recursion grammars, so a
transformation that eliminates left recursion in needed.
Ex:-
E → E +T | T
T→T*F|F
F → (E) | id

20
18. Write the algorithm to eliminate left recursion from a grammar?
1. Arrange the non terminals in some order A1,A2… .An
2 for i := 1 to n do begin
for j := 1 to i-1 do
begin
replace each production of the form Ai →AjƳ by
the productions
Aiδ1γ| δ2γ| …….| δkγ,
where Ajδ1| δ2|…….| δk
are all the current Aj –productions;
end
eliminate the immediate left recursion among the Ai-- productions
end.
19. What is meant by left factoring?
Left factoring is a grammar transformation that is useful for producing a grammar suitable for
predictive parsing. The basic idea is that when it is not clear which of two alternative productions
to use to expand a non terminal A, we may be able to rewrite the A production to defer the
decision until we have seen enough of the input to make the right choice.
20. Define parser.
A parser for grammar G is a program that takes as input a string w and produces as output either a
parse tree for w, if w is a sentence of G, or an error message indicating that w is not a sentence of
G.
21.Mention the properties of parse tree?
 The root is labeled by the start symbol.
 Each leaf is labeled by a token.
 Each interior node is labeled by a non-terminal .
 If A is the Non terminal, labeling some interior node and x1, x2, x3 … .xn are the labels of
the children
22.Define Handles.
A handle of a right-sentential form Ƴ is a production A → β and a position of Ƴ where the string
β may be found and replaced by A to produce the previous right-sentential form in a rightmost
derivation of Ƴ
23.What are the problems in top down parsing?
a) Left recursion.
b) Backtracking.
c) The order in which alternates are tried can affect the language accepted.
24.Define recursive-descent parser.
A parser that uses a set of recursive procedures to recognize its input with no
backtracking is called a recursive-descent parser. The recursive procedures can be
quite easy to write.
25.Define predictive parsers.
A predictive parser is an efficient way of implementing recursive-descent parsing by
handling the stack of activation records explicitly. The predictive parser has an input, a stack ,
a parsing table and an output
26.What is non recursive predictive parsing?
Non recursive predictive parser can be maintained by a stack explicitly, rather than implicitly via
recursive calls. The key problem during predictive parsing is that determining the production to e
applied for a non-terminal. The non recursive parser looks up the production to be applied in a
parsing table.
27.Write the algorithm for FIRST? (Or) Define FIRST in predictive parsing.

21
1. If X is terminal, then FIRST(X) is {X}.
2. If X Є is a production, then add Є to FIRST(X).
3. If X is non terminal and X Y1,Y2..Yk is a production, then place a in FIRST(X) if for some i
, a is in FIRST(Yi) , and Є is in all of FIRST(Y1),…FIRST(Yi-1);
28.Write the algorithm for FOLLOW? (Or) Define FOLLOW in predictive parsing.
1. Place $ in FOLLOW(S), where S is the start symbol and $ is the input right end marker.
2. If there is a production A  aBß, then everything in FIRST(ß) except for Є is placed in
FOLLOW(B).
3. If there is a production A  aB, or a production AaBß where FIRST(ß) contains Є, then
everything in FOLLOW(A) is in FOLLOW(B).
29. Write the algorithm for the construction of a predictive parsing table?
Input : Grammar G
Output : Parsing table M
Method :
a) For each production A →α of the grammar, do steps b and c.
b) For each terminal a in FIRST(α) ,add A→α to M [A , a]
c) If ε is in FIRST(α) , add A→α to M[A, b] for each terminal b is FOLLOW(A). if ε is in
FIRST(α) and $ is in FOLLOW(A), and A→α to M[A , $]
d) Make each undefined entry of M be error
30. What is LL(1) grammar?
A grammar whose parsing table has no multiply-defined entries is said to be LL(1).
31. Define Handle.
A handle of a string is a substring that matches the right side of a production and whose reduction
to the non-terminal on the left side of the production represents one step along the reverse of a
rightmost derivation. (or)
A handle of a right-sentential form Ƴ is a production A→β and a position of Ƴ where
the string β may be found and replaced by A to produce the previous right-sentential form
in a rightmost derivation of β.
32. Define handle pruning.
A technique to obtain the rightmost derivation in reverse (called canonical reduction sequence) is
known as handle pruning (i.e.) starting with a string of terminals w to be parsed. If w is the
sentence of the grammar then w=ϒn, where ϒn is the nth right sentential form of unknown right
most derivation.
33. What are the four possible action of a shift reduce parser?
a) Shift action – the next input symbol is shifted to the top of the stack.
b) Reduce action – replace handle.
c) Accept action – successful completion of parsing.
d) Error action- find syntax error
34. What is shift reduce parsing?
The bottom up style of parsing is called shift reduce parsing. This parsing method is bottom up
because it attempts to construct a parse tree for an input string beginning at the leaves and working
up towards the root.
35. Define viable prefixes.
The set of prefixes of right sentential forms that can appear on the stack of a shift-reduce parser
are called viable prefixes. An equivalent definition of a viable prefix is that it is a prefix of a right
sentential form that does not continue past the right end of the rightmost handle of that sentential
form
36. List down the conflicts during shift-reduce parsing.
 Shift/reduce conflict.-Parser cannot decide whether to shift or reduce.
 Reduce/reduce conflict-Parser cannot decide which of the several reductions to make.

22
37. What is LR Parsers?
LR(k) parsers scan the input from (L) left to right and construct a (R) rightmost
derivation in reverse. LR parsers consist of a driver routine and a parsing table. The k is for
the number of input symbols of look ahead that are used in making parsing decisions. When k
is omitted , k is assumed to be 1
38. Define LR grammar.
A grammar for which we can construct a parsing table in which every entry is uniquely
defined is said to be an LR grammar
39. Why LR parsing is good and attractive?
*LR parsers can be constructed for all programming language constructs for which CFG can be
written. * LR parsing is Non-backtracking Shift-Reduce parsing. * Grammars parsed using LR
parsers are super set of the class of grammar. * LR parser can detect syntactic error as soon as
possible, when left-to-right scan of the input.
40. Define LR (0) item.
An LR(0) item of a grammar G is a production of G with a dot at some position of the right side.
Eg: A .XYZ AX.YZ AXY.Z AXYZ.
41. What is augmented grammar? (or) How will you change the given grammar to an
augmented grammar?
If G is a grammar with start symbol S, then G‘, the augmented grammar for G, is G with a
new start symbol S‘ and production S‘→ S. It is to indicate the parser when it should stop and
announce acceptance of the input.
42. Explain the need of augmentation.
To indicate to the parser that, when it should stop parsing and when to announce acceptance of the
input. Acceptance occurs when the parser is about to reduce by S‘→S.
43. What are the rules for “Closure operation” in SLR parsing?
If ‗I‘ is a set of items for grammar G then closure (I) is the set of items constructed from I
by the following two rules.
1. Initially, every item in I is added to closure (I)
2. If A→   is in closure(I) and B→Ƴ is a production, then add the item B→Ƴ t  , ifit is
not
already there. Apply this until no more new items can be added to closure (I).
44. How to make an ACTION and GOTO entry in SLR parsing table?
i. If [A→      is in Ii and gotoi, (aI)=Ij then set action[i, a] to ―shift j‖. Here ‗a‘ must
be a terminal.
ii. If [A→ .] is in Ii then set action [i, a] to ―reduce A→ri‖ for all a in FOLLOW (A); here A
may not be S‘;.
iii. If S‘→S is in Ii then set action[i, $] to ― accept‖.
45.What is type checking?
Type checker verifies that the type of a construct (constant, variable, array, list, object) matches
what is expected in its usage context.
46.Write Static vs. Dynamic Type
Checking Static: Done at compile time (e.g.,
Java) Dynamic: Done at run time (e.g.,
Scheme)
Sound type system is one where any program that passes the static type checker cannot
contain run-time type errors. Such languages are said to be strongly typed.
47.Define procedure definition.
A procedure definition is a declaration that, in its simplest form, associates an identifier with a
statement. The identifier is the procedure name, and the statement body. Some of the
identifiers appearing in a ‗procedure definition‘ are special and are called ‗formal

23
parameters ‗ of the procedure. Arguments, known as ‗actual parameters‘ may be passed to a
called ‗procedure‘; they are substituted for the formal in the body.
48.Define activation trees.
A recursive procedure p need not call itself directly; p may call another procedure q,
which may then call p through some sequence of procedure calls. We can use a tree
called an activation tree, to depict the way control enters and leaves activation. In an
activation tree
a) Each node represents an activation of a procedure,
b) The root represents the activation of the main program
c) The node for a is the parent of the node for b if an only if control flows from
activation a to b, and
d) The node for a is to the left of the node for b if an only if the lifetime of a occurs before the
lifetime of b.
49.Write notes on control stack?
A control stack is to keep track of live procedure activations. The idea is to push the
node for activation onto the control stack as the activation begins and to pop the node when
the activation ends.
50.Write the scope of a declaration?
A portion of the program to which a declaration applies is called the scope of that declaration. An
occurrence of a name in a procedure is said to be local to eh procedure if it is in the
cope of a declaration within the procedure; otherwise, the occurrence is said to be nonlocal.
51.Define binding of names.
When an environment associates storage location s with a name x, we say that x is bound
to s; the association itself is referred to as a binding of x. A binding is the dynamic
counterpart of a declaring.
52.What is the use of run time storage?
The run time storage might be subdivided to hold: a) The generated target code b) Data objects,
and c) A counterpart of the control stack to keep track of procedure activation.
53. What is an activation record? (or) What is frame?
Information needed by a single execution of a procedure is managed using a contiguous block
of storage called an activation record or frame, consisting of the collection of fields such as
a) Return value b) Actual parameters c) Optional control link d) Optional access link e) Saved
machine status f) Local data g) Temporaries.
54.What are the storage allocation strategies?
a) Static allocation lays out storage for all data objects at compile time.
b) Stack allocation manages the run-storage as a stack.
c) Heap allocation allocates and de allocates storage as needed at run time from a data area
known as heap.
55.What is static allocation?
In static allocation, names are bound to storage as the program is compiled, so there is no need for
a run-time support package. Since the bindings do not change at run time, every time a
procedure is activated, its names are bound to the same storage location.
56.What is stack allocation?
Stack allocation is based on the idea of a control stack; storage is organized as a stack,
and activation records are pushed and popped as activations begin and end respectively.
57. What are the limitations of static allocation?
a) The size of a data object and constraints on its position in memory must be known at compile
time.
b) Recursive procedure is restricted.
c) Data structures cannot be created dynamically.

24
58.What is dangling references?
Whenever storage can be de-allocated, the problem of dangling references arises. A
dangling reference occurs when there is a reference to storage that has been de allocated.
59.What is heap allocation?
Heap allocation parcels out pieces of contiguous storage, as needed for activation records
or other objects. Pieces may be de allocated in any order, so over time the heap will consist
of alternate areas that are free and in use.
60. What is type system?
A type system is a collection of rules for assigning type expressions to the various parts of a
program. Type checker implements a type system.
61. Elininate left recursion from the following grammar AAc/Aad/bd/є . (May/June 2013)
AbdA'
A' aAcA' / adA' /є
62. Give examples for static check. (May/June 2013)
The integer 2 is converted to a float in the code for the expression 2*3.14:
t1=(float) 2
t2=t1*3.14
we can extend such examples to consider integer to float versions of the operators; we introduced
another attributes E.type , whose value is either integer or float. The rule associated with
EE1+E2 builds on the pseudocode
if(E1.type=integer and E2.type=integer) E.type=integer,
elseif(E1.type=float and E2.type-integer)

63. Write the regular expression for identifier and whitespace.(Nov/Dec
2013) An Identifier (or Name) /[a-zA-Z_][0-9a-zA-Z_]*/
1. Begin with one letters or underscore, followed by zero or more digits, letters
and underscore.
2. You can use meta character \w (word character) for [a-zA-Z0-9_]; \d (digit) for [0-9].
Hence, it can be written as /[a-zA-Z_]\w*/.
3. To include dash (-) in the identifier, use /[a-zA-Z_][\w-]*/. Nonetheless, dash conflicts
with subtraction and is often excluded from identifier.
An Whitespace
/\s\s/ # Matches two whitespaces
/\S\S\s/ # Two non-whitespaces followed by a whitespace
/\s+/ # one or more whitespaces
/\S+\s\S+/ # two words (non-whitespaces) separated by a whitespace
64. Eliminate the left recursion for the grammar (Nov/Dec 2013)
SAa|b
AAc|Sd|є
Sol:
Let's use the ordering S, A (S = A1, A = A2).
• When i = 1, we skip the "for j" loop and remove immediate left recursion from the S productions
(there is none).
• When i = 2 and j = 1, we substitute the S-productions in A → Sd to obtain the A-productions
A → Ac | Aad | bd | ε
• Eliminating immediate left recursion from the A productions yields the grammar:
S → Aa | b
A → bdA' | A'
A' → cA' | adA' | ε
65. compare the feature of DFA and NFA.(May/June 2014)

25
1. Both are transition functions of automata. In DFA the next possible state is distinctly set while in
NFA each pair of state and input symbol can have many possible next states.
2. NFA can use empty string transition while DFA cannot use empty string
transition. 3.NFA is easier to construct while it is more difficult to construct DFA.
4.Backtracking is allowed in DFA while in NFA it may or may not be allowed.
5.DFA requires more space while NFA requires less space.
66. What is dangling else problem?
Ambiguity can be eliminated by means of dangling-else grammar which is show below:
stmt → if expr then stmt
| if expr then stmt else stmt
| other
67. List the fields in activation record .(Nov/Dec 2014)
 Actual parameters
 Returned Values
 Control link
 Access link
 Saved machine status
 Local data
 Temporaries
PART – B
1.(i) Construct Predictive Parser for the following grammar:(May/June-2012&13)(10)
S  (L) / a
L  L, S/S
Soln:
The grammar has 4 productions
S(L)
Sa
LL,S
LS
The 3rd productions possess left recursion the left recursive nature can be eliminated as lested
below
S(L)
Sa
LSL‘
L‘,SL‘
L‘є

Computation for FIRST:

FIRST(S)={(,a}
FIRST(L)=FIRST(S)
={(,a}
FIRST(L‘)={‚,є}

Computation of FOLLOW:

Non-Terminal on the right side


S,L,L‘

FOLLOW(S)={ $,FIRST(L‘)}

26
={ $, , ,є}
={ $, , , FOLLOW(L‘)}
=={ $ , , , ) }
FOLLOW(L)={ ) }
FOLLOW(L‘)={ FOLLOW(L)}
={ ) }

Construction of parsing table:


It is a 2-D array having non-terminal as rows and terminals as column.$ is included in the
last column.
Non-Terminals: S,L,L‘
Terminals : a , , ( , ) , $ [є is not included in the Table]

a , ( ) $
S Sa S(L)
L LSL‘ LSl‘
L’ L‘,SL‘ L‘є

(ii) Describe the conflicts that may occur during shift reduce parsing. (may/june-2012).(6)

Shift-reduce parsing construct a parse tree for an input string beginning at the leaves
(bottom) and working up towards the root (the top). Here a string w is reduced to the start symbol
of the grammar. At each reduction step a particular sub string matching the right side of the
production is replaced by the symbol on the left of that production.
Conflicts during Shift-Reduce Parsing
The general shift-reduce technique
 Perform shift action when there is no handle on the stack
 Perform reduce action when there is a handle on the top of the stack
There are two problem that a shift-reduce parser faces,
Shift-reduce conflict:
What actions to take in case both shift and reduce action are valid?
Example:
Consider the grammar
E E + E | E * E | id
and input id + id * id.

Stack Input Action Stack Input Action

27
| |
$E+E * id$ Reduce by E  E + E $E+E * id$ Shift
$E * id$ Shift $E+E* * id$ Shift
$E* * id$ Shift $ E + E* id $ Reduce by E  id
$ E * id $ Reduce by E  id $E+E*E $ Reduce by E  E * E
$E*E $ Reduce by E  E * E $E*E $ Reduce by E  E + E
$E $E

Reduce-reduce conflict:
Which rule to use for reduction if reduction is possible by more than one rule?
Example:
Consider the grammar
MR+R|R+c|R
R  C and
input c+c
Stack Input Action Stack Input Action
| |
$ c+c $ Shift $ c+c$ Shift
$c +c$ Reduce by R  c $ c+c $ Reduce by R  c
$R +c$ Shift $R +c$ Shift
$ R+ C$ Shift $R +c$ Shift
$ R+c $ Reduce by R  c $R+c $ Reduce by M  R + c
$ R+R $ Reduce by M  R+R $M $
$M $

These conflicts come either because of ambiguous grammar or parsing method is not powerful.
2. What is FIRST and FOLLOW? Explain in detail with an example. Write down the
necessary algorithm.(16)

FIRST
To compute FIRST(X) for all grammar symbols X, apply the following rules until no more
terminals or Є can be added to any FIRST set.
 If X is terminal, then FIRST(X) is {X}.
 If X  Є is a production, then add Є to FIRST(X).

28
 If X is non-terminal and X  Y1Y2 ••• Yk is a production, then place ―a‖ in FIRST(X)
if for some ―i, a‖ is in FIRST (Yi), and Є is in all of FIRST (Y1), FIRST(Yi-1);
that is, Y1..... Yi-1Є. If Є is in FIRST(Yj) for all j = 1, 2, . . . , k, then add Є to
FIRST(X).For example, everything in FIRST (У1) is surely in FIRST(X). If У1 does not
derive Є, then we add nothing more to FIRST(X), but if Y 1Є, then we add FIRST(Y2)
and so on.

FOLLOW
To compute FOLLOW (A) for all non-terminals A, apply the following rules until nothing can be
added lo any FOLLOW set.
 Place $ in FOLLOW(S), where S is the start symbol and $ is the input right end marker
 If there is a production A  αBβ, then everything in FlRST(β) except for Є is placed in
FOLLOW(B).
 If there is a production A  αB, or a production A  αBβ where FlRST(β) contains Є
(i.e., β  Є), then everything in FOLLOW(A) is in FOLLOW(B).

Example:
Consider the following grammar and compute FIRST and FOLLOW for all.
E  TE’
E’  +TE’|Є
T  FT’
T  *FT’|Є
F  (E)|id
Solution:

3.i)Explain detail about the specification of a simple type checker.(may/june-2012&13).


ii)List all LR(0) items for the following grammar.
S→AS/b

29
A→SA/a

Specification of a Simple Type Checker


Consider a language (e.g., Java) that requiresthat an identifier be declared with a type
before the variable is used.

For simplicity, we'll declare all identifiers before using them in a single expression:
P D ; E
D D ; D | T id
Tchar | int | T [ num ]
E id | E mod E | E [ E ]

Add syntax-directed definitionsto type-check expressions:

D T id { addtype (id.entry, T.type ) }


Tchar { T.type := char }
T int { T.type := integer }
T T[ num ] { T.type := array (0..num.val-1, T1.type ) }
E id { E.type := lookup ( id.entry ) }
E E1 mod E2{ E.type := if E1.type = integer and E2.type = integer then integer else type error }
E E1[ E2] { E.type := if E2.type = integer andE1.type = array(s,t) then telse type_error }
Where addtype, lookup are symbol-table methods, array is a type constructor

ii) LR(0) items


Soln:
SAS
S b
ASA
A a
Solution:
We will construct a collection of canonical set of items for the given
grammar. I0: S‘.S
S>.AS
S.b
A.SA
A.a
I1: goto(I0,S)
S‘S.
AS.A
A.SA
A.a
S.AS
S.b
I2: goto(I0,A)
SA.S
S.AS
S.b
A.SA
A.a
I3: goto(I0,b)

30
Sb.
I4: goto(I0,a)
A.a
I5: goto(I1,A)
ASA.
AA.S
S.AS
S.B
A.SA
A.a
I6: goto(I1,S)
AS.A
A.SA
A.a
S.AS
S.b
I7:goto(I2,S)
SAS.
AS.A
A.SA
A.a
S.AS
S.b
We will list out the given rules
1.SAS FOLLOW(S)={a,b,$}
2.S—b FOLLOW(A)={a,b}
3.ASA
4.Aa

The parsing table will be


Action goto
a b $ S A
0 S4 S3 1 2
1 S4 S3 ACCEPT 6 5
2 S4 S3 7 2
3 r2 r3 r2
4 r4 r4
5 S4/r3 S3/r3 7 2
6 S4 S3 6 5
7 S4/r1 S3/r1 r1 6 5

This shows that the given grammar is not SLR.

4.(i)What are different storage allocation strategies? Explain.(8)(may/june-2013)


A different storage-allocation strategy is used in each of the data areas. They are,
 Static allocation lays out storage for all data objects at compile time.
 Stack allocation manages the run-time storage a» a stack.
 Heap allocation allocates and de-allocates storage as needed at run time from a data area
known as a heap.

31
Static Allocation
In static allocation, names are bound to storage as the program is compiled, so there is no need for
a run-time support package. Since the bindings do not change at run time, every time a procedure
is activated, its names are bound to the same storage locations. This property allows the values of
local names to be retained across activations of a procedure. That is, when control returns to a
procedure, the values of the locals are the same as they were when control left the last time.
However, some limitations go along with using static allocation alone.
 The size of a data object and constraints on its position in memory must be known at
compile time.
 Recursive procedures are restricted, because alt activations of a procedure use the same
bindings for local names.
 Data structures cannot be created dynamically.
Stack Allocation
Stack allocation is based on the idea of a control slack; storage is organized as a stack, and
activation records are pushed and popped as activations begin and end, respectively. Storage for
the locals in each call of a procedure is contained in the activation record for that call. Thus locals
are bound to fresh storage in each activation, because a new activation record is pushed onto the
stack when a call is made. Furthermore, the values of locals are deleted when the activation ends;
that is, the values are lost because the storage for locals disappears when the activation record is
popped.
Suppose that register top marks the top of the stack. At run time, an activation record can be
allocated and de-allocated by incrementing and decrementing top, respectively, by the size of the
record.
Calling Sequences
Procedure calls are implemented by generating what are known as calling sequences in the
target code, a call sequence allocates an activation record and enters information into its fields. A
return sequence restores the state of the machine so the calling procedure can continue execution.
The code in a calling sequence is often divided between the calling procedure (the caller) and the
procedure it calls (the callee).
The call sequence is:
 The caller evaluates actual.
 The caller stores a return address and the old value of top-sp into the callee,s activation
record.
 The callee saves register values and other status information.
 The callee initializes its local data and begins execution.

The division of tasks between caller and callee is shown in following figure.

Fig: Division of tasks between caller and callee


The return sequence is:
 The callee places a return value next to the activation record of the caller

32
 Using the information in the status field, the callee restores top-sp and other registers and
branches to a return address in the caller‘s code.
 Although top-sp has been decremented, the caller can copy the returned value into its own
activation record and use it to evaluate an expression.

Variable-Length Data
A common strategy for handling variable-length data is shown in following figure where
procedure ―p‖ has three local arrays.

Fig: Access to dynamically allocated arrays


The storage for these arrays is not part of the activation record for p; only a pointer to the
beginning of each array appears in the activation record. The relative addresses of these pointers
are known at compile time, so the target code can access array elements through the pointers.
Also in above figure, procedure q called by p. The activation record for q begins after the arrays
of p, and the variable-length arrays of q begin beyond that. Access to data on the stack is through
two pointers, top and top-sp. The first of these marks the actual top of the stack; it points to the
position at which the next activation record will begin. The second is used to find local data.
Dangling References
Whenever storage can be de-allocated, the problem of dangling references arises. A
dangling reference occurs when there is a reference to storage that has been de-allocated. It is a
logical error to use dangling references, since the value of de-allocated storage is undefined
according to the semantics of most languages.
Heap Allocation
The stack allocation strategy discussed above cannot be used if either of the following is possible.
 The values of local names must be retained when activation ends
 A called activation outlives the caller. This possibility cannot occur for those languages
where activation trees correctly depict the flow of control between procedures.
Heap allocation parcels out pieces of contiguous storage, as needed for activation records or other
objects. Pieces may be de-allocated in any order, so over time the heap will consist of alternate
areas that are free and in use.

33
(ii)Specify a type checker which can handle expressions, statements and
functions.(8)(may/june-2013)

5. Consider the grammar given below:


E  E+T
ET
T  T*F
TF
F  (E)
F  id
Construct an LR parsing side for the above grammar. Give the moves of LR parser on
id*id+id
Augmented grammar:
E‘  E
E  E+T
ET
T  T*F
TF
F  (E)
F  id

closure({E‘  .E}) =
{ E‘  .E kernel items
E  .E+T
E  .T
T  .T*F
T  .F
F  .(E)
F  .id }

The Canonical LR(0) Collection

34
Parsing Table

Action Go to
state
id + * ( ) $ E T F

0 s5 s4 1 2 3

1 s6 acc

2 r2 s7 r2 r2

3 r4 r4 r4 r4

4 s5 s4 8 2 3

5 r6 r6 r6 r6

6 s5 s4 9 3

7 s5 s4 10

8 s6 s11

9 r1 s7 r1 r1

10 r3 r3 r3 r3

11 r5 r5 r5 r5

Moves of LR parser on id*id+id

35
stack input action output
0 id*id+id$ shift 5
0id5 *id+id$ reduce by Fid Fid
0F3 *id+id$ reduce by TF TF
0T2 *id+id$ shift 7
0T2*7 id+id$ shift 5
0T2*7id5 +id$ reduce by Fid Fid
0T2*7F10 +id$ reduce by TT*F TT*F
0T2 +id$ reduce by ET ET
0E1 +id$ shift 6
0E1+6 id$ shift 5
0E1+6id5 $ reduce by Fid Fid
0E1+6F3 $ reduce by TF TF
0E1+6T9 $ reduce by EE+T EE+T
0E1 $ accept

6. Consider the following grammar(16). (Nov/Dec-2012)


E→E+T | T
T→TF | F
F→F* | a | b
construct the SLR parsing table for this grammar. Also parse the input a*b+a .

Solution:
Let us number the production rules in the grammar
1) EE+T
2) ET
3) TTF
4) TF
5) FF*
6) Fa
7) Fb
Augmented Grammar
I0: E‘.E
E.E+T
E.T
T.TF
T.F
F.F*
F.a
F.b

36
I1:goto(I0,E)
E‘E.
EE.+T
I2:goto(I0,T)
ET.
TT.F
T.F
F.F*
F.a
F.b
I3:goto(I0,F)
TF.
FF.*
I4:goto(I0,a)
F-->a.
I5:goto(I0,b)
F-->b.
I6:goto(I1,+)
EE+.T
T.TF
T.F
F.F*
F.a
F.b
I7:goto(I2,F)
T-->TF.
T-->F.*
I8:goto(I3,*)
F-->F*.
I9:goto(I6,T)
EE+T.
TT.F
F.F*
F.a
F.b
FOLLOW(E)={+,$}
FOLLOW(T)={+,a,b,$}
FOLLLOW(F)={+,*,a,b,$}

State Action Goto


+ * a b $ E T F
0 S4 S5 1 2 3
1 S6 Accept
2 r2 r2
3 r4 S8 R4 R4 R4
4 R6 R6 R6 R6 R6
5 R6 R6 R6 R6 R6
6 S4 S5 9 3
7 R3 S8 R3 R3 R3
8 R5 R5 R5 R5 R5

37
9 R1 S4 S5 R1 7

Now we will parse the input a*b+a using above parse table
Stack Input Buffer Action
$0 a*b+a $ Shift
$0a4 *b+a $ Reduce by F-->a
$0F3 *b+a $ Shift
$0F3*8 b+a $ Reduce by F-->a
$0F3 b+a $ Reduce by T-->F
$0T2 b+a $ Shift
$0T2b5 +a $ Reduce by F-->b
$0T2F7 +a $ Reduce by T-->TF
$0T2 +a$ Reduce by E-->T
$0E1 +a$ Shift
$0E1+6 a$ Shift
$0E1+6a4 $ Reduce by F-->a
$0E1+6F3 $ Reduce by T-->F
$E1+6T9 $ Reduce by E-->E+T
$0E1 $ Accept
Thus input string is parsed.

7. Construct SLR parsing table for the following grammar (10). (Nov/Dec-2012,April/May-
04,Marks 8)
S L=R|R L*R| id RL
Solution:We will first build a canonical set of items.Hence a collection of LR(0) items is as given
below.
I0: S’.S
S.L=R
S.R
L.*R
L.id
R.L
I1:goto(I0,S)
S’S.
I2:goto(I0,L)
SL.=R
RL.
I3:goto(I0,R)
SR.
I4:goto(I0,*)
L*.R
R.L
L.*R
L.id
I5:goto(I0,id)
Lid.
I6:goto(I2,=)
SL=.R

38
R.L
L.*R
L.id
I7:goto(I4,R)
L*R.
I8:goto(I4,L)
RL.
I9:goto(I6,R)
SL=R.
State Action Goto
= * Id $ S L R
0
1
2 S6/R5
3
4 S4 S5
5
6 S4 S5
7
8
9
As we are getting multiple entries in Action[2,=]=s6 and r5. That means shift and reduce both a
shift/reduce conflict occurs on input symbol =. Hence given grammar is not SLR(1) grammar.

(ii) Write down the rules for FIRST and FOLLOW (6). (Nov/Dec-2012)
FIRST
To compute FIRST(X) for all grammar symbols X, apply the following rules until no more
terminals or Є can be added to any FIRST set.
 If X is terminal, then FIRST(X) is {X}.
 If X  Є is a production, then add Є to FIRST(X).
 If X is non-terminal and X  Y1Y2 ••• Yk is a production, then place ―a‖ in FIRST(X)
if for some ―i, a‖ is in FIRST (Yi), and Є is in all of FIRST (Y1), FIRST(Yi-1);
that is, Y1..... Yi-1Є. If Є is in FIRST(Yj) for all j = 1, 2, . . . , k, then add Є to
FIRST(X).For example, everything in FIRST (У1) is surely in FIRST(X). If У1 does not
derive Є, then we add nothing more to FIRST(X), but if Y 1Є, then we add FIRST(Y2)
and so on.

FOLLOW
To compute FOLLOW (A) for all non-terminals A, apply the following rules until nothing can be
added lo any FOLLOW set.
 Place $ in FOLLOW(S), where S is the start symbol and $ is the input right end marker

39
 If there is a production A  αBβ, then everything in FlRST(β) except for Є is placed in
FOLLOW(B).
 If there is a production A  αB, or a production A  αBβ where FlRST(β) contains Є
(i.e., β  Є), then everything in FOLLOW(A) is in FOLLOW(B).

8. Explain the role of parser in detail.(Nov/Dec-06,Marks 6,Nov/Dec-07,Marks 4)


The parser or syntactic analyzer obtains a string of tokens from the lexical analyzer and
verifies that the string can be generated by the grammar for the source language. It reports any
syntax errors in the program. It also recovers from commonly occurring errors so that it can
continue processing its input.

Functions of the parser :


1. It verifies the structure generated by the tokens based on the grammar.
2. It constructs the parse tree.
3. It reports the errors.
4. It performs error recovery.
Issues :
Parser cannot detect errors such as:
1. Variable re-declaration
2. Variable initialization before use.
3. Data type mismatch for an operation.
The above issues are handled by Semantic Analysis phase.
Syntax error handling :
Programs can contain errors at many different levels. For example :
1. Lexical, such as misspelling a keyword.
2. Syntactic, such as an arithmetic expression with unbalanced parentheses.
3. Semantic, such as an operator applied to an incompatible operand.
4. Logical, such as an infinitely recursive call.
Functions of error handler :
1. It should report the presence of errors clearly and accurately.
2. It should recover from each error quickly enough to be able to detect subsequent errors.
3. It should not significantly slow down the processing of correct programs.

9. Check whether the following grammar is a LL(1) grammar


S  iEtS | iEtSeS | a
Eb
Also define the FIRST and FOLLOW procedures. (16)
Solution :
Let,

40
S iEtS | iEtSeS | a
As the production rule for S needs to be left factored, we will rewrite the grammar as
SiEtSS’/a
S’eS/є
Eb
Now we will compute FIRST and FOLLOw for S,S’ and E.

First(S)={i,a}
First(S’)={e.є}
First(E)={b}

Follow(S)={e,$}
Follow(S’)={e,$}
Follow(E)={t}
The predictive parsing table can be constructed as:
a b e i t $
S Sa S iEtSS
S’ S’є S’є
S’eS
E Eb
As we have got multiple entries M[S’,e] this grammar is not LL(1) grammar

10. Construct non recursion predictive parsing table for the following
grammar. EE or E /E and E/ not E / (E) / 0 /1. (Dec-12,Marks 16).

Solution:
As the given grammar needs to be left factored .Aslo the ambiguity of this grammar needs to be
removed .The modified grammar will be as follows.
 Eliminating ambiguity
EE or T
ET
TT and F
TF
F not G
FG
G(E)/0/1
 Eliminating Left recursion
ETE‘
E‘orTE‘/є
TFT‘
TandFT‘/є
Fnot G
FG
G(E)/0/1
The FIRST and FOLLOW for above grammar is-
First(E)={ not,(,0,1}
First(E‘)={or,є}
First(T)={not,(,0,1}
First(T‘)={and ,є}
First(F)={not,(,0,1}

41
First(G)={(,0,1}

Follow(E)={),$}
Follow(E‘)={),$}
Follow(T)={or,),$}
Follow(T‘)={or,),$}
Follow(F)={and,or,),$}
Follow(G)={or,and,),$}

The predictive parsing table will be


or and not ( ) 0 1 $
E ETE‘ E‘TE‘ ETE‘ ETE‘
E’ E‘orTE‘ E‘є E‘є
T TFT‘ TFT‘ TFT‘ TFT‘
T’ T‘andFT‘ T‘є T‘є
F Fnot G FG FG FG
G G(E) G0 G1

11. Explain the organization of runtime storage in detail.


 The executing target program runs in its own logical address space in which each program
value has a location.
 The management and organization of this logical address space is shared between the
complier, operating system and target machine. The operating system maps the
logical address into physical addresses, which are usually spread throughout memory.
Typical subdivision of run-time memory:

 Run-time storage comes in blocks, where a byte is the smallest unit of addressable
memory. Four bytes form a machine word. Multibyte objects are stored in
consecutive bytes and given the address of first byte.
 The storage layout for data objects is strongly influenced by the addressing constraints of
the target machine.
 A character array of length 10 needs only enough bytes to hold 10 characters, a compiler
may allocate 12 bytes to get alignment, leaving 2 bytes unused.
 This unused space due to alignment considerations is referred to as padding.
 The size of some program objects may be known at run time and may be placed in an area
called static.
 The dynamic areas used to maximize the utilization of space at run time are stack
and heap.

42
Activation records:
 Procedure calls and returns are usually managed by a run time stack called the control
stack.
 Each live activation has an activation record on the control stack, with the root of the
activation tree at the bottom, the latter activation has its record at the top of the stack.
 The contents of the activation record vary with the language being implemented.
 The diagram below shows the contents of activation record.

 Temporary values such as those arising from the evaluation of expressions.


 Local data belonging to the procedure whose activation record this is.
 A saved machine status, with information about the state of the machine just before the
call to procedures.
 An access link may be needed to locate data needed by the called procedure but found
elsewhere.
 A control link pointing to the activation record of the caller.
 Space for the return value of the called functions, if any. Again, not all called procedures
return a value, and if one does, we may prefer to place that value in a register for
efficiency.
 The actual parameters used by the calling procedure. These are not placed in activation
record but rather in registers, when possible, for greater efficiency.

12. Explain about static and stack allocation in storage allocation strategies.
STATIC ALLOCATION
 In static allocation, names are bound to storage as the program is compiled, so there is no
need for a run-time support package.
 Since the bindings do not change at run-time, everytime a procedure is activated, its
names are bound to the same storage locations.
 Therefore values of local names are retained across activations of a procedure. That is,
when control returns to a procedure the values of the locals are the same as they were
when control left the last time.
 From the type of a name, the compiler decides the amount of storage for the name and
decides where the activation records go. At compile time, we can fill in the addresses at
which the target code can find the data it operates on.
STACK ALLOCATION OF SPACE
 All compilers for languages that use procedures, functions or methods as units
of userdefined actions manage at least part of their run-time memory as a stack.
 Each time a procedure is called , space for its local variables is pushed onto a stack, and
when the procedure terminates, that space is popped off the stack.

43
Calling sequences:
 Procedures called are implemented in what is called as calling sequence, which consists
of code that allocates an activation record on the stack and enters information into its
fields.
 A return sequence is similar to code to restore the state of machine so the
calling procedure can continue its execution after the call.
 The code in calling sequence is often divided between the calling procedure (caller) and
the procedure it calls (callee).
 When designing calling sequences and the layout of activation records, the following
principles are helpful:
 Values communicated between caller and callee are generally placed at the beginning
of the callee‘s activation record, so they are as close as possible to the caller‘s activation
record.
 Fixed length items are generally placed in the middle. Such items typically include the
control link, the access link, and the machine status fields.
 Items whose size may not be known early enough are placed at the end of the activation
record. The most common example is dynamically sized array, where the value of one of
the callee‘s parameters determines the length of the array.
 We must locate the top-of-stack pointer judiciously. A common approach is to have it
point to the end of fixed-length fields in the activation record. Fixed-length data can then
be accessed by fixed offsets, known to the intermediate-code generator,relative to the top-
of-stack pointer.

The calling sequence and its division between caller and callee are as follows.
 The caller evaluates the actual parameters.
 The caller stores a return address and the old value of top_sp into the callee‘s
 activation record. The caller then increments the top_sp to the respective positions.
 The callee saves the register values and other status information.
 The callee initializes its local data and begins
execution. A suitable, corresponding return sequence is:
 The callee places the return value next to the parameters.
 Using the information in the machine-status field, the callee restores top_sp and other
registers, and then branches to the return address that the caller placed in the status
field.
 Although top_sp has been decremented, the caller kno ws where the return value
is, relative to the current value of top_sp; the caller therefore may use that value.
Variable length data on stack:

44
 The run-time memory management system must deal frequently with the allocation of
space for objects, the sizes of which are not known at the compile time, but which are
local to a procedure and thus may be allocated on the stack.
 The reason to prefer placing objects on the stack is that we avoid the expense of garbage
collecting their space.
 The same scheme works for objects of any type if they are local to the procedure called
and have a size that depends on the parameters of the call.

 Procedure p has three local arrays, whose sizes cannot be determined at compile time.
 The storage for these arrays is not part of the activation record for p.
 Access to the data is through two pointers, top and top-sp. Here the top marks the actual
top of stack; it points the position at which the next activation record will begin.
 The second top-sp is used to find local, fixed-length fields of the top activation record.
 The code to reposition top and top-sp can be generated at compile time, in terms of sizes
that will become known at run time.
13. Eliminate left recursion for the grammar.
EE+T/T, TT*F/F,F(E)/id (April/May-08,Marks 2)
Solution:
AAα/β then convert it to
AβA’
A‘αA’
A‘є

ETE’
E‘+TE’/є
TFT’
T‘*FT’/є
F(E)/id

UNIT III : Intermediate Code Generation


PART-A
1. Name some variety of intermediate forms. (or) Write the various three address code form
if intermediate code.(May/June 2014)
o Postfix notation or polish notation Syntax tree o Three address code o Quadruple o Triple
2. Write the syntax for three-address code statement, and mention its
properties. Syntax: A= B op C .
 Three-address instruction has at most one operator in addition to the assignment symbol.
The compiler has to decide the order in which operations are to be done. .

45
 The compiler must generate a temporary name to hold the value computed by each
instruction.
 Some three-address instructions can have less than three operands.
3.What are the two representations to express intermediate
languages? Graphical representation. Linear representation.
4. Draw syntax tree for the expression a=b*-c+b*-c (Nov/Dec 2012)

5. Explain postfix notation.


It is the liberalized representation of syntax tree .It is a list of nodes of the syntax tree in which a
node appears immediately after its children. Eg, for the above syntax tree, a b c uminus * b c
uminus * + =
6. What are the different types of three address statements?
 Assignment statements of the form x=y op z.
 Assignment statements of the form x= op y.
 Copy statements of the form x=y.
 An unconditional jump, goto L.
 Conditional jumps, if x relop y goto L.
 Procedure call statements, param x and call p , n.
 Indexed assignments of the form, x=y[i] , x[i]=y.
 Address and pointer assignments.
7. Define quadruple and give one example.
A quadruple is a data structure with 4 fields like operator, argument-1 ,argument-2 and result.

8. Define triple and give one example.


A triple is a data structure with 3 fields like operator, argument-1, argument-2.

9. What is the merit of quadruples?


If we move a statement computing a variable, ‗A‘ then the statements using ‗A‘
requires no change. That is ‗A‘ can be computed anywhere before the utilization of A.
10. What are the two methods of translation to find three-address statements in Boolean
expression?
  Numerical method   Control flow method
11. Generate three address code for “ if A<B then 1 else 0” , using numerical method.

46
1. if A<B goto (4) 2. T1=0 3. goto (5) 4. T1=1 5. …
12. How the value of synthesized attribute is computed?
It was computed from the values of attributes at the children of that node in the parse tree.
13. How the value of inherited attribute is computed?
It was computed from the value of attributes at the siblings and parent of that node.
14. Define annotated parse tree and give one example.
A parse tree showing the values of attributes at each node is called an annotated parse tree. Ex:
3*5+4

15. Define back patching. (Nov/Dec 2012)


To generate three address code, 2 passes are necessary. In first pass, labels are not specified. These
statements are placed in a list. In second pass, these labels are properly filled, is called back
patching.
16.Explain Back patching with an example of your own.(May/June 2012)
Backing is an activity of filling up specified information of labels using appropriates schematic
actions during the code generation process.
For examples :If the expression is A<B OR C<D AND P<Q then the code gets generated as
100 if A<B goto_
17. What are the three functions used for back patching?
(i) Makelist (ii) Merge (p1, p2) (iii)Backpatch (p, i)
18.When procedure call occurs, what are the steps to be takes placed?
 State of the calling procedure must be saved, so that it can resume after completion of
procedure.
  Return address is saved, in this location called routine must transfer after completion of
procedure
19. What are the demerits in generating 3-address code for procedure calls?
Consider, b=abc (I, J) .It is very difficult to identify whether it is array reference or it is a call to
the procedure abc.
20. Which information’s are entered into the symbol table?
The string of characters denoting the name.-Attributes of the name.
Parameters.-An offset describing the position in storage to be allocated for the name.
21.Define three-address code.
Three address code is a sequence of statements of the general form. x : = y op z where x, y, z –
names, constants or compiler generated temporaries .op – operator i.e., fixed or floating point
arithmetic operator or logical operator.
22. Define Boolean expression.
Boolean expressions are composed of the boolean operators (and, or, not) applied to elements.
That are boolean variables or relational expressions.
23. Define short-circuit code.
It translates a Boolean expression into three-address code without generating
code for any of the Boolean operators and without having the node necessarily evaluate the entire
expression. This type of evaluation is called short circuit or jumping code.
24.What is meant by coercion ? (Nov/Dec 2013)

47
A coercion occurs if the type of an operand is automatically converted to the type expected by the
operator. In an expression like 2*3.14, the usual transformation is to convert the integer 2 into an
equivalent floating –point number, 2.0 and then perform a floating-point operation on the resulting
pair of floating-point operands. The language definition specifies the allowable coercions.
25.Give the syntax-directed definition for if-else statement.
1.Sif E then S1
E.true:=new_label()
E.false:=S.next
S1.next:=S.next
S.code:=E.code|| gen_code(E.true‘:‘)||S1.code
2.Sif E then S1 else S2
E.true:=new_label()
E.false:=new_lable()
S1.next:=S.next
S2.next:=S.next
S.code:=E.code||gen_code(E.true‘:‘)||S1.code||gen_code(‗go to‘,S.next)|| gen_code(E.false‘:‘)||
S2.code
26. Translate the arithmetic expression a*-(b+c) into syntax tree and postfix
notation.(Nov/Dec 2013)
i) Syntax tree

a uminus

b c
ii) Postfix
abc+uminus*

PART-B
1. How would you generate the intermediate code for the flow of control statements? Explain with
examples. (16)
2. (a)What are the various ways of calling procedures? Explain in detail.(8)
(b)What is a three-address code? Mention its types. How would you implement the three address
statements? Explain with examples. (8)
3. How would you generate intermediate code for the flow of control statements? Explain with
examples. (16)
4. (a)Describe the method of generating syntax-directed definition for Control statements.(8)
(b)Give the semantic rules for declarations in a procedure. (8)
5. (a) How Back patching can be used the generate code for Boolean expressions and flow of
control statements. (8)
(b)Explain how the types and relative addresses of declared names are computed and how scope
information is dealt with. (8)
6. (a) Describe in detail the syntax-directed translation of case statements.(8)
(b) Explain in detail the translation of assignment statements. (8)

UNIT IV: Code Generation

48
PART-A
1. Define code generation.
The code generation is the final phase of the compiler. It takes an intermediate representation of
the source program as the input and produces an equivalent target program as the output.
2. Define Target machine.
The target computer is byte-addressable machine with four bytes to a word and n-general purpose
registers. R0, R1… … … .Rn-1. It has two address instructions of the form Op,
source, destination in which Op is an op-code, and source and destination are data fields.
3. How do you calculate the cost of an instruction?
The cost of an instruction can be computed as one plus cost associated with the source and
destination addressing modes given by added cost.
MOV R0,R1 cost is 1
MOV R1,M cost is 2
SUB 5(R0),*10(R1) cost is 3
4.Define basic block. (Nov/Dec 2013)
A basic block contains sequence of consecutive statements, which may be entered only at the
beginning and when it is entered it is executed in sequence without halt or possibility of branch.
5. What are the rules to find “ leader” in basic block?
 It is the first statement in a basic block is a leader.
 Any statement which is the target of a conditional or unconditional goto is a leader.
 Any statement which immediately follows a conditional goto is a leader.
6. Define flow graph.(Nov/Dec 2013)
Relationships between basic blocks are represented by a directed graph called flow graph.
7. What do you mean by DAG?
It is Directed Acyclic Graph. In this common sub expressions are eliminated. So it is a compact
way of representation.
8. list the advantages of DAGs (Nov/Dec 2012)
It automatically detects common sub expression.
We can determine which identifiers have their values used in the block.
We can determine which statements compute values, and which could be used outside the
block. It reconstruct a simplified list of quadruples taking advantage of common sub expressions
and not performs assignments of the form a=b unless necessary.
9. What is meant by registers and address descriptor?
Register descriptor: It contains information‘ s about,1.What registers are currently in use. 2.What
registers are empty. Address descriptor: It keeps track of the location where the current value of
the name can be found at run time.
10. Explain heuristic ordering for
DAG? It is based on Node Listing
algorithm. While unlisted interior nodes
remain do begin
Select an unlisted node n , all of whose parents have been listed;
list n;
While the leftmost child ‗m‘ of ‗n‘ has no unlisted parents and is not a
leaf do
do
{
/*since n was just listed, surely m is not yet listed */
}
begin

49
list m
n=m;
end
end
11. Write labeling algorithm.
if n is a leaf then
if n is the leftmost child of its parent then
LABEL(n)=1
else
LABEL(n)=0
else
begin
let n1 ,n2 ,n3,,,nk be the children of n ordered by LABEL , so
LABEL(n1)>=LABEL(n2)>=… … … LABEL(nk) LABEL(n)=max(LABEL(ni)
+i –1)
end
12. Explain multiregister operation.
If the number of registers used for an operation is more than two then the formula can be modified
as,

13. Define live variable.(Nov/Dec 2012)


A variable is live at a point in a program if its value can be used subsequently.
14. What are the different storage allocation strategies?
1) Static allocation.-It lays out storage for all data objects at compile time.
2) Stack allocation.-It manages the runtime storage as a stack.
3) Heap allocation.-It allocates and de-allocates storage as needed at runtime from a data
area.
15.Define formal parameters.
The identifiers appearing in procedure definitions are special and are called formal parameters.
16.Define l-value and r-value.
The l-value refers to the storage represented by an expression and r-value refers to the value
contained in the storage.
17.State the problems in code generation.
Three main problems in code generation are, Deciding what machine instructions to generate.
Deciding in what order the computations should be done and Deciding which registers to use.
18. Define actual parameters
The arguments passed to a called procedure are known as actual parameters.
19. What is meant by an activation of the procedure?
An execution of a procedure is referred to as an activation of then procedure.
20. Define activation tree.
An activation tree depicts the way control enters and leaves activations.
21. What is the use of control stack?
It keeps track of live procedure activations. Push the node for activation
onto the control stack as the activation begins and to pop the node when the activation ends.
22. What is meant by scope of declaration?

50
The portion of the program to which a declaration applies is called the scope of that declaration.
An occurrence of a name in a procedure is said to be local to the procedure if it is in the scope of
declaration within the procedure; otherwise the occurrence is said to be nonlocal.
23. What does the runtime storage hold?
Runtime storage holds 1) The generated target code 2) Data objects 3) A counter part of the
control stack to keep track of procedure activations.
24. Define activation record
Information needed by a single execution of a procedure is managed using
a contiguous block of storage called an activation record.
25. Define access link
It refers to nonlocal data held in other activation records.
26.What is the use of Next-use information?(Nov/Dec 2013)
 If a register contains a value for a name that is no longer needed, we should re-use that
register for another name (rather than using a memory location)
 So it is useful to determine whether/when a name is used again in a block
 Definition: Given statements i, j, and variable x,
– If i assigns a value to x, and
– j has x as an operand, and
– No intervening statement assigns a value to x,
– Then j uses the value of x computed at i

PART-B
Unit 4
1. (a) Explain the issues in design of code generator. (8)
ISSUES IN THE DESIGN OF A CODE GENERATOR The following issues arise during the
code generation phase :
1. Input to code generator
2. Target program
3. Memory management
4. Instruction selection
.5. Register allocation
6. Evaluation order
1. Input to code generator:
The input to the code generation consists of the intermediate representation of the source
program produced by front end , together with information in the symbol table to determine run-
time addresses of the data objects denoted by the names in the intermediate representation.
 Intermediate representation can be : a. Linear representation such as postfix notation b. Three
address representation such as quadruples c. Virtual machine representation such as stack machine
code d. Graphical representations such as syntax trees and dags.

51
Prior to code generation, the front end must be scanned, parsed and translated into intermediate
representation along with necessary type checking. Therefore, input to code generation is assumed
to be error-free.
2. Target program:
The output of the code generator is the target program.
The output may be :
a. Absolute machine language -It can beplaced in a fixed memorylocationand can be executed
immediately.
b. Relocatable machine language -It allows subprograms to be compiled separately.
c. Assembly language -Code generation is madeeasier.
3. Memory management:
Names in the source program are mapped to addresses of data objects in run-time memory
by the front end and code generator.
It makes use of symbol table, that is, a name in a three-address statement refers to a symbol-table
entry for the name.
Labels in three-address statements have to be converted to addresses of instructions. For example,
j:gotoigenerates jump instruction as follows :
ifi<j,a backward jump instruction with target address equal to location of code for quadrupleiis
generated.
ifi>j,the jump is forward. We must store on a list for quadruple ithe location of the first machine
instruction generated for quadruple j.Wheniis processed, the machine locationsfor all instructions
that forward jumps to i are filled.
4. Instruction selection:
The instructions of target machineshould be complete and uniform.
Instruction speeds and machine idioms are important factors when efficiency of target program is
considered.
The quality of the generated code is determined by its speed and size.
The former statement can be translated into the latter statement as shown below:
5. Register allocation
Instructions involving register operands are shorter and faster than those involving
operands in memory.
The use of registers is subdivided into two subproblems :
Register allocation – the set of variables that will reside in registers at a point in the program is
selected.

52
Register assignment – the specific register that a variable will reside in is picked.
Certain machine requires even-oddregister pairs for some operandsand results. For example ,
consider the division instruction of the form : D x, y where, x –dividend even register in even/odd
register pair y–divisor even register holds the remainder odd register holds the quotient
6. Evaluation order
The order in which the computations are performed can affect the efficiency of the target
code. Some computation orders require fewer registers to hold intermediate results than others.
1.(b)Explain peephole optimization. (8)
A simple but effective technique for locally improving the target code is peephole optimization, a
method for trying to improve the performance of the target program by examining a short
sequence of target instructions (called the peephole) and replacing these instructions by a shorter
or faster sequence, whenever possible.
A statement-by-statement code-generation strategy often produces target code that contains
redundant instructions and suboptimal constructs.
The quality of such target code can be improved by applying "optimizing" transformations to the
target program. The term "optimizing" is somewhat misleading because there is no guarantee that
the resulting code is optimal under any mathematical measure.
The characteristics of peephole optimization are,
 Redundant instruction elimination.
 Flow of control optimization
 Algebraic simplifications
 Use of machine idioms

Redundant- instruction elimination


If we see the instruction sequence
(1) MOV RO, a
(2) MOV a, RO
we can delete instruction (2) because whenever (2) is executed, (1) will ensure that the value of a
is already in register RO. Note that if (2) had a label we could not be sure that (1) was always
executed immediately before (2) and so we could not remove (2). Put another way, (1) and (2)
have to be in the same basic block for this transformation to be safe.
Unreachable Code
Another opportunity for peephole optimization is the removal of unreachable instructions. An
unlabeled instruction immediately following an unconditional jump may be removed. This
operation can be repeated to eliminate a sequence of instructions.
Example:
In C, the source code might look like:
#define debug 0
……
if ( debug )
{
print debugging information
}

53
Therefore the statements printing, debugging information are unreachable and can be eliminated
one at a time.
Flow-of-Control Optimizations
If we have jumps to jumps, then these unnecessary jumps can be eliminated in either intermediate
code or the target code as follows.

Algebraic Simplification
There is no end to the amount of algebraic simplification that can be attempted through
peephole optimization. However, only a few algebraic identities occur frequently enough that it is
worth considering implementing them. For example, statements such as
x :~ x + 0

54
or
x;=x*1
are often produced by straightforward intermediate code-generation algorithms, and they can be
eliminated easily through peephole optimization.
Reduction in Strength
Reduction in strength replaces expensive operations by equivalent cheaper ones on the target
machine. The exponentiation operators in the statement
x:=y**2
Usually requires a function call to implement. Using an algebraic transformation this statement
can be the cheaper, but equivalent statement.
x:=y*y
Use of Machine Idioms
The target machine may have hardware instructions to implement certain specific operations
efficiently. Detecting situations that permit the use of these instructions can reduce execution time
.significantly. For example, some machines have auto-increment and auto-decrement addressing
modes. The use of these modes greatly improves the quality of code when pushing or popping a
stack, as in parameter passing These modes can also be used in code for statements like i:= i + 1.

2. (a)Discuss run time storage management of a code generator.(8)


Information needed during an execution of a procedure is kept in a block of storage called
an activation record; storage for names local to the procedure also appears in the activation record.
Since run-time allocation and de-allocation of activation records occurs as pan of the procedure
call and return sequences, we focus on the following three-address statements:
 call,
 return,
 halt, and
 action, a placeholder for other statements.
There are two standard storage allocation
strategies:
 Static allocation
 Stack allocation
Static Allocation
An activation record for a procedure has fields to hold parameters, results, machine-status
information, local data, temporaries and the like.
Consider the code needed to implement static allocation. A call statement in the intermediate code
is implemented by a sequence of two target-machine instructions. A MOV instruction saves the
return address, and a GOTO transfer‘s control to the target code for the called procedure:

The attributes “callee.static_area” and “callee.code_area” are constants referring to the address
of the activation record and the first instruction for the called procedure, respectively. The source
#here + 20 in the MOV instruction is the literal return address; it is the address of the instruction
following the GOTO instruction.
A return from procedure ―callee‖ is implemented by
GOTO *callee.static-area
which transfers control to the address saved at the beginning of the activation record.

55
Fig: Static Allocation

Target code for the above Three-address code is,

Stack Allocation

Static allocation can become stack allocation by using relative addresses for storage in activation
records. The position of the record for an activation of a procedure is not known until run time. In
slack allocation, this position is usually stored in a register, so words in the activation record can
be accessed as offsets from the value in this register.
When a procedure call occurs, the calling procedure increments SP and transfers control to the
called procedure. After control returns to the caller, it decrements SP, thereby de-allocating the
activation record of the called procedure.

56
The attribute ―callee.recordsize‖ represents the size of an activation record, so the ADD
instruction leaves SP pointing to the beginning of the next activation record. The source #here+16
in the MOV instruction are the address of the instruction following the GOTO; it is saved in the
address pointed to by SP.

 The return sequence consists of two parts. The called procedure transfers control to the
return address using GOTO *0(SP) /* return to caller */
The reason for using *0(SP) in the GOTO instruction is that we need two levels of
indirection: Q(SP) is the address of the first word in the activation recorded *0(SP) is the
return address saved there.

 The second part of the return sequence is in the caller, which decrements SP, thereby
restoring SP to its previous value, That is, after the subtraction SP points to the beginning
of the activation record of the caller: SUB #caller record size, SP

Three address code to illustrate stack allocation

Target code of the input

57
2. (b)Explain DAG representation of the basic blocks with an example.(8)
Directed acyclic graphs (DAGs) are useful data structures for implementing transformations on
basic blocks. A dag gives a picture of how the value computed by each statement in a basic block
is used in subsequent statements of the block.
A dag for a basic block (or just dag) is a directed acyclic graph with the following labels on nodes:
 Leaves are labeled by unique identifiers, either variable names or constants.
 Interior nodes are labeled by an operator symbol.
 Nodes are also optionally given a sequence of identifiers for labels. The intention is that
interior nodes represent computed values, and the identifiers labeling a node are deemed to
have that value.
Dag Construction
Algorithm: Constructing a dag.
Input: A basic block,
Output: A dag for the basic block containing the following information:

58
 A label for each node. For leaves the label is an identifier (constants permitted), and for
interior nodes, an operator symbol.
 For each node a (possibly empty) list of attached identifiers (constants not permitted here).
Method:
Initially, we assume there are no nodes, and node is undefined for all arguments. Suppose
the "current" three-address statement is either (i) x := у op z, (ii) x := op y, or (iii) x := y We refer
to these as cases (i), (ii), and (iii). We treat a relational operator like if i <= 20 goto as case (i),
with x undefined.
 If node(y) is undefined, create a leaf labeled y, and let node(y) be this node. In case (i), if
node(z) is undefined, create a leaf labeled z and let that leaf be node(z).

 In case (i), determine if there is a node labeled op, whose left child is node(y) and whose
right child is node(z). (This check is to catch common sub-expressions.) If not, create such
a node. In either event, let n be the node found or created. In case (ii), determine whether
there is a node labeled op, whose lone child is node(y). If not, create such a node, and let n
be the node found or created. In case (iii),let n be node(y).

 Delete x from the list or attached identifiers for node (x). Append x to the list of alt ached
identifiers for the node n found in B) and set node(x) to n.

Example:
Consider the Following three address code,

The first statement, t1:= 4 * i


By step (1), create leaves labeled 4 and i
By step (2), create a node labeled *
By step (3), attach identifier t1 to it
The DAG at this stage is,

For the second statement, t2:=a[t1]


Create a new leaf labeled a
Find the previously created node (t1)
Create a new node labeled [] and attach the node for ―a‖ and t 1 as children to it.

59
The final DAG for the given three address code is,

Applications of DAGs
There are several pieces of useful information that we can obtain. There are,
 We automatically detect common sub-expressions.
 We can determine which identifiers have their values used in the block
 We can determine which statements compute values that could be used outside the block.

3. Explain the simple code generator with a suitable example. (8)


The code-generation strategy in this section generates target code for a sequence of three-address
statement. It considers each statement in turn, remembering if any of the operands of the statement
are currently in registers, and taking advantage of that fact if possible.
Register and Address Descriptors
The code-generation algorithm uses descriptors to keep track of register contents and addresses
for names.
A register descriptor keeps track of what is currently in each register. It is consulted whenever a
new register is needed. We assume that initially the register descriptor shows that all registers are
empty.
An address descriptor keeps (rack of the location (or locations) where the current value of the
name can be found at run time. The location might be a register, a stack location, a memory
address, or some set of these.
A Code-Generation Algorithm
The code-generation algorithm takes as input a sequence of three-address statements constituting
a basic block. For each three-address statement of the form x : = у op z we perform the following
actions:

60
Function getreg()

Example:

61
The assignment d:= (a-b) + (a-c) + (a-c) might be translated into the following three-
address code sequence

with й live at the end.

Code sequence for the example is,

Conditional Statements
 Machines implement conditional jumps in one of two ways. One way is to branch if the
value of a designated register meets one of the six conditions: negative, zero, positive,
nonnegative, nonzero, and non-positive.
 A second approach, common to many machines, uses a set of condition codex to indicate
whether the last quantity computed or loaded into a register is negative, zero, or positive.
Often a compare instruction (CMP in our machine) has the desirable property that it sets
the condition code without actually computing a value.

4. What are the different storage allocation strategies? Explain.

62
A different storage-allocation strategy is used in each of the data areas. They are,
 Static allocation lays out storage for all data objects at compile time.
 Stack allocation manages the run-time storage a» a stack.
 Heap allocation allocates and de-allocates storage as needed at run time from a data area
known as a heap.
Static Allocation
In static allocation, names are bound to storage as the program is compiled, so there is no need for
a run-time support package. Since the bindings do not change at run time, every time a procedure
is activated, its names are bound to the same storage locations. This property allows the values of
local names to be retained across activations of a procedure. That is, when control returns to a
procedure, the values of the locals are the same as they were when control left the last time.
However, some limitations go along with using static allocation alone.
 The size of a data object and constraints on its position in memory must be known at
compile time.
 Recursive procedures are restricted, because alt activations of a procedure use the same
bindings for local names.
 Data structures cannot be created dynamically.
Stack Allocation
Stack allocation is based on the idea of a control slack; storage is organized as a stack, and
activation records are pushed and popped as activations begin and end, respectively. Storage for
the locals in each call of a procedure is contained in the activation record for that call. Thus locals
are bound to fresh storage in each activation, because a new activation record is pushed onto the
stack when a call is made. Furthermore, the values of locals are deleted when the activation ends;
that is, the values are lost because the storage for locals disappears when the activation record is
popped.
Suppose that register top marks the top of the stack. At run time, an activation record can be
allocated and de-allocated by incrementing and decrementing top, respectively, by the size of the
record.
Calling Sequences
Procedure calls are implemented by generating what are known as calling sequences in the
target code, a call sequence allocates an activation record and enters information into its fields. A
return sequence restores the state of the machine so the calling procedure can continue execution.
The code in a calling sequence is often divided between the calling procedure (the caller) and the
procedure it calls (the callee).
The call sequence is:
 The caller evaluates actual.
 The caller stores a return address and the old value of top-sp into the callee,s activation
record.
 The callee saves register values and other status information.
 The callee initializes its local data and begins execution.

The division of tasks between caller and callee is shown in following figure.

63
Fig: Division of tasks between caller and callee
The return sequence is:
 The callee places a return value next to the activation record of the caller
 Using the information in the status field, the callee restores top-sp and other registers and
branches to a return address in the caller‘s code.
 Although top-sp has been decremented, the caller can copy the returned value into its own
activation record and use it to evaluate an expression.

Variable-Length Data
A common strategy for handling variable-length data is shown in following figure where
procedure ―p‖ has three local arrays.

Fig: Access to dynamically allocated arrays

64
The storage for these arrays is not part of the activation record for p; only a pointer to the
beginning of each array appears in the activation record. The relative addresses of these pointers
are known at compile time, so the target code can access array elements through the pointers.
Also in above figure, procedure q called by p. The activation record for q begins after the arrays
of p, and the variable-length arrays of q begin beyond that. Access to data on the stack is through
two pointers, top and top-sp. The first of these marks the actual top of the stack; it points to the
position at which the next activation record will begin. The second is used to find local data.
Dangling References
Whenever storage can be de-allocated, the problem of dangling references arises. A
dangling reference occurs when there is a reference to storage that has been de-allocated. It is a
logical error to use dangling references, since the value of de-allocated storage is undefined
according to the semantics of most languages.
Heap Allocation
The stack allocation strategy discussed above cannot be used if either of the following is possible.
 The values of local names must be retained when activation ends
 A called activation outlives the caller. This possibility cannot occur for those languages
where activation trees correctly depict the flow of control between procedures.
Heap allocation parcels out pieces of contiguous storage, as needed for activation records or other
objects. Pieces may be de-allocated in any order, so over time the heap will consist of alternate
areas that are free and in use.

65
5. Write detailed notes on Basic blocks and flow graphs.(8)

66
67
68
Loops
A loop is a collection of nodes in a flow graph such that
1. All nodes in the collection are strongly connected.
2. The collection of nodes has a unique entry.  A loop that contains no other loops is called an
inner loop.

69
6. Define a Directed Acyclic Graph. Construct a DAG and write the sequence of instructions
for the expression a+a*(b-c)+(b-c)*d. (May/June 2014)

Directed acyclic graphs (DAGs) are useful data structures for implementing transformations on
basic blocks. A dag gives a picture of how the value computed by each statement in a basic
block is used in subsequent statements of the block.

 Leaves corresponding to atomic operands, and interior nodes corresponding


to operators.
 A node N has multiple parents - N is a common subexpression.
(a + a * (b - c)) + ((b - c) * d)

 There is a node in the DAG for each of the initial values of the variables appearing
in the basic block.
 There is a node N associated with each statement s within the block. The children
of N are those nodes corresponding to statements that are the last definitions, prior
to s, of the operands used by s.
 Node N is labeled by the operator applied at s, and also attached to N is the list of
variables for which it is the last definition within the block.
 Certain nodes are designated output nodes. These are the nodes whose variables are
live on exit from the block;

UNIT V : Code Optimization

PART-A
1. What is meant by optimization?
It is a program transformation that made the code produced by compiling
algorithms run faster or takes less space.
2.What are the principle sources of optimization?
The principle sources of optimization are, Optimization consists of detecting patterns in the
program and replacing these patterns by equivalent but more efficient constructs. The richest
source of optimization is the efficient utilization of the registers and instruction set of a machine.
3.Mention some of the major optimization techniques.
 Local optimization
 Loop optimization
 Data flow analysis
 Function preserving transformations
 Algorithm optimization.
4. What are the methods available in loop optimization?

70
Code movement - Strength reduction- Loop test replacement- Induction variable elimination
5. What is the step takes place in peephole optimization?
It improves the performance of the target program by examining a short sequence of target
instructions. It is called peephole. Replace this instructions by a shorter or faster sequence
whenever possible. It is very useful for intermediate representation.
6. What are the characteristics of peephole optimization?
a. Redundant instruction elimination. b. Flow of control optimization c. Algebraic simplifications
d. Use of machine idioms
7. What are the various types of optimization?
The various type of optimization is, 1)Local optimization,2)Loop optimization,3)Data flow
analysis,4)Global optimization
8. List the criteria for selecting a code optimization technique.
The criteria for selecting a good code optimization technique are, It should capture most of the
potential improvement without an unreasonable amount of effort. It should preserve the meaning
of the program. It should reduce the time or space taken by the object program.
9.What is meant by U-D chaining?
It is Use-Definition chaining. It is the process of gathering information about how global data flow
analysis can be used id called as use-definition (UD) chaining.
10. What do you mean by induction variable elimination?
It is the process of eliminating all the induction variables , except one when there are two or more
induction variables available in a loop is called induction variable elimination.
11.List any two structure preserving transformations adopted by the optimizer?
The structure preserving transformations adopted by the optimizer are, Basic blocks.-Flow graphs
12. What are dominators?
A node of flow graph is said to be a dominator, i.e one node dominates the other node if every
path from the initial node of the flow graph to that node goes through the first node.(d Dom
n).when d-node dominates n-node.
13. What is meant by constant folding?
Constant folding is the process of replacing expressions by their value if the value can be
computed at complex time.
14. Define optimizing compilers.(Nov/Dec 2013)
Compilers that apply code-improving transformations are called optimizing compilers.
15. When do you say a transformation of a program is local?
A transformation of a program is called local, if it can be performed by looking only at the
statement in a basic block.
16. Write a note on function preserving transformation.
A complier can improve a program by transformation without changing the
function it compliers.
17.List the function –preserving transformation.
1)Common subexpression elimination.2)Copy propagation.3)Dead code elimination.4)
Constant folding.
18. Define common subexpression.
An occurrence of an expression E is called a common subexpression if E was previously
computed and the values of variables in E have not changed since the previous computation.
19.What is meant by loop optimization?
The running time of a program may be improved if we decrease the number
of instructions in a inner loop even if we increase the amount of code outside that loop.
20. What do you mean by data flow equations?
A typical equation has the form out[s] = gen[s] U (in[s]-kill[s])

71
It can be read as information at the end of a statement is either generated within the statement or
enters at the beginning and is not killed as control flows through the statement.
21. State the meaning of in[s], out[s], kill[s], gen[s].
in[s]-The set of definitions reaching the beginning of S.out[s]-End of S.gen [s]-The set of
definitions generated by S.kill[s]-The set of definitions that never reach the end of S.
22.What is data flow analysis? (Nov/Dec 2012)
The data flow analysis is the transmission of useful relationships from all parts of the program to
the places where the information can be of use.
23. Define code motion and loop-variant computation.
Code motion: It is the process of taking a computation that yields the same
result independent of the number of times through the loops and placing it
before
the loop. Loop –variant computation: It is eliminating all the induction variables, except
one when there are two or more induction variables available in a loop
24.Define loop unrolling with example.(Nov/Dec 2013)
Loop overhead can be reduced by reducing the number of iterations and replicating the body of
the loop.
Example:
In the code fragment below, the body of the loop can be replicated once and the number of
iterations can be reduced from 100 to 50.
for (i = 0; i < 100; i++)
g ();
Below is the code fragment after loop unrolling.
for (i = 0; i < 100; i += 2)
{
g ();
g ();
}
25.What is constant folding?(May/June 2013)
Constant folding is the process of replacing expressions by their value if the value can be
computed at complex time.
26.How would you represent the dummy blocks with no statements indicated in global data
flow analysis?(May/June 2013)
Dummy blocks with no statements are used as technical convenience (indicated as open circles)
27.What is meant by copy propagation or variable propagation?
One concerns assignments of the form f:=g called copy statements or copies for short
Eg: it means the use of variable V1 in place of V2
1.V1:=V2
2.f:V1+f
3.g:=v2+f-6
In statement 2, V2 can be used in place of V1 by copy propagations.
So,
1.V1=V2;
….

….
4.f:=V2+f
5.g:V2+f-6
This leads for common sub expression elimination further (V2+f is a common sub expression)
28. Define busy expressions.
An expression E is busy at a program point if and only if

72
 An evaluation of E exits along some path P1,P2,….Pn starting at program point P1
and,
 No definition of any operand of E exists before its evaluation along the path.

PART – B

1. Explain the principle sources of optimization in detail (8)


The principle sources of optimization introduce some of the most useful code-improving
transformations. Techniques for implementing these transformations are presented in subsequent
sections. A transformation of a program is called local if it can be performed by looking only at
the statements in a basic block; otherwise, it is called global.
Function-Preserving Transformations
There are number of function preserving transformations. They are,
 Common sub-expression elimination
 Copy propagation
 Dead-code elimination
 Constant folding

 Common sub-expression elimination


An occurrence of an expression E is called a common sub-expression if E was previously
computed, and the values of variables in E have not changed since the previous computation.
We can avoid re-computing the expression if we can use the previously computed value. For
example, the assignments to t7 and t10 have the common sub-expressions 4*i and 4* j,
respectively, on the right side in Fig.(a). They have been eliminated in Fig.(b), by using t 6
instead of t7 and t8 instead of t10.

After local common sub-expressions are eliminated, B5, still evaluates 4*i and 4* j, as shown
in Fig. (b). both are common sub-expressions; in particular, the three statements,

using t4 computed in block B3.

73
Fig: B5 and B6 after common sub-expression elimination

 Copy Propagation
Block B5 in above figure can be further improved by eliminating x using two new
transformations. One concerns assignments of the form f :=g called copy statements, or
copies for short.
For example, when the common sub-expression in c:=d+e is eliminated in following
figure,

Fig: Copies introduced during common sub-expression elimination.

 Dead-Code Elimination
A variable is live at a point in a program if its value can be used subsequently; otherwise,
it is dead at that point. Statements that compute values that never get used called dead code
or useless code. The programmer is unlikely to introduce any dead code intentionally; it
may appear as a result of previous transformations.
if (debug) print
By a data-flow analysis, it may be possible to deduce that each time the program reaches
this statement, the value of debug is false. Usually, it is because there is one particular
statement debug := false

74
If copy propagation replaces debug by false, then the print statement is dead because it
cannot be reached. More generally, deducing at compile time that the value of an
expression is a constant and using the constant instead is known as constant folding.
Copy propagation followed by dead-code elimination removes the assignment to x and
transforms into;

 Loop Optimizations
The running time of a program may be improved if we decrease the number of instructions
in an inner loop, even if we increase the amount of code outside that loop. Three
techniques are important for loop optimization;
 code motion, which moves code outside a loop;
 induction-variable elimination, which we apply to eliminate i and j from the
inner loops
 Reduction in strength, which replaces an expensive operation by a cheaper
one.
 Code Motion

 Induction Variables and Reduction in Strength


Loops are usually processed inside out. For example, consider the loop around B 3. Only
the portion of the flow graph relevant to the transformations on B 3 is shown in following
figure.

75
Fig: Strength reduction applied to 4* j in block B3.
Note that the values of j and t4 remain in lock-step; every time the value of j decreases by
1, that of t4 decreases by 4 because 4* j is assigned to t4. Such identifiers are called
induction variables.
After reduction in strength h applied to the inner loops around В 2 and B3 the only use of i
and j is to determine the outcome of the test in block B4. We know that the values of i and
t2 satisfy the relationship t2 = 4*i, while those of j and t4 satisfy the relationship t4 = 4*j,
so the test t2>=t4 is equivalent to i> = j. Once this replacement is made, i in block В 2 and j
in block В3 become dead variables and the assignments to them in these blocks become
dead code that can be eliminated, resulting in the flow graph shown in following figure.

Fig: Plow graph after induction-variable elimination.

2. (a) Discuss about the


following: i). Copy Propagation

76
ii) Dead-code Elimination and
iii) Code motion (6)
 Copy Propagation
Block B5 in above figure can be further improved by eliminating x using two new
transformations. One concerns assignments of the form f :=g called copy statements, or
copies for short.
For example, when the common sub-expression in c:=d+e is eliminated in following
figure,

Fig: Copies introduced during common sub-expression elimination.

 Dead-Code Elimination
A variable is live at a point in a program if its value can be used subsequently; otherwise,
it is dead at that point. Statements that compute values that never get used called dead code
or useless code. The programmer is unlikely to introduce any dead code intentionally; it
may appear as a result of previous transformations.
if (debug) print
By a data-flow analysis, it may be possible to deduce that each time the program reaches
this statement, the value of debug is false. Usually, it is because there is one particular
statement debug := false
If copy propagation replaces debug by false, then the print statement is dead because it
cannot be reached. More generally, deducing at compile time that the value of an
expression is a constant and using the constant instead is known as constant folding.
Copy propagation followed by dead-code elimination removes the assignment to x and
transforms into;

 Loop Optimizations
The running time of a program may be improved if we decrease the number of instructions
in an inner loop, even if we increase the amount of code outside that loop. Three
techniques are important for loop optimization;
 code motion, which moves code outside a loop;
 induction-variable elimination, which we apply to eliminate i and j from the
inner loops
 Reduction in strength, which replaces an expensive operation by a cheaper
one.
 Code Motion

77
 Induction Variables and Reduction in Strength
Loops are usually processed inside out. For example, consider the loop around B 3. Only
the portion of the flow graph relevant to the transformations on B 3 is shown in following
figure.

Fig: Strength reduction applied to 4* j in block B3.

Note that the values of j and t4 remain in lock-step; every time the value of j
decreases by 1, that of t4 decreases by 4 because 4* j is assigned to t4. Such identifiers are
called induction variables.

78
After reduction in strength h applied to the inner loops around В2 and B3 the only
use of i and j is to determine the outcome of the test in block B4. We know that the values
of i and t2 satisfy the relationship t2 = 4*i, while those of j and t4 satisfy the relationship t4
= 4*j, so the test t2>=t4 is equivalent to i> = j. Once this replacement is made, i in block
В2 and j in block В3 become dead variables and the assignments to them in these blocks
become dead code that can be eliminated, resulting in the flow graph shown in following
figure.

Fig: Plow graph after induction-variable elimination.

3. Describe the various storage allocation strategies. (8)


There are two standard storage allocation strategies:
 Static allocation
 Stack allocation
Static Allocation
An activation record for a procedure has fields to hold parameters, results, machine-status
information, local data, temporaries and the like.
Consider the code needed to implement static allocation. A call statement in the intermediate code
is implemented by a sequence of two target-machine instructions. A MOV instruction saves the
return address, and a GOTO transfer‘s control to the target code for the called procedure:

The attributes “callee.static_area” and “callee.code_area” are constants referring to the address
of the activation record and the first instruction for the called procedure, respectively. The source

79
#here + 20 in the MOV instruction is the literal return address; it is the address of the instruction
following the GOTO instruction.
A return from procedure ―callee‖ is implemented by
GOTO *callee.static-area
which transfers control to the address saved at the beginning of the activation record.

Fig: Static Allocation

Target code for the above Three-address code is,

Stack Allocation

Static allocation can become stack allocation by using relative addresses for storage in activation
records. The position of the record for an activation of a procedure is not known until run time. In

80
slack allocation, this position is usually stored in a register, so words in the activation record can
be accessed as offsets from the value in this register.
When a procedure call occurs, the calling procedure increments SP and transfers control to the
called procedure. After control returns to the caller, it decrements SP, thereby de-allocating the
activation record of the called procedure.

The attribute ―callee.recordsize‖ represents the size of an activation record, so the ADD
instruction leaves SP pointing to the beginning of the next activation record. The source #here+16
in the MOV instruction are the address of the instruction following the GOTO; it is saved in the
address pointed to by SP.
 The return sequence consists of two parts. The called procedure transfers control to the
return address using GOTO *0(SP) /* return to caller */
The reason for using *0(SP) in the GOTO instruction is that we need two levels of
indirection: Q(SP) is the address of the first word in the activation recorded *0(SP) is the
return address saved there.
 The second part of the return sequence is in the caller, which decrements SP, thereby
restoring SP to its previous value, That is, after the subtraction SP points to the beginning
of the activation record of the caller: SUB #caller record size, SP

Three address code to illustrate stack allocation

81
Target code of the input

82
4. Explain in detail access to nonlocal names.(8)
When one procedure calls another, the usual method of communication between them is through
nonlocal names and through parameters of the called procedure.
There are different parameters passing methods that associate actual and formal parameters.
There are,
 Call-by-value
 Call –by-Reference
 Copy-restore
 Call by name

Call-by-value
It is the simplest possible method of passing parameter. The actual parameters are evaluated and
their r-values are passed to the called procedure. Call-by-value is used in С and Pascal parameters
are usually passed this way. All the programs so far in this chapter have relied on this method for
passing parameters. Call-by-value can be implemented as follows.
 A formal parameter is treated just like a local name, so the storage for the formats is in the
activation record of the called procedure.
 The caller evaluates the actual parameters and places their r-values in the storage for the
formals.
A distinguishing feature of call-by-value is that operations on the formal parameters do not affect
values in the activation record of the caller. A procedure called by value can differ its caller either
through nonlocal names or through points that are explicitly passed as values.
Call-by-Reference
When parameters are passed by reference (also known as call-by-address or cull-by-location) the
caller passes to the called procedure a pointer to the storage address of each actual parameter.
 If an actual parameter is a name or an expression having an ―l‖ value, then that l
value itself is passed.
 However, if the actual parameter is an expression, like a+b or 2, that has no ―l‖ value,
then the expression is evaluated in a new location, and the address of that location is
passed.
A reference to a formal parameter in the called procedure becomes, in the target code, an indirect
reference through the pointer passed to the called procedure.
Copy-Restore
A hybrid between call-by-value and call-by-reference is copy-restore linkage, also known as
copy-in copy-out, or value-result.
 Before control flows to the called procedure, the actual parameters are evaluated. The r-
values of the actuals are passed to the called procedure as in call-by-value. In addition, the
l values of those actual parameters having l values are determined before the call.
83
 When control returns, the current r-values of the formal parameters are copied back into
the l values of the actuals, using the l values computed before the call. Only actuals having
/-values are copied, of course.
The first step "copies in" the values of the actuals into the activation record of the called
procedure (into the storage for the formats). The second step "copies out" the final values of the
formals into the activation record of the caller (into /-values computed from the actuals before the
call).
Call-by-Name
Call-by-name is traditionally defined by the copy-rule, which is:
 The procedure is treated as if it were a macro; that is, its body is substituted for the call in
the caller, with the actual parameters literally substituted for the formal s. Such a literal
substitution is called macro-expansion or inline expansion.
 The local names of the called procedure are kept distinct from the names of the calling
procedure. We can think of each local of the called procedure being systematically
renamed into a distinct new name before the macro-expansion is done.
 The actual parameters are surrounded by parentheses if necessary to preserve their
integrity.

5. (a) Elaborate storage organization.(8)


Subdivision of Run-Time Memory
The compiler obtains a block of storage from the operating system for the compiled
program to run in. This run-time storage might be subdivided to hold:
 the generated target code
 data objects
 a counterpart of the control stack to keep (rack of procedure activations
The size of the generated target code is fixed at compile time, so the compiler can place it in a
statically determined area, perhaps in the low end of memory. Similarly, the size of some of the
data objects may also be known at compile time.
The languages like Pascal and С use extensions of the control stack to manage activations of
procedures. When a call occurs, execution of activation is interrupted and information about the
status of the machine, such as the value of the program counter and machine registers, is saved on
the stack. When control returns from the call, this activation can be restarted after restoring the
values of relevant registers and setting the program counter to the point immediately after the call.
Data objects whose life- time is contained in that of activation can be allocated on the slack, along
with other information associated with the activation.
A separate area of run-time memory, called a heap, holds alt other information.
The sizes of the stack and the heap can change as the program executes, so we show these at
opposite ends of memory in following figure.

84
Fig: Typical subdivision of run-time memory into code and data areas.
Activation Records
Information needed by a single execution of a procedure is managed using a contiguous block of
storage called an activation record ox frame, consisting of the collection of fields shown in
following figure.

Fig: A general activation record


For languages like Pascal and С ii is customary to push the activation record of a procedure on the
run-time stack when the procedure is called and lo pop the activation record off the stack when
control returns to the caller.
The purpose of the fields of an activation record is as follows, starting from I he field for
temporaries.

 Temporary values, such as those arising in the evaluation of expressions, are stored in the
field for temporaries.
 The field for local data holds data that is local to an execution of a procedure. The layout
of this field is discussed below.
 The field for saved machine status holds information about the state of the machine just
before the procedure is called. This includes the values of the program counter and
machine registers that have to be restored when control returns from the procedure.
 The optional access link is used to refer lo nonlocal data held in other activation records.
For a language like FORTRAN access links are not needed because nonlocal data is kept
in a fixed place. Access links, or the related "display" mechanism, are needed for Pascal.
 The optional control link points to the activation record of the caller.

85
 The field for actual parameters is used by the calling procedure to supply parameters to the
called procedure. We show space for parameters in the activation record, but in practice
parameters are often passed in machine registers for greater efficiency.
 The field for the returned value is used by the called procedure to return a value to the
calling procedure. Again, in practice this value is often returned in a register for greater
efficiency.

Compile-Time Layout of Local Data


The amount of storage needed for a name is determined from its type, an elementary data type,
such as a character, integer, or real, can usually be stored in an integral number of bytes. Storage
for an aggregate, such as an array or record, must be large enough to hold all its components. For
easy access to the components, storage for aggregates is typically allocated in one contiguous
block of bytes.
The field for local data is laid out as the declarations in a procedure are examined at compile time.
Variable-length data h kept outside this field.
The following Figure is a simplification of the data layout used by С compilers for two machines.

Fig: Data layouts used by two C compiler

(b) Write detailed notes on parameter passing.(8)


When one procedure calls another, the usual method of communication between them is through
nonlocal names and through parameters of the called procedure.
There are different parameters passing methods that associate actual and formal parameters.
There are,
 Call-by-value
 Call –by-Reference
 Copy-restore
 Call by name

Call-by-value
It is the simplest possible method of passing parameter. The actual parameters are evaluated and
their r-values are passed to the called procedure. Call-by-value is used in С and Pascal parameters
are usually passed this way. All the programs so far in this chapter have relied on this method for
passing parameters. Call-by-value can be implemented as follows.

86
 A formal parameter is treated just like a local name, so the storage for the formats is in the
activation record of the called procedure.
 The caller evaluates the actual parameters and places their r-values in the storage for the
formals.
A distinguishing feature of call-by-value is that operations on the formal parameters do not affect
values in the activation record of the caller. A procedure called by value can differ its caller either
through nonlocal names or through points that are explicitly passed as values.
Call-by-Reference
When parameters are passed by reference (also known as call-by-address or cull-by-location) the
caller passes to the called procedure a pointer to the storage address of each actual parameter.
 If an actual parameter is a name or an expression having an ―l‖ value, then that l
value itself is passed.
 However, if the actual parameter is an expression, like a+b or 2, that has no ―l‖ value,
then the expression is evaluated in a new location, and the address of that location is
passed.
A reference to a formal parameter in the called procedure becomes, in the target code, an indirect
reference through the pointer passed to the called procedure.
Copy-Restore
A hybrid between call-by-value and call-by-reference is copy-restore linkage, also known as
copy-in copy-out, or value-result.
 Before control flows to the called procedure, the actual parameters are evaluated. The r-
values of the actuals are passed to the called procedure as in call-by-value. In addition, the
l values of those actual parameters having l values are determined before the call.
 When control returns, the current r-values of the formal parameters are copied back into
the l values of the actuals, using the l values computed before the call. Only actuals having
/-values are copied, of course.
The first step "copies in" the values of the actuals into the activation record of the called
procedure (into the storage for the formats). The second step "copies out" the final values of the
formals into the activation record of the caller (into /-values computed from the actuals before the
call).
Call-by-Name
Call-by-name is traditionally defined by the copy-rule, which is:
 The procedure is treated as if it were a macro; that is, its body is substituted for the call in
the caller, with the actual parameters literally substituted for the formal s. Such a literal
substitution is called macro-expansion or inline expansion.
 The local names of the called procedure are kept distinct from the names of the calling
procedure. We can think of each local of the called procedure being systematically
renamed into a distinct new name before the macro-expansion is done.
 The actual parameters are surrounded by parentheses if necessary to preserve their
integrity.

6. (a) Explain optimization of basic blocks.


The code-improving transformations for basic blocks. These included structure-preserving
transformations, they are
 Common sub-expression elimination
 Dead code elimination
 Algebraic transformation
87
The structure-preserving transformations can be implemented by constructing a dag for a basic
block. A DAG gives a picture of how the value computed by each statement in a basic block is
used in subsequent of the block.

Dag Construction
A dag for a basic block (or just dag) is a directed acyclic graph with the following labels on nodes:
 Leaves are labeled by unique identifiers, either variable names or constants.
 Interior nodes are labeled by an operator symbol.
 Nodes are also optionally given a sequence of identifiers for labels. The intention is that
interior nodes represent computed values, and the identifiers labeling a node are deemed to
have that value.

Each node of a flow graph can be represented by a dag, since each node of the flow graph stands
for a basic block.

The final DAG for the given three address code is,

Consider the block,

A dag for the above block is,

88
The Use of Algebraic Identities
Algebraic identities represent another important class of optimizations on basic blocks.
For example, we may apply arithmetic identities, such as

Another class of algebraic optimizations includes reduction in strength, which is, replacing a more
expensive operator by a cheaper one as in

A third class of related optimizations is constant folding. Here we evaluate constant expressions at
compile time and replace the constant expressions by their values,
The more general algebraic transformations are,
Associative law

Commutative law
For example, suppose * is commutative; that is, x*y = y*x. Before we create a new node
labeled * with left child m and right child n, we check whether such a node already exists.
We then check for a node having operator *, left child n and right child m.

89

You might also like