You are on page 1of 11

III Year-V Semester: B.Tech.

Computer Science and Engineering


5CS4-02: Compiler Design

UNIT-1

1 Introduction: Objective, scope and outcome of the course. Compiler, Translator,


Interpreter definition, Phase of compiler, Bootstrapping, Review of Finite automata
lexical analyzer, Input, Recognition of tokens, Idea about LEX: A lexical analyzer
generator, Error handling.
2 Review of CFG Ambiguity of grammars: Introduction to parsing. Top down parsing, LL
grammars & passers error handling of LL parser, Recursive descent parsing predictive
parsers, Bottom up parsing, Shift reduce parsing, LR parsers, Construction of SLR,
Conical LR & LALR parsing tables, parsing with ambiguous grammar. Operator
precedence parsing, Introduction of automatic parser generator: YACC error handling in
LR parsers.
3 Syntax directed definitions; Construction of syntax trees, S- Attributed Definition, L-
attributed definitions, Top down translation. Intermediate code forms using postfix
notation, DAG, Three address code, TAC for various control structures, Representing
TAC using triples and quadruples, Boolean expression and control structures.
4 Storage organization; Storage allocation, Strategies, Activation records, Accessing local
and non-local names in a block structured language, Parameters passing, Symbol table
organization, Data structures used in symbol tables.
5 Definition of basic block control flow graphs; DAG representation of basic block,
Advantages of DAG, Sources of optimization, Loop optimization, Idea about global data
flow analysis, Loop invariant computation, Peephole optimization, Issues in design of
code generator, A simple code generator, Code generation from DAG.

1.1 What is Translators? Different type of translators

A program written in high-level language is called as source code. To convert the source code
into machine code, translators are needed.
A translator takes a program written in source language as input and converts it into a program in
target language as output.
It also detects and reports the error during translation.
Roles of translator are:
• Translating the high-level language program input into an equivalent machine language
program.
• Providing diagnostic messages wherever the programmer violates specification of the high-level
language program.
Different type of translators

The different types of translator are as follows:


Compiler

Compiler is a translator which is used to convert programs in high-level language to low-level


language. It translates the entire program and also reports the errors in source program
encountered during the translation.

Interpreter

Interpreter is a translator which is used to convert programs in high-level language to low-level


language. Interpreter translates line by line and reports the error once it encountered during the
translation process.
It directly executes the operations specified in the source program when the input is given by the
user.
It gives better error diagnostics than a compiler.

Differences between compiler and interpreter

SI. Compiler Interpreter


No

1 Performs the translation of a program as Performs statement by statement


a whole. translation.

2 Execution is faster. Execution is slower.

3 Requires more memory as linking is Memory usage is efficient as no


needed for the generated intermediate intermediate object code is generated.
object code.

4 Debugging is hard as the error messages It stops translation when the first error
are generated after scanning the entire is met. Hence, debugging is easy.
program only.
5 Programming languages like C, C++ Programming languages like Python,
uses compilers. BASIC, and Ruby uses interpreters.

1.2 What are the Phases of Compiler Design?

Compiler operates in various phases each phase transforms the


source program from one representation to another. Every phase
takes inputs from its previous stage and feeds its output to the
next phase of the compiler.

There are 6 phases in a compiler. Each of this phase help in


converting the high-level langue the machine code. The phases
of a compiler are:

1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Intermediate code generator
5. Code optimizer
6. Code generator

Lexical Analysis:
Lexical analyzer phase is the first phase of compilation process. It takes source code as input. It
reads the source program one character at a time and converts it into meaningful lexemes. Lexical
analyzer represents these lexemes in the form of tokens.

Syntax Analysis
Syntax analysis is the second phase of compilation process. It takes tokens as input and
generates a parse tree as output. In syntax analysis phase, the parser checks that the expression
made by the tokens is syntactically correct or not.

Semantic Analysis
Semantic analysis is the third phase of compilation process. It checks whether the parse tree
follows the rules of language. Semantic analyzer keeps track of identifiers, their types and
expressions. The output of semantic analysis phase is the annotated tree syntax.

Intermediate Code Generation


In the intermediate code generation, compiler generates the source code into the intermediate
code. Intermediate code is generated between the high-level language and the machine language.
The intermediate code should be generated in such a way that you can easily translate it into the
target machine code.
Code Optimization
Code optimization is an optional phase. It is used to improve the intermediate code so that the
output of the program could run faster and take less space. It removes the unnecessary lines of
the code and arranges the sequence of statements in order to speed up the program execution.

Code Generation
Code generation is the final stage of the compilation process. It takes the optimized intermediate
code as input and maps it to the target machine language. Code generator translates the
intermediate code into the machine code of the specified computer.

Example:

1.3 Bootstrapping
o Bootstrapping is widely used in the compilation development.
o Bootstrapping is used to produce a self-hosting compiler. Self-hosting compiler is a type of
compiler that can compile its own source code.
o Bootstrap compiler is used to compile the compiler and then you can use this compiled
compiler to compile everything else as well as future versions of itself.

A compiler can be characterized by three languages:


1. Source Language
2. Target Language
3. Implementation Language

The T- diagram shows a compiler SCIT for Source S, Target T, implemented in I.

Follow some steps to produce a new language L for machine A:

1. Create a compiler SCAA for subset, S of the desired language, L using language "A" and that
compiler runs on machine A.

2. Create a compiler LCSA for language L written in a subset of L.

3. Compile LCSA using the compiler SCAA to obtain LCAA. LCAA is a compiler for language L, which
runs on machine A and produces code for machine A.

The process described by the T-diagrams is called bootstrapping.

1.4 Review of Finite Automata

Finite automata is a state machine that takes a string of symbols as input and changes its state
accordingly. Finite automata is a recognizer for regular expressions. When a regular expression
string is fed into finite automata, it changes its state for each literal. If the input string is
successfully processed and the automata reaches its final state, it is accepted, i.e., the string just
fed was said to be a valid token of the language in hand.
The mathematical model of finite automata consists of:

 Finite set of states (Q)


 Finite set of input symbols (Σ)
 One Start state (q0)
 Set of final states (qf)
 Transition function (δ)
The transition function (δ) maps the finite set of state (Q) to a finite set of input symbols (Σ), Q × Σ
➔Q

Finite Automata Construction


Let L(r) be a regular language recognized by some finite automata (FA).
 States : States of FA are represented by circles. State names are written inside circles.
 Start state : The state from where the automata starts, is known as the start state. Start
state has an arrow pointed towards it.
 Intermediate states : All intermediate states have at least two arrows; one pointing to and
another pointing out from them.
 Final state : If the input string is successfully parsed, the automata is expected to be in this
state. Final state is represented by double circles. It may have any odd number of arrows
pointing to it and even number of arrows pointing out from it. The number of odd arrows
are one greater than even, i.e. odd = even+1.
 Transition : The transition from one state to another state happens when a desired symbol
in the input is found. Upon transition, automata can either move to the next state or stay in
the same state. Movement from one state to another is shown as a directed arrow, where
the arrows points to the destination state. If automata stays on the same state, an arrow
pointing from a state to itself is drawn.
Example : We assume FA accepts any three digit binary value ending in digit 1. FA = {Q(q0, qf),
Σ(0,1), q0, qf, δ}

1.5 Compiler Design - Lexical Analysis

Lexical analysis is the first phase of a compiler. It takes the modified source code from language
preprocessors that are written in the form of sentences. The lexical analyzer breaks these
syntaxes into a series of tokens, by removing any whitespace or comments in the source code.
If the lexical analyzer finds a token invalid, it generates an error. The lexical analyzer works
closely with the syntax analyzer. It reads character streams from the source code, checks for
legal tokens, and passes the data to the syntax analyzer when it demands.
1.6 Tokens

Lexemes are said to be a sequence of characters (alphanumeric) in a token. There are some
predefined rules for every lexeme to be identified as a valid token. These rules are defined by
grammar rules, by means of a pattern. A pattern explains what can be a token, and these patterns
are defined by means of regular expressions.
In programming language, keywords, constants, identifiers, strings, numbers, operators and
punctuations symbols can be considered as tokens.
For example, in C language, the variable declaration line
int value = 100;
contains the tokens:
int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).

Specifications of Tokens

Let us understand how the language theory undertakes the following terms:
Alphabets
Any finite set of symbols {0,1} is a set of binary alphabets, {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a
set of Hexadecimal alphabets, {a-z, A-Z} is a set of English language alphabets.
Strings
Any finite sequence of alphabets is called a string. Length of the string is the total number of
occurrence of alphabets, e.g., the length of the string tutorialspoint is 14 and is denoted by
|tutorialspoint| = 14. A string having no alphabets, i.e. a string of zero length is known as an empty
string and is denoted by ε (epsilon).
Special Symbols
A typical high-level language contains the following symbols:-

Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*), Division(/)

Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)

Assignment =
Special Assignment +=, /=, *=, -=

Comparison ==, !=, <, <=, >, >=

Preprocessor #

Location Specifier &

Logical &, &&, |, ||, !

Shift Operator >>, >>>, <<, <<<

Language
A language is considered as a finite set of strings over some finite set of alphabets. Computer
languages are considered as finite sets, and mathematically set operations can be performed on
them. Finite languages can be described by means of regular expressions.

Longest Match Rule

When the lexical analyzer read the source-code, it scans the code letter by letter; and when it
encounters a whitespace, operator symbol, or special symbols, it decides that a word is
completed.
For example:
int intvalue;
While scanning both lexemes till ‘int’, the lexical analyzer cannot determine whether it is a
keyword int or the initials of identifier int value.
The Longest Match Rule states that the lexeme scanned should be determined based on the
longest match among all the tokens available.
The lexical analyzer also follows rule priority where a reserved word, e.g., a keyword, of a
language is given priority over user input. That is, if the lexical analyzer finds a lexeme that
matches with any existing reserved word, it should generate an error.

1.7 Recognition of Tokens in Compiler Design

Recognition of Tokens
Tokens can be recognized by Finite Automata

A Finite automaton(FA) is a simple idealized machine used to recognize patterns within input taken
from some character set(or Alphabet) C. The job of FA is to accept or reject an input depending on
whether the pattern defined by the FA occurs in the input.

There are two notations for representing Finite Automata. They are
Transition Diagram
Transition Table

Transition diagram is a directed labeled graph in which it contains nodes and edges

Nodes represents the states and edges represents the transition of a state

Every transition diagram is only one initial state represented by an arrow mark (-->) and zero or
more final states are represented by double circle
Example:

Where state "1" is initial state and state 3 is final state.


Finite Automata for recognizing identifiers

Finite Automata for recognizing keywords

Finite Automata for recognizing numbers

Finite Automata for relational operators

Finite Automata for recognizing white spaces


1.8 Error Handling in Compiler Design

Types or Sources of Error – There are two types of error: run-time and compile-time error:

1. A run-time error is an error which takes place during the execution of a program, and
usually happens because of adverse system parameters or invalid input data. The lack of
sufficient memory to run an application or a memory conflict with another program and
logical error are example of this. Logic errors, occur when executed code does not produce
the expected result. Logic errors are best handled by meticulous program debugging.
2. Compile-time errors rises at compile time, before execution of the program. Syntax error or
missing file reference that prevents the program from successfully compiling is the example
of this.

Classification of Compile-time error –

1. Lexical : This includes misspellings of identifiers, keywords or operators


2. Syntactical : missing semicolon or unbalanced parenthesis
3. Semantical : incompatible value assignment or type mismatches between operator and
operand
4. Logical : code not reachable, infinite loop.

Finding error or reporting an error – Viable-prefix is the property of a parser which allows early
detection of syntax errors.

 Goal: detection of an error as soon as possible without further consuming unnecessary


input
 How: detect an error as soon as the prefix of the input does not match a prefix of any string
in the
language.
 Example: for(;), this will report an error as for have two semicolons inside braces.

Error Recovery –
The basic requirement for the compiler is to simply stop and issue a message, and cease
compilation. There are some common recovery methods that are follows.

1. Panic mode recovery: This is the easiest way of error-recovery and also, it prevents the
parser from developing infinite loops while recovering error. The parser discards the input
symbol one at a time until one of the designated (like end, semicolon) set of synchronizing
tokens (are typically the statement or expression terminators) is found. This is adequate
when the presence of multiple errors in same statement is rare. Example: Consider the
erroneous expression- (1 + + 2) + 3. Panic-mode recovery: Skip ahead to next integer and
then continue. Bison: use the special terminal error to describe how much input to skip.

E->int|E+E|(E)|error int|(error)

2. Phase level recovery: Perform local correction on the input to repair the error. But error
correction is difficult in this strategy.
3. Error productions: Some common errors are known to the compiler designers that may
occur in the code. Augmented grammars can also be used, as productions that generate
erroneous constructs when these errors are encountered. Example: write 5x instead of 5*x
4. Global correction: Its aim is to make as few changes as possible while converting an
incorrect input string to a valid string. This strategy is costly to implement.

You might also like