You are on page 1of 41

CST-402(T): Language Processors

• Course Outcomes:
• On successful completion of the course, students will be able to:
1. Exhibit role of various phases of compilation, with understanding of
types of grammars and design complexity of compiler.
2. Design various types of parses and perform operations like string
parsing and error handling.
3. Demonstrate syntax directed translation schemes, their
implementation for different programming language constructs.
4. Implement different code optimization and code generation
techniques using standard data structures.
M.B.Chandak, CSE-RCOEM, NAGPUR
Course: Gradation
• Three Test: T1, T2, T3 [Can attempt all three]
• Assignment: Types:
• 02 Marks : Class Assignment
• 02 Marks : Programming Assignment
• 02 Marks : Quiz/Programming Assignment
• 01 Mark : Class participation
• 03 Marks : Attendance
• Total : 30+10 = 40 Marks
• End Semester Question paper : Generally two questions with choice and 4
Questions with internal choice.
M.B.Chandak, CSE-RCOEM, NAGPUR
UNIT – I: Introduction [CO1]

Outcomes:
1. To understand the design complexity of language
processor.
2. To understand the functions of various phases of
compilation.
3. To understand allied concepts like cross
compilation, bootstrapping etc.

M.B.Chandak, CSE-RCOEM, NAGPUR


Motivation
• Early days software were written in assembly language. The
software was machine specific.
• No portability.
• Separate module for separate task [Assembler, Linker,
Loader].
• Software cost for operation increased.
• First complier FORTRAN – IN 1950
• Total 18 person-years to build.

M.B.Chandak, CSE-RCOEM, NAGPUR


Typical Compilation Process
Source program with macros

Preprocessor

Source program

Compiler
Target assembly program

assembler

Relocatable machine code

linker

Absolute machine code


M.B.Chandak, CSE-RCOEM, NAGPUR
Compiler
• A compiler acts as a translator,
transforming human-oriented programming languages into computer-oriented
machine languages. [English – Machine]
• Ignore machine-dependent details for programmer
• A program that reads a program written in one language (source language) and
translates it into an equivalent program in another language (target language).
• Two components
• Understand the program (make sure it is correct)
• Rewrite the program in the target language.
• Traditionally, the source language is a high level language and the target
language is a low level language (machine code).
Source program Compiler Target program

Error message

M.B.Chandak, CSE-RCOEM, NAGPUR


Compilation process
• Compilation of a program proceeds through a fixed series of phases
• Each phase use an (intermediate) form of the program produced by an earlier
phase. [Cascading effect]
• Subsequent phases operate on lower-level code representations. [Close to
system]
• Each phase may consist of a number of passes over the program
representation
• Pascal, FORTRAN, C languages designed for one-pass compilation, which
explains the need for function prototypes
• Single-pass compilers need less memory to operate
• Java, C++ and ADA are multi-pass

M.B.Chandak, CSE-RCOEM, NAGPUR


Two major operations
• Any compiler must perform two major tasks

Compiler

Analysis Synthesis

• Analysis of the source program


• Synthesis of a machine-language program

M.B.Chandak, CSE-RCOEM, NAGPUR


Block Schematic: Modern Compilers

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Stream) Structure Routines

Intermediate Representation

Intermediate
Error recovery and Symbol and Code Generator
Attribute
Tables
Code Optimizer

(Used by all Phases of The Compiler)

Code Generator

M.B.Chandak, CSE-RCOEM, NAGPUR Target machine code


Block Schematic

Source
Program Tokens Syntactic Semantic
Scanner
Parser
(Character Stream) LA Structure Routines

Intermediate Representation
Scanner Intermediate
 The scanner begins the analysis of the source program by Code Generator
reading the input, character by character, and grouping
characters into individual words and symbols (tokens)
Code Optimizer
 RE ( Regular expression )
 NFA ( Non-deterministic Finite Automata )
 DFA ( Deterministic Finite Automata )
 LEX
Code Generator

M.B.Chandak, CSE-RCOEM, NAGPUR Target machine code


Application of Lexical Analyzer

M.B.Chandak, CSE-RCOEM, NAGPUR


Block Schematic

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Stream) Structure Routines

Intermediate Representation
Parser Intermediate
 Given a formal syntax specification (typically as a context-free
Code Generator
grammar [CFG] ), the parse reads tokens and groups them into
units as specified by the productions of the CFG being used.
 As syntactic structure is recognized, the parser either calls
corresponding semantic routines directly or builds a syntax tree. Code Optimizer
 CFG ( Context-Free Grammar )
 BNF ( Backus-Naur Form )
 GAA ( Grammar Analysis Algorithms )
 LL, LR, SLR, LALR Parsers

Code Generator
YACC

M.B.Chandak, CSE-RCOEM, NAGPUR Target machine code


Block Schematic

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Stream) Structure Routines

Intermediate Representation
Semantic Routines Intermediate
 Perform two functions
Code Generator
 Check the static semantics of each construct
 Do the actual translation
 The heart of a compiler
Code Optimizer
 Syntax Directed Translation
 Semantic Processing Techniques
 IR (Intermediate Representation)

Code Generator

M.B.Chandak, CSE-RCOEM, NAGPUR Target machine code


Block Schematic

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Stream) Structure Routines

Intermediate Representation

Optimizer Intermediate
 The IR code generated by the semantic routines is analyzed and Code Generator
transformed into functionally equivalent but improved IR code
 This phase can be very complex and slow
 Peephole optimization Code Optimizer
 Loop optimization, register allocation, code scheduling
 Local Optimization
 Register and Temporary Management
 Peephole Optimization
Code Generator

M.B.Chandak, CSE-RCOEM, NAGPUR Target machine code


Block Schematic

Source
Program Tokens Syntactic Semantic
Scanner Parser
(Character Stream) Structure Routines

Intermediate Representation

Code Generator Intermediate


 Interpretive Code Generation Code Generator
 Generating Code from Tree/Dag
 Grammar-Based Code Generator
 Generally in Machine language for better understanding of
course Code Optimizer

Code Generator

M.B.Chandak, CSE-RCOEM, NAGPUR Target machine code


Example:1
Code Generator
[Intermediate Code Generator]

Non-optimized Intermediate Code


Scanner
[Lexical Analyzer]

Tokens

Code Optimizer
Parser
[Syntax Analyzer]
Optimized Intermediate Code
Parse tree

Code Generation
Semantic Process
[Semantic analyzer] Target machine code

Abstract Syntax Tree w/ Attributes

M.B.Chandak, CSE-RCOEM, NAGPUR


Compiler Front-end / Back-end
Source program (character stream) Abstract syntax tree or
other intermediate form
Scanner
(lexical analysis) Intermediate Code
Tokens Generation
Front end

Back end
synthesis
Parser Modified intermediate form
analysis

(syntax analysis)
Parse tree Code Optimization
Assembly or object code
Semantic Analysis
Machine Specific
Code Generation
Abstract syntax tree or
Modified assembly or object code
other intermediate form
M.B.Chandak, CSE-RCOEM, NAGPUR
Block diagram of compilation phases

M.B.Chandak, CSE-RCOEM, NAGPUR


Differences between Compiler and Interpreter
Compiler Interpreter
Compiler Takes Entire program as input Interpreter Takes Single instruction as input .
Intermediate Object Code is Generated No Intermediate Object Code is Generated
Optimization is possible and implementable Optimization is very difficult
Conditional Control Statements are Executes Conditional Control Statements are Executes
faster slower
Memory Requirement : More (Since Object
Code is Generated) Memory Requirement is Less
Every time higher level program is converted
Program need not be compiled every time
into lower level program
Errors are displayed after entire program is Errors are displayed for every instruction
checked interpreted (if any)
Refer services.msc to check what happens when complier & interpreter services are disable:
during program is coded and after program is coded and compiled.

M.B.Chandak, CSE-RCOEM, NAGPUR


Phases Functionalities

M.B.Chandak, CSE-RCOEM, NAGPUR


Lexical Analyzer
• Lexical analysis breaks up a program into tokens/lexicon
• Grouping characters into non- separable units (tokens)
• Changing a stream to characters to a stream of tokens

program gcd (input, output);


program gcd ( input ,
var i, j : integer; output ) ;
begin var i , j :
read (i, j); integer ; begin
while i <> j do read ( i ,
if i > j then i := i - j else j := j - i; j ) ; while
writeln (i) i <> j do if i
> j
end.
then i := i - j
else j
:= i - i ; writeln
( i
Comment on kinds of errors reported by lexical analyzer
) end .

M.B.Chandak, CSE-RCOEM, NAGPUR


Syntax Analyzer
• Grammatical check of tokens.
• A syntax error is produced by the compiler when the program does not meet
the grammatical specification.
• For grammatically correct program, this phase generates an internal
representation that is easy to manipulate in later phases
• Typically a syntax tree (also called a parse tree).
• A grammar of a programming language is typically described by a
context free grammar. It can be used define the structure of the parse
tree.

M.B.Chandak, CSE-RCOEM, NAGPUR


Syntax Analyzer: Parser: Parse Tree
• The syntax defines the syntactic categories for language constructs
• Statements
• Expressions
• Declarations
• Categories are subdivided into more detailed categories
• A Statement is a
• For-statement
• If-statement
• Assignment

M.B.Chandak, CSE-RCOEM, NAGPUR


Semantic Analysis/SDTS
• Semantic analysis is applied by a compiler to discover the meaning of
a program by analyzing its parse tree or abstract syntax tree.
• A program without grammatical errors may not always be correct
program.
• pos = init + rate * 60
• What if pos is a char while init and rate are integers?
• This kind of errors are not reported by parser.
• Semantic analysis reports such errors and ensure that the program has
defined meaning.

• C++: Semantically strong language?

M.B.Chandak, CSE-RCOEM, NAGPUR


Types of Semantic Checks
• Static semantic checks (done by the compiler) are performed at compile time
• Type checking
• Every variable is declared before used
• Identifiers are used in appropriate contexts
• Check subroutine call arguments
• Check labels
• Dynamic semantic checks are performed at run time, and the compiler
produces code that performs these checks
• Array subscript values are within bounds
• Arithmetic errors, e.g. division by zero
• Pointers are not dereferenced unless pointing to valid object
• A variable is used but hasn't been initialized
• When a check fails at run time, an exception is raised

M.B.Chandak, CSE-RCOEM, NAGPUR


Semantic Analysis
• A language is “strongly typed” if (type) errors are always detected.
• Errors are either detected at compile time or at run time
• Languages that are strongly typed are Ada, Java, ML, Haskell
• Languages that are not strongly typed are Fortran, Pascal, C/C++, Lisp
• Strong typing makes language safe and easier to use, but potentially slower
because of dynamic semantic checks
• In some languages, most (type) errors are detected late at run time which is
detrimental to reliability e.g. early Basic, Lisp, Prolog, some script languages
• Role of Semantic Analysis in Search Engine Optimization
• Book the ticket
• Reading book

M.B.Chandak, CSE-RCOEM, NAGPUR


Intermediate Code Generator
• Conversion of parse tree into intermediate code.
• Various forms of intermediate code: Quadruple, Triplet, Indirect
Triplet etc.
• Temporary storage is used in representation.
• Proper use of data structures is key factor.

M.B.Chandak, CSE-RCOEM, NAGPUR


Code Optimization
• Purpose:
• To improve efficiency of code.
• To reduce time required for execution.
• Types
• Local Optimization
• Loop Optimization
• Peep-hole Optimization
• Role of data structures and their memory implementation is important
[Trees/Graphs]
• Optimization:
• Machine independent
• Machine dependent

M.B.Chandak, CSE-RCOEM, NAGPUR


Code Generation
• Purpose:
• To convert optimized code into machine code.
• Depends upon machine architecture.
• For learning purpose assembly language code will be used.
• Example:
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1

M.B.Chandak, CSE-RCOEM, NAGPUR


Summary
• Compiler front-end: lexical analysis, syntax analysis, semantic analysis
• Tasks: understanding the source code, making sure the source code is written
correctly
• Compiler back-end: Intermediate code generation/improvement, and
Machine code generation/improvement.
• Tasks: translating the program to a semantically the same program (in a
different language).

M.B.Chandak, CSE-RCOEM, NAGPUR


Questions
• Explain the various phases of compilation.
• Open source tools: for various phases of compilation.
• File name in which details of keywords of “C” language are stored. Its locations and
structure.
• C++ is semantically strong language? Justify
• Any five rules to design lexical analyzer. For example: “?” “!” symbols are not
considered as valid tokens.
• Advantages of Late and Early binding approaches.
• How to decide the complier is one pass or two pass. Any two rules.
• How to classify the front end and back end component of any software product.
• Role of Lexical Analyzer in image processing.
• Role of Semantic Analyzer in Search engine optimization.
M.B.Chandak, CSE-RCOEM, NAGPUR
Course Curriculum
UNIT-I
Introduction to Compilers- Compilers and translators, Phases of compiler design, cross compiler, Bootstrapping, Design of Lexical analyser, LEX.
UNIT-II
Syntax Analysis- Specification of syntax of programming languages using CFG, Top-down parser, design of LL(1) parser, bottom up parsing technique,
LR parsing, Design of SLR, CLR, LALR parsers, YACC.
UNIT-III
Syntax directed translation- Study of syntax directed definitions & syntax directed translation schemes, implementation of SDTS, intermediate
notations- postfix, syntax tree, TAC, translation of expressions, controls structures, declarations, procedure calls, Array reference.
UNIT-IV
Storage allocation & Error Handling- Run time storage administration stack allocation, symbol table management, Error detection and recovery- lexical,
syntactic and semantic.
UNIT-V
Code optimization- Important code optimization techniques, loop optimization, control flow analysis, data flow analysis, Loop invariant computation,
Induction variable removal, Elimination of Common sub expression.
UNIT-VI
Code generation – Problems in code generation, Simple code generator, Register allocation and assignment, Code generation from DAG, Peephole
optimization.
Web resource: www.mbchandak.com
TEXTBOOKS
Aho, Sethi, and Ullman; Compilers Principles Techniques and Tools; Second Edition, Pearson education, 2008.
Alfred V. Aho and Jeffery D. Ullman; Principles of Compiler Design; Narosa Pub. House, 1977.
Vinu V. Das; Compiler Design using Flex and Yacc; PHI Publication, 2008.
M.B.Chandak, CSE-RCOEM, NAGPUR
TOC concepts for Compiler
Design
Design of Lexical Analyzer

M.B.Chandak, CSE-RCOEM, NAGPUR


Concepts and Notations
• Set: An unordered collection of unique elements
S1 = { a, b, c } S2 = { 0, 1, …, 19 } empty set: Æ
membership: x Î S union: S1 È S2 = { a, b, c, 0, 1, …, 19 }
universe of discourse: U subset: S1 Ì U
complement: if U = { a, b, …, z }, then S1' = { d, e, …, z } = U - S1
• Alphabet: A finite set of symbols
• Examples:
• Character sets: ASCII, ISO-8859-1, Unicode
• S1 = { a, b } S2 = { Spring, Summer, Autumn, Winter }
• String: A sequence of zero or more symbols from an alphabet
• The empty string: e
Concepts and Notations
• Language: A set of strings over an alphabet
• Also known as a formal language; may not bear any resemblance to a natural language, but
could model a subset of one.
• The language comprising all strings over an alphabet å is written as: å*
• Graph: A set of nodes (or vertices), some or all of which may be connected by edges.
• An example: – A directed graph example:

1 2 a c

3 b
Regular Expression
• It is a tool to express language in the form of expression.
• RE uses primitive operators for expressing language.
• The three operators used are: Union, Concatenation, Kleene Star.
• Examples:
(0 + 10*) L= { 0, 1, 10, 100, 1000, 10000, … }

(0*10*) L={1, 01, 10, 010, 0010, …}

Set of strings consisting of even number of a’s


(aa)*(bb)*b followed by odd number of b’s , so L= {b, aab, aabbb,
aabbbbb, aaaab, aaaabbb, …………..}

Set of strings of a’s and b’s ending with the string abb,
(a + b)*abb
So L = {abb, aabb, babb, aaabb, ababb, …………..}

M.B.Chandak, CSE-RCOEM, NAGPUR


Regular Expression in Python: Examples
Tokenization example
import re
String=‘India scored 225 runs and Sachin scored 125’
p=re.compile('\w+')
>>> p.findall(string)
['India', 'scored', '225', 'runs', 'and', 'Sachin', 'scored', '125']
How to find multiple numeric values from the string
>>> string='India scored 225 runs and Sachin scored 125‘
>>> p=re.compile('\d+')
>>> p.findall(string)
['225', '125']
M.B.Chandak, CSE-RCOEM, NAGPUR
Regular Expression in Python: Examples
How to find multiple numeric values from the string
>>> string='India scored 225 runs and Sachin scored 125‘
>>> p=re.compile('\d')
>>> p.findall(string)
['2', '2', '5', '1', '2', '5']
>>> p=re.compile('\d\d')
['22', '12']
Finding name entity attribute
p=re.compile(r'[A-Z][a-z]*')
>>> p.findall(string)
['India', 'Sachin'] M.B.Chandak, CSE-RCOEM, NAGPUR
Regular Expression in Python: Examples
Split function:
>>> s='One 1, Two 2, Three 3'
re.split('\d+',s)
['One ', ', Two ', ', Three ', '']
>>> re.split('\D+',s)
['', '1', '2', '3']
Split email address
string='chandakmb@gmail.com'
re.split('\@+',string)
['chandakmb', 'gmail.com']

M.B.Chandak, CSE-RCOEM, NAGPUR


Regular Expression in Python: Examples
Regular expression for first name and last name
string='manoj-chandak'
>>> re.split('\-', string)
['manoj', 'chandak']
>>> m = re.search('(?<=-)\d+', '0712-2580011')
>>> m.group(0)
'2580011'
>>> m = re.search('(?<=-)\w+', 'CSE-RKNEC')
>>> m.group(0)
'RKNEC'
M.B.Chandak, CSE-RCOEM, NAGPUR
Regular Expression in Python: Examples
Substitute and Split function
import re
>>> s='North East Road'
>>> re.sub('Road$', 'RD', s)
'North East RD'
>>>

M.B.Chandak, CSE-RCOEM, NAGPUR

You might also like