Professional Documents
Culture Documents
UNIT - I
Input
Output
OVERVIEW OF LANGUAGE
PROCESSING SYSTEM
Preprocessor
• A preprocessor produce input to compilers.
• They may perform the following functions.
1. Macro processing: A preprocessor may allow a user to
define macros that are short hands for longer constructs.
2. File inclusion: A preprocessor may include header files
into the program text.
3. Language Extensions: These preprocessor attempts to add
capabilities to the language by certain amounts to build-in
macro
ASSEMBLER
• Programmers found it difficult to write or read programs in
machine language.
• They begin to use a mnemonic (symbols) for each machine
instruction, which they would subsequently translate into
machine language. Such a mnemonic machine language is now
called an assembly language.
• Programs known as assembler were written to automate the
translation of assembly language in to machine language. The
input to an assembler program is called source program, the
output is a machine language translation (object program).
• A compiler may generate assembly language as
its target language and an assembler finished
the translation into object code
• Examples of Assemblers :
Lexeme Token
• x Identifier – id1
• = Assignment Operator
• y Identifier – id2
• + Addition Operator
• z Identifier – id3
• * Multiplication Operator
• 10 Number – num
id1,ass opr, id2, add opr, id3, mul opr, num Tokens
Syntax Analysis
Semantic Analysis
Intermediate Code
Code Optimization
t1 = id3 * 10; Optimized Code
id1 = id2 + t1;
Code Generation
MOV id3, r1 Target Program, Mnemonic Instructions
MUF *10, r1
MOV id2, r2
ADD r1, r2 MOV r2, id1
Error Handler
• All the phases of Compiler reports errors to the
handlers.
• Major part of the errors are reported by
Analysis phase.
1. Lexical Analysis – Lexical Errors
2. Syntax Analysis – Syntax Errors
3. Semantic Analysis – Semantic Errors
• Error Handler will display the appropriate
error message to the user
Major Data structures used in Compilers
• THE SYMBOL TABLE
– Keeps information associated with identifiers: function, variable,
constants, and data types
– Interacts with almost every phase of compiler.
– Access operation need to be constant-time
– Used to store and retrieve data in the form of attribute values
– Collection of records and fields
• TEMPORARY FILES
– Holds the product of intermediate steps during compiling
– Solve the problem of memory constraints or back-patch
addressed during code generation
BACK
Other Issues in Compiler
Structure
Analysis and Synthesis
• The analysis part of the compiler analyzes the source program
to compute its properties
– Lexical analysis, syntax analysis and semantics analysis, as
well as optimization
– More mathematical and better understood
• The synthesis part of the compiler produces the translated
codes
– Code generation, as well as optimization
– More specialized
• The two parts can be changed independently of the other
Front End and Back End
• The operations of the front end depend on the source language
– The scanner, parser, and semantic analyzer, as well as
intermediate code synthesis
• The operations of the back end depend on the target language
– Code generation, as well as some optimization analysis
• The intermediate representation is the medium of
communication between them
• This structure is important for compiler portability
Passes
• The repetitions to process the entire source program before
generating code are referred as passes.
• Passes may or may not correspond to phases
– A pass often consists of several phases
– A compiler can be one pass, which results in efficient
compilation but less efficient target code
– Most compilers with optimization use more than one pass
• One Pass for scanning and parsing
• One Pass for semantic analysis and source-level
optimization
• The third Pass for code generation and target-level
optimization
Bootstrapping and Porting
A Compiler usually available in
• Machine language
– compiler to execute immediately;
• Another language with existed compiler on the same target
machine : (First Scenario)
– Compile the new compiler with existing compiler
• Another language with existed compiler on different machine
: (Second Scenario)
– Compilation produce a cross compiler
T-Diagram Describing Complex Situation
S T
H
• Original compiler
• Compiler source code retargeted to K
• Result in Cross Compiler
Lexical Analysis
• Example : input string/statement :
int amt = 10;
Then,
Lexeme Token Pattern
Int Keyword Data type
Amt Identifier Letter(Letter/Digit)*
= Assignment Operator Operator
10 Number Digit(digit)*
• Lexeme : It is a sequence of characters in the source
program that is matched by the pattern for a token.
Ex : int, a, printf
• Token : A lexical token is a sequence of characters that
can be treated as a unit in the grammar of the
programming language.
• A Token is a pair consisting of token name and an
optional attribute value.
Ex : identifier
• Pattern : Tokens are described by a rule ( regular
expression) called as pattern.
Ex : identifier – id=letter(letter/digit)*
• Regular Expressions are usually used to specify patterns.
Specification of Tokens
• Tokens are produced by lexical analysis for set
of strings in the input as output to syntax
analysis.
• Ex : if (EXP)
{
STMT
}
• Here, “ if “ is string in the above input
code/program and Token will be produced for
it – “ Keyword”.
• Alphabet Set : It is a finite non empty set of symbols
denoted by ‘ Ƹ ‘
Ex : Ƹ = {0,1}, Ƹ ={ a,b} – represents binary alphabet.
• Symbol : It is a smallest abstract entity that cannot be
divided further, that can be represented but cannot be
defined.
Ex : 0, 1, a, b
• String : A string is a finite sequence of symbols on an
alphabet set denoted by ‘ w’.
Ex : Binary alphabet set – {0,1}
Strings : w = 001, 101, 110
Length of string - | w | = | 001 | = 3
• Prefix : Collection of leading symbols called as prefix. Ex : w
= 001 , then its prefixes are 0, 00, 001
• Proper prefix : All prefixes of a string other than the string is
called as Proper Prefix.
Ex : w = 001 , then its proper prefixes are 0, 00
• Suffix : Collection of trailing symbols called as suffix. Ex : w
= 001 , then its suffixes are 1, 01, 001
• Proper suffix : All suffixes of a string other than the string is
called as Proper suffix.
Ex : w = 001 , then its proper suffixes are 1, 01
• Substring : It is a contiguous sequence of characters from a
string – part of a string.
Ex : w = 00101, substrings are 0, 01, 101, 010
• Subsequence of a string is part of a string may not be
contiguous.
• Language : It is a set of strings over some alphabet set.
Ex : L = { 0, 01, 001, 00001.....} over { 0,1}
Operation on Languages :
• Union : L1 and L2 are languages the L1UL2 is also a language.
L1 = {00, 01}, L2 = { 10, 11} then L1UL2={00,01,10,11}
• Concatenation : L1L2 = { 0010,0011,0110,0111}
• Closure ( Kleene) : L * = UL i
• Regular Grammar is responsible for syntax of programming
languages.
• Regular Grammar and Regular Language are represented using
finite automata.
• Regular Expression : It is used to describe/denote the Regular Language
and Regular Set.
• Regular Expressions are used to describe tokens of a programming
language.
• The language denoted by a RE is called as Regular Set.
• An algebraic way of expressing the notation is called as RE.
• RE is defined as –
1. Ǿ is RE denoting an empty language.
2. ε (Epsilon) is a RE denoting language which has an empty string.
3. {a} = a - RE
4. If R and S are RE, then LR and LS denote language for RE R and S, then
i. R+S is RE for language LR U LS .
ii. RS is RE for language LR LS .
iii. R* is RE for language L*R
Algebraic Laws on Regular Expressions
Let R, S, T denote Regular Expressions.
• Commutative Law : R|S = S|R
• Associative Law : R|(S|T) = (R|S)|T
R(ST) = (RS)T
• Distributive Law : R(S|T) = RS|RT
(S|T)R=SR|TR
• Identity Law : εR = Rε = R
• R* = (R|ε)*
• Idempotent Law : R ** = R*
• Regular Definition : For Notional
Convenience, Regular Expressions are denoted
by ‘Regular Definition ‘.
• Regular Expressions are used to specify Tokens
Token/Regular Definition Regular expression
Letter A|B|....|Z|a|b|...|z
[A-Z]|[a-z]
Digit 0|1|...|9 or [09]
Id letter(letter/digit)*
Recognition of Tokens :
Ex : Piece of code in High Level language for ‘if - else’
statement.
If ( EXP) • Grammar Behind ...
{ • S->iEtS|iEtSeS|ε
STMT; • E->T relop T|T
} • T-> id|num
Else Terminals are
{ i,t,e,relop,id,num –
STMT; strings.
}
• Strings represent tokens and tokens are represented by Regular
Expressions / Patterns.
• Regular Expressions are represented by Finite Automata
digit -> [0-9]
Digits -> digit+
number -> digit(.digits)? (E[+-]? Digit)?
letter -> [A-Za-z_]
id -> letter (letter|digit)*
If -> if
Then -> then
Else -> else
Relop -> < | > | <= | >= | = | <>
We also need to handle whitespaces:
ws -> (blank | tab | newline)+
Transition diagram for token
‘ relop'
Transition diagram for reserved words and identifiers
Transition diagram for unsigned numbers
Transition diagram for whitespace
TOKEN getRelop()
{
TOKEN retToken = new (RELOP)
while (1) { /* repeat character processing until a
return or failure occurs */
switch(state) {
case 0: c= nextchar();
if (c == ‘ < ‘ ) state = 1;
else if (c == ‘ = ‘ ) state = 5;
else if (c == ‘ > ’ ) state = 6;
else fail(); /* lexeme is not a relop */
break;
case 1: …
…
case 8: retract();
retToken.attribute = GT;
return(retToken);
}
Role of Lexical Analyzer
token
Source Lexical Analy To semantic
program Parser analysis
zer
getNextToken
Symbol
table
Tokens, Patterns, Lexemes
• A token is a pair a token name and an optional token value
– attribute value
• Token is a sequence of characters that can be treated as a
single logical entity. Typical tokens are,
1) Identifiers 2) keywords 3) operators 4) special symbols
5)constants
• A pattern is a rule that describes the token.
• A lexeme is a sequence of characters in the source program
that matches the pattern for a token
Ex :
Token Informal description Sample lexemes
if Characters i, f Characters e, if
else l, s, e else
comparison < or > or <= or >= or == or != <=, !=
lbfp
C > = B; text = C> - No Token Matched
lb fp
C > = B; text = C - Lexeme – Token : Variable - Parser
lbfp
• Sometimes lexical analyzer needs to look ahead
some s ymbols to decide about the token to return
• In C language: we need to look after -, = or < to decide w
hat token to return
• We need to introduce a two buffer scheme to handle la
rge look-aheads safely
E = M * C * * 2 eof
Sentinels
E = M eof * C * * 2 eof eof
Lexical Analyzer Generator -
LEX
Lex Source program Lexical Com lex.yy.c
lex.l piler
lex.yy.c
C a.out
compiler
Sequence
Input stream a.out
of tokens
Syntax – Structure of LEX
Program
declarations
%%
translation rules Regular {Action}
%% Expression -
auxiliary functions Pattern
Example
%{
Int installID() {/* funtion to install the
/* definitions of manifest constants LT,
lexeme, whose first character is poi
LE, EQ, NE, GT, GE, nted to by yytext, and whose lengt
IF, THEN, ELSE, ID, NUMBER, RELOP */ h is yyleng, into the symbol table a
%} nd return a pointer thereto */
}
/* regular definitions
delim [ \t\n] Int installNum() {/* similar to installI
ws {delim}+ D, but puts numerical constants int
o a separate table */
letter [A-Za-z]
}
digit id [0-9]
number {letter}({letter}|{digit})*
{digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%
{ws} {/* no action and no return */} if
{return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval = (int) installID(); return(ID); }
{number} {yylval = (int) installNum(); return(NUMBER);}
…
References
1.Kenneth C Louden, “ Compiler Construction:
Principles and Practice”, Cengage Learning,
Lex & Yacc, John R Levine, Oreilly Publishers.
2.Alfred V.Aho, Monica S Lam, Ravi Sethi, Jeffrey
D Ullman, “ Compilers: Principles, Techniques
& Tools”, Pearson Education 2nd Edition 2013