Professional Documents
Culture Documents
INTRODUCTION TO COMPILERS
TRANSLATORS-COMPILATION AND INTERPRETATION
TRANSLATOR
A translator is a program that takes as input a program written in one
language and produces as output a program in another language. Beside
program translation, the translator performs another very important role,
the error-detection. Any violation of d HLL(High Level Language)
specification would be detected and reported to the programmers.
Important Role of Translator are:
Translating the HLL program input into an equivalent ml program.
Providing diagnostic messages wherever the programmer violates
specification of the HLL.
A translator or language processor is a program that translates an
input program written in a programming language into an equivalent
program in another language.
Source Code Translator Target Code
Execution in Translator
Types of Translators:
a. Interpreter
b. Assembler
c. Compiler
1.1.1.a INTERPRETER
An interpreter is a program that appears to execute a source program
as if it were machine language. It is one of the translators that translate
high level language to low level language.
Source Program Program Output
Interpreter
Data
Execution in Interpreter
During execution, it checks line by line for errors. Languages such as
BASIC, SNOBOL, LISP can be translated using interpreters. JAVA also uses
interpreter. The process of interpretation can be carried out in following
phases.
1. Lexical analysis
2. Syntax analysis
3. Semantic analysis
4. Direct Execution
Example: BASIC , Lower Version of Pascal, SNOBOL, LISP & JAVA
Advantages:
Modification of user program can be easily made and implemented as
execution proceeds.
Type of object that denotes various may change dynamically.
Debugging a program and finding errors is simplified task for a
program used for interpretation.
The interpreter for the language makes it machine independent.
Disadvantages:
The execution of the program is slower.
Memory consumption is more.
1.1.1.b. ASSEMBLER
Programmers found it difficult to write or read programs in machine
language. They begin to use a mnemonic (symbols) for each machine
instruction, which they would subsequently translate into machine
language. Such a mnemonic machine language is now called an assembly
language. Programs known as assembler were written to automate the
translation of assembly language in to machine language. The input to an
assembler program is called source program, the output is a machine
language translation (object program).
It translates assembly level language to machine code.
Assembly Language Assembler Machine Code
Ada compilers
ALGOL compilers
BASIC compilers
C# compilers
C compilers
C++ compilers
COBOL compilers
D compilers
Common Lisp compilers
Fortran compilers
Java compilers
Pascal compilers
PL/I compilers
Python
Difference between Compiler and
Interpreter
Sl.No Compiler Interpreter
1 Compiler works on the Interpreter program works
complete program at once. It line-by-line. It takes one
takes the entire program as statement at a time as input.
input.
2 Compiler generates Interpreter does not generate
intermediate code, called the intermediate object code or
object code or machine code. machine code .
3 Compiler executes conditional Interpreter executes
control statements (like if-else conditional control statements
and switch-case) and logical at a much slower speed.
constructs faster than
interpreter.
4 Compiled programs take more Interpreter does not generate
memory because the entire intermediate object code. As a
object code has to reside in result, interpreted programs
memory. are more memory efficient.
5 Compile once and run 5 Interpreted programs are
anytime. Compiled program interpreted line-by-line every
does not need to be compiled time they are run.
every time.
6 Compiler does not allow a Interpreter runs the program
program to run until it is from first line and stops
completely error-free. execution only if it encounters
an error.
7 Compiled languages are more Interpreted languages are less
efficient but difficult to debug. efficient but easier to debug.
This makes such languages
an ideal choice for new
students
8 Example: C, C++, COBOL Example: BASIC, Visual
Basic, Python, Ruby, PHP,
Perl, MATLAB, Lisp
2. Complier :
It converts the source program (HLL) into target program (LLL).
3. Assemblers :
It converts an assembly language (LLL) into machine code. Some
compilers produce assembly for further processing. Other compilers perform
the job of the assembler, producing relocatable machine code that can be
passed directly to the loader/link-editor.
Assembly code is a mnemonic version of the machine code, in which
names are used instead of binary codes for operation and names are also
given to memory addresses. A typical sequence of assembly instructions
might be
MOV a, R1
ADD #2, R1
MOV R1, b
4. Loader and Link Editors :
Loader :
The process of loading consists of taking relocatable machine code,
altering the relocatable addresses and placing the altered instructions and
data in memory at the proper locations. The Link-editor allows us to make a
single program from several files of relocatable machine code. These files
may have been the result of several different compilations, and one or more
may be library files of routines provided by the system and available to any
program that needs them.
Link Editor :
It allows us to make a single program from several files of relocatable
machine code.
Software Tools
Many software tools that manipulate source program first perform some
kind of analysis. Some examples of such tools include:
1. STRUCTURE EDITORS: A structure editor takes as input a sequence
of commands to build a source program. The structure editor not only
performs the text creation and modification functions of an ordinary
text editor but it also analyses the program text. For example, it can
check that the input is correctly formed, can supply keywords
automatically etc., The output of such an editor is often similar to the
output of the analyses phase of the compiler.
2. PRETTY PRINTERS: A Pretty Printer analyzes a program and prints
it in such a way that the structure of the program becomes clearly
visible. For example, comments may appear in a special font, and
statements may appear with the amount of indentation proportional
to the depth of the nesting.
3. STATIC CHECKERS: A Static Checker reads a program, analyzes it
and attempts to discover potential bugs without running the program.
The analysis portion is often similar to that found in optimizing
compilers. For example, a static checker may detect that parts of the
source program can never be executed or that a certain variable might
be used before being defined. In addition it can catch logical errors.
4. INTERPRETERS: Instead of producing a target program as a
translation, an interpreter performs the operation implied by the
source program. For an assignment statement, for example, an
interpreter builds a tree and then carry out the operations at the
nodes.
The first three phases, forms the analysis portion of a compiler and
last three phases form the synthesis portion of a compiler. Two other
activities, symbols-table management and error handling, are shown
interacting with the six phases of lexical analysis, syntax analysis,
intermediate code generation, code optimization, and code generation.
Informally, we shall also call the symbol-table manager and the error
handler “phases”.
Symbol-Table Management
An essential function of a compiler is to record the identifiers used in
the source program and collect information about various attributes of each
identifier. These attributes may provide information about the storage
allocated for an identifier, its type, its scope(where in the program it is valid),
and in the case of procedure names, such things as the number and types of
its arguments, the method of passing each argument, and the type returned,
if any.
Source Program
Lexical Analyzer
Syntax Analyzer
Semantic Analyzer
Code Optimizer
Code Generator
Target Program
ANALYSIS OF THE SOURCE PROGRAM:
Analysis consists of three phases.
1. LINEAR ANALYSIS: The stream of characters making up the source
program is read from left-to-right and grouped into tokens that are
sequence of characters having a collective meaning.
2. HIERARCHICAL ANALYSIS: Characters or tokens are grouped
hierarchically into nested collections with collective meaning.
3. SEMANTIC ANALYSIS: Certain checks are performed to ensure that
the components of the program fit together meaningfully.
LEXICAL ANALYSIS:
In a compiler, Linear analysis is called lexical analysis or scanning.
For example, in lexical analysis the characters in the assignment statement
position: = initial + rate * 60
would be grouped in to the following tokens
1. The identifier position
2. The assignment symbol :=
3. The identifier initial
4. The plus sign
5. The identifier rate
6. The multiplication sign
7. The number 60
The blanks are usually eliminated during lexical analysis.
SYNTAX ANALYSIS:
Hierarchical analysis is called parsing or syntax analysis. It involves
grouping the tokens of the source program into grammatical phrases that
are used by the compiler to synthesize output.
The hierarchical structure of a program is usually expressed by
recursive rules. The rules are
1. Any identifier is an expression
2. Any number is an expression
3. If expression1 and expression2 are expressions, then so are
expression1 + expression2
expression1 * expression2
(expression1 )
Rules (1) and (2) are basis rules, while (3) defines expressions in terms
of operators applied to other expressions. Thus by rule (1), initial and rate
are expressions. By rule(2), 60 is an expression, while by rule (3), we can
first infer that rate*60 is an expression and finally that initial + rate * 60 is
an expression.
Similarly, many languages define statements recursively by rules
such as:
1. if identifier1 is an identifier, and expression2 is an expression, then
identifier1 := expression2 is a statement
2. If expression2 is an expression and statement2 is a statement,
then
while (expression1) do statement2
if (expression1) then statement2
are statements
SEMANTIC ANALYSIS:
The semantic analysis checks the source program for semantic errors
and gathers type information for the subsequent code-generation phase. It
uses the hierarchical structure determined by the syntax-analysis phase to
identify the operators and operands of expressions and statements.
An important component of semantic analysis is type checking. Here
the compiler checks that each operator has operands that are permitted by
the source language specification. For example, many programming
language definitions require a compiler to report an error every time a real
number is used to index an array.
:= code optimizer
id1 +
* temp1 := id3 * 60.0
id2 id1 := id2 + temp1
id3 60
code generator
semantic analyzer
MOVF id3, R2
MULF #60.0, R2
:= MOVF id2, R1
ADDF R2, R1
id1 + MOVF R1, id1
id2 *
id3 inttoreal
60
Code Optimization
The code optimization phase attempts to improve the intermediate
code, so that faster running machine codes will result. Some optimizations
are trivial. There is a great variation in the amount of code optimization
different compilers perform. In those that do the most, called ‘optimizing
compilers’, a significant fraction of the time of the compiler is spent on this
phase.
temp1 := id3 * 60.0
id1 := id2 + temp1
Code Generation
The final phase of the compiler is the generation of target code,
consisting normally of relocatable machine code or assembly code. Memory
locations are selected for each of the variables used by the program. Then,
intermediate instructions are each translated into a sequence of machine
instructions that perform the same task. A crucial aspect is the assignment
of variables to registers.
For example, using registers 1 and 2, the translation of the code in
code optimizer becomes,
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
The first and second operands of each instruction specify a source
and destination, respectively. The F in the each instruction tells us that
instruction deal with floating point numbers.
1.c=a+b*d-4
2.c=(b+c)*(b+c)*2
3.b=b2 -4ac
4.result=(height*width)+(rate*2)
LEXICAL ANALYZER
1. NEED AND ROLE OF LEXICAL ANALYZER
◦ First phase of a compiler or Scanner
◦ To identify the tokens we need some method of describing the possible tokens that can
appear in the input stream. For this purpose we introduce regular expression, a notation that
can be used to describe essentially all the tokens of programming language.
◦ Secondly , having decided what the tokens are, we need some mechanism to recognize these
in the input stream. This is done by the token recognizers, which are designed using
transition diagrams and finite automata.
Main Task:
To read input characters and produce output as a sequence of tokens that the parser uses
for syntax analysis(Token Identification)
Upon receiving a getNextToken command from the parser, the lexical analysis reads
input characters until it can identify the next token.
Secondary Task:
⚫ It produces stream of tokens that all the basic elements in a language must be token
⚫ Stripping out from the comments and whitespaces while creating the tokens
⚫ It generates symbol table which stores the information about identifiers, constants
encountered in the input
⚫ It keeps track of line numbers
⚫ If any error is present then lexical analyzer will compare that error with source file and
line number mean while it reports the error encountered while generating the tokens.
⚫ If source language uses as a macro preprocessor (e.g: #define pi 3.14) expansion of
macro may be performed by the lexical analyzer.
Some Lexical Analyzer are divided into cascade of two phases:
1 .Scanning – Responsibility for doing simple task that scans the source program to
recognize the tokens
2.Lexical analysis- Responsibility for doing complex task, perform all secondary task.
Issues in Lexical analysis: (Lexical Analysis vs. Parsing)
Reasons for separating the analysis phase of compiling into lexical analysis and parsing
are as follows:
Simplicity of design
◦ Separation of lexical from syntactical analysis -> simplify at least one of the tasks
◦ e.g. parser dealing with white spaces -> complex
Improved compiler efficiency
◦ Speedup reading input characters using specialized buffering techniques
Enhanced compiler portability
Input device peculiarities are restricted to the lexical analyzer
Tokens, Patterns, Lexemes:
Token: Sequence of character having a collective meaning.
Example: keyword, identifier, operators, special character constants, etc
Pattern: The set of rules by which set of string associated with single token
Example:
for a keyword the pattern is the character sequence forming that keyword
for identifiers the pattern is a letter followed by any number of letter
ordigits
Lexeme: a sequence of characters in the source program matching a pattern for a token
Examples of Tokens:
Token Informal Description Sample Lexemes
If characters i, f If
Case characters c, a, s, e Case
comparison < or > or <= or >= or == or != <=, !=
Identifier Letter followed by letters and digits area, result, m1
Example 2 : E = M * C ** 2
◦ <id, pointer to symbol table entry for E>
◦ <assign_op>
◦ <id, pointer to symbol-table entry for M>
◦ <mult_op>
◦ <id, pointer to symbol-table entry for C>
◦ <exp_op>
◦ <number, integer value 2>
Symbol Table (The Data will be stored in symbol table as follows)
Symbol Token Data type Initialized
E id1 int Yes
M id2 int Yes
C id3 int Yes
Lexical Errors
Few errors are visible at the lexical level alone, because lexical analyzer has a localized
view of a source program
For instance, if the string whlle is occurred in a source program
Example : whlle ( a <= 7 ) a lexical analyzer cannot tell whether whlle is a misspelling
of the keyword while or an undeclared function identifier.
Since whlle is a valid lexeme for the token id, the lexical analyzer must return the token
to the parser and some other phase of the compiler handle an error due to misspelling of
the letters. However, suppose a circumstance arises in which the lexical analyzer is
unable to proceed because none of the patterns for tokens matches any prefix of the
remaining input. The simplest recovery strategy is "panic mode" recovery.
Actions are not followed by Lexical analyzer
o Misspelling of the keyword while
o An undeclared function identifier
Error-recovery actions
1. Delete successive characters from the remaining input, until the lexical analyzer can find a
well-formed token at the beginning of what input is left.
2. Delete one character from the remaining input.
3. Insert a missing character into the remaining input.
4. Replace a character by another character.
5. Transpose two adjacent characters.
INPUT BUFFERING
We often have to look one or more characters beyond the next lexeme before we can be sure we
have the right lexeme. As characters are read from left to right, each character is stored in the
buffer to form a meaningful token
We introduce a two-buffer scheme that handles large look ahead’s safely. We then consider an
improvement involving "sentinels" that saves time checking for the ends of buffers.
BUFFER PAIRS
A buffer is divided into two N-character halves, as shown below
Each buffer is of the same size N, and N is usually the number of characters on one disk block.
E.g., 1024 or 4096 bytes.
Using one system read command we can read N characters into a buffer.
If fewer than N characters remain in the input file, then a special character, represented
by eof, marks the end of the source file.
For each character read, we make two tests: one for the end of the buffer, and one to determine
what character is read. We can combine the buffer-end test with the test for the current character
if we extend each buffer to hold a sentinel character at the end.
The sentinel is a special character that cannot be part of the source program, and a natural choice
is the character eof.
The sentinel arrangement is as shown below:
Note that eof retains its use as a marker for the end of the entire input. Any eof that appears other than
at the end of a buffer means that the input is at an end.
Code to advance forward pointer:
forward : = forward + 1;
if forward ↑ = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward + 1
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half
end
else /* eof within a buffer signifying end of input */
terminate lexical analysis
end
TERM DEFINITION
Prefix of s A string obtained by removing zero or more trailing symbols of string s;
e.g., cs is a prefix of cs6660.
Suffix of s A string formed by deleting zero or more of the leading symbols of s;
e.g., 660 is a suffix of cs6660.
Proper prefix, suffix, Any nonempty string x that is a prefix, suffix or substring of s that
or substring of s s <> x.
Subsequence of s Any string formed by deleting zero or more not necessarily contiguous
symbols from s; e.g., c60 is a subsequence of cs6660.
Operations on Languages :
There are several operations that can be applied to languages:
Definitions of operations on languages L and M:
OPERATION DEFINITION
Union of L and M. written LυM L υ M = { s | s is in L or s is in M }
Concatenation of L and M. written LM LM = { st | s is in L and t is in M }
Kleene closure of L.
written L*
1. L U D is the set of letters and digits — strictly speaking the language with 62 strings of length
one, each of which strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of one letter followed by one digit.
3. L5 is the set of all 5-letter strings.
4. L* is the set of all strings of letters, including ε, the empty string.
5. L(L U D)* is the set of all strings of letters and digits beginning with a letter.
6. D+ is the set of all strings of one or more digits.
Example 2
Let W be the set of characters {c,o,m,p,i,l,e,r} and let N be the set of digits {1,2,3}
Operations:
1. W U N is the set of characters and digits —language with 11 strings of length one, each of
which strings is either one letter or one digit.
2. WN is the set of 24 strings of length two, each consisting of one characters followed by one
digit.
3. W3 is the set of all 3-character strings.
4. W* is the set of all strings of characters, including ε, the empty string.
5. W(W U N)* is the set of all strings of characters and digits beginning with a character.
6. N+ is the set of all strings of one or more digits.
Regular Expressions
It allows defining the sets to form tokens.
Defines a Pascal identifier –identifier is formed by a letter followed by zero or more
letters or digits. e.g., letter ( letter | digit) *
A regular expression is formed using a set of defining rules.
Each regular expression r denotes a language L(r).
The rules that define the regular expressions over alphabet ∑. Associated with each rule
is a specification of the language denoted by the regular expression being defined
BASIS
Rule 1 ε is a regular expression that denotes {ε}, i.e. the set containing the
empty string.
Rule2 If a is a symbol in ∑, then a is a regular expression that denotes {a},
i.e. the set containing the string a.
INDUCTION (Rule 3). Suppose r and s is regular expressions denoting the languages
L(r) and L(s). Then
(r) | (s) is a regular expression denoting the languages L(r) U L(s).
(r)(s) is a regular expression denoting the languages L(r)L(s).
(r)* is a regular expression denoting the languages (L(r))*.
(r) is a regular expression denoting the languages L(r).
• A language denoted by a regular expression is said to be a regular set.
• The specification of a regular expression is an example of a recursive definition.
Regular Definition
For notational convenience, we need to give names for regular expressions and to define
regular expressions using these names as if they were symbols.
Identifiers are the string of letters and digits beginning with a letter. The following regular
definition provides clear specification for the string.
If ∑ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the
form
d1 → r1
d2 → r2
…
dn → rn
Where each di is a distinct name, and each ri is a regular expression over the symbols in ∑ U {d1,
d2, … , di-1}, i.e., the basic symbols and the previously defined names.
Example 14 C identifiers are strings of letters, digits, and underscores. Here is a regular
definition for the language of C identifiers.
letter_ A|B|. . . |Z|a|b|. . . |z| _
digit 0 | 1 | • • • | 9
id letter_ ( letter_ | digit )*
Example :Unsigned numbers (integer or floating point) are strings such as 5280, 0.01234,
6.336E4, or 1.89E-4. The regular definition
digit 0 | 1 | • • • | 9
digits digit digit*
optionalFraction . digits | ε
optionalExponent ( E ( + | - | ε ) digits ) | ε
number digits optionalFraction optionalExponent
Example :
Write a Regular Definition to represent date in the following format: JAN 5th2016
Date format = Month Date Year
Month =JAN|FEB| ............ |DEC
Date =[0-3] [0-9]th|1st|2nd |3rd
Year =[1|2] [0-9]3
Extensions of Regular Expressions
• Notational Shorthand:
– This shorthand is used in certain constructs that occur frequently in regular expressions.
1. One or more instances. The unary, postfix operator + represents the positive closure of a
regular expression and its language. That is, if r is a regular expression, then (r) + denotes
the language (L(r)) + . The operator + has the same precedence and associativity as the
operator *. Two useful algebraic laws, r* = r + | ε and r + = rr* = r*r relate the Kleene
closure and positive closure.
2. Zero or one instance. The unary postfix operator ? means "zero or one occurrence." That
is, r? is equivalent to r|ε, or put another way, L(r?) = L(r) U { ε }. The ? operator has the
same precedence and associativity as * and +.
3. Character classes. A regular expression a1 |a2 | |an, where the ai’s are each symbols of
the alphabet, can be replaced by the shorthand [a1 a2 . . . an]. e.g., consecutive uppercase
letters, lowercase letters, or digits, we can replace them by a1-an , that is, just the first and
last separated by a hyphen. Thus, [abc] is shorthand for a|b|c, and [a-z] is shorthand for a|b|--
-|z .
Example Using these shorthands, we can rewrite the regular definition of Example 3.4.1 as:
letter_ [A- Za-z_]
digit [0-9]
id letter_ ( letter_ | digit )*
the regular definition of Example 3.4.2 as:
digit [0-9]
digits digit+
optionalFraction (. digits )?
optionalExponent ( E [+-]? digits ) ?
number digits (. digits )? ( E [+ - ]? digits ) ?
Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must study how to take the
patterns for all the needed tokens and build a piece of code that examines the input
string and finds a prefix that is a lexeme matching one of the patterns
|є
| term
Term →id
|number
For relop ,we use the comparison operations of languages like Pascal or SQL where = is “equals”
and < > is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names of tokens
as far as the lexical analyzer is concerned, the patterns for the tokens are described using regular
definitions.
Patterns for tokens of above grammar
digit → [0,9]
digits →digit+
number →digit(.digit)?(e.[+-]?digits)?
letter → [A-Z,a-z]
id →letter(letter/digit)*
if → if
then →then
else →else
In addition, we assign the lexical analyzer the job stripping out white space, by recognizing the
“token” we defined by:
WS → (blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII characters of
the same names. Token ws is different from the other tokens in that ,when we recognize it, we do
not return it to parser ,but rather restart the lexical analysis from the character that
follows the white space . It is the following token that gets returned to the parser.
Transition Diagram
Transition Diagram consists of a collection of nodes or circles, called states. Each state
represents a rule that occurs during the process of reading the input looking for a
lexeme that matches one of many patterns .
Edges are directed from one state to another. Each edge is labeled by a symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we check for an edge out of state s labeled
by a. if we find such an edge ,we advance the forward pointer and enter the state of the transition
diagram to which that edge leads.
3. One state is designed the state, or initial state. It is indicated by an edge labeled “start”
entering from nowhere .the transition diagram always begins in the state before any input
symbols have been used.
Transition Diagram for relop
Relational operator :< | > |< = | >= | = | <>’’
We call the recognizer of the tokens as a finite automaton. A finite automaton can be:
deterministic (DFA) or non-deterministic (NFA).Both deterministic and non-
deterministic finite automaton recognize regular sets.
4.1Model of NFA:
• NFA (Non-deterministic Finite Automaton) is a 5-tuple (S, Σ, , S0, F):
S: a set of states;
: the symbols of the input alphabet;
: a set of transition functions;
move(state, symbol) a set of states
S0: s0 S, the start state;
13
F: F S, a set of final or accepting states.
• Non-deterministic -- a state and symbol pair can be mapped to a set of states.
• Finite—the number of states is finite.
• Finite automata can be represented using transition diagram.
• Corresponding to FA definition, a transition diagram has:
– States represented by circles;
– An Alphabet (Σ) represented by labels on edges;
– Transitions represented by labeled directed edges between states. The label is the
input symbol;
– One Start State shown as having an arrow head;
– One or more Final State(s) represented by double circles.
– Example transition diagram to recognize (a|b)*abb
o S - a set of states
o Σ - a set of input symbols (alphabet)
o move - a transition function move to map state-symbol pairs to sets of states.
o s0 - a start (initial) state
o F- a set of accepting states (final states)
• ε- transitions are allowed in NFAs. In other words, we can move from one state to
another one without consuming any symbol.
• A NFA accepts a string x, if and only if there is a path from the starting state to one of
accepting states such that edge labels along this path spell out x.
Transition table
• A transition table is a good way to implement a FSA
– One row for each state, S
– One column for each symbol, A
– Entry in cell (S,A) gives the state or set of states can be reached from state S on
input A.
14
Example:
• For each symbol a and state s, there is at most one labeled edge a leaving s. i.e. transition
Example 4.2
15
4.2 From Regular Expressions to Automata
Construction of an NFA from a Regular Expression
Simulation of an NFA
Conversion of a NFA to DFA
4.3. From Regular Expressions to Automata
regular expression describes
lexical analyzers
pattern processing software
implies simulation of DFA or NFA
NFA simulation is less straightforward Techniques
to convert NFA to DFA
the subset construction technique
simulating NFA directly
when NFA to DFA is time consuming
16
to convert regular expression to NFA and then to DFA
Algorithm: 4.1
Input
◦ regular expression r over an alphabet Σ
Output
◦ An NFA accepting L(r)
Method
◦ to parse r into constituent sub expressions
◦ basis rules for handling sub expressions with no operators
◦ inductive rules for creating larger NFAs from sub expressions NFAs union,
concatenation, closure
17
• For regular expression r1 | r2:
Example 4.3
18
1.
2. a|b
3.(a|b)*
4. abb
5. (a|b)+bcd
19
Example 4.4
Convert the R.E (0 + 1)* 1(0 + 1) to NFA
Step 1 : (0 + 1)
Step2: (0 + 1) *
Step 3: (0 + 1) *1
Step 4: (0 + 1) *1(0 + 1)
20
Example 4.5
Convert the R.E 01* to NFA
Step 1 : 1*
Step 2 : 01*
Example 4.6
Convert the R.E (0 + 1) 01 to NFA
Step 1 : (0 + 1)
21
Step 2 : 01
Step 3 : (0 + 1)01
Example 4.7
Convert the R.E aa (a | b)*to NFA
Step 1 : (a | b)*
22
Step 2 : aa (a | b)*
Example 4.8
Convert the R.E (a|b)* (aa | bb) to NFA
Step 1 : (a | b)*
23
Step 3 : (a|b)* (aa | bb)
24
Every DFA defines a unique language but in general, there may be many DFAs for a given
language. These DFAs accept the same language. The minimization of DFA obtained the DFA
with the minimal number of states and added advantages of minimization of DFA are as follows:
Use less memory
Use less hardware (flip-flops)
• From the point of view of the input, any two states that are connected by an –
transition may as well be the same, since we can move from one to the other without
consuming any character. Thus states which are connected by an -transition will be
represented by the same states in the DFA.
• If it is possible to have multiple transitions based on the same symbol, then we can regard a
transition on a symbol as moving from a state to a set of states (ie. the union of all those states
reachable by a transition on the current symbol). Thus these states will be combined into a single
DFA state.
25
• The -closure function takes a state and returns the set of states reachable from it based on (one
or more) -transitions. Note that this will always include the state itself. We should be able to get
from a state to any state in its -closure without consuming any input.
• The function move takes a state and a character, and returns the set of states reachable by one
transition on this character.
We can generalise both these functions to apply to sets of states by taking the union of the
application to individual states.
ε-closure(T) set of NFA states reachable from some NFA state s in set
T on ε-transitions alone
move(T,a) set of NFA states to which there is a transition on input symbol a from some
state s in T
Transitions
s0 – start state
N can be in any states of ε-closure(s0)
reading input string x
◦ N can be in the set of states T after
reading input a
◦ N can go in ε-closure(move(T, a))
accepting states of D are all sets of N states that include at least one accepting state of N
Input
◦ an NFA N
Output
26
◦ DFA D accepting the same language as N
Computing ε-closure(T)
push all states of T onto stack; initialize ε-closure(T) to T; while(stack is not empty)
{
pop t, the top element, off stack;
for(each state u with an edge from t to u labeled ε)
if(u is not in ε-closure(T))
{
add u to ε-enclosure(T);
push u onto stack;
}
}
Example 4.9
By using subset construction algorithm convert the following NFA (a|b)*abb to DFA
27
We need to remove:
1.ε – transition
Example 4.10
First step: construct ε – closure (s)
State ε – closure (s)
{0}= {0,1,2,4,7}
{1}= {1,2,4}
{2}= {2}
{3}= {1,2,3,4,6,7}
{4}= {4}
{5}= {1,2,4,5,6,7}
{6}= {1,2,4,6,7}
{7}= {7}
{8}= {8}
{9}= {9}
{10}= {10}
Second step: Looking for the start state for the DFA
28
• First determine the input alphabet here input alphapet = {a,b}
• Second compute:
1.Dtran[A,a] = ε -closure(move(A,a))
2.Dtran[A,b] = ε -closure(move(A,b))
1. Dtran[A,a] = ε -closure(move(A,a))
So we conclude:
2. Dtran[A,b] = ε -closure(move(A,b))
• Third compute:
3.Dtran[B,a] = ε -closure(move(B,a))
Among the states in B, only 2, 7 has a transition on a, and it goes to {3, 8} respectively
Thus:
4.Dtran[B,b] = ε -closure(move(B,b))
Among the states in B, only 4, 8 has a transition on b, and it goes to {5, 9} respectively
Thus:
•Fourth compute:
29
5.Dtran[C,a] = ε -closure(move(C,a))
Among the states in C, only 2, 7 has a transition on a, and it goes to {3, 8} respectively
Thus:
6. Dtran[C,b] = ε -closure(move(C,b))
Among the states in C, only 4 has a transition on b, and it goes to {5} respectively
Thus:
7. Dtran[D,a] = ε -closure(move(D,a))
Among the states in D, only 2, 7 has a transition on a, and it goes to {3, 8} respectively
Thus:
8. Dtran[D,b] = ε -closure(move(D,b))
Among the states in D, only 4 and 9 has a transition on b, and it goes to {5, 10} respectively
Thus:
9. Dtran[E,a] = ε -closure(move(E,a))
Among the states in E, only 2, 7 has a transition on a, and it goes to {3, 8} respectively
Thus:
Among the states in E, only 4 has a transition on b, and it goes to {5} respectively
Thus:
30
Dtran[E,b] = ε -closure(move(E,b)) = ε- closure ({5} = {1,2, 4, 5, 6, 7} C
Minimizing the
Number of States of a
DFA
Distinguishable States
string x distinguishes state s from state t if exactly one of the states reached from s and t
by following the path x is an accepting state
state s is distinguishable from state t if
exists some string that distinguish them
the empty string distinguishes any accepting state from any non-accepting state
Algorithm 4.3
Input
◦ DFA D with set of states S, input alphabet Σ, start state s0, accepting states F
Output
◦ DFA D’ accepting the same language as D and having as few states as possible
Method:
1 Start with an initial partition Π with two groups F and S-F
2 Apply the procedure
for(each group G of Π)
{
partition G into subgroups such that states s and t are in the same subgroup iff for all input
symbol a states s and t have transitions on a to states in the same group of Π
}
31
3 if Πnew= Π let Πfinal= Π and continue with step 4, otherwise repeat step 2 with Πnew
instead of Π
4 choose one state in each group of Πfinal as the representative
State a b for that group
A B A Minimum State DFA Construction
B B D the start state of D’ is the representative of the group
D B E containing the start state of D
E B A the accepting states of D’ are the representatives of those
groups that contain an accepting state of D
if
◦ s is the representative of G from Πfinal
◦ exists a transition from s on input a is t from group H
◦ r is the representative of H then in D’ there is a transition from s to r on input a
32
◦ on input b:
– A,C,->{A,C} – B->{D} – D->{E} – E->{A,C}
Example 4.11 Construct the minimized DFA for the regular expression (0+1)*(0+1) 10
Step 1: By using Thompson construction algorithm construct NFA from regular expression
(0+1)*(0+1)10
33
= ε – closure(3,10)
= {1,2,3,4,6,7,8,9,10,11,13,14} B
Move (A,1) = ε – closure(δ(A,1))
= ε – closure(5,12)
= {1,2,4,5,6,7,8,9,11,12,13,14} C
Move (B,0) = ε – closure(δ(B,0))
= ε – closure(3,10)
= {1,2,3,4,6,7,8,9, 10,11,13,14} B
Move (B,1) = ε – closure(δ(B,1))
= ε – closure(5,12,15)
= {1,2,4,5,6,7,8,9,11,12,13,14,15} D
Move (C,0) = ε – closure(δ(C,0))
= ε – closure(3,10)
= {1,2,3,4,6,7,8,9,10,11,13,14} B
Move (C,1) = ε – closure(δ(C,1))
= ε – closure(5,12,15)
= {1,2,4,5,6,7,8,9,11,12,13,14,15} D
Move (D,0) = ε – closure(δ(D,0))
= ε – closure(3,10,16)
EE
= {1,2,3,4,6,7,8,9,10,11,13,14,16}
Move (D,1) = ε – closure(δ(D,1))
= ε – closure(5,12,15)
= {1,2,4,5,6,7,8,9,11,12,13,14,15} D
Move (E,0) = ε – closure(δ(E,0))
= ε – closure(3,10)
= {1,2,3,4,6,7,8,9, 10,11,13,14} B
Move (E,1) = ε – closure(δ(E,1))
= ε – closure(5,12,15)
= {1,2,4,5,6,7,8,9,11,12,13,14,15} D
Since there is no new states , stop with this
34
Step 5: Transition Diagram for DFA
1. δ(A,B)
δ(A,0) = B δ(B,0) = B
δ(A,1) = C δ(B,1) = D
(A,B) are equivalent for single input
2. δ(A,C)
δ(A,0) = B δ(C,0) = B
δ(A,1) = C δ(C,1) = D
(A,C) are equivalent state for single input
3. δ(A,D)
δ(A,0) = B δ(D,0) = E
δ(A,1) = C δ(D,1) = D
(A,D) are not equivalent.
4. δ(B,C)
δ(B,0) = B δ(C,0) = B
δ(B,1) = D δ(C,1) = D
(B,C) are equivalent.
5. δ(B,D)
δ(B,0) = B δ(D,0) = E
δ(B,1) = D δ(D1) = D
(B,D) are equivalent. state for single input
6. δ(C,D)
δ(C,0) = B δ(D,0) = E
δ(C,1) = D δ(D1) = D
(B,D) are equivalent. state for single input
35
B
E [A B C D]
E [A] [ B C] [D]
A B,C B,C
B,C B,C D
D E D
E B,C D
Minimized DFA Transition table
36
Produces more compact representation of transition tables then the standard two
dimensional ones.
o All leaves are alphabet symbols (plus # and the empty string)
o All inner nodes are operators.
37
lastpost
followpos
nullable(n)
◦ true for syntax tree node n if the sub expression represented by n
has ε in its language
can be made null or the empty string even it can represent other strings
firstpos(n)
◦ set of positions in the n rooted subtree that correspond to the first symbol of at
least one string in the language of the subexpression rooted at n
lastpos(n)
◦ set of positions in the n rooted subtree that correspond to the last symbol of at
least one string in the language of the subexpression rooted at n
followpos(n)
◦ for a position p
◦ is the set of positions q such that
◦ x=a1a2…an in L((r)#) such that
◦ for some i there is a way to explain the membership of x in L((r)#) by matching ai
to position p of the syntax tree ai+1 to position q
38
Example 4.12 (a|b)*abb# augmented regular expression
Computing Followpos
A position of a regular expression can follow
another position in two ways:
◦ if n is a cat-node c1c2 (rule 1)
for every position i in lastpos(c1)
all positions in firstpos(c2) are in
followpos(i)
39
◦ if n is a star-node (rule 2)
if i is a position in lastpos(n) then all positions in firstpos(n) are in
followpos(i)
◦ Applying rule 1
◦ followpos(1) incl. {3}
◦ followpos(2) incl. {3}
◦ followpos(3) incl. {4}
◦ followpos(4) incl. {5}
◦ followpos(5) incl. {6}
◦ Applying rule 2
◦ followpos(1) incl. {1,2}
◦ followpos(2) incl. {1,2}
Compute Followpos
Position(Node) Followpos
1 {1,2,3}
2 {1,2,3}
3 {4}
4 {5}
5 {6}
6 {φ}
Find Position for a & b
a position = 1,3
b position = 2,4,5
firstpos(n0)={1,2,3} =…..A
Dtran[A,a]= followpos(1) U followpos(3)= {1,2,3,4}=…..B
Dtran[A,b]= followpos(2)={1,2,3}=…..A
Dtran[B,a]= followpos(1) U followpos(3)=……B
40
Dtran[B,b]= followpos(2) U followpos(4)={1,2,3,5}=…..C…..
41
Compute Followpos
Position(Node) Followpos
1 {1,2,3}
2 {1,2,3}
3 {4}
4 {φ}
After we calculate follow positions, find Dtran then we are ready to create DFA for the
regular expression.
States/Input a b
A B A
B B A
42
Example 4.14 (a*|b)*#
Compute Followpos
43
Position(Node) Followpos
1 {1,2,3}
2 {1,2,3}
3 {φ}
44
Step 2:- Compute Firstpos and Lastpos
Compute Followpos
Position(Node) Followpos
1 {1,2,3}
2 {1,2,3}
3 {φ}
States/Input a b
A A A
Step 4:- Optimized DFA Transition Diagran
45
Example 4.15 abb(a|b)*#
Step 1:- Syntax tree
46
Compute Followpos
Position(Node) Followpos
1 {2}
2 {3}
3 {3,4,5,6}
4 {4,5,6}
5 {4,5,6}
6 {φ}
Find Positions for a & b
a positions = 1,4
b positions = 2,3,5
Step 3:- Find Dtran
Firstpos(n o)={1} =….. A
Dtran[A,a]= followpos(1) ={2}=…..B
Dtran[B,b]= followpos(2) ={3}=…..C
Dtran[C,b]= followpos(3)={3,4,5,6} =…..D
Dtran[D,a]= followpos(4)={4,5,6}=…..E
Dtran[D,b]= followpos(3) U followpos(5) ={3,4,5,6} =…..D
Dtran[E,a]= followpos(4)={4,5,6}=…..E
Dtran[E,b]= followpos(5)={4,5,6}=…..E
Step 4:- Optimized DFA Transition table
States/Input a b
A B φ
B φ C
C φ D
D E D
E E E
47
5. LANGUAGE FOR SPECIFYING LEXICAL ANALYZER –LEX
Lex is a lexical analyzer generator developed by Lesk and Schmidt of AT&T Bell Lab,
written in C, running under UNIX.
Lex produces an entire scanner module that can be compiled and linked with other
compiler modules.
Lex associates regular expressions with arbitrary code fragments. When an expression is
matched, the code segment is executed.
A typical lex program contains three sections separated by %% delimiters.
Role of LEX
48
regular definitions
%%
translation rules
%%
auxiliary procedures
1. Declarations section
It includes declarations of variables, manifest constants (A manifest constant is an
identifier that is declared to represent a constant e.g. # define PIE 3.14), the files to be included
and definitions of regular expressions.
49
BEGIN {return 1}
END {return 2}
IF {return 3}
THEN {return 4}
ELSE {return 5}
letter(letter|digit)* {LEX VAL:= INSTALL( ); return 6}
digit+ {LEX VAL:= INSTALL( ); return 7}
< {LEX VAL := 1; return 8}
<= {LEX VAL := 2; return 8}
= {LEX VAL := 3; return 8}
<> {LEX VAL := 4; return 8}
> {LEX VAL := 5; return 8}
>= {LEX VAL := 6; return 8}
How does this Lexical analyzer work?
The lexical analyzer created by Lex behaves in concert with a parser in the following
manner. When activated by the parser, the lexical analyzer begins reading its remaining
input , one character at a time, until it has found the longest prefix of the input that is
matched by one of the regular expressions p.
Then it executes the corresponding action. Typically the action will return control to the
parser. However, if it does not, then the lexical analyzer proeeds to find more lexemes,
until an action causes control to return to the parser. The repeated search for lexemes
until an explicit return allows the lexical analyzer to process white space and comments
conveniently.
The lexical analyzer returns a single quantity, the token, to the parser. To pass an attribute
value with information about the lexeme, we can set the global variable yylval.
•e.g. Suppose the lexical analyzer returns a single token for all the relational operators, in which
case the parser won’t be able to distinguish between ”<=”,”>=”,”<”,”>”,”==” etc. We can set
yylval appropriately to specify the nature of the operator.
LEX Actions
1. yytext() is a variable that is a pointer to the first character of the lexeme.
2. yywrap()
yywrap is called when lexical analyzer reach end of file. It yywrap returns a then
lexical analyzer continue scanning. When yywrap return 1 means end of file is
encountered
3. yyleng() is an integer telling how long the lexeme is.
4. yyin() –It is used to read the source program from file and then stored in yyin.
50
Sample LEX Program
1. Program to Find the Capital letters in some string using LEX
%{
%} Declarations section
%%
[A-Z] {printf(“%s”,yytext);}
.; Transition rules section
%%
main()
{printf(“Enter Some string”);
yylex();
} Auxillary procedures
int yywrap()
{return 1;
}
Input
Enter Some string
Panimalar Engineering College
Output
PEC
51
An NFA for Lex program
• Create an NFA for each regular expression
• Combine all the NFAs into one
• Introduce a new start state
• Connect it with ε- transitions to the start states of the NFAs
52
Pattern Matching with DFA
1. Convert the NFA for all the patterns into an equivalent DFA. For each DFA state with more
than one accepting NFA states, choose the pattern, who is defined earliest, the output of the DFA
state.
2. Simulate the DFA until there is no next state.
3. Trace back to the nearest accepting DFA state, and perform the associated action
53