You are on page 1of 76

Hawassa University

Department of Computer
Science
Compiler Design

Course code: CoSc4072


By: Mekonen M.
Unit Two
Lexical Analysis
Topics to be covered
 Looping
• Interaction of scanner & parser
• Token, Pattern & Lexemes
• Input buffering
• Specification of tokens
• Regular expression & Regular definition
• Transition diagram
• Finite automata
• Regular expression to NFA using Thompson's rule
• Conversion from NFA to DFA using subset construction
method
• DFA optimization
Interaction with Scanner &
Parser
Interaction of scanner & parser
Token
Source Lexical
Parser
Program Analyzer
Get next token

Symbol Table

• Upon receiving a “Get next token” command from parser, the lexical analyzer reads the input
character until it can identify the next token.
• Lexical analyzer also stripping out comments and white space in the form of blanks, tabs, and
newline characters from the source program.

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 5


Why to separate lexical analysis & parsing?
1. Simplicity in design.
2. Improves compiler efficiency.
3. Enhance compiler portability.

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 6


Lexical Analyzer
• Lexical analysis is the process of converting a sequence of characters
into a sequence of tokens.
• A program or function which performs lexical analysis is called a
lexical analyzer, lexer, or scanner.
• A lexer often exists as a single function which is called by a parser
(syntax Analyzer) or another function.
• The tasks done by scanner are
• Grouping input characters into tokens
• Stripping out comments and white spaces and other separators
• Correlating error messages with the source program
• Macros expansion
• Pass token to parser

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 7


Token, Pattern & Lexemes
Token, Pattern & Lexemes
Token Pattern
The set of rules called pattern associated
Sequence of character having a
with a token.
collective meaning is known as
Example: “non-empty sequence of digits”,
token. “letter followed by letters and digits”
Categories of Tokens:
1.Identifier Lexemes

2.Keyword The sequence of character in a source


program matched with a pattern for a token
3.Operator is called lexeme.
4.Special symbol Example: Position, =, intial, rate, *,
5.Constant
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 9
Example: Token, Pattern & Lexemes
Example: total = sum + 45
Tokens:
total Identifier1
Operator1
=
sum Identifier2 Tokens

+ Operator2

45 Constant1

Lexemes
Lexemes of identifier: total, sum
Lexemes of operator: =, +
Lexemes of constant: 45
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 10
Token \Attributes
• More than one lexeme can match a pattern
• the lexical analyzer must provide the information about the particular
lexeme that matched.
• For example, the pattern for token number matches both 0, 1, 934, …
• But code generator must know which lexeme was found in the source
program.
• Thus, the lexical analyzer returns to the parser a token name and an
attribute value
• For each lexeme the following type of output is produced
(token-name, attribute-value)
• attribute-value points to an entry in the symbol table for this token
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 11
Example
• E = M * C ** 2
• <id, pointer to symbol table entry for E>
• < assign_op>
• <id, pointer to symbol table entry for M>
• <mult_op>
• <id, pointer to symbol table entry for C>
• <exp_op>
• <number, integer value 2>

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 12


Input buffering
Input buffering
• Processing large number of characters during the compilation of a large source
program consumes time- due to overhead required to process a single input
character
• How to speed the reading of source program ?
• There are mainly two techniques for input buffering:
1. Buffer pairs
2. Sentinels
Buffer Pair

• The lexical analysis scans the input string from left to right one character at a time.
• Buffer divided into two N-character halves, where N is the number of character on
one disk block. : : : E : : = : : Mi : * : : C: * : * : 2 : eof : : :
:
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 14
Buffer pairs
Two pointers to the input are
maintained : : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :
lexemeBegin – marks the
beginning of the current forward forward
lexeme
lexeme_beginnig
Forward – scans ahead until a
pattern match is found

• Pointer Lexeme Begin, marks the beginning of the current lexeme.


• Pointer Forward, scans ahead until a pattern match is found.
• Once the next lexeme is determined, forward is set to character at its right end.
• Lexeme Begin is set to the character immediately after the lexeme just found.
• If forward pointer is at the end of first buffer half then second is filled with N input
character.
• If forward pointer is at the end of second buffer half then first is filled with N input
character.
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 15
Buffer pairs
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :

forward forward forward


lexeme_beginnig
Code to advance forward pointer
if forward at end of first half then begin
reload second half;
forward := forward + 1;
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half;
end
else forward := forward + 1;
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 16
Sentinels
: : E : : = : : Mi : * : eof : C: * : * : 2 : eof : : eof

forward
lexeme_beginnig

• In buffer pairs we must check, each time we move the forward pointer that we
have not moved off one of the buffers.
• Thus, for each character read, we make two tests.
• We can combine the buffer-end test with the test for the current character.
• We can reduce the two tests to one if we extend each buffer to hold a sentinel
character at the end.
• The sentinel is a special character that cannot be part of the source program, and a
natural choice is the character EOF.
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 17
Sentinels
: : E : : = : : Mi : * : eof : C: * : * : 2 : eof : : eof

forward forward forward


lexeme_beginnig
forward := forward + 1;
if forward = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward + 1;
end
else if forward at the second half then begin
reload first half;
move forward to beginning of first half;
end
else terminate lexical analysis;
end

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 18


Specification of tokens
Specifying tokens
• Two issues in lexical analysis.
• How to specify tokens?
• How to recognize the tokens giving a token specification?
(i.e. how to implement the nexttoken( ) routine)?
• How to specify tokens:
• Tokens are specified by regular expressions.

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 20


Alphabets, String and Language
• alphabet : a finite set of symbols. E.g. {a, b, c}
• A string/sentence/word over an alphabet is a finite sequence of
symbols drawn from that alphabet
• Do you remember words related to parts of a string:
• prefix VS suffix
• proper prefix VS proper suffix,
• substring
• subsequence
• A language is any countable set of strings over some fixed alphabet.
Alphabet Language
{0,1} {0,10,100,1000,10000,…}
{a,b,c} {abc,aabbcc,aaabbbccc,…}
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 21
Strings and languages
Term Definition
Prefix of s A string obtained by removing zero or more trailing symbol of
string S.
e.g., ban is prefix of banana.
Suffix of S A string obtained by removing zero or more leading symbol of
string S.
e.g., nana is suffix of banana.
Sub string of S A string obtained by removing prefix and suffix from S.
e.g., nan is substring of banana
Proper prefix, suffix Any nonempty string x that is respectively proper prefix, suffix or
and substring of S substring of S, such that s≠x.
Subsequence of S A string obtained by removing zero or more not necessarily
contiguous symbol from S.
e.g., baaa is subsequence of banana.

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 22


Exercise
• Write prefix, suffix, substring, proper prefix, proper suffix and subsequence of
following string:
String: Compiler

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 23


Operations on languages
Operation Definition
Union of L and M
Written L U M
Concatenation of L
and M
Written LM
Kleene closure of L
Written L∗
Positive closure of L
Written L+

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 24


Regular Expression & Regular
Definition
Regular expression
• A regular expression is a sequence of characters that define a pattern.
Notational shorthand's
1. One or more instances: +
2. Zero or more instances: *
3. Zero or one instances: ?
4. Alphabets: Σ

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 26


Rules to define regular expression
1. is a regular expression that denotes , the set containing empty string.
2. If is a symbol in then is a regular expression,
3. Suppose and are regular expression denoting the languages and . Then,
a. is a regular expression denoting
b. is a regular expression denoting
c. * is a regular expression denoting
d. is a regular expression denoting

The language denoted by regular expression is said to be a regular set.

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 27


Regular Expressions: Algebraic Properties
• Associativity:
• L + (M + N) = (L + M) + N and
• L.(M.N) = (L.M).N.
• Commutativity:
• L + M = M + L. However,
• L.M = M.L in general.
• Identity: –∅ + L = L + ∅ = L and ε.L = L.ε = L
• Distributivity:
• left distributivity L.(M + N) = L.M + L.N.
• right distributivity (M + N).L = M.L + N.L.

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 28


Regular expression
• L = Zero or More Occurrencesa*of a =

*
𝜖
a
aa
aaa Infinite …..
aaaa
aaaaa….
.

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 29


Regular expression
a +
• L = One or More Occurrences of a =

+
a
aa
aaa Infinite …..
aaaa
aaaaa….
.

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 30


Precedence and associativity of operators
Operator Precedence Associative
Kleene * 1 left
Concatenation 2 left
Union | 3 left

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 31


Regular expression examples
1. 0 or 1
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 :𝟎 ,𝟏𝐑 . 𝐄 .=𝟎∨𝟏
2. 0 or 11 or 111
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 :𝟎 ,𝟏𝟏, 𝟏𝟏𝟏 𝐑 . 𝐄 .=𝟎|𝟏𝟏|𝟏𝟏𝟏
3. String having zero or more a.
𝐑 . 𝐄 .= 𝐚 ∗
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 : 𝛜 , 𝐚 , 𝐚𝐚 , 𝐚𝐚𝐚 , 𝐚𝐚𝐚𝐚 …..
4. String having one or more a.
𝐑 . 𝐄 .= 𝐚 +¿
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 : 𝐚 , 𝐚𝐚 , 𝐚𝐚𝐚 , 𝐚𝐚𝐚𝐚 …..
5. Regular expression over that represent all string of length 3.
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 : 𝐚𝐛𝐜 , 𝐛𝐜𝐚 , 𝐛𝐛𝐛 ,𝐜𝐚𝐛 ,𝐚𝐛𝐚 …. 𝐑 . 𝐄 .= ( 𝐚|𝐛|𝐜 )( 𝐚|𝐛|𝐜 ) (𝐚|𝐛|𝐜)
6. All binary string
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 :𝟎,𝟏𝟏,𝟏𝟎𝟏,𝟏𝟎𝟏𝟎𝟏,𝟏𝟏𝟏𝟏… +

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 32


Regular expression examples
7. 0 or more occurrence of either a or b or both
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝝐,𝒂,𝒂𝒂,𝒂𝒃𝒂𝒃,𝒃𝒂𝒃… 𝑹. 𝑬 .=(𝒂∨𝒃)∗
8. 1 or more occurrence of either a or b or both
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝒂,𝒂𝒂,𝒂𝒃𝒂𝒃,𝒃𝒂𝒃,𝒃𝒃𝒃𝒂𝒂𝒂… +

9. Binary no. ends with 0


𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟎,𝟏𝟎,𝟏𝟎𝟎,𝟏𝟎𝟏𝟎,𝟏𝟏𝟏𝟏𝟎… *

10.Binary no. ends with 1


𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟏,𝟏𝟎𝟏,𝟏𝟎𝟎𝟏,𝟏𝟎𝟏𝟎𝟏,… 𝑹. 𝑬 .=(𝟎∨𝟏)∗𝟏
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟏𝟏,𝟏𝟎𝟏,𝟏𝟎𝟎𝟏,𝟏𝟎𝟏𝟎𝟏,…
11.Binary no. starts and ends with 1

𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟎𝟎,𝟏𝟎𝟏,𝒂𝒃𝒂,𝒃𝒂𝒂𝒃…
12.String starts and ends with same character

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 33


Regular expression examples
13.All string of a and b starting with a
… *

14.String of 0 and 1 ends with 00


… 𝑹. 𝑬 .=(𝟎∨𝟏)∗𝟎𝟎
15.String ends with abb
… 𝑹. 𝑬 .=(𝒂∨𝒃)∗𝒂𝒃𝒃
16.String starts with 1 and ends with 0
… 𝑹. 𝑬 .=𝟏(𝟎∨𝟏)∗𝟎
17.All binary string with at least 3 characters and 3 rd character should be zero
… 𝑹.𝑬.=( 𝟎|𝟏 )( 𝟎|𝟏) 𝟎(𝟎∨𝟏)∗
18.Language which consist of exactly two b’s over the set

Mekonen M.
𝑹. 𝑬 .=𝒂∗𝒃 𝒂∗𝒃𝒂∗
# CoSc4072  Unit 2 – Lexical Analysis 34
Regular expression examples
19. The language with such that 3rd character from right end of the string is always a.
… 𝑹.𝑬.=(𝒂∨𝒃)∗𝒂(𝒂∨𝒃)(𝒂∨𝒃)
20. Any no. of followed by any no. of followed by any no. of
… 𝑹. 𝑬 .=𝒂∗𝒃∗𝒄
21. String should contain at least three

∗ ∗ ∗ ∗
…. 𝑹.𝑬.=(𝟎∨𝟏) 𝟏(𝟎∨𝟏) 𝟏(𝟎∨𝟏) 𝟏(𝟎∨𝟏)
22. String should contain exactly two
∗ ∗ ∗
….
𝑹 . 𝑬 .=𝟎 𝟏 𝟎 𝟏 𝟎
23. Length of string should be at least 1 and at most 3
…. 𝑹.𝑬.=( 𝟎∨𝟏)|( 𝟎∨𝟏)( 𝟎∨𝟏)|( 𝟎∨𝟏)( 𝟎∨𝟏 )( 𝟎∨𝟏)
24. No. of zero should be multiple of 3
∗ ∗ ∗ ∗ ∗
…. 𝑹. 𝑬 .=(𝟏 𝟎𝟏 𝟎𝟏 𝟎𝟏 )
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 35
Regular expression examples
24. The language with where should be multiple of 3
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝒂𝒂𝒂,𝒃𝒂𝒂𝒂,𝒃𝒂𝒄𝒂𝒃𝒂,𝒂𝒂𝒂𝒂𝒂𝒂.. ∗ ∗ ∗
𝑹.𝑬.=( ( 𝒃∨𝒄 ) 𝒂 ( 𝒃∨𝒄 ) 𝒂 ( 𝒃∨𝒄 ) 𝒂 ( 𝒃∨𝒄 ) )
∗∗

25. Even no. of 0 ∗ ∗ ∗ ∗


…. 𝑹 . 𝑬 .=(𝟏 𝟎 𝟏 𝟎 𝟏 )
26. String should have odd length ∗
…. 𝑹. 𝑬 .=( 𝟎∨𝟏 ) (( 𝟎|𝟏 ) (𝟎∨𝟏))
27. String should have even length ∗
…. 𝑹 . 𝑬 .=( ( 𝟎|𝟏 ) ( 𝟎∨𝟏))
28. String start with 0 and has odd length ∗
…. 𝑹. 𝑬 .=( 𝟎 ) ( ( 𝟎|𝟏 ) (𝟎∨𝟏))
30. String start with 1 and has even length ∗
…. 𝑹. 𝑬 .=𝟏(𝟎∨𝟏)(( 𝟎|𝟏 ) (𝟎∨𝟏))
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟎𝟎𝟏𝟎𝟏,𝟏𝟎𝟏𝟎𝟎,𝟏𝟏𝟎,𝟎𝟏𝟎𝟏𝟏…
31. All string begins or ends with𝑹.𝑬.=(𝟎𝟎∨𝟏𝟏)(𝟎∨𝟏)∗∨
00 or 11 ( 𝟎|𝟏 ) ∗(𝟎𝟎∨𝟏𝟏)
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 36
Regular definition
• A regular definition gives names to certain regular expressions and uses those
names in other regular expressions.
• Regular definition is a sequence of definitions of the form:

……

Where is a distinct name & is a regular expression.


 Example: Regular definition for identifier
letter  A|B|C|………..|Z|a|b|………..|z
digit  0|1|…….|9|
id letter (letter | digit)*

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 37


Regular definition example
• Example: Unsigned numbers(integer or floating point) are strings such as
3
5280
39.37
6.336E4
1.894E-4
2.56E+7
Regular Definition
digit  0|1|…..|9
digits  digit digit*
optional_fraction  .digits | 𝜖
optional_exponent  (E(+|-|𝜖)digits)|𝜖
num  digits optional_fraction optional_exponent
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 38
Regular grammar
• Basically regular grammars are used to represent regular languages
• The Regular Grammars are either left or right linear:
• Right Regular Grammars:
• Rules of the forms: A → ε, A → a, A → aB, A,B: variables and a:
terminal
• Left Regular Grammars:
• Rules of the forms: A → ε, A → a, A → Ba; A,B: variables and A:
terminal
• Example: S → aS | bA, A → cA | ε
• This grammar produces the language produced by the regular expression
a*bc*

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 39


Recognition of tokens
• Recognition of tokens is the second issue in lexical analysis
• How to recognize the tokens giving a token specification?
• Given the grammar of branching statement:

• The terminals of the grammar, which are if, then, else, relop, id, and
number, are the names of tokens
• The patterns for the given tokens are:
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 40
Recognition of tokens …

• The lexical analyzer also has the job of stripping out whitespace, by
recognizing the "token" ws defined by:

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 41


Recognition of tokens …
• The table shows which token name is returned to the parser and what
attribute value for each lexeme or family of lexemes.

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 42


Transition Diagram
Transition Diagram
• A stylized flowchart is called transition diagram.

is a state

is a transition

is a start state

is a final state

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 44


Transition Diagram : Relational operator

<
0 1
=
2 return (relop,LE)

>
3 return (relop,NE)
=
other
5
4 return (relop,LT)
return (relop,EQ)
>

6 =
7 return (relop,GE)

other
8 return (relop,GT)

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 45


Transition diagram : Unsigned number

digit digit digit

start digit . digit other


1 2 3 digit E +or -
4 5 6 7 8

E digit
3
5280
39.37
1.894 E - 4
2.56 E + 7
45 E + 6
96 E 2
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 46
Finite Automata
Finite Automata
• Finite Automata are recognizers.
• FA simply say “Yes” or “No” about each possible input string.
• Finite Automata is a mathematical model consist of:
1. Set of states
2. Set of input symbol
3. A transition function move
4. Initial state
5. Final states or accepting states

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 48


Types of finite automata
• Types of finite automata are:
DFA
b
 Deterministic finite automata (DFA): have
for each state exactly one edge leaving out a b b
1 2 3 4
for each symbol.
a
a
b a
NFA DFA
 Nondeterministic finite automata (NFA): a
There are no restrictions on the edges
leaving a state. There can be several with a b b
1 2 3 4
the same symbol as label and some edges
can be labeled with .
b NFA
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 49
Regular expression to NFA
using Thompson's rule
Regular expression to NFA using Thompson's rule
1. For , construct the NFA 3. For regular expression

𝑖 N(s) 𝑓
start
start N(t)
𝑖 𝑓
𝜖

Ex: ab
2. For in , construct the NFA

𝑖 𝑓
start a a b
1 2 3

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 51


Regular expression to NFA using Thompson's rule
4. For regular expression 5. For regular expression *
𝜖
N(s) 𝜖
𝜖
𝜖 𝜖
𝑖 𝑓
start
𝑖 𝑓
start N(s)

𝜖 N(t) 𝜖 𝜖

Ex: a*
𝜖
Ex: (a|b) a
2 3
𝜖 𝜖 𝑎 𝜖

1
𝜖

6
1 2 3 4
𝜖 𝜖 𝜖
4 5
b
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 52
Regular expression to NFA using Thompson's rule
• a*b

𝜖 𝑎 𝜖 𝑏
1 2 3 4 5
𝜖

𝜖
• b*ab
𝜖 𝑏 𝜖 𝑎 𝑏
1 2 3 4 5 6
𝜖

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 53


Exercise
Convert following regular expression to NFA:
1. abba
2. bb(a)*
3. (a|b)*
4. a* | b*
5. a(a)*ab
6. aa*+ bb*
7. (a+b)*abb
8. 10(0+1)*1
9. (a+b)*a(a+b)
10. (0+1)*010(0+1)*
11. (010+00)*(10)*
12. 100(1)*00(0+1)*
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 54
Conversion from NFA to DFA
using subset construction
method
Subset construction algorithm
Input: An NFA .
Output: A DFA D accepting the same language.
Method: Algorithm construct a transition table for D. We use the following
operation:
OPERATION DESCRIPTION
Set of NFA states reachable from NFA state on –
transition alone.
Set of NFA states reachable from some NFA state
in on – transition alone.
Set of NFA states to which there is a transition on
input symbol from some NFA state in .

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 56


Subset construction algorithm
initially be the only state in and it is unmarked;
while there is unmarked states T in do begin
mark ;
for each input symbol do begin

if is not in then
add as unmarked state to

end
end

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 57


Conversion from NFA to DFA
(a|b)*abb 𝜖

a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10

𝜖 𝜖
4 5
b

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 58


Conversion from NFA to DFA
𝜖

a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10

𝜖 𝜖
4 5
b

{0, 1, 7, 2,
𝜖- Closure(0)=
4}
=
---- A
{0,1,2,4,7}

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 59


Conversion from NFA to DFA
𝜖

a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B=
{1,2,3,4,6,7,8}
𝜖 𝜖
4 5
b

𝜖
A= {0, 1, 2, 4, 7}
Move(A,a)= {3,8}
𝜖-
Closure(Move(A,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 60
Conversion from NFA to DFA
𝜖

a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B=
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}
𝜖 𝜖
4 5
b

𝜖
A= {0, 1, 2, 4, 7}
Move(A,b) ={5}
𝜖- Closure(Move(A,b)) ={5, 6, 7, 1, 2, 4}
----
= {1,2,4,5,6,7}
C
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 61
Conversion from NFA to DFA
𝜖

a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}
𝜖 𝜖
4 5
b

𝜖
B = {1, 2, 3, 4, 6, 7, 8}
Move(B,a)= {3,8}
𝜖-
Closure(Move(B,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 62
Conversion from NFA to DFA
𝜖

a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B D
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}
𝜖 𝜖
4 5 D=
b {1,2,4,5,6,7,9}

𝜖
B= {1, 2, 3, 4, 6, 7,
8}
Move(B,b)= {5,9}
𝜖-
= {5, 6, 7, 1, 2, 4, 9}
Closure(Move(B,b)) ----
= {1,2,4,5,6,7,9}
D
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 63
Conversion from NFA to DFA
𝜖

a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B D
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7} B
𝜖 𝜖
4 5 D=
b {1,2,4,5,6,7,9}

C= {1, 2, 4, 5, 6 ,7}
Move(C,a)= {3,8}
𝜖-
Closure(Move(C,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 64
Conversion from NFA to DFA
𝜖

a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B D
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D=
b {1,2,4,5,6,7,9}

𝜖
C= {1, 2, 4, 5, 6,
7}
Move(C,b)
{5}
=
𝜖-
Closure(Move(C,b))= {5, 6, 7, 1, 2, 4}
----
= {1,2,4,5,6,7}
C
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 65
Conversion from NFA to DFA
𝜖

a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B D
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D= B
b {1,2,4,5,6,7,9}

𝜖
D= {1, 2, 4, 5, 6, 7,
9}
Move(D,a)= {3,8}
𝜖-
Closure(Move(D,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 66
Conversion from NFA to DFA
𝜖

a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B D
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D= B E
b {1,2,4,5,6,7,9}
E=
{1,2,4,5,6,7,10}
𝜖
D= {1, 2, 4, 5, 6, 7,
9}
Move(D,b)= {5,10}
𝜖- = {5, 6, 7, 1, 2, 4,
Closure(Move(D,b)) 10}
= {1,2,4,5,6,7,10}---- E
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 67
Conversion from NFA to DFA
𝜖

a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B D
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D= B E
b {1,2,4,5,6,7,9}
E= B
{1,2,4,5,6,7,10}
𝜖
E= {1, 2, 4, 5, 6, 7,
10}
Move(E,a)= {3,8}
𝜖-
Closure(Move(E,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 68
Conversion from NFA to DFA
𝜖

a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B D
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D= B E
b {1,2,4,5,6,7,9}
E= B C
{1,2,4,5,6,7,10}
𝜖
E= {1, 2, 4, 5, 6, 7,
10}
Move(E,b)={5}
𝜖-
{5,6,7,1,2,4}
Closure(Move(E,b))=
----
= {1,2,4,5,6,7}
C
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 69
Conversion from NFA to DFA
a

b
States
A = {0,1,2,4,7}
a
B
b
C
a B a
D
B=
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}
B
B
D
C A a a b

D=
{1,2,4,5,6,7,9}
E=
B
B
E
C
b
C b
E
{1,2,4,5,6,7,10}
Transition
Table
b
Note:
• Accepting state in NFA is 10 DFA
• 10 is element of E
• So, E is acceptance state in DFA

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 70


Exercise
• Convert following regular expression to DFA using subset construction method:
1. (a+b)*a(a+b)
2. (a+b)*ab*a

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 71


DFA optimization
DFA optimization
1. Construct an initial partition of the set of states with two groups: the accepting
states and the non-accepting states .
2. Apply the repartition procedure to to construct a new partition .
3. If , let and continue with step (4). Otherwise, repeat step (2) with .
for each group of do begin
partition into subgroups such that two states and
of are in the same subgroup if and only if for all
input symbols , states and have transitions on
to states in the same group of .
replace in by the set of all subgroups formed.
end
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 73
DFA optimization
4. Choose one state in each group of the partition as the representative for that
group. The representatives will be the states of . Let s be a representative state,
and suppose on input a there is a transition of from to . Let be the
representative of s group. Then has a transition from to on . Let the start state
of be the representative of the group containing start state of , and let the
accepting states of be the representatives that are in .
5. If has a dead state , then remove from . Also remove any state not reachable
from the start state.

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 74


DFA optimization
States a b
{ 𝐴, 𝐵,𝐶, 𝐷, 𝐸} A B C
B B D
Accepting States
Nonaccepting States C B C
D B E

{𝐷} E B C

States a b
A B A
B B D

• Now no more splitting is possible. D B E


E B A
• If we chose A as the representative for
Optimized
group (AC), then we obtain reduced Transition Table
transition table
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 75
Thank You

You might also like