Unit 2 - Lexical Anlaysis

Hawassa University
Department of Computer
Science
Compiler Design
Course code: CoSc4072

By: Mekonen M.
Unit Two
Lexical Analysis
Topics to be covered
 Looping
• Interaction of scanner & parser
• Token, Pattern & Lexemes
• Input buffering
• Specification of tokens
• Regular expression & Regular definition
• Transition diagram
• Finite automata
• Regular expression to NFA using Thompson's rule
• Conversion from NFA to DFA using subset construction
method
• DFA optimization
Interaction with Scanner &
Parser
Interaction of scanner & parser
Token
Source Lexical
Parser
Program Analyzer
Get next token
Symbol Table
• Upon receiving a “Get next token” command from parser, the lexical analyzer reads the input
character until it can identify the next token.
• Lexical analyzer also stripping out comments and white space in the form of blanks, tabs, and
newline characters from the source program.
Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 5

Why to separate lexical analysis & parsing?
1. Simplicity in design.
2. Improves compiler efficiency.
3. Enhance compiler portability.

Lexical Analyzer
• Lexical analysis is the process of converting a sequence of characters
into a sequence of tokens.
• A program or function which performs lexical analysis is called a
lexical analyzer, lexer, or scanner.
• A lexer often exists as a single function which is called by a parser
(syntax Analyzer) or another function.
• The tasks done by scanner are
• Grouping input characters into tokens
• Stripping out comments and white spaces and other separators
• Correlating error messages with the source program
• Macros expansion
• Pass token to parser

Token, Pattern & Lexemes
Token, Pattern & Lexemes
Token Pattern
The set of rules called pattern associated
Sequence of character having a
with a token.
collective meaning is known as
Example: “non-empty sequence of digits”,
token. “letter followed by letters and digits”
Categories of Tokens:
1.Identifier Lexemes
2.Keyword The sequence of character in a source

program matched with a pattern for a token
3.Operator is called lexeme.
4.Special symbol Example: Position, =, intial, rate, *,
5.Constant
Example: Token, Pattern & Lexemes
Example: total = sum + 45
Tokens:
total Identifier1
Operator1
=
sum Identifier2 Tokens
+ Operator2
45 Constant1
Lexemes
Lexemes of identifier: total, sum
Lexemes of operator: =, +
Lexemes of constant: 45
Token \Attributes
• More than one lexeme can match a pattern
• the lexical analyzer must provide the information about the particular
lexeme that matched.
• For example, the pattern for token number matches both 0, 1, 934, …
• But code generator must know which lexeme was found in the source
program.
• Thus, the lexical analyzer returns to the parser a token name and an
attribute value
• For each lexeme the following type of output is produced
(token-name, attribute-value)
• attribute-value points to an entry in the symbol table for this token
Example
• E = M * C ** 2
• <id, pointer to symbol table entry for E>
• < assign_op>
• <id, pointer to symbol table entry for M>
• <mult_op>
• <id, pointer to symbol table entry for C>
• <exp_op>
• <number, integer value 2>

Input buffering
Input buffering
• Processing large number of characters during the compilation of a large source
program consumes time- due to overhead required to process a single input
character
• How to speed the reading of source program ?
• There are mainly two techniques for input buffering:
1. Buffer pairs
2. Sentinels
Buffer Pair
• The lexical analysis scans the input string from left to right one character at a time.
• Buffer divided into two N-character halves, where N is the number of character on
one disk block. : : : E : : = : : Mi : * : : C: * : * : 2 : eof : : :
:
Buffer pairs
Two pointers to the input are
maintained : : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :
lexemeBegin – marks the
beginning of the current forward forward
lexeme
lexeme_beginnig
Forward – scans ahead until a
pattern match is found
• Pointer Lexeme Begin, marks the beginning of the current lexeme.

• Pointer Forward, scans ahead until a pattern match is found.
• Once the next lexeme is determined, forward is set to character at its right end.
• Lexeme Begin is set to the character immediately after the lexeme just found.
• If forward pointer is at the end of first buffer half then second is filled with N input
character.
• If forward pointer is at the end of second buffer half then first is filled with N input
character.
Buffer pairs
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :
forward forward forward

lexeme_beginnig
Code to advance forward pointer
if forward at end of first half then begin
reload second half;
forward := forward + 1;
end
else if forward at end of second half then begin
reload first half;
move forward to beginning of first half;
end
else forward := forward + 1;
Sentinels
: : E : : = : : Mi : * : eof : C: * : * : 2 : eof : : eof
forward
lexeme_beginnig
• In buffer pairs we must check, each time we move the forward pointer that we
have not moved off one of the buffers.
• Thus, for each character read, we make two tests.
• We can combine the buffer-end test with the test for the current character.
• We can reduce the two tests to one if we extend each buffer to hold a sentinel
character at the end.
• The sentinel is a special character that cannot be part of the source program, and a
natural choice is the character EOF.
Sentinels
: : E : : = : : Mi : * : eof : C: * : * : 2 : eof : : eof
forward forward forward

lexeme_beginnig
if forward = eof then begin
if forward at end of first half then begin
reload second half;
end
else if forward at the second half then begin
reload first half;
move forward to beginning of first half;
end
else terminate lexical analysis;
end

Specification of tokens
Specifying tokens
• Two issues in lexical analysis.
• How to specify tokens?
• How to recognize the tokens giving a token specification?
(i.e. how to implement the nexttoken( ) routine)?
• How to specify tokens:
• Tokens are specified by regular expressions.

Alphabets, String and Language
• alphabet : a finite set of symbols. E.g. {a, b, c}
• A string/sentence/word over an alphabet is a finite sequence of
symbols drawn from that alphabet
• Do you remember words related to parts of a string:
• prefix VS suffix
• proper prefix VS proper suffix,
• substring
• subsequence
• A language is any countable set of strings over some fixed alphabet.
Alphabet Language
{0,1} {0,10,100,1000,10000,…}
{a,b,c} {abc,aabbcc,aaabbbccc,…}
Strings and languages
Term Definition
Prefix of s A string obtained by removing zero or more trailing symbol of
string S.
e.g., ban is prefix of banana.
Suffix of S A string obtained by removing zero or more leading symbol of
string S.
e.g., nana is suffix of banana.
Sub string of S A string obtained by removing prefix and suffix from S.
e.g., nan is substring of banana
Proper prefix, suffix Any nonempty string x that is respectively proper prefix, suffix or
and substring of S substring of S, such that s≠x.
Subsequence of S A string obtained by removing zero or more not necessarily
contiguous symbol from S.
e.g., baaa is subsequence of banana.

Exercise
• Write prefix, suffix, substring, proper prefix, proper suffix and subsequence of
following string:
String: Compiler

Operations on languages
Operation Definition
Union of L and M
Written L U M
Concatenation of L
and M
Written LM
Kleene closure of L
Written L∗
Positive closure of L
Written L+

Regular Expression & Regular
Definition
Regular expression
• A regular expression is a sequence of characters that define a pattern.
Notational shorthand's
1. One or more instances: +
2. Zero or more instances: *
3. Zero or one instances: ?
4. Alphabets: Σ

Rules to define regular expression
1. is a regular expression that denotes , the set containing empty string.
2. If is a symbol in then is a regular expression,
3. Suppose and are regular expression denoting the languages and . Then,
a. is a regular expression denoting
b. is a regular expression denoting
c. * is a regular expression denoting
d. is a regular expression denoting
The language denoted by regular expression is said to be a regular set.

Regular Expressions: Algebraic Properties
• Associativity:
• L + (M + N) = (L + M) + N and
• L.(M.N) = (L.M).N.
• Commutativity:
• L + M = M + L. However,
• L.M = M.L in general.
• Identity: –∅ + L = L + ∅ = L and ε.L = L.ε = L
• Distributivity:
• left distributivity L.(M + N) = L.M + L.N.
• right distributivity (M + N).L = M.L + N.L.

Regular expression
• L = Zero or More Occurrencesa*of a =
*
𝜖
a
aa
aaa Infinite …..
aaaa
aaaaa….
.

Regular expression
a +
• L = One or More Occurrences of a =
+
a
aa
aaa Infinite …..
aaaa
aaaaa….
.

Precedence and associativity of operators
Operator Precedence Associative
Kleene * 1 left
Concatenation 2 left
Union | 3 left

Regular expression examples
1. 0 or 1
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 :𝟎 ,𝟏𝐑 . 𝐄 .=𝟎∨𝟏
2. 0 or 11 or 111
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 :𝟎 ,𝟏𝟏, 𝟏𝟏𝟏 𝐑 . 𝐄 .=𝟎|𝟏𝟏|𝟏𝟏𝟏
3. String having zero or more a.
𝐑 . 𝐄 .= 𝐚 ∗
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 : 𝛜 , 𝐚 , 𝐚𝐚 , 𝐚𝐚𝐚 , 𝐚𝐚𝐚𝐚 …..
4. String having one or more a.
𝐑 . 𝐄 .= 𝐚 +¿
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 : 𝐚 , 𝐚𝐚 , 𝐚𝐚𝐚 , 𝐚𝐚𝐚𝐚 …..
5. Regular expression over that represent all string of length 3.
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 : 𝐚𝐛𝐜 , 𝐛𝐜𝐚 , 𝐛𝐛𝐛 ,𝐜𝐚𝐛 ,𝐚𝐛𝐚 …. 𝐑 . 𝐄 .= ( 𝐚|𝐛|𝐜 )( 𝐚|𝐛|𝐜 ) (𝐚|𝐛|𝐜)
6. All binary string
𝐒𝐭𝐫𝐢𝐧𝐠𝐬 :𝟎,𝟏𝟏,𝟏𝟎𝟏,𝟏𝟎𝟏𝟎𝟏,𝟏𝟏𝟏𝟏… +

7. 0 or more occurrence of either a or b or both
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝝐,𝒂,𝒂𝒂,𝒂𝒃𝒂𝒃,𝒃𝒂𝒃… 𝑹. 𝑬 .=(𝒂∨𝒃)∗
8. 1 or more occurrence of either a or b or both
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝒂,𝒂𝒂,𝒂𝒃𝒂𝒃,𝒃𝒂𝒃,𝒃𝒃𝒃𝒂𝒂𝒂… +
9. Binary no. ends with 0

𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟎,𝟏𝟎,𝟏𝟎𝟎,𝟏𝟎𝟏𝟎,𝟏𝟏𝟏𝟏𝟎… *
10.Binary no. ends with 1

𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟏,𝟏𝟎𝟏,𝟏𝟎𝟎𝟏,𝟏𝟎𝟏𝟎𝟏,… 𝑹. 𝑬 .=(𝟎∨𝟏)∗𝟏
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟏𝟏,𝟏𝟎𝟏,𝟏𝟎𝟎𝟏,𝟏𝟎𝟏𝟎𝟏,…
11.Binary no. starts and ends with 1
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟎𝟎,𝟏𝟎𝟏,𝒂𝒃𝒂,𝒃𝒂𝒂𝒃…
12.String starts and ends with same character

13.All string of a and b starting with a
… *
14.String of 0 and 1 ends with 00

… 𝑹. 𝑬 .=(𝟎∨𝟏)∗𝟎𝟎
15.String ends with abb
… 𝑹. 𝑬 .=(𝒂∨𝒃)∗𝒂𝒃𝒃
16.String starts with 1 and ends with 0
… 𝑹. 𝑬 .=𝟏(𝟎∨𝟏)∗𝟎
17.All binary string with at least 3 characters and 3 rd character should be zero
… 𝑹.𝑬.=( 𝟎|𝟏 )( 𝟎|𝟏) 𝟎(𝟎∨𝟏)∗
18.Language which consist of exactly two b’s over the set
…
Mekonen M.
𝑹. 𝑬 .=𝒂∗𝒃 𝒂∗𝒃𝒂∗
# CoSc4072  Unit 2 – Lexical Analysis 34
19. The language with such that 3rd character from right end of the string is always a.
… 𝑹.𝑬.=(𝒂∨𝒃)∗𝒂(𝒂∨𝒃)(𝒂∨𝒃)
20. Any no. of followed by any no. of followed by any no. of
… 𝑹. 𝑬 .=𝒂∗𝒃∗𝒄
21. String should contain at least three
∗
∗ ∗ ∗ ∗
…. 𝑹.𝑬.=(𝟎∨𝟏) 𝟏(𝟎∨𝟏) 𝟏(𝟎∨𝟏) 𝟏(𝟎∨𝟏)
22. String should contain exactly two
∗ ∗ ∗
….
𝑹 . 𝑬 .=𝟎 𝟏 𝟎 𝟏 𝟎
23. Length of string should be at least 1 and at most 3
…. 𝑹.𝑬.=( 𝟎∨𝟏)|( 𝟎∨𝟏)( 𝟎∨𝟏)|( 𝟎∨𝟏)( 𝟎∨𝟏 )( 𝟎∨𝟏)
24. No. of zero should be multiple of 3
∗ ∗ ∗ ∗ ∗
…. 𝑹. 𝑬 .=(𝟏 𝟎𝟏 𝟎𝟏 𝟎𝟏 )
24. The language with where should be multiple of 3
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝒂𝒂𝒂,𝒃𝒂𝒂𝒂,𝒃𝒂𝒄𝒂𝒃𝒂,𝒂𝒂𝒂𝒂𝒂𝒂.. ∗ ∗ ∗
𝑹.𝑬.=( ( 𝒃∨𝒄 ) 𝒂 ( 𝒃∨𝒄 ) 𝒂 ( 𝒃∨𝒄 ) 𝒂 ( 𝒃∨𝒄 ) )
∗∗
25. Even no. of 0 ∗ ∗ ∗ ∗

…. 𝑹 . 𝑬 .=(𝟏 𝟎 𝟏 𝟎 𝟏 )
26. String should have odd length ∗
…. 𝑹. 𝑬 .=( 𝟎∨𝟏 ) (( 𝟎|𝟏 ) (𝟎∨𝟏))
27. String should have even length ∗
…. 𝑹 . 𝑬 .=( ( 𝟎|𝟏 ) ( 𝟎∨𝟏))
28. String start with 0 and has odd length ∗
…. 𝑹. 𝑬 .=( 𝟎 ) ( ( 𝟎|𝟏 ) (𝟎∨𝟏))
30. String start with 1 and has even length ∗
…. 𝑹. 𝑬 .=𝟏(𝟎∨𝟏)(( 𝟎|𝟏 ) (𝟎∨𝟏))
𝑺𝒕𝒓𝒊𝒏𝒈𝒔:𝟎𝟎𝟏𝟎𝟏,𝟏𝟎𝟏𝟎𝟎,𝟏𝟏𝟎,𝟎𝟏𝟎𝟏𝟏…
31. All string begins or ends with𝑹.𝑬.=(𝟎𝟎∨𝟏𝟏)(𝟎∨𝟏)∗∨
00 or 11 ( 𝟎|𝟏 ) ∗(𝟎𝟎∨𝟏𝟏)
Regular definition
• A regular definition gives names to certain regular expressions and uses those
names in other regular expressions.
• Regular definition is a sequence of definitions of the form:
……
Where is a distinct name & is a regular expression.

 Example: Regular definition for identifier
letter  A|B|C|………..|Z|a|b|………..|z
digit  0|1|…….|9|
id letter (letter | digit)*

Regular definition example
• Example: Unsigned numbers(integer or floating point) are strings such as
3
5280
39.37
6.336E4
1.894E-4
2.56E+7
Regular Definition
digit  0|1|…..|9
digits  digit digit*
optional_fraction  .digits | 𝜖
optional_exponent  (E(+|-|𝜖)digits)|𝜖
num  digits optional_fraction optional_exponent
Regular grammar
• Basically regular grammars are used to represent regular languages
• The Regular Grammars are either left or right linear:
• Right Regular Grammars:
• Rules of the forms: A → ε, A → a, A → aB, A,B: variables and a:
terminal
• Left Regular Grammars:
• Rules of the forms: A → ε, A → a, A → Ba; A,B: variables and A:
terminal
• Example: S → aS | bA, A → cA | ε
• This grammar produces the language produced by the regular expression
a*bc*

Recognition of tokens
• Recognition of tokens is the second issue in lexical analysis
• How to recognize the tokens giving a token specification?
• Given the grammar of branching statement:
• The terminals of the grammar, which are if, then, else, relop, id, and
number, are the names of tokens
• The patterns for the given tokens are:
Recognition of tokens …
• The lexical analyzer also has the job of stripping out whitespace, by
recognizing the "token" ws defined by:

Recognition of tokens …
• The table shows which token name is returned to the parser and what
attribute value for each lexeme or family of lexemes.

Transition Diagram
Transition Diagram
• A stylized flowchart is called transition diagram.
is a state
is a transition
is a start state
is a final state

Transition Diagram : Relational operator
<
0 1
=
2 return (relop,LE)
>
3 return (relop,NE)
=
other
5
4 return (relop,LT)
return (relop,EQ)
>
6 =
7 return (relop,GE)
other
8 return (relop,GT)

Transition diagram : Unsigned number
digit digit digit
start digit . digit other

1 2 3 digit E +or -
4 5 6 7 8
E digit
3
5280
39.37
1.894 E - 4
2.56 E + 7
45 E + 6
96 E 2
Finite Automata
Finite Automata
• Finite Automata are recognizers.
• FA simply say “Yes” or “No” about each possible input string.
• Finite Automata is a mathematical model consist of:
1. Set of states
2. Set of input symbol
3. A transition function move
4. Initial state
5. Final states or accepting states

Types of finite automata
• Types of finite automata are:
DFA
b
 Deterministic finite automata (DFA): have
for each state exactly one edge leaving out a b b
1 2 3 4
for each symbol.
a
a
b a
NFA DFA
 Nondeterministic finite automata (NFA): a
There are no restrictions on the edges
leaving a state. There can be several with a b b
1 2 3 4
the same symbol as label and some edges
can be labeled with .
b NFA
Regular expression to NFA
using Thompson's rule
Regular expression to NFA using Thompson's rule
1. For , construct the NFA 3. For regular expression
𝑖 N(s) 𝑓
start
start N(t)
𝑖 𝑓
𝜖
Ex: ab
2. For in , construct the NFA
𝑖 𝑓
start a a b
1 2 3

4. For regular expression 5. For regular expression *
𝜖
N(s) 𝜖
𝜖
𝜖 𝜖
𝑖 𝑓
start
𝑖 𝑓
start N(s)
𝜖 N(t) 𝜖 𝜖
Ex: a*
𝜖
Ex: (a|b) a
2 3
𝜖 𝜖 𝑎 𝜖
1
𝜖
6
1 2 3 4
𝜖 𝜖 𝜖
4 5
b
• a*b
𝜖 𝑎 𝜖 𝑏
1 2 3 4 5
𝜖
𝜖
• b*ab
𝜖 𝑏 𝜖 𝑎 𝑏
1 2 3 4 5 6
𝜖

Exercise
Convert following regular expression to NFA:
1. abba
2. bb(a)*
3. (a|b)*
4. a* | b*
5. a(a)*ab
6. aa*+ bb*
7. (a+b)*abb
8. 10(0+1)*1
9. (a+b)*a(a+b)
10. (0+1)*010(0+1)*
11. (010+00)*(10)*
12. 100(1)*00(0+1)*
Conversion from NFA to DFA
using subset construction
method
Subset construction algorithm
Input: An NFA .
Output: A DFA D accepting the same language.
Method: Algorithm construct a transition table for D. We use the following
operation:
OPERATION DESCRIPTION
Set of NFA states reachable from NFA state on –
transition alone.
Set of NFA states reachable from some NFA state
in on – transition alone.
Set of NFA states to which there is a transition on
input symbol from some NFA state in .

Subset construction algorithm
initially be the only state in and it is unmarked;
while there is unmarked states T in do begin
mark ;
for each input symbol do begin
if is not in then
add as unmarked state to
end
end

(a|b)*abb 𝜖
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b

𝜖
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
{0, 1, 7, 2,
𝜖- Closure(0)=
4}
=
---- A
{0,1,2,4,7}

𝜖
a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B=
{1,2,3,4,6,7,8}
𝜖 𝜖
4 5
b
𝜖
A= {0, 1, 2, 4, 7}
Move(A,a)= {3,8}
𝜖-
Closure(Move(A,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
𝜖
a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B=
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}
𝜖 𝜖
4 5
b
𝜖
A= {0, 1, 2, 4, 7}
Move(A,b) ={5}
𝜖- Closure(Move(A,b)) ={5, 6, 7, 1, 2, 4}
----
= {1,2,4,5,6,7}
C
𝜖
a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}
𝜖 𝜖
4 5
b
𝜖
B = {1, 2, 3, 4, 6, 7, 8}
Move(B,a)= {3,8}
𝜖-
Closure(Move(B,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
𝜖
a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B D
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}
𝜖 𝜖
4 5 D=
b {1,2,4,5,6,7,9}
𝜖
B= {1, 2, 3, 4, 6, 7,
8}
Move(B,b)= {5,9}
𝜖-
= {5, 6, 7, 1, 2, 4, 9}
Closure(Move(B,b)) ----
= {1,2,4,5,6,7,9}
D
𝜖
a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B D
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7} B
𝜖 𝜖
4 5 D=
b {1,2,4,5,6,7,9}
C= {1, 2, 4, 5, 6 ,7}
Move(C,a)= {3,8}
𝜖-
Closure(Move(C,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
𝜖
a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B D
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D=
b {1,2,4,5,6,7,9}
𝜖
C= {1, 2, 4, 5, 6,
7}
Move(C,b)
{5}
=
𝜖-
Closure(Move(C,b))= {5, 6, 7, 1, 2, 4}
----
= {1,2,4,5,6,7}
C
𝜖
a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B D
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D= B
b {1,2,4,5,6,7,9}
𝜖
D= {1, 2, 4, 5, 6, 7,
9}
Move(D,a)= {3,8}
𝜖-
Closure(Move(D,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
𝜖
a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B D
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D= B E
b {1,2,4,5,6,7,9}
E=
{1,2,4,5,6,7,10}
𝜖
D= {1, 2, 4, 5, 6, 7,
9}
Move(D,b)= {5,10}
𝜖- = {5, 6, 7, 1, 2, 4,
Closure(Move(D,b)) 10}
= {1,2,4,5,6,7,10}---- E
𝜖
a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B D
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D= B E
b {1,2,4,5,6,7,9}
E= B
{1,2,4,5,6,7,10}
𝜖
E= {1, 2, 4, 5, 6, 7,
10}
Move(E,a)= {3,8}
𝜖-
Closure(Move(E,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
𝜖
a
2 3
States a b
𝜖 𝜖 A = {0,1,2,4,7} B C
𝜖 𝜖 a b b
0 1 6 7 8 9 10 B= B D
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7} B C
𝜖 𝜖
4 5 D= B E
b {1,2,4,5,6,7,9}
E= B C
{1,2,4,5,6,7,10}
𝜖
E= {1, 2, 4, 5, 6, 7,
10}
Move(E,b)={5}
𝜖-
{5,6,7,1,2,4}
Closure(Move(E,b))=
----
= {1,2,4,5,6,7}
C
a
b
States
A = {0,1,2,4,7}
a
B
b
C
a B a
D
B=
{1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}
B
B
D
C A a a b
D=
{1,2,4,5,6,7,9}
E=
B
B
E
C
b
C b
E
{1,2,4,5,6,7,10}
Transition
Table
b
Note:
• Accepting state in NFA is 10 DFA
• 10 is element of E
• So, E is acceptance state in DFA

Exercise
• Convert following regular expression to DFA using subset construction method:
1. (a+b)*a(a+b)
2. (a+b)*ab*a

DFA optimization
DFA optimization
1. Construct an initial partition of the set of states with two groups: the accepting
states and the non-accepting states .
2. Apply the repartition procedure to to construct a new partition .
3. If , let and continue with step (4). Otherwise, repeat step (2) with .
for each group of do begin
partition into subgroups such that two states and
of are in the same subgroup if and only if for all
input symbols , states and have transitions on
to states in the same group of .
replace in by the set of all subgroups formed.
end
DFA optimization
4. Choose one state in each group of the partition as the representative for that
group. The representatives will be the states of . Let s be a representative state,
and suppose on input a there is a transition of from to . Let be the
representative of s group. Then has a transition from to on . Let the start state
of be the representative of the group containing start state of , and let the
accepting states of be the representatives that are in .
5. If has a dead state , then remove from . Also remove any state not reachable
from the start state.

DFA optimization
States a b
{ 𝐴, 𝐵,𝐶, 𝐷, 𝐸} A B C
B B D
Accepting States
Nonaccepting States C B C
D B E
{𝐷} E B C
States a b
A B A
B B D
• Now no more splitting is possible. D B E

E B A
• If we chose A as the representative for
Optimized
group (AC), then we obtain reduced Transition Table
transition table
Thank You

Unit 2 - Lexical Anlaysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 2 - Lexical Anlaysis

Uploaded by

Copyright:

Available Formats

Hawassa University

Course code: CoSc4072

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 5

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 6

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 7

2.Keyword The sequence of character in a source

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 12

• Pointer Lexeme Begin, marks the beginning of the current lexeme.

forward forward forward

forward forward forward

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 18

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 20

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 22

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 23

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 24

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 26

The language denoted by regular expression is said to be a regular set.

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 27

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 28

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 29

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 30

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 31

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 32

9. Binary no. ends with 0

10.Binary no. ends with 1

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 33

14.String of 0 and 1 ends with 00

25. Even no. of 0 ∗ ∗ ∗ ∗

Where is a distinct name & is a regular expression.

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 37

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 39

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 41

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 42

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 44

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 45

digit digit digit

start digit . digit other

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 48

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 51

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 53

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 56

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 57

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 58

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 59

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 70

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 71

Mekonen M. # CoSc4072  Unit 2 – Lexical Analysis 74

• Now no more splitting is possible. D B E

You might also like