Professional Documents
Culture Documents
Token
Source Lexical
Parser Parse Tree
Program Analyzer
Get next token
Symbol Table
Main Duty
• Generate Tokens from Source Program
Secondary Duties
• Remove / Stripping single and Multi-line comments
• Remove white spaces in the form of blank, tab and new line
• Keep track of line number for error message
Why to separate lexical analysis & parsing?
1. Simplicity in design.
2. Improves compiler efficiency.
3. Enhance compiler portability.
Token, Pattern & Lexemes
Token Pattern
Sequence of character having a The set of rules called pattern associated
collective meaning is known as with a token.
token. Example: “non-empty sequence of digits”,
Categories of Tokens: “letter followed by letters and digits”
1. Identifier Lexemes
2. Keyword The sequence of character in a source
3. Operator program matched with a pattern for a
4. Special symbol token is called lexeme.
5. Constant Example: Sum, count, Average
Token, Pattern & Lexemes (Example)
Example: Position = rate * 9
Tokens
Identifier1
Position
Operator1
=
Rate Identifier2 Tokens
* Operator2
Constant1
9
Lexemes
Lexemes of identifier: Position, Rate
Lexemes of operator: =, *
Lexemes of constant: 9
Example
if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
print = token1
( = token2
"i=%d%x" = token3 ////anything inside " " is considered as 1 token
, = token4
i = token5
, = token6
& = token7
i = token8
) = token9
; = token10
so total 10 tokens
Input buffering
Input Buffering
• How to speed the reading of source program ?
• to look one additional character ahead
• e.g.
• to see the end of an identifier you must see a character
• which is not a letter or a digit
• not a part of the lexeme for id
• in C
- ,= , <
->, ==, <=
• two buffer scheme that handles large lookaheads
Safely
• sentinels – improvement which saves time checking
• buffer ends
Input buffering
There are mainly two techniques for input buffering:
1. Buffer pairs
2. Sentinels
Buffer pairs
The lexical analysis scans the input string from left to right one
character at a time.
Buffer divided into two N-character halves, where N is the number
of character on one disk block.
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :
Buffer pairs
: : : E : : = : : Mi : * : : : C: * : * : 2 : eof : : :
forward forward
lexeme_beginnig
forward
lexeme_beginnig
In buffer pairs we must check, each time we move the forward pointer
that we have not moved off one of the buffers.
Thus, for each character read, we make two tests.
We can combine the buffer-end test with the test for the current
character.
We can reduce the two tests to one if we extend each buffer to hold a
sentinel character at the end.
The sentinel is a special character that cannot be part of the source
program, and a natural choice is the character EOF.
Sentinels
eof : C: * : * : 2 : eof : : eof
: : E : : = : : Mi : * : eof
forward := forward + 1;
if forward = eof then begin
if forward at end of first half then begin
reload second half;
forward := forward + 1;
end
else if forward at the second half then begin
reload first half;
move forward to beginning of first half;
end
else terminate lexical analysis;
end
Specification of tokens
Strings and languages
Term Definition
Prefix of s A string obtained by removing zero or more trailing symbol of
string S.
e.g., ban is prefix of banana.
Proper prefix, suffix Any nonempty string x that is respectively proper prefix, suffix or
and substring of S substring of S, such that s≠x.
*
𝜖
a
aa
aaa Infinite …..
aaaa
aaaaa…..
Regular expression
+
a+
L = One or More Occurrences of a =
a
aa
aaa Infinite …..
aaaa
aaaaa…..
Precedence and associativity of operators
Operator Precedence Associative
Kleene * 1 left
Concatenation 2 left
Union | 3 left
Regular expression examples
Regular expression examples
7. 0 or more occurrence of either a or b or both
is a state
is a transition
is a start state
is a final state
Transition diagram example: Relational operator
< =
2 return (relop,LE)
>
3 return (relop,NE)
=
other
5
4 return (relop,LT)
return (relop,EQ)
>
=
7 return (relop,GE)
other
8 return (relop,GT)
Transition diagram example: Unsigned number
other
start digit . digit E +or - digit
8
E digit
3
5280
39.37
1.894 E - 4
2.56 E + 7
45 E + 6
96 E 2
Hard coding & automatic
generation lexical analyzers
Hard coding and automatic generation lexical analyzers
Lexical analysis is about identifying the pattern from the input.
To recognize the pattern, transition diagram is constructed.
It is known as hard coding lexical analyzer.
Example: to represent identifier in ‘C’, the first character must be
letter and other characters are either letter or digits.
To recognize this pattern, hard coding lexical analyzer will work
with a transition diagram.
Letter or digit
Start Letter
1 2 3
Hard coding and automatic generation lexical analyzers
The automatic generation lexical analyzer takes special notation as
input.
For example, lex compiler tool will take regular expression as
input and finds out the pattern matching to that regular expression.
Finite automata
Finite automata
Types of finite automata
b
a
a b b a b b
1 2 3 4 1 2 3 4
a
a
b a
b
DFA NFA
Regular expression to NFA
using Thompson's rule
Regular expression to NFA (Thompson’s construction)
start 𝜖
start a
Regular expression to NFA (Thompson’s construction)
N(s) 𝜖
𝜖
start
𝜖 N(t) 𝜖
a
2 3
𝜖 𝜖
1 6
𝜖 𝜖
4 5
b
Regular expression to NFA (Thompson’s construction)
start
N(s) N(t)
a b
1 2 3
Regular expression to NFA (Thompson’s construction)
𝜖
start 𝜖 𝜖
N(s)
𝜖 𝜖
1 2 3
𝜖
Regular expression to NFA examples
𝜖
• a*b
𝜖 𝜖
1 2 3
• b*ab
𝜖
𝜖 𝜖
1 2 3 5
𝜖
Regular expression to NFA examples
•(c|d) c
2 3
𝜖 𝜖
1 6
𝜖 𝜖
4 5
d
• (c|d)* 𝜖
c
2 3
𝜖 𝜖
𝜖 𝜖
0 1 6 7
𝜖 𝜖
4 5
d
𝜖
Exercise
Convert following regular expression to NFA:
1. abba
2. bb(a)*
3. (a|b)*
4. a* | b*
5. a(a)*ab
6. aa*+ bb*
7. (a+b)*abb
8. 10(0+1)*1
9. (a+b)*a(a+b)
10. (0+1)*010(0+1)*
11. (010+00)*(10)*
12. 100(1)*00(0+1)*
Conversion from NFA to DFA
using subset construction
method
Subset construction algorithm
OPERATION DESCRIPTION
Subset construction algorithm
Conversion from NFA to DFA
(a|b)* abb 𝜖
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
𝜖
Conversion from NFA to DFA
(a|b)* abb 𝜖
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
𝜖- Closure(0)= {0, 1, 7, 2, 4}
= {0,1,2,4,7} ---- A
Conversion from NFA to DFA
𝜖
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
States a b
𝜖 A = {0,1,2,4,7} B
A= {0, 1, 2, 4, 7} B = {1,2,3,4,6,7,8}
Move(A,a) = {3,8}
𝜖- Closure(Move(A,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA
𝜖
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
States a b
𝜖 A = {0,1,2,4,7} B C
A= {0, 1, 2, 4, 7} B = {1,2,3,4,6,7,8}
C = {1,2,4,5,6,7}
Move(A,b) = {5}
𝜖- Closure(Move(A,b)) = {5, 6, 7, 1, 2, 4}
= {1,2,4,5,6,7} ---- C
Conversion from NFA to DFA
𝜖
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
States a b
𝜖 A = {0,1,2,4,7} B C
B = {1, 2, 3, 4, 6, 7, 8} B = {1,2,3,4,6,7,8} B
C = {1,2,4,5,6,7}
Move(B,a) = {3,8}
𝜖- Closure(Move(B,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA
𝜖
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
States a b
𝜖 A = {0,1,2,4,7} B C
B= {1, 2, 3, 4, 6, 7, 8} B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7}
Move(B,b) = {5,9}
D = {1,2,4,5,6,7,9}
𝜖- Closure(Move(B,b)) = {5, 6, 7, 1, 2, 4, 9}
= {1,2,4,5,6,7,9} ---- D
Conversion from NFA to DFA
𝜖
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
States a b
𝜖 A = {0,1,2,4,7} B C
C= {1, 2, 4, 5, 6 ,7} B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B
Move(C,a) = {3,8}
D = {1,2,4,5,6,7,9}
𝜖- Closure(Move(C,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA
𝜖
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
States a b
𝜖 A = {0,1,2,4,7} B C
C= {1, 2, 4, 5, 6, 7} B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
Move(C,b) = {5}
D = {1,2,4,5,6,7,9}
𝜖- Closure(Move(C,b))= {5, 6, 7, 1, 2, 4}
= {1,2,4,5,6,7} ---- C
Conversion from NFA to DFA
𝜖
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
States a b
𝜖 A = {0,1,2,4,7} B C
D= {1, 2, 4, 5, 6, 7, 9} B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
Move(D,a) = {3,8}
D = {1,2,4,5,6,7,9} B
𝜖- Closure(Move(D,a)) = {3, 6, 7, 1, 2, 4, 8}
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA
𝜖
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
States a b
𝜖 A = {0,1,2,4,7} B C
D= {1, 2, 4, 5, 6, 7, 9} B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
Move(D,b)= {5,10}
D = {1,2,4,5,6,7,9} B E
𝜖- Closure(Move(D,b)) = {5, 6, 7, 1, 2, 4, 10}
E = {1,2,4,5,6,7,10}
= {1,2,4,5,6,7,10} ---- E
Conversion from NFA to DFA
𝜖
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
States a b
𝜖 A = {0,1,2,4,7} B C
E= {1, 2, 4, 5, 6, 7, 10} B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
Move(E,a) = {3,8}
D = {1,2,4,5,6,7,9} B E
𝜖- Closure(Move(E,a)) = {3, 6, 7, 1, 2, 4, 8}
E = {1,2,4,5,6,7,10} B
= {1,2,3,4,6,7,8} ---- B
Conversion from NFA to DFA
𝜖
a
2 3
𝜖 𝜖
𝜖 𝜖 a b b
0 1 6 7 8 9 10
𝜖 𝜖
4 5
b
States a b
𝜖 A = {0,1,2,4,7} B C
E= {1, 2, 4, 5, 6, 7, 10} B = {1,2,3,4,6,7,8} B D
C = {1,2,4,5,6,7} B C
Move(E,b)= {5}
D = {1,2,4,5,6,7,9} B E
𝜖- Closure(Move(E,b))= {5,6,7,1,2,4}
E = {1,2,4,5,6,7,10} B C
= {1,2,4,5,6,7} ---- C
DFA
a
b
States a b
a
A = {0,1,2,4,7} B C a
B = {1,2,3,4,6,7,8} B D
a a b
C = {1,2,4,5,6,7} B C
D = {1,2,4,5,6,7,9} B E b
E = {1,2,4,5,6,7,10} B C b
Transition Table
b
Note:
• Accepting state in NFA is 10
DFA
• 10 is element of E
• So, E is acceptance state in DFA
Exercise
Convert following regular expression to DFA using subset
construction method:
1. (a+b)*a(a+b)
2. (a+b)*ab*a
DFA optimization
DFA Optimization Algorithm
DFA Optimization Algorithm
DFA Optimization
States a b
A B C
B B D
C B C
D B E
E B C
States a b
A B A
B B D
• Now no more splitting is possible.
D B E
• If we chose A as the representative for group E B A
(AC), then we obtain reduced transition table Optimized
Transition Table
Conversion from regular
expression to DFA
Function computed from the syntax tree
Rules to compute nullable, firstpos, lastpos
Node n nullable(n) firstpos(n) lastpos(n)
true
false
n
if (nullable(c1)) if (nullable(c2))
nullable(c1)
thenfirstpos(c1) then lastpos(c1)
and
c1 c2 nullable(c2) firstpos(c2) lastpos(c2)
else firstpos(c1) else lastpos(c2)
n
true firstpos(c1) lastpos(c1)
c1
Rules to compute followpos
1. If n is concatenation node with left child c1 and right
child c2 and i is a position in lastpos(c1), then all
position in firstpos(c2) are in followpos(i)
n
firstpos(c1)
c1
n if (nullable(c1))
firstpos(c1) firstpos(c2)
c1 c2 else firstpos(c1)
Conversion from regular expression to DFA
Step 3: Calculate lastpos
Lastpos
.
.
.
n
lastpos(c1) lastpos(c2)
.
c1 c2
n
lastpos(c1)
c1
n if (nullable(c2))
lastpos(c1) lastpos(c2)
c1 c2 else lastpos(c2)
Conversion from regular expression to DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos .
Lastpos
.
.
. .
Conversion from regular expression to DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos . 4 5
Lastpos
.
.
. .
Conversion from regular expression to DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos . 4 5
Lastpos
. 3 4
.
. .
Conversion from regular expression to DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos . 4 5
Lastpos
. 3 4
2 3
.
1 3
. .
Conversion from regular expression to DFA
Step 4: Calculate followpos Position followpos
5 6
Firstpos . 4 5
Lastpos
. 3 4
2 1,2, 3
.
1 1,2, 3
.
*
Construct DFA
Position followpos
5 6
4 5
3 4
2 1,2,3
1 1,2,3
States a b
A={1,2,3} B A
B={1,2,3,4}
Construct DFA
State B Position followpos
=(1,2,3) ----- A
b States a b
a
A={1,2,3} B A
a b b B={1,2,3,4} B C
A B C D
C={1,2,3,5} B D
a
a D={1,2,3,6} B A
b
DFA
Construct DFA
Position followpos
5 6
b
a 4 5
3 4
a b b 2 1,2,3
A B C D
1 1,2,3
a
a
b
States a b
DFA
A={1,2,3} B A
B={1,2,3,4} B C
Note: Elements of E contains state 10 that is C={1,2,3,5} B D
acceptance state in NFA. So, State E is acceptance state D={1,2,3,6} B A
Exercise
Construct DFA for following regular expression:
1. (c | d)*c#
Question already asked in GTU
1
2
3
4
5
6
7
8
9
10
11
End of Unit-2