You are on page 1of 49

Prof.

Radu Prodan

LEXICAL ANALYSIS

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 1


Phases of a Compiler
Front-end Back-end

Source Syntax Annotated Intermediate Target Target


Tokens
Code Tree Tree Representation Code Code

Source Code

Target Code
Generator

Optimiser
Optimiser
Semantic
Syntactic
Analyser

Analyser

Analyser
Lexical

Code
Literal Symbol Error
Table Table Handler

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 2


Agenda
▪ Introduction

▪ Regular expressions

▪ Deterministic Finite Automata (DFA)

▪ Automatic DFA generation

▪ Conclusions

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 3


Overview
▪ Lexical analysis or scanning
– Input: source program (ASCII file)
– Output: set of tokens

▪ Token
– Corresponds to a natural language word
– Keywords: if, while, for, int, float
– Identifiers : user-defined, variable size beginning with letter
– Special symbols: arithmetic or logical operations: +, *, /, <, >, =, <>

▪ Scanning is a special case of pattern matching


– Specified using regular expressions
– Implemented using finite automata

▪ Must be as efficient and fast as possible


– Other compiler phases have a much higher time overhead

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 4


Token
▪ Usually defined as an enumerated type
– typedef enum { IF, THEN, ELSE, PLUS, MINUS, NUM, ID, … }
TokenType;

▪ Keywords (fixed strings of characters or keywords)


– IF, THEN, ELSE represent string of characters “if”, “then”, “else”

▪ Special symbols
– PLUS, MINUS represent string of characters “+”, “–”

▪ Identifiers
– ID can represent many user-defined variables or identifiers (“a”, “b”, “c”)

▪ Numbers
– NUM can represent many numbers or values (1, 2, 3, ...)
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 5
Attribute
▪ Attribute
– Associated to a token has one or more

▪ Lexeme or string value


– String matched by token

▪ NUM token type


– Lexeme: “32767” (string value)
– Attribute: 32767 (integer)

▪ PLUS token type


– Lexeme: “+” (string value)
– Attribute: + (arithmetic operator)

▪ ID token type
– Lexeme: “x”, “i”, “j”, “tmp”, “var” (string value)

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 6


Token Record
typedef struct typedef struct
{ TokenType tokenval; { TokenType tokenval;
char *stringval; union
int intval; { char *stringval;
float floatval; int intval;
} TokenRecord; float floatval;
} attribute;
} TokenRecord;

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 7


Operation
▪ Parser
– Controls scanner operation
– Function call that returns next token on demand
– TokenType getToken(void);
– Compute additional token attributes

▪ Example: a[index] = 4 + 2
a [ i n d e x ] = 4 + 2

▪ getToken
a [ i n d e x ] = 4 + 2

▪ Token ID Source
token
Semantic
▪ String value: “a” Scanner getToken Parser
program analysis

Symbol Table

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 8


Goal
▪ Complete automation of lexical analysis phase

▪ Regular expressions
– Specify program tokens

▪ Automatic lexical analysis generation tool

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 9


Agenda
▪ Introduction

▪ Regular expressions

▪ Deterministic Finite Automata (DFA)

▪ Automatic DFA generation

▪ Conclusions

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 10


Regular Expressions
▪ Alphabet  ▪ OR operator: r|s
– Set of legal symbols – L(r|s) = L(r)  L(s)

▪ Concatenation: rs
▪ Language L(r)
– L(rs) = L(r)L(s)
– Generated by regular expression r

▪ Repetition: r*
▪ Basic regular expression a – L(r*) = L(r)*
– a
– L(a) = { a } ▪ Sub-expressions: (r)
– L((r)) = L(r)
– Empty string: L() = {  }
– Empty set: L() = { } ▪ Precedence order: *,
concatenation, |
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 11
Examples of Regular Expressions
▪  = { a, b, c }

▪ Strings that contain exactly one b


– (a|c)*b(a|c)*

▪ Set of strings that contains at most one b


– (a|c)*|(a|c)*b(a|c)*
– (a|c)*(b|)(a|c)*

▪ Set of strings that contain no consecutive b’s


– notb = a | c
– (notb | b notb)*(b|)

▪ Strings containing an even number of b’s


– (notb* b notb* b)* notb*
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 12
Extensions to Regular Expressions
▪ + describes one or more ▪ Binary numbers
repetitions – (0|1)(0|1)*
– (0|1)+

▪ . describes any character ▪ All strings containing at least one b


– .*b.*
▪ Character classes
– [a-z] instead of a|b|...|z ▪ Any character but a, b, and c
– [0-9] instead of 0|1|...|9 – ~(a|b|c)
– Multiple ranges: [a-zA-Z] – [^abc] in Lex

▪ Number with optional leading sign


▪ ~ describes any character not – natural = [0−9]+
in a given set – signedNat = nat | + nat | - nat

– signedNat = (+|-)? nat


▪ ? describes optional sub-
expressions
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 13
Programming Language Tokens
▪ Numbers: 2.71E-2 = 0.0271
– nat = [0-9]+
– signedNat = (+|-)? nat
– number = signedNat(“.” nat)?(E signedNat)?

▪ Keywords
– keyword = if | while | do | ...

▪ Identifiers
– letter = [a-zA-Z]
– digit = [0-9]
– identifier = letter(letter|digit)*

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 14


Comments
▪ { this is a Pascal comment }
– {(~})*}

▪ -- this is an Ada comment


– --(~newline)*

▪ /* this is a C comment */
– ba(~(ab))*ab
• Not valid because “not” operator is restricted to single characters
– (~(ab))* could be written as: b*(a*~(a|b)b*)*a*

▪ In some programming languages, comments can be nested


– (* this is (* a Modula-2 *) comment *)
– (* this is (* illegal in a Modula-2 *)
• Impossible since scanner cannot count comment separators

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 15


Ambiguity
▪ Strings may be matched by several regular expressions
– Are if, while, for keywords or identifiers?
– Are strings <>, ==, <=, >= one or more two tokens?

▪ Regular expressions cannot solve ambiguities

▪ Disambiguating rules
– Keyword (or reserved words) first
– Principle of longest substring

▪ Token delimiters or separators


– Indicate that a longer substring cannot represent a token

▪ Lookahead problem
– Token delimiters must not be consumed but returned to input stream
– Often one single lookahead character is enough, but sometimes not

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 16


Token Delimiters
▪ Whitespaces
– Discarded token delimiters
– Free format languages
– whitespace = (newline | blank | tab | comment)+

▪ while x …
– Keyword while, identifier x
– Space as token delimiter

▪ xtemp=ytemp
– Identifiers xtemp and ytemp
– = as token delimiter

▪ Comments usually serve as delimiters


– do/**/if
– Two reserved words (do and if) rather than identifier doif
– /**/ as token delimiter

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 17


FORTRAN Example
▪ Fixed format language
– Whitespaces are not delimiters, but removed by a preprocessor before scanning

▪ I F ( X 2 .EQ. 0 ) THE N
– Equivalent to IF(X2.EQ.0) THEN

▪ No reserved words in FORTRAN (all keywords can also be identifiers)


– Character position in each line of input is important

▪ IF(IF.EQ.0)THENTHEN=1.0
– First IF and THEN are keywords
– Second IF and THEN are variables (identifiers)

▪ Backtrack to arbitrary positions in a code line


– DO99I=1.10  Assign value 1.1 to variable D99I
– DO99I=1,10  for(i:=1; i<=10; i++) in C

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 18


Agenda
▪ Introduction

▪ Regular expressions

▪ Deterministic Finite Automaton (DFA)

▪ Automatic DFA generation

▪ Conclusions

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 19


Finite Automaton
▪ Mathematical method of recognising patterns in input strings

▪ identifier = letter(letter|digit)*
– Transition graph with two states letter
– Start state: 1
letter
– Accepting state: 2 1 2
– Transitions: arrowed lines
digit
▪ Process of recognising xtemp as identifier

x t e m p
1 2 2 2 2 2

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 20


Deterministic Finite Automaton (DFA)
▪ Deterministic Finite Automata (DFA)
– Next state is uniquely given by current state and current input character
– A symbol can label one single transition out of a state

▪ A DFA M consists of
– An alphabet 
– A set of states S
– A transition function T: S   → S
– A start state s0  S
– A set of accepting states A  S

▪ L(M): Language accepted by M

▪ c1c2…cn  L(M)  ci  ,  i  [1..n] 


 s1=T(s0,c1), s2=T(s1,c2), …, sn=T(sn-1,cn)  sn  A

c1 c2 c3 cn-1 cn
s0 s1 s2 ... sn-1 sn
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 21
Error Transitions
▪ letter = [a-zA-Z]
letter

▪ T(start, c) defined only if c is a letter


letter start in_id
digit
▪ T(in_id, c) defined only if c is other1 other2
letter or digit
error
▪ Error transitions any
– Not drawn but assumed to exist

other1 = ~letter
other2 = ~(letter|digit)
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 22
DFA Examples
▪ Set of strings that contain exactly one b
– (a|c)*b(a|c)* notb notb

▪ Set of strings that contain at most one b


– (a|c)*|(a|c)*b(a|c)*
notb notb
– (a|c)*(b|)(a|c)*
b

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 23


Numeric Constants digit
digit = [0-9]+ digit
nat = digit+
+ digit

digit
signedNat = (+|-)? nat
– digit

number = signedNat(“.” nat)?(E signedNat)?


+ digit digit digit
+
digit . digit E digit
– –
digit E digit
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 24
Unnested Comments other

▪ Pascal comments { }
– {(~})*}
– other = ~}
other *
▪ C comments 1
/
2
*
3
*
4
/
5
– 1 – start
– 2 – entering comment
– 3 – inside comment other
– 4 – exiting comment
– 5 – finish

▪ Easier to write DFA than regular expression

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 25


DFA Implementation with State Variables
letter
state := 1; { start }
letter [other]
while state = 1 or 2 do 1 2 3
case state of
1: case input character of digit
letter: advance input;
state := 2;
else state := . . .; { error or other }
end case;
2: case input character of
letter, digit: advance input;
state := 2; { unnecessary }
else state := 3;
end case;
end case;
end while;

if state = 3 then accept else error;

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 26


C Comments with State Variables
state := 1; { start }

while state = 1, 2, 3 or 4 do
case state of
1: case input character of
“/”: state := 2;
else state := 6; { error or other }
end case;
2: case input character of
“*”: state := 3;
else state := 6; { error or other }
end case;
3: case input character of
”*”: state := 4;
else { stay in state 3 }
end case;
4: case input character of
“/”: state := 5;
“*”: { stay in state 4 }
else state := 3;
end case; other *
end case;
advance input;
end while;
/ * * /
1 2 3 4 5
if state = 5 then accept else error;

other
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 27
Transition Table
▪ Transition table
– Two-dimensional array indexed by state and input character
– Expresses values of transition function T
– Extra column indicates accepting states
– Square brackets indicate “noninput-consuming” transitions
letter

letter [other]
1 2 3

digit
T Input character Accepting
letter digit other
1 2 No
State
2 2 2 [3] No
3 Yes

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 28


C Comments
DFA Transition Table
other *
/ * * /
1 2 3 4 5

other
T Input character Accepting
/ * other
1 2 No

State 2 3 No
3 4 3 No
4 5 4 3 No
5 Yes

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 29


Table Driven DFA Implementation
state := 1; ▪ Advantage
ch := next input character; – Small code size
– Easy to maintain and
while  T[state, ch]    change
 error(state) do
newstate := T[state, ch];
if Advance[state, ch] then ▪ Disadvantage
ch := next input char; – Tables can get very large
state := newstate; – Table compression methods
end while; slow and rarely used
– Sparse array
if Accept[state] then representations
accept;

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 30


Agenda
▪ Introduction

▪ Regular expressions

▪ Deterministic Finite Automata (DFA)

▪ Automatic scanner generation

▪ Conclusions

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 31


Single DFA Generation
▪ Automatically construct DFAs for all regular expressions
< =
return LE

< >
return NE

<
return LT

▪ Combine all DFAs in one single DFA


= return LE return LE
< =
< > return NE < > return NE
< [other]
return LT return LT
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 32
Automatic DFA Generation
▪ Nondeterministic Finite Automaton (NFA)
– Constructed for each regular expression
– Thompson’s construction

: =
▪ -transitions 
– Connect NFAs of all tokens  < =
– “Spontaneous” transition without
consuming any input characters 
– “Match” of empty string =

▪ Convert NFA into DFA using a fast and


direct algorithm
– Subset construction

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 33


Automatic Scanner Generation

Regular Nondeterministic Deterministic Final


Expressions Final Automaton Automaton

• Thompson‘s • Subset • Table Driven DFA


Construction Construction Implementation

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 34


NFA Definition
▪ NFA M consists of
– An alphabet 
– A set of states S
– A transition function: T: S  (  {  }) → (S)
– A start state s0  S
– A set of accepting states A  S

▪ L(M): language accepted by M

▪ c1c2…cn  L(M)  ci    {  },  i  [1..n] 


 s1=T(s0,c1), s2=T(s1,c2), …, sn=T(sn-1,cn)  sn  A

▪ Observations
– Any ci may be 
– c1c2…cn may have fewer than n characters ( removed)
– Sequence of states s1s2…sn chosen from sets of states T(s0,c1), T(s1,c2), …,T(sn-1,cn) not
always uniquely determined
– Arbitrary number of  in input stream corresponding to any number of NFA -transitions

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 35


NFA Example 2
▪ String abb can be accepted by any of 
a b
following transition sequences
1 a 3  4
a b  b
1 2 4 2 4
a   b  b 
1 3 4 2 4 2 4
▪ NFA matches multiple regular expressions
– ab: T(1,a)=2, T(2,b)=4
– ab+: T(1,a)=2, T(2,b)=4, T(4,)=2, T(2,b)=4, … a
– ab*: T(1,a)=3, T(3,)=4, T(4,)=2, T(2,b)=4, …
– b*: T(1,)=4 , T(4,)=2, T(2,b)=4, …
b b
▪ NFA accepts same language as regular
expression: ab*|b*
– Equivalent to (a|)b* b

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 36


Thompson’s Construction
▪ Construct an NFA from a regular expression

▪ Basic regular expression


–a a

–

▪ To do
– Concatenation
– Or
– Repetition

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 37


Concatenation
▪ Input
– Two regular expressions r and s
– Two NFAs (of r and of s)
...r...
▪ Goal
– NFA for regular expression rs 

▪ Connect accepting state of r with start ...s...


state of s through one -transition

▪ L(rs) = L(r) L(s)


12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 38
Or
▪ Input ▪ Two new states: start and
– Two regular expressions r and s accepted
– Two NFAs (of r and of s)
▪ Connected with -transitions
▪ Goal
– NFA for regular expression r|s ▪ L(r|s) = L(r)  L(s)

...r...
 

 
...s...
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 39
Repetition
▪ Input ▪ Two new states: start and accepting
– Two regular expressions r and s connected through -transitions
– Two NFAs (of r and of s)
▪ Repetition through -transition from
▪ Goal accepting to start state of r
– NFA for regular expression r*
▪ Empty string is accepted by -transition
from start to accepting state

 
...r...


12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 40
Example: ab | a
a
▪ a

▪ b b

▪ ab a  b

▪ ab | a
 a  b 

 a 

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 41


letter (letter | digit)*
▪ letter letter

▪ digit
digit

▪ letter | digit
 letter 

digit
 

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 42


letter (letter | digit)*
▪ (letter | digit)* 

 letter 
 
digit
 

▪ letter (letter | digit)*


 letter 
letter   
digit
 

12.03.2024

R. Prodan, Compiler Construction, Summer Semester 2024 43
Subset Construction
▪ Convert an NFA into a DFA

▪ We need some method for eliminating


– -transitions
– Multiple transitions from a state on same input character

▪ -closure of a state s is
– Set of states reachable by a series of zero or more -transitions
– Denoted as s

▪ DFA has as states sets of states of original DFA


12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 44
Example: a*
1 = { 1, 2, 4 } ▪ DFA start state
– -closure of start state

{ 1, 2, 4 }a = { 3 } = { 2, 3, 4 } ▪ For each DFA state S, compute state S ,  a


a
– Sa = { t |  sS  T(s,a)=t }
T({1, 2, 4}, a) = { 2, 3, 4 }
▪ Add new transition T(S,a) = Sa
{2, 3, 4}a = { 3 } = { 2, 3, 4 } ▪ DFA accepting states
– All states that contain accepting NFA states
T({ 2, 3, 4 }, a) = { 2, 3, 4 }
 a
 a 
1 2 3 4 a
{1,2,4} {2,3,4}

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 45
ab|a
{ 1 } = { 1, 2, 6 } { 3, 4, 7, 8 }b = { 5 } = { 5, 8 }
{ 1, 2, 6 }a = { 3, 7 } = { 3, 4, 7, 8 } T({3, 4, 7, 8}, b) = { 5, 8}
T({1, 2, 6}, a) = { 3, 4, 7, 8 }

 a  b 
2 3 4 5
1  8
a 
6 7

a b
{1, 2, 6} {3, 4, 7, 8} {5, 8}

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 46


letter(letter|digit)*
{1}={1} letter
{ 1 }letter = { 2 } = { 2, 3, 4, 5, 7, 10 } {1}
T({ 1 }, letter) = { 2, 3, 4, 5, 7, 10 }
{4,5,6,7,9,10}
{ 2, 3, 4, 5, 7, 10 }letter = { 6 } = { 4, 5, 6, 7, 9, 10 } letter letter
T({ 2, 3, 4, 5, 7, 10 }, letter) = { 4, 5, 6, 7, 9, 10 }
{ 2, 3, 4, 5, 7, 10 }digit = { 8 } = { 4, 5, 7, 8, 9, 10 }
T({ 2, 3, 4, 5, 7, 10 }, digit) = { 4, 5, 7, 8, 9, 10 } {2,3,4,5,7,10} digit
{ 4, 5, 6, 7, 9, 10 }letter = { 6 } = { 4, 5, 6, 7, 9, 10 } letter
T({ 4, 5, 6, 7, 9, 10 }, letter) = { 4, 5, 6, 7, 9, 10 } digit
{ 4, 5, 6, 7, 9, 10 }digit = { 8 } = { 4, 5, 7, 8, 9, 10 }
T({ 4, 5, 7, 8, 9, 10 }, digit) = { 4, 5, 7, 8, 9, 10 } {4,5,7,8,9,10}
{ 4, 5, 7, 8, 9, 10 }letter = { 6 } = { 4, 5, 6, 7, 9, 10 }
T({ 4, 5, 7, 8, 9, 10 }, letter) = { 4, 5, 6, 7, 9, 10 }
{ 4, 5, 7, 8, 9, 10 }digit = { 8 } = { 4, 5, 7, 8, 9, 10 }
digit
T({ 4, 5, 7, 8, 9, 10 }, digit) = { 4, 5, 7, 8, 9, 10 }

 letter 
5 6
letter   
1 2 3 4 9 10
digit
 7 8 

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 47
Agenda
▪ Introduction

▪ Regular expressions

▪ Deterministic Finite Automata (DFA)

▪ Automatic DFA generation

▪ Conclusions

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 48


Conclusions
▪ Lexical analysis or scanning

▪ Operates under parser control

▪ Tokens specified through regular expressions

▪ Regular expressions implemented as finite automata


– Automatic NFA generation through Thompson construction
– NFA conversion into DFA through Subset construction

▪ Automatic lexical analyser generator (e.g., Lex)


12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 49

You might also like