2 - Scanner

Prof.
Radu Prodan
LEXICAL ANALYSIS
12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 1

Phases of a Compiler
Front-end Back-end
Source Syntax Annotated Intermediate Target Target

Tokens
Code Tree Tree Representation Code Code
Source Code
Target Code
Generator
Optimiser
Optimiser
Semantic
Syntactic
Analyser
Analyser
Analyser
Lexical
Code
Literal Symbol Error
Table Table Handler

Agenda
▪ Introduction
▪ Regular expressions
▪ Deterministic Finite Automata (DFA)
▪ Automatic DFA generation
▪ Conclusions

Overview
▪ Lexical analysis or scanning
– Input: source program (ASCII file)
– Output: set of tokens
▪ Token
– Corresponds to a natural language word
– Keywords: if, while, for, int, float
– Identifiers : user-defined, variable size beginning with letter
– Special symbols: arithmetic or logical operations: +, *, /, <, >, =, <>
▪ Scanning is a special case of pattern matching

– Specified using regular expressions
– Implemented using finite automata
▪ Must be as efficient and fast as possible

– Other compiler phases have a much higher time overhead

Token
▪ Usually defined as an enumerated type
– typedef enum { IF, THEN, ELSE, PLUS, MINUS, NUM, ID, … }
TokenType;
▪ Keywords (fixed strings of characters or keywords)

– IF, THEN, ELSE represent string of characters “if”, “then”, “else”
▪ Special symbols
– PLUS, MINUS represent string of characters “+”, “–”
▪ Identifiers
– ID can represent many user-defined variables or identifiers (“a”, “b”, “c”)
▪ Numbers
– NUM can represent many numbers or values (1, 2, 3, ...)
Attribute
▪ Attribute
– Associated to a token has one or more
▪ Lexeme or string value

– String matched by token
▪ NUM token type

– Lexeme: “32767” (string value)
– Attribute: 32767 (integer)
▪ PLUS token type

– Lexeme: “+” (string value)
– Attribute: + (arithmetic operator)
▪ ID token type
– Lexeme: “x”, “i”, “j”, “tmp”, “var” (string value)

Token Record
typedef struct typedef struct
{ TokenType tokenval; { TokenType tokenval;
char *stringval; union
int intval; { char *stringval;
float floatval; int intval;
} TokenRecord; float floatval;
} attribute;
} TokenRecord;

Operation
▪ Parser
– Controls scanner operation
– Function call that returns next token on demand
– TokenType getToken(void);
– Compute additional token attributes
▪ Example: a[index] = 4 + 2
a [ i n d e x ] = 4 + 2
▪ getToken
a [ i n d e x ] = 4 + 2
▪ Token ID Source
token
Semantic
▪ String value: “a” Scanner getToken Parser
program analysis
Symbol Table

Goal
▪ Complete automation of lexical analysis phase
– Specify program tokens
▪ Automatic lexical analysis generation tool

Agenda
▪ Introduction
▪ Conclusions

Regular Expressions
▪ Alphabet  ▪ OR operator: r|s
– Set of legal symbols – L(r|s) = L(r)  L(s)
▪ Concatenation: rs
▪ Language L(r)
– L(rs) = L(r)L(s)
– Generated by regular expression r
▪ Repetition: r*
▪ Basic regular expression a – L(r*) = L(r)*
– a
– L(a) = { a } ▪ Sub-expressions: (r)
– L((r)) = L(r)
– Empty string: L() = {  }
– Empty set: L() = { } ▪ Precedence order: *,
concatenation, |
Examples of Regular Expressions
▪  = { a, b, c }
▪ Strings that contain exactly one b

– (a|c)*b(a|c)*
▪ Set of strings that contains at most one b

– (a|c)*|(a|c)*b(a|c)*
– (a|c)*(b|)(a|c)*
▪ Set of strings that contain no consecutive b’s

– notb = a | c
– (notb | b notb)*(b|)
▪ Strings containing an even number of b’s

– (notb* b notb* b)* notb*
Extensions to Regular Expressions
▪ + describes one or more ▪ Binary numbers
repetitions – (0|1)(0|1)*
– (0|1)+
▪ . describes any character ▪ All strings containing at least one b

– .*b.*
▪ Character classes
– [a-z] instead of a|b|...|z ▪ Any character but a, b, and c
– [0-9] instead of 0|1|...|9 – ~(a|b|c)
– Multiple ranges: [a-zA-Z] – [^abc] in Lex
▪ Number with optional leading sign

▪ ~ describes any character not – natural = [0−9]+
in a given set – signedNat = nat | + nat | - nat
– signedNat = (+|-)? nat

▪ ? describes optional sub-
expressions
Programming Language Tokens
▪ Numbers: 2.71E-2 = 0.0271
– nat = [0-9]+
– signedNat = (+|-)? nat
– number = signedNat(“.” nat)?(E signedNat)?
▪ Keywords
– keyword = if | while | do | ...
▪ Identifiers
– letter = [a-zA-Z]
– digit = [0-9]
– identifier = letter(letter|digit)*

Comments
▪ { this is a Pascal comment }
– {(~})*}
▪ -- this is an Ada comment

– --(~newline)*
▪ /* this is a C comment */
– ba(~(ab))*ab
• Not valid because “not” operator is restricted to single characters
– (~(ab))* could be written as: b*(a*~(a|b)b*)*a*
▪ In some programming languages, comments can be nested

– (* this is (* a Modula-2 *) comment *)
– (* this is (* illegal in a Modula-2 *)
• Impossible since scanner cannot count comment separators

Ambiguity
▪ Strings may be matched by several regular expressions
– Are if, while, for keywords or identifiers?
– Are strings <>, ==, <=, >= one or more two tokens?
▪ Regular expressions cannot solve ambiguities
▪ Disambiguating rules
– Keyword (or reserved words) first
– Principle of longest substring
▪ Token delimiters or separators

– Indicate that a longer substring cannot represent a token
▪ Lookahead problem
– Token delimiters must not be consumed but returned to input stream
– Often one single lookahead character is enough, but sometimes not

Token Delimiters
▪ Whitespaces
– Discarded token delimiters
– Free format languages
– whitespace = (newline | blank | tab | comment)+
▪ while x …
– Keyword while, identifier x
– Space as token delimiter
▪ xtemp=ytemp
– Identifiers xtemp and ytemp
– = as token delimiter
▪ Comments usually serve as delimiters

– do/**/if
– Two reserved words (do and if) rather than identifier doif
– /**/ as token delimiter

FORTRAN Example
▪ Fixed format language
– Whitespaces are not delimiters, but removed by a preprocessor before scanning
▪ I F ( X 2 .EQ. 0 ) THE N
– Equivalent to IF(X2.EQ.0) THEN
▪ No reserved words in FORTRAN (all keywords can also be identifiers)

– Character position in each line of input is important
▪ IF(IF.EQ.0)THENTHEN=1.0
– First IF and THEN are keywords
– Second IF and THEN are variables (identifiers)
▪ Backtrack to arbitrary positions in a code line

– DO99I=1.10  Assign value 1.1 to variable D99I
– DO99I=1,10  for(i:=1; i<=10; i++) in C

Agenda
▪ Introduction
▪ Deterministic Finite Automaton (DFA)
▪ Conclusions

Finite Automaton
▪ Mathematical method of recognising patterns in input strings
▪ identifier = letter(letter|digit)*
– Transition graph with two states letter
– Start state: 1
letter
– Accepting state: 2 1 2
– Transitions: arrowed lines
digit
▪ Process of recognising xtemp as identifier
x t e m p
1 2 2 2 2 2

Deterministic Finite Automaton (DFA)
– Next state is uniquely given by current state and current input character
– A symbol can label one single transition out of a state
▪ A DFA M consists of
– An alphabet 
– A set of states S
– A transition function T: S   → S
– A start state s0  S
– A set of accepting states A  S
▪ L(M): Language accepted by M
▪ c1c2…cn  L(M)  ci  ,  i  [1..n] 

 s1=T(s0,c1), s2=T(s1,c2), …, sn=T(sn-1,cn)  sn  A
c1 c2 c3 cn-1 cn
s0 s1 s2 ... sn-1 sn
Error Transitions
▪ letter = [a-zA-Z]
letter
▪ T(start, c) defined only if c is a letter

letter start in_id
digit
▪ T(in_id, c) defined only if c is other1 other2
letter or digit
error
▪ Error transitions any
– Not drawn but assumed to exist
other1 = ~letter
other2 = ~(letter|digit)
DFA Examples
▪ Set of strings that contain exactly one b
– (a|c)*b(a|c)* notb notb
▪ Set of strings that contain at most one b

– (a|c)*|(a|c)*b(a|c)*
notb notb
– (a|c)*(b|)(a|c)*
b

Numeric Constants digit
digit = [0-9]+ digit
nat = digit+
+ digit
digit
signedNat = (+|-)? nat
– digit
number = signedNat(“.” nat)?(E signedNat)?

+ digit digit digit
+
digit . digit E digit
– –
digit E digit
Unnested Comments other
▪ Pascal comments { }
– {(~})*}
– other = ~}
other *
▪ C comments 1
/
2
*
3
*
4
/
5
– 1 – start
– 2 – entering comment
– 3 – inside comment other
– 4 – exiting comment
– 5 – finish
▪ Easier to write DFA than regular expression

DFA Implementation with State Variables
letter
state := 1; { start }
letter [other]
while state = 1 or 2 do 1 2 3
case state of
1: case input character of digit
letter: advance input;
state := 2;
else state := . . .; { error or other }
end case;
2: case input character of
letter, digit: advance input;
state := 2; { unnecessary }
else state := 3;
end case;
end case;
end while;
if state = 3 then accept else error;

C Comments with State Variables
state := 1; { start }
while state = 1, 2, 3 or 4 do
case state of
“/”: state := 2;
else state := 6; { error or other }
end case;
“*”: state := 3;
else state := 6; { error or other }
end case;
”*”: state := 4;
else { stay in state 3 }
end case;
“/”: state := 5;
“*”: { stay in state 4 }
else state := 3;
end case; other *
end case;
advance input;
end while;
/ * * /
1 2 3 4 5
if state = 5 then accept else error;
other
Transition Table
▪ Transition table
– Two-dimensional array indexed by state and input character
– Expresses values of transition function T
– Extra column indicates accepting states
– Square brackets indicate “noninput-consuming” transitions
letter
letter [other]
1 2 3
digit
T Input character Accepting
letter digit other
1 2 No
State
2 2 2 [3] No
3 Yes

C Comments
DFA Transition Table
other *
/ * * /
1 2 3 4 5
other
T Input character Accepting
/ * other
1 2 No
State 2 3 No
3 4 3 No
4 5 4 3 No
5 Yes

Table Driven DFA Implementation
state := 1; ▪ Advantage
ch := next input character; – Small code size
– Easy to maintain and
while  T[state, ch]    change
 error(state) do
newstate := T[state, ch];
if Advance[state, ch] then ▪ Disadvantage
ch := next input char; – Tables can get very large
state := newstate; – Table compression methods
end while; slow and rarely used
– Sparse array
if Accept[state] then representations
accept;

Agenda
▪ Introduction
▪ Automatic scanner generation
▪ Conclusions

Single DFA Generation
▪ Automatically construct DFAs for all regular expressions
< =
return LE
< >
return NE
<
return LT
▪ Combine all DFAs in one single DFA

= return LE return LE
< =
< > return NE < > return NE
< [other]
return LT return LT
Automatic DFA Generation
▪ Nondeterministic Finite Automaton (NFA)
– Constructed for each regular expression
– Thompson’s construction
: =
▪ -transitions 
– Connect NFAs of all tokens  < =
– “Spontaneous” transition without
consuming any input characters 
– “Match” of empty string =
▪ Convert NFA into DFA using a fast and

direct algorithm
– Subset construction

Automatic Scanner Generation
Regular Nondeterministic Deterministic Final

Expressions Final Automaton Automaton
• Thompson‘s • Subset • Table Driven DFA

Construction Construction Implementation

NFA Definition
▪ NFA M consists of
– An alphabet 
– A set of states S
– A transition function: T: S  (  {  }) → (S)
– A start state s0  S
– A set of accepting states A  S
▪ L(M): language accepted by M
▪ c1c2…cn  L(M)  ci    {  },  i  [1..n] 

 s1=T(s0,c1), s2=T(s1,c2), …, sn=T(sn-1,cn)  sn  A
▪ Observations
– Any ci may be 
– c1c2…cn may have fewer than n characters ( removed)
– Sequence of states s1s2…sn chosen from sets of states T(s0,c1), T(s1,c2), …,T(sn-1,cn) not
always uniquely determined
– Arbitrary number of  in input stream corresponding to any number of NFA -transitions

NFA Example 2
▪ String abb can be accepted by any of 
a b
following transition sequences
1 a 3  4
a b  b
1 2 4 2 4
a   b  b 
1 3 4 2 4 2 4
▪ NFA matches multiple regular expressions
– ab: T(1,a)=2, T(2,b)=4
– ab+: T(1,a)=2, T(2,b)=4, T(4,)=2, T(2,b)=4, … a
– ab*: T(1,a)=3, T(3,)=4, T(4,)=2, T(2,b)=4, …
– b*: T(1,)=4 , T(4,)=2, T(2,b)=4, …
b b
▪ NFA accepts same language as regular
expression: ab*|b*
– Equivalent to (a|)b* b

Thompson’s Construction
▪ Construct an NFA from a regular expression
▪ Basic regular expression

–a a
–

▪ To do
– Concatenation
– Or
– Repetition

Concatenation
▪ Input
– Two regular expressions r and s
– Two NFAs (of r and of s)
...r...
▪ Goal
– NFA for regular expression rs 
▪ Connect accepting state of r with start ...s...

state of s through one -transition
▪ L(rs) = L(r) L(s)

Or
▪ Input ▪ Two new states: start and
– Two regular expressions r and s accepted
▪ Connected with -transitions
▪ Goal
– NFA for regular expression r|s ▪ L(r|s) = L(r)  L(s)
...r...
 
 
...s...
Repetition
▪ Input ▪ Two new states: start and accepting
– Two regular expressions r and s connected through -transitions
▪ Repetition through -transition from
▪ Goal accepting to start state of r
– NFA for regular expression r*
▪ Empty string is accepted by -transition
from start to accepting state
 
...r...

Example: ab | a
a
▪ a
▪ b b
▪ ab a  b
▪ ab | a
 a  b 
 a 

letter (letter | digit)*
▪ letter letter
▪ digit
digit
▪ letter | digit
 letter 
digit
 

letter (letter | digit)*
▪ (letter | digit)* 
 letter 
 
digit
 

▪ letter (letter | digit)*


 letter 
letter   
digit
 
12.03.2024

R. Prodan, Compiler Construction, Summer Semester 2024 43
Subset Construction
▪ Convert an NFA into a DFA
▪ We need some method for eliminating

– -transitions
– Multiple transitions from a state on same input character
▪ -closure of a state s is
– Set of states reachable by a series of zero or more -transitions
– Denoted as s
▪ DFA has as states sets of states of original DFA

Example: a*
1 = { 1, 2, 4 } ▪ DFA start state
– -closure of start state
{ 1, 2, 4 }a = { 3 } = { 2, 3, 4 } ▪ For each DFA state S, compute state S ,  a

a
– Sa = { t |  sS  T(s,a)=t }
T({1, 2, 4}, a) = { 2, 3, 4 }
▪ Add new transition T(S,a) = Sa
{2, 3, 4}a = { 3 } = { 2, 3, 4 } ▪ DFA accepting states
– All states that contain accepting NFA states
T({ 2, 3, 4 }, a) = { 2, 3, 4 }
 a
 a 
1 2 3 4 a
{1,2,4} {2,3,4}

ab|a
{ 1 } = { 1, 2, 6 } { 3, 4, 7, 8 }b = { 5 } = { 5, 8 }
{ 1, 2, 6 }a = { 3, 7 } = { 3, 4, 7, 8 } T({3, 4, 7, 8}, b) = { 5, 8}
T({1, 2, 6}, a) = { 3, 4, 7, 8 }
 a  b 
2 3 4 5
1  8
a 
6 7
a b
{1, 2, 6} {3, 4, 7, 8} {5, 8}

letter(letter|digit)*
{1}={1} letter
{ 1 }letter = { 2 } = { 2, 3, 4, 5, 7, 10 } {1}
T({ 1 }, letter) = { 2, 3, 4, 5, 7, 10 }
{4,5,6,7,9,10}
{ 2, 3, 4, 5, 7, 10 }letter = { 6 } = { 4, 5, 6, 7, 9, 10 } letter letter
T({ 2, 3, 4, 5, 7, 10 }, letter) = { 4, 5, 6, 7, 9, 10 }
{ 2, 3, 4, 5, 7, 10 }digit = { 8 } = { 4, 5, 7, 8, 9, 10 }
T({ 2, 3, 4, 5, 7, 10 }, digit) = { 4, 5, 7, 8, 9, 10 } {2,3,4,5,7,10} digit
{ 4, 5, 6, 7, 9, 10 }letter = { 6 } = { 4, 5, 6, 7, 9, 10 } letter
T({ 4, 5, 6, 7, 9, 10 }, letter) = { 4, 5, 6, 7, 9, 10 } digit
{ 4, 5, 6, 7, 9, 10 }digit = { 8 } = { 4, 5, 7, 8, 9, 10 }
T({ 4, 5, 7, 8, 9, 10 }, digit) = { 4, 5, 7, 8, 9, 10 } {4,5,7,8,9,10}
{ 4, 5, 7, 8, 9, 10 }letter = { 6 } = { 4, 5, 6, 7, 9, 10 }
T({ 4, 5, 7, 8, 9, 10 }, letter) = { 4, 5, 6, 7, 9, 10 }
{ 4, 5, 7, 8, 9, 10 }digit = { 8 } = { 4, 5, 7, 8, 9, 10 }
digit
T({ 4, 5, 7, 8, 9, 10 }, digit) = { 4, 5, 7, 8, 9, 10 }

 letter 
5 6
letter   
1 2 3 4 9 10
digit
 7 8 

Agenda
▪ Introduction
▪ Conclusions

Conclusions
▪ Lexical analysis or scanning
▪ Operates under parser control
▪ Tokens specified through regular expressions
▪ Regular expressions implemented as finite automata

– Automatic NFA generation through Thompson construction
– NFA conversion into DFA through Subset construction
▪ Automatic lexical analyser generator (e.g., Lex)


2 - Scanner

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2 - Scanner

Uploaded by

Copyright:

Available Formats

Prof.

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 1

Source Syntax Annotated Intermediate Target Target

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 2

▪ Deterministic Finite Automata (DFA)

▪ Automatic DFA generation

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 3

▪ Scanning is a special case of pattern matching

▪ Must be as efficient and fast as possible

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 4

▪ Keywords (fixed strings of characters or keywords)

▪ Lexeme or string value

▪ NUM token type

▪ PLUS token type

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 6

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 7

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 8

▪ Automatic lexical analysis generation tool

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 9

▪ Deterministic Finite Automata (DFA)

▪ Automatic DFA generation

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 10

▪ Strings that contain exactly one b

▪ Set of strings that contains at most one b

▪ Set of strings that contain no consecutive b’s

▪ Strings containing an even number of b’s

▪ . describes any character ▪ All strings containing at least one b

▪ Number with optional leading sign

– signedNat = (+|-)? nat

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 14

▪ -- this is an Ada comment

▪ In some programming languages, comments can be nested

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 15

▪ Regular expressions cannot solve ambiguities

▪ Token delimiters or separators

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 16

▪ Comments usually serve as delimiters

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 17

▪ No reserved words in FORTRAN (all keywords can also be identifiers)

▪ Backtrack to arbitrary positions in a code line

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 18

▪ Deterministic Finite Automaton (DFA)

▪ Automatic DFA generation

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 19

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 20

▪ L(M): Language accepted by M

▪ c1c2…cn  L(M)  ci  ,  i  [1..n] 

▪ T(start, c) defined only if c is a letter

▪ Set of strings that contain at most one b

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 23

number = signedNat(“.” nat)?(E signedNat)?

▪ Easier to write DFA than regular expression

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 25

if state = 3 then accept else error;

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 26

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 28

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 29

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 30

▪ Deterministic Finite Automata (DFA)

▪ Automatic scanner generation

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 31

▪ Combine all DFAs in one single DFA

▪ Convert NFA into DFA using a fast and

12.03.2024 R. Prodan, Compiler Construction, Summer Semester 2024 33