CC 02 Lexical Analysis

Grundlagen des Compilerbaus
Sommersemester 2021
Prof. Christoph Bockisch

Stefan Schulz
(Programmiersprachen und –werkzeuge)
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Grundlagen des Compilerbaus 2
LEXICAL ANALYSIS

Scanner
• Syntax of programming language
• Made up of symbols
• Can be complex
• Symbols, or lexemes
• Sequences of characters
• Recognized by scanner
• Rules for lexemes also called micro syntax

• What is the form of literals or identifiers?
• Which keywords exist?
• Recognized by scanner

Scanner
• Definitions
• Σ: set of characters that may appear in input (alphabet)
e.g., ASCII, Unicode
• p ∈ Σ*: source code program to be scanned (input)
• L ∈ P(Σ*): a (formal) language is a set of programs over Σ that are
accepted
• Task
• Split source program into sequence of lexemes
• Transform sequence of lexemes to sequence of tokens
• Many different lexemes play the same role

• Group into classes of lexemes: tokens
• Token: <token-name, attribute(s)>

Scanner Kleene-Closure = kleenesche Hülle

Set of all words that can be formed by
• Definitions concatenating the characters of Σ in any way.
• Σ: set of characters that may appear in input (alphabet)
e.g., ASCII, Unicode
• p ∈ Σ*: source code program to be scanned (input)
• L ∈ P(Σ*): a (formal) language is a set of programs over Σ that are
accepted
Powerset = Potenzmenge
• Task
• Split source program into sequence of lexemes
• Transform sequence of lexemes to sequence of tokens
• Many different lexemes play the same role

• Group into classes of lexemes: tokens
• Token: <token-name, attribute(s)>

Typical tokens kinds

• Identifiers • Compound symbols
• Sequence of letters and digits • Operators: ==, <=, ++, …
• Start with letter • Whitespace
• Numbers • Space, tab, newline, etc.
• Sequence of digits • Special symbols
• Possibly with a sign • Comments
• Keywords • Pragmas
• Special characters
• Operators: +, *, …
• Parentheses: (, ), {, }, …
• …

Towards implementing a scanner

• Restrict lexemes to regular language
• Definition regular expression

• Constructive definition of regular language
• Given an alphabet Σ
• Then RE(Σ), the set of regular expressions over Σ, defined by
• Λ ∈ RE(Σ): the empty expression is an RE
• ∀a ∈ Σ. a ∈ RE(Σ) : all characters of Σ are an RE
• Given 𝛼,β ∈ RE(Σ) then
1. (𝛼 | β) ∈ RE(Σ) (“or”)
2. (𝛼 · β) ∈ RE(Σ) (“sequence”)
3. (𝛼*) ∈ RE(Σ) (“repetition”)
• Closed under regular composition
• Order of precedence: *, ·, |

Towards implementing a scanner

• Definition [[RE(Σ)]]: language generated by regular
expression
1. [[Λ]] := ∅
2. [[a]] := {a} for all a ∈ Σ
3. [[𝛼 |β]] := [[𝛼]] ∪ [[β]]
[[𝛼 · β]] := [[𝛼]] · [[β]] (element-wise concatenation)
[[𝛼*]] := [[𝛼]]* (closure)
• [[Λ∗]] := {ε}
• Λ=ε
• Example: identifier
• 𝛼 = (_ | a | … | z) · (_ | a | … | z | 0 | … | 9)*

Lab A
• Define regular expressions for the following languages
• Specify the alphabet for each language
• Languages
1. Positive integers without leading zeroes
pint = (1|..|9) . (0|..|9)* | 0
2. Term:
• Product of coefficient and ’x’ raised to the power of exponent
• Coefficient: possibly negative integer
• Exponent: positive integer
• Coefficient may be left out (meaning it is 1)
• Exponent may be left out (meaning it is 1)
• e.g.: xˆ2, 2x, x
(pint | epsilon) x ^
Lab A
• 1. Positive integers without leading zeroes
• Alphabet:
• Regular Expression:

Lab A
• 2. Term
• Alphabet:
• Regular Expression:

Word Problem
• Given
• A regular expression 𝛼 ∈ RE(Σ)
(respectively a language L = [[𝛼]])
• A word w ∈ Σ*
• Is w ∈ [[𝛼]]?
• Can this question always be answered?

• Yes, for regular languages
• There is an algorithm to answer this question:
construct a finite state machine (finite automaton)

Non-Deterministic Finite Automaton

• State transition diagram with finitely many states
• Definition Non-Deterministic Finite Automaton (NFA)

• A = (Q, Σ, 𝝳, q0, F)
• Q: finite set of states
• Σ: alphabet
• q0 ∈ Q: initial state
• F ⊆ Q: set of final states
• 𝝳: state transition function
• 𝝳: Q ⨯ (Σ∪{ε}) à P(Q)


1 A = (Q, Σ, d, q0, F)
q2 q4 Q = {q1, q2, q3, q4}
Σ = {0,1}
0
q0 = q1
ε F = {q4} Í Q
q3
d: 0 1 ε
0 q1
Transition q2
Table
q1 q3
q4


1 A = (Q, Σ, d, q0, F)
q2 q4 Q = {q1, q2, q3, q4}
Σ = {0,1}
0
q0 = q1
ε F = {q4} Í Q
q3
d: 0 1 ε
0 q1 {q3} Æ Æ
q2 Æ {q4} Æ
q1 00 Î L(A)? q3 {q4} Æ {q2}
01 Î L(A)? q4 Æ Æ Æ


1 A = (Q, Σ, d, q0, F)
q2 q4 Q = {q1, q2, q3, q4}
Σ = {0,1}
0
q0 = q1
ε F = {q4} Í Q
q3
d: 0 1 ε
0 q1 {q3} Æ Æ
q2 Æ {q4} Æ
q1 00 Î L(A)? ✅ q3 {q4} Æ {q2}
01 Î L(A)? ✅ q4 Æ Æ Æ

Thompson’s construction
• Construct NFA from Regular Expression r ∈RE(Σ):
r r
r = a. a∈Σ∪ε a
𝛼
r = 𝛼|β. 𝛼,β∈RE(Σ)
β
𝛼 β
r = 𝛼 · β. 𝛼,β∈RE(Σ)
r = 𝛼*. 𝛼∈RE(Σ) ε ε
Ken Thompson,
CACM, 1968
𝛼

Extension to Regular Expressions and

Thompson’s construction
• Often extensions to regular expressions are used as
shortcut:
• 𝛼+. 𝛼∈RE(Σ) means 𝛼·𝛼*
• 𝛼?. 𝛼∈RE(Σ) means 𝛼|ε
• We can extend specific construction rules to minimize

ε-transitions
r = 𝛼+. 𝛼∈RE(Σ) 𝛼 ε
r = 𝛼?. 𝛼∈RE(Σ) 𝛼

Lab B
• Construct an NFA for the regular expressions
1. RE({0,1}) := 1 · 0 · (0 | 1)* | (0 | 1)* · 0 · 0
2. RE({a,b}) := a · b* · a | b · a* · b


• Recognition of word
• Start at initial state
• Follow transitions defined by 𝝳 with the input (the word)
• If a final state can be reached, the word is recognized by NFA
• Set of all recognized words: recognized language of NFA
• Definition Epsilon-Closure of T ⊆Q: ε̂(T)

• ε̂ (T) is the smallest set with
1. T ⊆ ε̂ (T)
2. q ∈ ε̂ (T) then 𝝳(q, ε) ⊆ ε̂ (T)


• Definition extended transition function
• Given
• T⊆Q Power set
• a∈Σ
• w ∈ Σ*


• Can answer word problem:
If for a given input w the NFA can reach a final state, then
w is in the language
• Language recognized by NFA

L(A) :=

Deterministic Finite Automaton

• Non-deterministic
• Multiple state transitions may be possible
• |𝝳(Q,a)| may be larger than 1
• To decide word problem: possibly need to backtrack
• Deterministic finite automaton

• At most one state transition possible
• No backtracking
• Answers word problem in linear time (i.e., depending on length of
input)

Deterministic Finite Automaton

• Definition deterministic finite automaton:
• A = (Q, Σ, 𝝳, q0, F)
• |𝝳(q,a)| = 1 for all q∈Q, a ∈Σ
• |𝝳(q,ε)| = 0 for all q∈Q
• A DFA is an NFA, whereby
• For each state and input character, there is exactly one state
transition
• There are no autonomous state transitions
• Extended transition function
• Language recognized by DFA

L(A) :=

Lab C
• Construct a deterministic finite state automaton for the
regular expression
• RE({0,1}) := (0|1)*·0·0
• Given the following finite automaton

• Is it deterministic or non-deterministic?
• What is the language it recognizes?
a
0 1 c
b c b 4
a
c
2 3

Subset Construction Method

• For every non-deterministic finite state automaton A,
an equivalent deterministic finite state automaton Ad
can be constructed.
• Theorem: subset construction

• Given an NFA A =(Q, Σ, 𝝳, q0, F)
• There is an equivalent DFA Ad =(Qd, Σ, 𝝳d, q0d, Fd) with
for all

• Using Subset construction method to convert NFA to DFA
involves the following steps:
• For every state in the NFA, determine all reachable states for every
input symbol.
• The set of reachable states constitute a single state in the
converted DFA (Each state in the DFA corresponds to a subset of
states in the NFA).
• Find reachable states for each new DFA state, until no more new
states can be found.

3
b a
a
a
a 2 b
Step 1
a,b Construct a transition table showing
1 5
all reachable states for every state
a for every input signal.
a,b
4

3 q δ(q,a) δ(q,b)
b a 1
a
a
a 2 b 2
a,b
1 5 3
a,b a 4
4
5
b

3 q δ(q,a) δ(q,b)
b a 1 {1,2,3,4,5} {4,5}
a
a
a 2 b 2 {3} {5}
a,b
1 5 3 ∅ {2}
a,b a 4 {5} {4}

4
b
5 ∅ ∅

q δ(q,a) δ(q,b)
1 {1,2,3,4,5} {4,5} Step 2
The set of states resulting from every
2 {3} {5} transition function constitutes a new
state. Calculate all reachable states
3 ∅ {2} for every such state for every input
signal.
4 {5} {4}
5 ∅ ∅


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
q δ(q,a) δ(q,b)
1 {1,2,3,4,5} {4,5}
2 {3} {5}
3 ∅ {2}
4 {5} {4}
5 ∅ ∅


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5}
q δ(q,a) δ(q,b) {4,5}
1 {1,2,3,4,5} {4,5}
2 {3} {5}
3 ∅ {2}
4 {5} {4}
5 ∅ ∅
Step 3
Repeat this process(step2) until no
more new states are reachable.

Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5}
1 {1,2,3,4,5} {4,5} {2,4,5}
2 {3} {5}
3 ∅ {2}
4 {5} {4}
5 ∅ ∅


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5}
{5}
2 {3} {5}
{4}
3 ∅ {2}
4 {5} {4}
5 ∅ ∅


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5}
2 {3} {5}
{4}
3 ∅ {2} {3,5}
4 {5} {4}
5 ∅ ∅


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5} ∅ ∅
2 {3} {5}
{4}
3 ∅ {2} {3,5}
4 {5} {4} ∅
5 ∅ ∅


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5} ∅ ∅
2 {3} {5}
{4} {5} {4}
3 ∅ {2} {3,5}
4 {5} {4} ∅
5 ∅ ∅


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5} ∅ ∅
2 {3} {5}
{4} {5} {4}
3 ∅ {2} {3,5} ∅ {2}
4 {5} {4} ∅
{2}
5 ∅ ∅


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5} ∅ ∅
2 {3} {5}
{4} {5} {4}
3 ∅ {2} {3,5} ∅ {2}
4 {5} {4} ∅ ∅ ∅
{2}
5 ∅ ∅


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5} ∅ ∅
2 {3} {5}
{4} {5} {4}
3 ∅ {2} {3,5} ∅ {2}
4 {5} {4} ∅ ∅ ∅
{2} {3} {5}
5 ∅ ∅ {3}


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5} ∅ ∅
2 {3} {5}
{4} {5} {4}
3 ∅ {2} {3,5} ∅ {2}
4 {5} {4} ∅ ∅ ∅
{2} {3} {5}
5 ∅ ∅ {3} ∅ {2}


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5} ∅ ∅
2 {3} {5}
{4} {5} {4}
3 ∅ {2} {3,5} ∅ {2}
4 {5} {4} ∅ ∅ ∅
{2} {3} {5}
5 ∅ ∅ {3} ∅ {2}
All sets containing original state 5

are now final states.


q δ(q,a) δ(q,b) a
{1} {1,2,3,4,5} {4,5}
12345 b 245 a
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
{4,5} {5} {4} 35
{2,4,5} {3,5} {4,5} a
a,b a
{5} ∅ ∅
b
a b
{4} {5} {4}
1
∅
3
{3,5} ∅ {2} b
b a,b
∅ ∅ ∅ a
{2} {3} {5} 2

a
{3} ∅ {2}
45 5 b
b 4 a
b
Lab D
• Construct a DFA from the following NFA using the subset
construction method
b
a
1 2
b,c
a,b c 5
c
3 c 4

Error state
• For DFA, remember:
• |𝝳(q,a)| = 1 for all q∈Q, a ∈Σ
• That was not always the case in previous finite automata

• Extend FAs with error state:
• Add an error state
• For each state
• Add a transition
• With all characters a∈Σ, which have not yet a transition from that state
• Destination: error state
• Error state has reentrant transition labeled with Σ

Lab E
• Implement a Java program that solves the word problem
for the language recognized by the DFA given below
• Stub implementation: see ILIAS
a
0 1
c a
b c b 4
c a
2 3 a,b,c
b,c
a,b
5 (err)
Σ

Word Problem (summary)

• Thompson’s Construction
• For each regular expression 𝛼∈RE(Σ),
• There is an automaton A(𝛼)∈NFA(Σ)
• Such that,
• when A(𝛼) consumes a word w∈Σ* and, starting from the initial state,
reaches an end state, w∈L(𝛼), and
• w∉ L(𝛼), otherwise
• From A(𝛼), an equivalent automaton Ad(𝛼)∈DFA(Σ) can be
constructed using the subset construction method
• The size of the automaton A(𝛼) is:
•
• The size of the automaton Ad(𝛼) is:
•
• Complexity of solving the word problem is:
•

Word Problem (summary)

• Thompson’s Construction
• For each regular expression 𝛼∈RE(Σ),
• There is an automaton A(𝛼)∈NFA(Σ)
• Such that,
• when A(𝛼) consumes a word w∈Σ* and, starting from the initial state,
reaches an end state, w∈L(𝛼), and
• w∉ L(𝛼), otherwise
• From A(𝛼), an equivalent automaton Ad(𝛼)∈DFA(Σ) can be
constructed using the subset construction method
• The size of the automaton A(𝛼) is:
• O(|𝛼|)
• The size of the automaton Ad(𝛼) is:
• O(2|𝛼|)
• Complexity of solving the word problem is:
• O(|w|), with Ad(𝛼)

Scanner Generators
• Scanner generators lift us from the task of hand-coding
lexical analysis
• Specify regular expressions for all tokens
• Specify skipped input

• Whitespace
• Comments
• Automatically generate Scanner to provide a stream of

tokens

Tokenization
• Lexical analysis problem:
Scanner determines a tokenization of w∈Σ+ with respect
to 𝛼1, …, 𝛼n
• Assignment
• Given
• Σ = {a, b}
• Regular expressions: 𝛼1 = a , 𝛼2 = b, 𝛼3 = ab
• w = aab Tokenization (English) =
• Δ := {S1, S2, S3} Analyse (Deutsch)
• What is a correct tokenization of w?
• v = S1 · S2 · S3
• v = S1 · S1 · S2
• v = S1 · S3

Tokenization
• Lexical analysis problem:
Scanner determines a tokenization of w∈Σ+ with respect
• Assignment
• Given
• Σ = {a, b}
• Regular expressions: 𝛼1 = a , 𝛼2 = b, 𝛼3 = ab
• w = aab Tokenization (English) =
• Δ := {S1, S2, S3} Analyse (Deutsch)
• What is a correct tokenization of w?
• v = S1 · S2 · S3 ❌
• v = S1 · S1 · S2
✅
• v = S1 · S3
✅

Tokenization
• Generally: tokenization is not unique
• Need additional rules

Longest Match
• Definition longest match Decomposition (English) =
Zerlegung (Deutsch)
• Given
• A decomposition (w1, …, wk) of the word w with respect to 𝛼1, …, 𝛼n
• Then this is the longest-match decomposition iff
• For all j∈{1, …, k}, x, y∈Σ* and p, q ∈{1, …, n}
• w = w1·…wj·x·y ∧ wj∈[[𝛼p]] ∧ wj·x∈[[𝛼q]]
⇒x=ε
• None of the wj together with the following character can

be matched by one of 𝛼1, …, 𝛼n

Longest Match
• There is at most one longest-match decomposition
• Proof

Longest Match
• Proof
• Assume two decompositions of the word w:
• (w1_1, …, w1_i) and (w2_1, …, w2_j) with i ≤ j
• w1_1 = w2_1, …, w1_k = w2_k for any k < i
• if w1_k+1 ≠ w2_k+1 then w2_k+1 = w1_k+1·x, x ≠ ε (w.l.o.g.)
• This is in contradiction to the definition
Without loss of generality (English) =

ohne Beschränkung der
Allgemeinheit(Deutsch)

Longest Match
• Proof
• Assume two decompositions of the word w:
• (w1_1, …, w1_i) and (w2_1, …, w2_j) with i ≤ j
• w1_1 = w2_1, …, w1_k = w2_k for any k < i
• if w1_k+1 ≠ w2_k+1 then w2_k+1 = w1_k+1·x, x ≠ ε (w.l.o.g.)
• This is in contradiction to the definition
• Why does this match our intuition?

• E.g. identifiers may start with characters also forming a keyword
• Example: internal should be an identifier

Longest Match
• Longest-match decomposition is unique
• Is tokenization also unique with longest-match rule?

Longest Match
• Longest-match decomposition is unique
• Is tokenization also unique with longest-match rule?

• No
• We did not require that p = q
• Regular expressions may have overlap: [[𝛼i]]∩[[𝛼j]] ≠ ∅
• For example: keywords typically would also be legal identifiers

Longest Match
• Is there always a longest-match decomposition if there is
any decomposition?

Longest Match
• Is there always a longest-match decomposition if there is
any decomposition?
• No
• Proof by counterexample
• Given
• w = bba
• 𝛼1= b.a, 𝛼2= b.b*

First Longest Match

• Definition first longest match analysis
• Given
• A longest-match decomposition (w1, …, wk) of the word w with respect
Thus: ∀ j∈{1, …, k}.i_j∈{1, …, n}
• Then
• v = Si_1·…·Si_k with Si_1,…,Si_k∈Δ is the first-longest-match analysis of w
iff
• i_j = min{ m | wj ∈ [[𝛼m]] ( 1 ≤ m ≤ n) }
• There is at most on first-longest-match analysis of w with

respect to 𝛼1, …, 𝛼n
• An flm-analysis of w exists iff an lm decomposition exists

Implementation of FLM-Analysis
• Procedure
1. Generate DFAs for all regular expressions
with
2. Generate a product automaton
• Consumes input
• Advances all Ai at once
3. Generate a backtrack DFA
• Uses product automaton
• Advance until one DFA reaches final state (choose the one with highest
priority)
• Remember final state
• Advance until
• Longer match is found (continue as above)
• No more match possible (backtrack to remembered state)
Product Automaton
• Does at least one of multiple DFAs match a word?
• Advance DFAs simultaneously
• Accept if at least one is in an final state
a,b
• Example:
b err1 b w = aa
a
• 𝛼1 = aa*, A1 = 1_1 1_2 a
a,b
err2
b a,b
• 𝛼2 = ab, A2 =
a
2_1
a 2_2
b 2_3
a
3_1 err3 a,b
b a
• 𝛼3 = bb*, A3 =
3_2
b
Product Automaton
a,b
• Example:
b err1 b w = aa
a
• 𝛼1 = aa*, A1 = 1_1 1_2 a
a,b
err2
b a,b
• 𝛼2 = ab, A2 =
a
2_1
a 2_2
b 2_3
a
3_1 err3 a,b
b a
• 𝛼3 = bb*, A3 =
3_2
b
Product Automaton
a,b
• Example:
b err1 b w = aa
a
• 𝛼1 = aa*, A1 = 1_1 1_2 a
a,b
err2
b a,b
• 𝛼2 = ab, A2 =
a
2_1
a 2_2
b 2_3
a
3_1 err3 a,b
b a
• 𝛼3 = bb*, A3 =
3_2
b
Product Automaton
a,b
• Example:
b err1 b w = aa
a
• 𝛼1 = aa*, A1 = 1_1 1_2 a
a,b
err2 Word is accepted because
b a,b
• 𝛼2 = ab, A2 =
a A1 is in a final state.
2_1
a 2_2
b 2_3
a
3_1 err3 a,b
b a
• 𝛼3 = bb*, A3 =
3_2
b
Product Automaton
• Combining multiple DFAs (Ai) into
one product automaton A
for all

Product Automaton
a,b
• Example: Product automaton A:
b err1 b a
a
• A1 = 1_1 1_2 a a 1_2,err2,err3
a,b
1_2,2_2,err3 b
err2
b a,b a
a b err1,2_3,err3
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3 a,b
a err1,err2,err3
b
a
• A3 = 3_1
b a
err3 a,b err1,err2,3_2 a,b
3_2
b
b

Product Automaton
a,b
b err1 b a
a
• A1 = 1_1 1_2 a a 1_2,err2,err3
a,b
1_2,2_2,err3 b
err2
b a,b a
a b err1,2_3,err3
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3 a,b
a err1,err2,err3
b
a
• A3 = 3_1
b a
3_2
b
b
Do you notice somtething?

Constructing a Product Automaton

• The formal definition of he product automaton contains
unreachable states
• The product automaton on the previous slide did not

contain unreachable states
• We can construct an equivalent product automaton

without unreachable states by simulating the coponent
automata


a,b
b err1 b
a
• A1 = 1_1 1_2 a
a,b
err2
b a,b
a
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3
a
• A3 = 3_1
b a
err3 a,b
3_2
b


a,b
b err1 b
a
• A1 = 1_1 1_2 a
a,b
1_2,2_2,err3
err2
b a,b a
a
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3
b
a
• A3 = 3_1
b a
err3 a,b err1,err2,3_2
3_2
b


a,b
b err1 b
a
• A1 = 1_1 1_2 a a 1_2,err2,err3
a,b
1_2,2_2,err3
err2
b a,b a
a b err1,2_3,err3
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3
a err1,err2,err3
b
a
• A3 = 3_1
b a
err3 a,b err1,err2,3_2
3_2
b
b


a,b
b err1 b a
a
• A1 = 1_1 1_2 a a 1_2,err2,err3
a,b
1_2,2_2,err3 b
err2
b a,b a
a b err1,2_3,err3
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3 a,b
a err1,err2,err3
b
a
• A3 = 3_1
b a
3_2
b
b

Product Automaton
• Product automaton is in a final state iff at least one
component DFA is in a final state
• Tokenization:
must know index of matching component DFA
• Multiple DFAs may match
DFA with highest priority
• Pick DFA with highest priority (first-match rule) (i.e., lowest index)
determines membership
• (i.e., split F into F(k)) with:
of final state.
(q(1),…,q(n)) ∈ F(i) ⇔ q(i) ∈ Fi and q(j) ∉ Fj for all 1 ≤ j < i

⨄: disjoint union (English) = disjunkte Vereinigung (Deutsch)
Product Automaton
a,b
b err1 b a
a
• A1 = 1_1 1_2 a a 1_2,err2,err3 ∈F(1)
a,b
1_2,2_2,err3 ∈F(1) b
err2
b a,b a
a b err1,2_3,err3∈F(2)
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3 a,b
a err1,err2,err3
b
a
• A3 = 3_1
b a
err3 a,b err1,err2,3_2 ∈F(3) a,b
3_2
b
b

Backtrack DFA
• Goal: FLM analysis
• Product automaton provides token of matched word
(final states are labeled)
• Need to ensure longest match
• Sometimes continued reading possible after reaching a final state
• Possibly continued reading does not lead to another final state
• After matching
• Continue reading next lexeme
• Remember previous tokens
• Construct a backtrack DFA

• Regular DFA maintains read head in input
• Backtrack DFA additionally maintains backtrack head

Backtrack DFA
Either normal mode (N) or
• State of backtrack candidate token for which final
automaton: triple of state was already reached.
• Mode: ({N}∪Δ) Beginning of band marks
• Input band: (Σ*QΣ*) backtrack head, state (Q) marks
read head.
• Output: (Δ*·{ε,lexerr})
• Initial state: (N,q0w,ε)
• Start in normal mode
• DFA in initial state and full input still to be read
• No output yet
• Definition productive states

Backtrack DFA
• Let
• q’ := 𝝳(q,a)
• 𝝈 ∈Δ*
Successor state
• Normal mode productive but not a
final state
if
Successor state final
if state of component i
Output: if No final state
reachable

Backtrack DFA
• Let
• q’ := 𝝳(q,a)
• 𝝈 ∈Δ*
Successor state
• Lookahead mode productive but not a
Update remem- final state
bered token if
Successor state final
if state of component j
if No final state
reachable
Add remembered token
to output and backtrack

Backtrack DFA
• Let
• q’ := 𝝳(q,a)
• 𝝈 ∈Δ*
• End of input State not a final state

Output: if
In a final state
Output: if
if Not in a final state,
but can backtrack

Backtrack DFA
• Two possible outcomes
• (N,q0w,ε) ⊢* Output: 𝝈∈Δ*
⇔𝝈 is a flm-analysis of w with respect to 𝛼1, …, 𝛼n
• (N,q0w,ε) ⊢* Output: 𝝈·lexerr (𝝈∈Δ*)
⇔there is no flm-analysis of w with respect to 𝛼1, …, 𝛼n

Backtrack DFA (summary)

if
if
Output: if
if
if
if
Output: if
Output: if
if

Backtrack DFA
• Example:
a
1. Analyze: w = aab
a 4∈F(1)
2∈F(1) b
a b 5∈F(2)
1
a,b
b a 6
3∈F(3)
a,b
b

Backtrack DFA
• Example:
a
1. Analyze: w = aab
a 4∈F(1)
(N,1aab,ε) ⊢
2∈F(1) b
a b 5∈F(2)
1
a,b
b a 6
3∈F(3)
a,b
b

Backtrack DFA
• Example:
a
1. Analyze: w = aab
a 4∈F(1)
(N,1aab,ε) ⊢
2∈F(1) (S1,2ab, ε) ⊢
b
a (S1,4b, ε) ⊢
b 5∈F(2) (N,1b, S1) ⊢
1 (S3,3, S1) ⊢
a,b
b Output: S1·S3
a 6
3∈F(3)
a,b
b

CC 02 Lexical Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CC 02 Lexical Analysis

Uploaded by

Copyright:

Available Formats

Grundlagen des Compilerbaus

Prof. Christoph Bockisch

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

• Rules for lexemes also called micro syntax

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

• Many different lexemes play the same role

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Scanner Kleene-Closure = kleenesche Hülle

• Many different lexemes play the same role

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Typical tokens kinds

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Towards implementing a scanner

• Definition regular expression

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Towards implementing a scanner

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

• Can this question always be answered?

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Non-Deterministic Finite Automaton

• Definition Non-Deterministic Finite Automaton (NFA)

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Non-Deterministic Finite Automaton

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Non-Deterministic Finite Automaton

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Non-Deterministic Finite Automaton

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Extension to Regular Expressions and

• We can extend specific construction rules to minimize

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Non-Deterministic Finite Automaton

• Definition Epsilon-Closure of T ⊆Q: ε̂(T)

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Non-Deterministic Finite Automaton

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Non-Deterministic Finite Automaton

• Language recognized by NFA

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Deterministic Finite Automaton

• Deterministic finite automaton

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Deterministic Finite Automaton

• Language recognized by DFA

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

• Given the following finite automaton

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Subset Construction Method

• Theorem: subset construction

Subset Construction Method

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Subset Construction Method

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Subset Construction Method

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

Subset Construction Method

a,b a 4 {5} {4}

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge