Professional Documents
Culture Documents
CC 02 Lexical Analysis
CC 02 Lexical Analysis
Sommersemester 2021
LEXICAL ANALYSIS
Scanner
• Syntax of programming language
• Made up of symbols
• Can be complex
• Symbols, or lexemes
• Sequences of characters
• Recognized by scanner
Scanner
• Definitions
• Σ: set of characters that may appear in input (alphabet)
e.g., ASCII, Unicode
• p ∈ Σ*: source code program to be scanned (input)
• L ∈ P(Σ*): a (formal) language is a set of programs over Σ that are
accepted
• Task
• Split source program into sequence of lexemes
• Transform sequence of lexemes to sequence of tokens
• Task
• Split source program into sequence of lexemes
• Transform sequence of lexemes to sequence of tokens
Lab A
• Define regular expressions for the following languages
• Specify the alphabet for each language
• Languages
1. Positive integers without leading zeroes
pint = (1|..|9) . (0|..|9)* | 0
2. Term:
• Product of coefficient and ’x’ raised to the power of exponent
• Coefficient: possibly negative integer
• Exponent: positive integer
• Coefficient may be left out (meaning it is 1)
• Exponent may be left out (meaning it is 1)
• e.g.: xˆ2, 2x, x
(pint | epsilon) x ^
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 10
Lab A
• 1. Positive integers without leading zeroes
• Alphabet:
• Regular Expression:
Lab A
• 2. Term
• Alphabet:
• Regular Expression:
Word Problem
• Given
• A regular expression 𝛼 ∈ RE(Σ)
(respectively a language L = [[𝛼]])
• A word w ∈ Σ*
• Is w ∈ [[𝛼]]?
q4
q2 Æ {q4} Æ
q1 00 Î L(A)? q3 {q4} Æ {q2}
01 Î L(A)? q4 Æ Æ Æ
q2 Æ {q4} Æ
q1 00 Î L(A)? ✅ q3 {q4} Æ {q2}
01 Î L(A)? ✅ q4 Æ Æ Æ
Thompson’s construction
• Construct NFA from Regular Expression r ∈RE(Σ):
r r
r = a. a∈Σ∪ε a
𝛼
r = 𝛼|β. 𝛼,β∈RE(Σ)
β
𝛼 β
r = 𝛼 · β. 𝛼,β∈RE(Σ)
r = 𝛼*. 𝛼∈RE(Σ) ε ε
Ken Thompson,
CACM, 1968
𝛼
r = 𝛼?. 𝛼∈RE(Σ) 𝛼
Lab B
• Construct an NFA for the regular expressions
1. RE({0,1}) := 1 · 0 · (0 | 1)* | (0 | 1)* · 0 · 0
2. RE({a,b}) := a · b* · a | b · a* · b
• Given
• T⊆Q Power set
• a∈Σ
• w ∈ Σ*
Lab C
• Construct a deterministic finite state automaton for the
regular expression
• RE({0,1}) := (0|1)*·0·0
b c b 4
a
c
2 3
for all
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 27
3
b a
a
a
a 2 b
Step 1
a,b Construct a transition table showing
1 5
all reachable states for every state
a for every input signal.
a,b
4
3 q δ(q,a) δ(q,b)
b a 1
a
a
a 2 b 2
a,b
1 5 3
a,b a 4
4
5
b
3 q δ(q,a) δ(q,b)
b a 1 {1,2,3,4,5} {4,5}
a
a
a 2 b 2 {3} {5}
a,b
1 5 3 ∅ {2}
b
5 ∅ ∅
q δ(q,a) δ(q,b)
1 {1,2,3,4,5} {4,5} Step 2
The set of states resulting from every
2 {3} {5} transition function constitutes a new
state. Calculate all reachable states
3 ∅ {2} for every such state for every input
signal.
4 {5} {4}
5 ∅ ∅
q δ(q,a) δ(q,b)
1 {1,2,3,4,5} {4,5}
2 {3} {5}
3 ∅ {2}
4 {5} {4}
5 ∅ ∅
2 {3} {5}
3 ∅ {2}
4 {5} {4}
5 ∅ ∅
Step 3
Repeat this process(step2) until no
more new states are reachable.
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 34
2 {3} {5}
3 ∅ {2}
4 {5} {4}
5 ∅ ∅
4 {5} {4}
5 ∅ ∅
4 {5} {4}
5 ∅ ∅
4 {5} {4} ∅
5 ∅ ∅
4 {5} {4} ∅
5 ∅ ∅
4 {5} {4} ∅
{2}
5 ∅ ∅
4 {5} {4} ∅ ∅ ∅
{2}
5 ∅ ∅
4 {5} {4} ∅ ∅ ∅
{2} {3} {5}
5 ∅ ∅ {3}
4 {5} {4} ∅ ∅ ∅
{2} {3} {5}
5 ∅ ∅ {3} ∅ {2}
4 {5} {4} ∅ ∅ ∅
{2} {3} {5}
5 ∅ ∅ {3} ∅ {2}
Lab D
• Construct a DFA from the following NFA using the subset
construction method
b
a
1 2
b,c
a,b c 5
c
3 c 4
Error state
• For DFA, remember:
• |𝝳(q,a)| = 1 for all q∈Q, a ∈Σ
Lab E
• Implement a Java program that solves the word problem
for the language recognized by the DFA given below
• Stub implementation: see ILIAS
a
0 1
c a
b c b 4
c a
2 3 a,b,c
b,c
a,b
5 (err)
Σ
Scanner Generators
• Scanner generators lift us from the task of hand-coding
lexical analysis
Tokenization
• Lexical analysis problem:
Scanner determines a tokenization of w∈Σ+ with respect
to 𝛼1, …, 𝛼n
• Assignment
• Given
• Σ = {a, b}
• Regular expressions: 𝛼1 = a , 𝛼2 = b, 𝛼3 = ab
• w = aab Tokenization (English) =
• Δ := {S1, S2, S3} Analyse (Deutsch)
• What is a correct tokenization of w?
• v = S1 · S2 · S3
• v = S1 · S1 · S2
• v = S1 · S3
Tokenization
• Lexical analysis problem:
Scanner determines a tokenization of w∈Σ+ with respect
to 𝛼1, …, 𝛼n
• Assignment
• Given
• Σ = {a, b}
• Regular expressions: 𝛼1 = a , 𝛼2 = b, 𝛼3 = ab
• w = aab Tokenization (English) =
• Δ := {S1, S2, S3} Analyse (Deutsch)
• What is a correct tokenization of w?
• v = S1 · S2 · S3 ❌
• v = S1 · S1 · S2
✅
• v = S1 · S3
✅
Tokenization
• Generally: tokenization is not unique
Longest Match
• Definition longest match Decomposition (English) =
Zerlegung (Deutsch)
• Given
• A decomposition (w1, …, wk) of the word w with respect to 𝛼1, …, 𝛼n
• Then this is the longest-match decomposition iff
• For all j∈{1, …, k}, x, y∈Σ* and p, q ∈{1, …, n}
• w = w1·…wj·x·y ∧ wj∈[[𝛼p]] ∧ wj·x∈[[𝛼q]]
⇒x=ε
Longest Match
• There is at most one longest-match decomposition
• Proof
Longest Match
• There is at most one longest-match decomposition
• Proof
• Assume two decompositions of the word w:
• (w1_1, …, w1_i) and (w2_1, …, w2_j) with i ≤ j
• w1_1 = w2_1, …, w1_k = w2_k for any k < i
• if w1_k+1 ≠ w2_k+1 then w2_k+1 = w1_k+1·x, x ≠ ε (w.l.o.g.)
• This is in contradiction to the definition
Longest Match
• There is at most one longest-match decomposition
• Proof
• Assume two decompositions of the word w:
• (w1_1, …, w1_i) and (w2_1, …, w2_j) with i ≤ j
• w1_1 = w2_1, …, w1_k = w2_k for any k < i
• if w1_k+1 ≠ w2_k+1 then w2_k+1 = w1_k+1·x, x ≠ ε (w.l.o.g.)
• This is in contradiction to the definition
Longest Match
• Longest-match decomposition is unique
Longest Match
• Longest-match decomposition is unique
Longest Match
• Is there always a longest-match decomposition if there is
any decomposition?
Longest Match
• Is there always a longest-match decomposition if there is
any decomposition?
• No
• Proof by counterexample
• Given
• w = bba
• 𝛼1= b.a, 𝛼2= b.b*
Implementation of FLM-Analysis
• Procedure
1. Generate DFAs for all regular expressions
with
2. Generate a product automaton
• Consumes input
• Advances all Ai at once
3. Generate a backtrack DFA
• Uses product automaton
• Advance until one DFA reaches final state (choose the one with highest
priority)
• Remember final state
• Advance until
• Longer match is found (continue as above)
• No more match possible (backtrack to remembered state)
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 64
Product Automaton
• Does at least one of multiple DFAs match a word?
• Advance DFAs simultaneously
• Accept if at least one is in an final state
a,b
• Example:
b err1 b w = aa
a
• 𝛼1 = aa*, A1 = 1_1 1_2 a
a,b
err2
b a,b
• 𝛼2 = ab, A2 =
a
2_1
a 2_2
b 2_3
a
3_1 err3 a,b
b a
• 𝛼3 = bb*, A3 =
3_2
b
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 65
Product Automaton
• Does at least one of multiple DFAs match a word?
• Advance DFAs simultaneously
• Accept if at least one is in an final state
a,b
• Example:
b err1 b w = aa
a
• 𝛼1 = aa*, A1 = 1_1 1_2 a
a,b
err2
b a,b
• 𝛼2 = ab, A2 =
a
2_1
a 2_2
b 2_3
a
3_1 err3 a,b
b a
• 𝛼3 = bb*, A3 =
3_2
b
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 66
Product Automaton
• Does at least one of multiple DFAs match a word?
• Advance DFAs simultaneously
• Accept if at least one is in an final state
a,b
• Example:
b err1 b w = aa
a
• 𝛼1 = aa*, A1 = 1_1 1_2 a
a,b
err2
b a,b
• 𝛼2 = ab, A2 =
a
2_1
a 2_2
b 2_3
a
3_1 err3 a,b
b a
• 𝛼3 = bb*, A3 =
3_2
b
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 67
Product Automaton
• Does at least one of multiple DFAs match a word?
• Advance DFAs simultaneously
• Accept if at least one is in an final state
a,b
• Example:
b err1 b w = aa
a
• 𝛼1 = aa*, A1 = 1_1 1_2 a
a,b
err2 Word is accepted because
b a,b
• 𝛼2 = ab, A2 =
a A1 is in a final state.
2_1
a 2_2
b 2_3
a
3_1 err3 a,b
b a
• 𝛼3 = bb*, A3 =
3_2
b
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 68
Product Automaton
• Combining multiple DFAs (Ai) into
one product automaton A
for all
Product Automaton
a,b
• Example: Product automaton A:
b err1 b a
a
• A1 = 1_1 1_2 a a 1_2,err2,err3
a,b
1_2,2_2,err3 b
err2
b a,b a
a b err1,2_3,err3
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3 a,b
a err1,err2,err3
b
a
• A3 = 3_1
b a
err3 a,b err1,err2,3_2 a,b
3_2
b
b
Product Automaton
a,b
• Example: Product automaton A:
b err1 b a
a
• A1 = 1_1 1_2 a a 1_2,err2,err3
a,b
1_2,2_2,err3 b
err2
b a,b a
a b err1,2_3,err3
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3 a,b
a err1,err2,err3
b
a
• A3 = 3_1
b a
err3 a,b err1,err2,3_2 a,b
3_2
b
b
a
• A3 = 3_1
b a
err3 a,b
3_2
b
b
a
• A3 = 3_1
b a
err3 a,b err1,err2,3_2
3_2
b
a err1,err2,err3
b
a
• A3 = 3_1
b a
err3 a,b err1,err2,3_2
3_2
b
b
Product Automaton
• Product automaton is in a final state iff at least one
component DFA is in a final state
• Tokenization:
must know index of matching component DFA
• Multiple DFAs may match
DFA with highest priority
• Pick DFA with highest priority (first-match rule) (i.e., lowest index)
determines membership
• (i.e., split F into F(k)) with:
of final state.
Product Automaton
a,b
• Example: Product automaton A:
b err1 b a
a
• A1 = 1_1 1_2 a a 1_2,err2,err3 ∈F(1)
a,b
1_2,2_2,err3 ∈F(1) b
err2
b a,b a
a b err1,2_3,err3∈F(2)
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3 a,b
a err1,err2,err3
b
a
• A3 = 3_1
b a
err3 a,b err1,err2,3_2 ∈F(3) a,b
3_2
b
b
Backtrack DFA
• Goal: FLM analysis
• Product automaton provides token of matched word
(final states are labeled)
• Need to ensure longest match
• Sometimes continued reading possible after reaching a final state
• Possibly continued reading does not lead to another final state
• After matching
• Continue reading next lexeme
• Remember previous tokens
Backtrack DFA
Either normal mode (N) or
• State of backtrack candidate token for which final
automaton: triple of state was already reached.
• Mode: ({N}∪Δ) Beginning of band marks
• Input band: (Σ*QΣ*) backtrack head, state (Q) marks
read head.
• Output: (Δ*·{ε,lexerr})
• Initial state: (N,q0w,ε)
• Start in normal mode
• DFA in initial state and full input still to be read
• No output yet
Backtrack DFA
• Let
• q’ := 𝝳(q,a)
• 𝝈 ∈Δ*
Successor state
• Normal mode productive but not a
final state
if
Successor state final
if state of component i
Output: if No final state
reachable
Backtrack DFA
• Let
• q’ := 𝝳(q,a)
• 𝝈 ∈Δ*
Successor state
• Lookahead mode productive but not a
Update remem- final state
bered token if
Successor state final
if state of component j
if No final state
reachable
Add remembered token
to output and backtrack
Backtrack DFA
• Let
• q’ := 𝝳(q,a)
• 𝝈 ∈Δ*
Backtrack DFA
• Two possible outcomes
• (N,q0w,ε) ⊢* Output: 𝝈∈Δ*
⇔𝝈 is a flm-analysis of w with respect to 𝛼1, …, 𝛼n
• (N,q0w,ε) ⊢* Output: 𝝈·lexerr (𝝈∈Δ*)
⇔there is no flm-analysis of w with respect to 𝛼1, …, 𝛼n
if
if
if
Output: if
Output: if
if
Backtrack DFA
• Example:
a
1. Analyze: w = aab
a 4∈F(1)
2∈F(1) b
a b 5∈F(2)
1
a,b
b a 6
3∈F(3)
a,b
b
Backtrack DFA
• Example:
a
1. Analyze: w = aab
a 4∈F(1)
(N,1aab,ε) ⊢
2∈F(1) b
a b 5∈F(2)
1
a,b
b a 6
3∈F(3)
a,b
b
Backtrack DFA
• Example:
a
1. Analyze: w = aab
a 4∈F(1)
(N,1aab,ε) ⊢
2∈F(1) (S1,2ab, ε) ⊢
b
a (S1,4b, ε) ⊢
b 5∈F(2) (N,1b, S1) ⊢
1 (S3,3, S1) ⊢
a,b
b Output: S1·S3
a 6
3∈F(3)
a,b
b