You are on page 1of 87

Grundlagen des Compilerbaus

Sommersemester 2021

Prof. Christoph Bockisch


Stefan Schulz
(Programmiersprachen und –werkzeuge)

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 2

LEXICAL ANALYSIS

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 3

Scanner
• Syntax of programming language
• Made up of symbols
• Can be complex

• Symbols, or lexemes
• Sequences of characters
• Recognized by scanner

• Rules for lexemes also called micro syntax


• What is the form of literals or identifiers?
• Which keywords exist?
• Recognized by scanner

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 4

Scanner
• Definitions
• Σ: set of characters that may appear in input (alphabet)
e.g., ASCII, Unicode
• p ∈ Σ*: source code program to be scanned (input)
• L ∈ P(Σ*): a (formal) language is a set of programs over Σ that are
accepted

• Task
• Split source program into sequence of lexemes
• Transform sequence of lexemes to sequence of tokens

• Many different lexemes play the same role


• Group into classes of lexemes: tokens
• Token: <token-name, attribute(s)>

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 5

Scanner Kleene-Closure = kleenesche Hülle


Set of all words that can be formed by
• Definitions concatenating the characters of Σ in any way.
• Σ: set of characters that may appear in input (alphabet)
e.g., ASCII, Unicode
• p ∈ Σ*: source code program to be scanned (input)
• L ∈ P(Σ*): a (formal) language is a set of programs over Σ that are
accepted
Powerset = Potenzmenge

• Task
• Split source program into sequence of lexemes
• Transform sequence of lexemes to sequence of tokens

• Many different lexemes play the same role


• Group into classes of lexemes: tokens
• Token: <token-name, attribute(s)>

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 6

Typical tokens kinds


• Identifiers • Compound symbols
• Sequence of letters and digits • Operators: ==, <=, ++, …
• Start with letter • Whitespace
• Numbers • Space, tab, newline, etc.
• Sequence of digits • Special symbols
• Possibly with a sign • Comments
• Keywords • Pragmas
• Special characters
• Operators: +, *, …
• Parentheses: (, ), {, }, …
• …

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 7

Towards implementing a scanner


• Restrict lexemes to regular language

• Definition regular expression


• Constructive definition of regular language
• Given an alphabet Σ
• Then RE(Σ), the set of regular expressions over Σ, defined by
• Λ ∈ RE(Σ): the empty expression is an RE
• ∀a ∈ Σ. a ∈ RE(Σ) : all characters of Σ are an RE
• Given 𝛼,β ∈ RE(Σ) then
1. (𝛼 | β) ∈ RE(Σ) (“or”)
2. (𝛼 · β) ∈ RE(Σ) (“sequence”)
3. (𝛼*) ∈ RE(Σ) (“repetition”)
• Closed under regular composition
• Order of precedence: *, ·, |

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 8

Towards implementing a scanner


• Definition [[RE(Σ)]]: language generated by regular
expression
1. [[Λ]] := ∅
2. [[a]] := {a} for all a ∈ Σ
3. [[𝛼 |β]] := [[𝛼]] ∪ [[β]]
[[𝛼 · β]] := [[𝛼]] · [[β]] (element-wise concatenation)
[[𝛼*]] := [[𝛼]]* (closure)
• [[Λ∗]] := {ε}
• Λ=ε
• Example: identifier
• 𝛼 = (_ | a | … | z) · (_ | a | … | z | 0 | … | 9)*

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 9

Lab A
• Define regular expressions for the following languages
• Specify the alphabet for each language

• Languages
1. Positive integers without leading zeroes
pint = (1|..|9) . (0|..|9)* | 0
2. Term:
• Product of coefficient and ’x’ raised to the power of exponent
• Coefficient: possibly negative integer
• Exponent: positive integer
• Coefficient may be left out (meaning it is 1)
• Exponent may be left out (meaning it is 1)
• e.g.: xˆ2, 2x, x

(pint | epsilon) x ^
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 10

Lab A
• 1. Positive integers without leading zeroes
• Alphabet:

• Regular Expression:

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 11

Lab A
• 2. Term
• Alphabet:

• Regular Expression:

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 12

Word Problem
• Given
• A regular expression 𝛼 ∈ RE(Σ)
(respectively a language L = [[𝛼]])
• A word w ∈ Σ*

• Is w ∈ [[𝛼]]?

• Can this question always be answered?


• Yes, for regular languages
• There is an algorithm to answer this question:
construct a finite state machine (finite automaton)

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 13

Non-Deterministic Finite Automaton


• State transition diagram with finitely many states

• Definition Non-Deterministic Finite Automaton (NFA)


• A = (Q, Σ, 𝝳, q0, F)
• Q: finite set of states
• Σ: alphabet
• q0 ∈ Q: initial state
• F ⊆ Q: set of final states
• 𝝳: state transition function
• 𝝳: Q ⨯ (Σ∪{ε}) à P(Q)

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 14

Non-Deterministic Finite Automaton


1 A = (Q, Σ, d, q0, F)
q2 q4 Q = {q1, q2, q3, q4}
Σ = {0,1}
0
q0 = q1
ε F = {q4} Í Q
q3
d: 0 1 ε
0 q1
Transition q2
Table
q1 q3

q4

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 15

Non-Deterministic Finite Automaton


1 A = (Q, Σ, d, q0, F)
q2 q4 Q = {q1, q2, q3, q4}
Σ = {0,1}
0
q0 = q1
ε F = {q4} Í Q
q3
d: 0 1 ε
0 q1 {q3} Æ Æ

q2 Æ {q4} Æ
q1 00 Î L(A)? q3 {q4} Æ {q2}

01 Î L(A)? q4 Æ Æ Æ

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 16

Non-Deterministic Finite Automaton


1 A = (Q, Σ, d, q0, F)
q2 q4 Q = {q1, q2, q3, q4}
Σ = {0,1}
0
q0 = q1
ε F = {q4} Í Q
q3
d: 0 1 ε
0 q1 {q3} Æ Æ

q2 Æ {q4} Æ
q1 00 Î L(A)? ✅ q3 {q4} Æ {q2}

01 Î L(A)? ✅ q4 Æ Æ Æ

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 17

Thompson’s construction
• Construct NFA from Regular Expression r ∈RE(Σ):

r r

r = a. a∈Σ∪ε a

𝛼
r = 𝛼|β. 𝛼,β∈RE(Σ)
β
𝛼 β
r = 𝛼 · β. 𝛼,β∈RE(Σ)

r = 𝛼*. 𝛼∈RE(Σ) ε ε
Ken Thompson,
CACM, 1968
𝛼

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 18

Extension to Regular Expressions and


Thompson’s construction
• Often extensions to regular expressions are used as
shortcut:
• 𝛼+. 𝛼∈RE(Σ) means 𝛼·𝛼*
• 𝛼?. 𝛼∈RE(Σ) means 𝛼|ε

• We can extend specific construction rules to minimize


ε-transitions
r = 𝛼+. 𝛼∈RE(Σ) 𝛼 ε

r = 𝛼?. 𝛼∈RE(Σ) 𝛼

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 19

Lab B
• Construct an NFA for the regular expressions
1. RE({0,1}) := 1 · 0 · (0 | 1)* | (0 | 1)* · 0 · 0
2. RE({a,b}) := a · b* · a | b · a* · b

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 20

Non-Deterministic Finite Automaton


• Recognition of word
• Start at initial state
• Follow transitions defined by 𝝳 with the input (the word)
• If a final state can be reached, the word is recognized by NFA
• Set of all recognized words: recognized language of NFA

• Definition Epsilon-Closure of T ⊆Q: ε̂(T)


• ε̂ (T) is the smallest set with
1. T ⊆ ε̂ (T)
2. q ∈ ε̂ (T) then 𝝳(q, ε) ⊆ ε̂ (T)

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 21

Non-Deterministic Finite Automaton


• Definition extended transition function

• Given
• T⊆Q Power set
• a∈Σ
• w ∈ Σ*

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 22

Non-Deterministic Finite Automaton


• Can answer word problem:
If for a given input w the NFA can reach a final state, then
w is in the language

• Language recognized by NFA


L(A) :=

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 23

Deterministic Finite Automaton


• Non-deterministic
• Multiple state transitions may be possible
• |𝝳(Q,a)| may be larger than 1
• To decide word problem: possibly need to backtrack

• Deterministic finite automaton


• At most one state transition possible
• No backtracking
• Answers word problem in linear time (i.e., depending on length of
input)

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 24

Deterministic Finite Automaton


• Definition deterministic finite automaton:
• A = (Q, Σ, 𝝳, q0, F)
• |𝝳(q,a)| = 1 for all q∈Q, a ∈Σ
• |𝝳(q,ε)| = 0 for all q∈Q
• A DFA is an NFA, whereby
• For each state and input character, there is exactly one state
transition
• There are no autonomous state transitions
• Extended transition function

• Language recognized by DFA


L(A) :=

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 25

Lab C
• Construct a deterministic finite state automaton for the
regular expression
• RE({0,1}) := (0|1)*·0·0

• Given the following finite automaton


• Is it deterministic or non-deterministic?
• What is the language it recognizes?
a
0 1 c

b c b 4
a
c
2 3

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 26

Subset Construction Method


• For every non-deterministic finite state automaton A,
an equivalent deterministic finite state automaton Ad
can be constructed.

• Theorem: subset construction


• Given an NFA A =(Q, Σ, 𝝳, q0, F)
• There is an equivalent DFA Ad =(Qd, Σ, 𝝳d, q0d, Fd) with

for all
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 27

Subset Construction Method


• Using Subset construction method to convert NFA to DFA
involves the following steps:
• For every state in the NFA, determine all reachable states for every
input symbol.
• The set of reachable states constitute a single state in the
converted DFA (Each state in the DFA corresponds to a subset of
states in the NFA).
• Find reachable states for each new DFA state, until no more new
states can be found.

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 28

Subset Construction Method

3
b a
a
a
a 2 b
Step 1
a,b Construct a transition table showing
1 5
all reachable states for every state
a for every input signal.
a,b
4

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 29

Subset Construction Method

3 q δ(q,a) δ(q,b)
b a 1
a
a
a 2 b 2
a,b
1 5 3
a,b a 4
4
5
b

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 30

Subset Construction Method

3 q δ(q,a) δ(q,b)
b a 1 {1,2,3,4,5} {4,5}
a
a
a 2 b 2 {3} {5}
a,b
1 5 3 ∅ {2}

a,b a 4 {5} {4}


4

b
5 ∅ ∅

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 31

Subset Construction Method

q δ(q,a) δ(q,b)
1 {1,2,3,4,5} {4,5} Step 2
The set of states resulting from every
2 {3} {5} transition function constitutes a new
state. Calculate all reachable states
3 ∅ {2} for every such state for every input
signal.
4 {5} {4}

5 ∅ ∅

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 32

Subset Construction Method


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}

q δ(q,a) δ(q,b)
1 {1,2,3,4,5} {4,5}

2 {3} {5}

3 ∅ {2}

4 {5} {4}

5 ∅ ∅

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 33

Subset Construction Method


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5}
q δ(q,a) δ(q,b) {4,5}
1 {1,2,3,4,5} {4,5}

2 {3} {5}

3 ∅ {2}

4 {5} {4}

5 ∅ ∅
Step 3
Repeat this process(step2) until no
more new states are reachable.
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 34

Subset Construction Method


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5}
1 {1,2,3,4,5} {4,5} {2,4,5}

2 {3} {5}

3 ∅ {2}

4 {5} {4}

5 ∅ ∅

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 35

Subset Construction Method


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5}
{5}
2 {3} {5}
{4}
3 ∅ {2}

4 {5} {4}

5 ∅ ∅

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 36

Subset Construction Method


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5}
2 {3} {5}
{4}
3 ∅ {2} {3,5}

4 {5} {4}

5 ∅ ∅

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 37

Subset Construction Method


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5} ∅ ∅
2 {3} {5}
{4}
3 ∅ {2} {3,5}

4 {5} {4} ∅

5 ∅ ∅

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 38

Subset Construction Method


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5} ∅ ∅
2 {3} {5}
{4} {5} {4}
3 ∅ {2} {3,5}

4 {5} {4} ∅

5 ∅ ∅

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 39

Subset Construction Method


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5} ∅ ∅
2 {3} {5}
{4} {5} {4}
3 ∅ {2} {3,5} ∅ {2}

4 {5} {4} ∅
{2}
5 ∅ ∅

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 40

Subset Construction Method


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5} ∅ ∅
2 {3} {5}
{4} {5} {4}
3 ∅ {2} {3,5} ∅ {2}

4 {5} {4} ∅ ∅ ∅
{2}
5 ∅ ∅

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 41

Subset Construction Method


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5} ∅ ∅
2 {3} {5}
{4} {5} {4}
3 ∅ {2} {3,5} ∅ {2}

4 {5} {4} ∅ ∅ ∅
{2} {3} {5}
5 ∅ ∅ {3}

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 42

Subset Construction Method


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5} ∅ ∅
2 {3} {5}
{4} {5} {4}
3 ∅ {2} {3,5} ∅ {2}

4 {5} {4} ∅ ∅ ∅
{2} {3} {5}
5 ∅ ∅ {3} ∅ {2}

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 43

Subset Construction Method


Starts with
Initial state
q δ(q,a) δ(q,b)
{1} {1,2,3,4,5} {4,5}
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
q δ(q,a) δ(q,b) {4,5} {5} {4}
1 {1,2,3,4,5} {4,5} {2,4,5} {3,5} {4,5}
{5} ∅ ∅
2 {3} {5}
{4} {5} {4}
3 ∅ {2} {3,5} ∅ {2}

4 {5} {4} ∅ ∅ ∅
{2} {3} {5}
5 ∅ ∅ {3} ∅ {2}

All sets containing original state 5


are now final states.

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 44

Subset Construction Method


q δ(q,a) δ(q,b) a
{1} {1,2,3,4,5} {4,5}
12345 b 245 a
{1,2,3,4,5} {1,2,3,4,5} {2,4,5}
{4,5} {5} {4} 35
{2,4,5} {3,5} {4,5} a
a,b a
{5} ∅ ∅
b
a b
{4} {5} {4}
1

3
{3,5} ∅ {2} b
b a,b
∅ ∅ ∅ a

{2} {3} {5} 2


a
{3} ∅ {2}
45 5 b
b 4 a
b
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 45

Lab D
• Construct a DFA from the following NFA using the subset
construction method

b
a
1 2
b,c
a,b c 5
c

3 c 4

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 46

Error state
• For DFA, remember:
• |𝝳(q,a)| = 1 for all q∈Q, a ∈Σ

• That was not always the case in previous finite automata


• Extend FAs with error state:
• Add an error state
• For each state
• Add a transition
• With all characters a∈Σ, which have not yet a transition from that state
• Destination: error state
• Error state has reentrant transition labeled with Σ

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 47

Lab E
• Implement a Java program that solves the word problem
for the language recognized by the DFA given below
• Stub implementation: see ILIAS

a
0 1
c a
b c b 4

c a
2 3 a,b,c
b,c
a,b
5 (err)
Σ

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 48

Word Problem (summary)


• Thompson’s Construction
• For each regular expression 𝛼∈RE(Σ),
• There is an automaton A(𝛼)∈NFA(Σ)
• Such that,
• when A(𝛼) consumes a word w∈Σ* and, starting from the initial state,
reaches an end state, w∈L(𝛼), and
• w∉ L(𝛼), otherwise
• From A(𝛼), an equivalent automaton Ad(𝛼)∈DFA(Σ) can be
constructed using the subset construction method
• The size of the automaton A(𝛼) is:

• The size of the automaton Ad(𝛼) is:

• Complexity of solving the word problem is:

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 49

Word Problem (summary)


• Thompson’s Construction
• For each regular expression 𝛼∈RE(Σ),
• There is an automaton A(𝛼)∈NFA(Σ)
• Such that,
• when A(𝛼) consumes a word w∈Σ* and, starting from the initial state,
reaches an end state, w∈L(𝛼), and
• w∉ L(𝛼), otherwise
• From A(𝛼), an equivalent automaton Ad(𝛼)∈DFA(Σ) can be
constructed using the subset construction method
• The size of the automaton A(𝛼) is:
• O(|𝛼|)
• The size of the automaton Ad(𝛼) is:
• O(2|𝛼|)
• Complexity of solving the word problem is:
• O(|w|), with Ad(𝛼)

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 50

Scanner Generators
• Scanner generators lift us from the task of hand-coding
lexical analysis

• Specify regular expressions for all tokens

• Specify skipped input


• Whitespace
• Comments

• Automatically generate Scanner to provide a stream of


tokens

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 51

Tokenization
• Lexical analysis problem:
Scanner determines a tokenization of w∈Σ+ with respect
to 𝛼1, …, 𝛼n

• Assignment
• Given
• Σ = {a, b}
• Regular expressions: 𝛼1 = a , 𝛼2 = b, 𝛼3 = ab
• w = aab Tokenization (English) =
• Δ := {S1, S2, S3} Analyse (Deutsch)
• What is a correct tokenization of w?
• v = S1 · S2 · S3
• v = S1 · S1 · S2
• v = S1 · S3

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 52

Tokenization
• Lexical analysis problem:
Scanner determines a tokenization of w∈Σ+ with respect
to 𝛼1, …, 𝛼n

• Assignment
• Given
• Σ = {a, b}
• Regular expressions: 𝛼1 = a , 𝛼2 = b, 𝛼3 = ab
• w = aab Tokenization (English) =
• Δ := {S1, S2, S3} Analyse (Deutsch)
• What is a correct tokenization of w?
• v = S1 · S2 · S3 ❌
• v = S1 · S1 · S2

• v = S1 · S3

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 53

Tokenization
• Generally: tokenization is not unique

• Need additional rules

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 54

Longest Match
• Definition longest match Decomposition (English) =
Zerlegung (Deutsch)
• Given
• A decomposition (w1, …, wk) of the word w with respect to 𝛼1, …, 𝛼n
• Then this is the longest-match decomposition iff
• For all j∈{1, …, k}, x, y∈Σ* and p, q ∈{1, …, n}
• w = w1·…wj·x·y ∧ wj∈[[𝛼p]] ∧ wj·x∈[[𝛼q]]
⇒x=ε

• None of the wj together with the following character can


be matched by one of 𝛼1, …, 𝛼n

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 55

Longest Match
• There is at most one longest-match decomposition

• Proof

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 56

Longest Match
• There is at most one longest-match decomposition

• Proof
• Assume two decompositions of the word w:
• (w1_1, …, w1_i) and (w2_1, …, w2_j) with i ≤ j
• w1_1 = w2_1, …, w1_k = w2_k for any k < i
• if w1_k+1 ≠ w2_k+1 then w2_k+1 = w1_k+1·x, x ≠ ε (w.l.o.g.)
• This is in contradiction to the definition

Without loss of generality (English) =


ohne Beschränkung der
Allgemeinheit(Deutsch)

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 57

Longest Match
• There is at most one longest-match decomposition

• Proof
• Assume two decompositions of the word w:
• (w1_1, …, w1_i) and (w2_1, …, w2_j) with i ≤ j
• w1_1 = w2_1, …, w1_k = w2_k for any k < i
• if w1_k+1 ≠ w2_k+1 then w2_k+1 = w1_k+1·x, x ≠ ε (w.l.o.g.)
• This is in contradiction to the definition

• Why does this match our intuition?


• E.g. identifiers may start with characters also forming a keyword
• Example: internal should be an identifier

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 58

Longest Match
• Longest-match decomposition is unique

• Is tokenization also unique with longest-match rule?

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 59

Longest Match
• Longest-match decomposition is unique

• Is tokenization also unique with longest-match rule?


• No
• We did not require that p = q
• Regular expressions may have overlap: [[𝛼i]]∩[[𝛼j]] ≠ ∅
• For example: keywords typically would also be legal identifiers

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 60

Longest Match
• Is there always a longest-match decomposition if there is
any decomposition?

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 61

Longest Match
• Is there always a longest-match decomposition if there is
any decomposition?
• No

• Proof by counterexample
• Given
• w = bba
• 𝛼1= b.a, 𝛼2= b.b*

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 62

First Longest Match


• Definition first longest match analysis
• Given
• A longest-match decomposition (w1, …, wk) of the word w with respect
to 𝛼1, …, 𝛼n
Thus: ∀ j∈{1, …, k}.i_j∈{1, …, n}
• Then
• v = Si_1·…·Si_k with Si_1,…,Si_k∈Δ is the first-longest-match analysis of w
iff
• i_j = min{ m | wj ∈ [[𝛼m]] ( 1 ≤ m ≤ n) }

• There is at most on first-longest-match analysis of w with


respect to 𝛼1, …, 𝛼n
• An flm-analysis of w exists iff an lm decomposition exists

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 63

Implementation of FLM-Analysis
• Procedure
1. Generate DFAs for all regular expressions

with
2. Generate a product automaton
• Consumes input
• Advances all Ai at once
3. Generate a backtrack DFA
• Uses product automaton
• Advance until one DFA reaches final state (choose the one with highest
priority)
• Remember final state
• Advance until
• Longer match is found (continue as above)
• No more match possible (backtrack to remembered state)
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 64

Product Automaton
• Does at least one of multiple DFAs match a word?
• Advance DFAs simultaneously
• Accept if at least one is in an final state
a,b
• Example:
b err1 b w = aa
a
• 𝛼1 = aa*, A1 = 1_1 1_2 a
a,b
err2
b a,b
• 𝛼2 = ab, A2 =
a
2_1
a 2_2
b 2_3

a
3_1 err3 a,b
b a
• 𝛼3 = bb*, A3 =
3_2
b
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 65

Product Automaton
• Does at least one of multiple DFAs match a word?
• Advance DFAs simultaneously
• Accept if at least one is in an final state
a,b
• Example:
b err1 b w = aa
a
• 𝛼1 = aa*, A1 = 1_1 1_2 a
a,b
err2
b a,b
• 𝛼2 = ab, A2 =
a
2_1
a 2_2
b 2_3

a
3_1 err3 a,b
b a
• 𝛼3 = bb*, A3 =
3_2
b
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 66

Product Automaton
• Does at least one of multiple DFAs match a word?
• Advance DFAs simultaneously
• Accept if at least one is in an final state
a,b
• Example:
b err1 b w = aa
a
• 𝛼1 = aa*, A1 = 1_1 1_2 a
a,b
err2
b a,b
• 𝛼2 = ab, A2 =
a
2_1
a 2_2
b 2_3

a
3_1 err3 a,b
b a
• 𝛼3 = bb*, A3 =
3_2
b
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 67

Product Automaton
• Does at least one of multiple DFAs match a word?
• Advance DFAs simultaneously
• Accept if at least one is in an final state
a,b
• Example:
b err1 b w = aa
a
• 𝛼1 = aa*, A1 = 1_1 1_2 a
a,b
err2 Word is accepted because
b a,b
• 𝛼2 = ab, A2 =
a A1 is in a final state.
2_1
a 2_2
b 2_3

a
3_1 err3 a,b
b a
• 𝛼3 = bb*, A3 =
3_2
b
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 68

Product Automaton
• Combining multiple DFAs (Ai) into
one product automaton A

for all

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 69

Product Automaton
a,b
• Example: Product automaton A:
b err1 b a
a
• A1 = 1_1 1_2 a a 1_2,err2,err3
a,b
1_2,2_2,err3 b
err2
b a,b a
a b err1,2_3,err3
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3 a,b
a err1,err2,err3
b
a
• A3 = 3_1
b a
err3 a,b err1,err2,3_2 a,b
3_2
b
b

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 70

Product Automaton
a,b
• Example: Product automaton A:
b err1 b a
a
• A1 = 1_1 1_2 a a 1_2,err2,err3
a,b
1_2,2_2,err3 b
err2
b a,b a
a b err1,2_3,err3
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3 a,b
a err1,err2,err3
b
a
• A3 = 3_1
b a
err3 a,b err1,err2,3_2 a,b
3_2
b
b

Do you notice somtething?

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 71

Constructing a Product Automaton


• The formal definition of he product automaton contains
unreachable states

• The product automaton on the previous slide did not


contain unreachable states

• We can construct an equivalent product automaton


without unreachable states by simulating the coponent
automata

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 72

Constructing a Product Automaton


a,b
• Example: Product automaton A:
b err1 b
a
• A1 = 1_1 1_2 a
a,b
err2
b a,b
a
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3

a
• A3 = 3_1
b a
err3 a,b
3_2
b

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 73

Constructing a Product Automaton


a,b
• Example: Product automaton A:
b err1 b
a
• A1 = 1_1 1_2 a
a,b
1_2,2_2,err3
err2
b a,b a
a
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3

b
a
• A3 = 3_1
b a
err3 a,b err1,err2,3_2
3_2
b

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 74

Constructing a Product Automaton


a,b
• Example: Product automaton A:
b err1 b
a
• A1 = 1_1 1_2 a a 1_2,err2,err3
a,b
1_2,2_2,err3
err2
b a,b a
a b err1,2_3,err3
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3

a err1,err2,err3
b
a
• A3 = 3_1
b a
err3 a,b err1,err2,3_2
3_2
b
b

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 75

Constructing a Product Automaton


a,b
• Example: Product automaton A:
b err1 b a
a
• A1 = 1_1 1_2 a a 1_2,err2,err3
a,b
1_2,2_2,err3 b
err2
b a,b a
a b err1,2_3,err3
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3 a,b
a err1,err2,err3
b
a
• A3 = 3_1
b a
err3 a,b err1,err2,3_2 a,b
3_2
b
b

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 76

Product Automaton
• Product automaton is in a final state iff at least one
component DFA is in a final state

• Tokenization:
must know index of matching component DFA
• Multiple DFAs may match
DFA with highest priority
• Pick DFA with highest priority (first-match rule) (i.e., lowest index)
determines membership
• (i.e., split F into F(k)) with:
of final state.

(q(1),…,q(n)) ∈ F(i) ⇔ q(i) ∈ Fi and q(j) ∉ Fj for all 1 ≤ j < i


⨄: disjoint union (English) = disjunkte Vereinigung (Deutsch)
Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge
Grundlagen des Compilerbaus 77

Product Automaton
a,b
• Example: Product automaton A:
b err1 b a
a
• A1 = 1_1 1_2 a a 1_2,err2,err3 ∈F(1)
a,b
1_2,2_2,err3 ∈F(1) b
err2
b a,b a
a b err1,2_3,err3∈F(2)
• A2 = a b 1_1, 2_1, 3_1
2_1 2_2 2_3 a,b
a err1,err2,err3
b
a
• A3 = 3_1
b a
err3 a,b err1,err2,3_2 ∈F(3) a,b
3_2
b
b

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 78

Backtrack DFA
• Goal: FLM analysis
• Product automaton provides token of matched word
(final states are labeled)
• Need to ensure longest match
• Sometimes continued reading possible after reaching a final state
• Possibly continued reading does not lead to another final state
• After matching
• Continue reading next lexeme
• Remember previous tokens

• Construct a backtrack DFA


• Regular DFA maintains read head in input
• Backtrack DFA additionally maintains backtrack head

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 79

Backtrack DFA
Either normal mode (N) or
• State of backtrack candidate token for which final
automaton: triple of state was already reached.
• Mode: ({N}∪Δ) Beginning of band marks
• Input band: (Σ*QΣ*) backtrack head, state (Q) marks
read head.
• Output: (Δ*·{ε,lexerr})
• Initial state: (N,q0w,ε)
• Start in normal mode
• DFA in initial state and full input still to be read
• No output yet

• Definition productive states

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 80

Backtrack DFA
• Let
• q’ := 𝝳(q,a)
• 𝝈 ∈Δ*
Successor state
• Normal mode productive but not a
final state
if
Successor state final
if state of component i
Output: if No final state
reachable

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 81

Backtrack DFA
• Let
• q’ := 𝝳(q,a)
• 𝝈 ∈Δ*
Successor state
• Lookahead mode productive but not a
Update remem- final state
bered token if
Successor state final
if state of component j
if No final state
reachable
Add remembered token
to output and backtrack

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 82

Backtrack DFA
• Let
• q’ := 𝝳(q,a)
• 𝝈 ∈Δ*

• End of input State not a final state


Output: if
In a final state
Output: if
if Not in a final state,
but can backtrack

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 83

Backtrack DFA
• Two possible outcomes
• (N,q0w,ε) ⊢* Output: 𝝈∈Δ*
⇔𝝈 is a flm-analysis of w with respect to 𝛼1, …, 𝛼n
• (N,q0w,ε) ⊢* Output: 𝝈·lexerr (𝝈∈Δ*)
⇔there is no flm-analysis of w with respect to 𝛼1, …, 𝛼n

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 84

Backtrack DFA (summary)


if
if
Output: if

if
if
if

Output: if
Output: if
if

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 85

Backtrack DFA
• Example:
a
1. Analyze: w = aab
a 4∈F(1)

2∈F(1) b
a b 5∈F(2)
1
a,b
b a 6

3∈F(3)
a,b
b

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 86

Backtrack DFA
• Example:
a
1. Analyze: w = aab
a 4∈F(1)
(N,1aab,ε) ⊢
2∈F(1) b
a b 5∈F(2)
1
a,b
b a 6

3∈F(3)
a,b
b

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge


Grundlagen des Compilerbaus 87

Backtrack DFA
• Example:
a
1. Analyze: w = aab
a 4∈F(1)
(N,1aab,ε) ⊢
2∈F(1) (S1,2ab, ε) ⊢
b
a (S1,4b, ε) ⊢
b 5∈F(2) (N,1b, S1) ⊢
1 (S3,3, S1) ⊢
a,b
b Output: S1·S3
a 6

3∈F(3)
a,b
b

Prof. Christoph Bockisch (bockisch@mathematik.uni-marburg.de) | Programmiersprachen und –werkzeuge

You might also like