You are on page 1of 16

Outline

Introduction to Parsing • Regular languages revisited


Ambiguity and Syntax Errors • Parser overview

• Context-free grammars (CFG’s)

• Derivations

• Ambiguity

• Syntax errors

Compiler Design 1 (2011) 2

Languages and Automata Limitations of Regular Languages

• Formal languages are very important in CS Intuition: A finite automaton that runs long
– Especially in programming languages enough must repeat states
• A finite automaton cannot remember # of
• Regular languages times it has visited a particular state
– The weakest formal languages widely used • because a finite automaton has finite memory
– Many applications – Only enough to store in which state it is
– Cannot count, except up to a finite limit

• We will also study context-free languages • Many languages are not regular
• E.g., language of balanced parentheses is not
regular: { (i )i | i ≥ 0}

Compiler Design 1 (2011) 3 Compiler Design 1 (2011) 4


The Functionality of the Parser Example

• Input: sequence of tokens from lexer • If-then-else statement


if (x == y) then z =1; else z = 2;
• Output: parse tree of the program • Parser input
IF (ID == ID) THEN ID = INT; ELSE ID = INT;
• Possible parser output

IF-THEN-ELSE

== = =

ID ID ID INT ID INT
Compiler Design 1 (2011) 5 Compiler Design 1 (2011) 6

Comparison with Lexical Analysis The Role of the Parser

• Not all sequences of tokens are programs . . .


• . . . Parser must distinguish between valid and
Phase Input Output invalid sequences of tokens

Lexer Sequence of Sequence of


characters tokens • We need
– A language for describing valid sequences of tokens
Parser Sequence of Parse tree
– A method for distinguishing valid from invalid
tokens
sequences of tokens

Compiler Design 1 (2011) 7 Compiler Design 1 (2011) 8


Context-Free Grammars CFGs (Cont.)

• Many programming language constructs have a • A CFG consists of


recursive structure – A set of terminals T
– A set of non-terminals N
• A STMT is of the form – A start symbol S (a non-terminal)
if COND then STMT else STMT , or – A set of productions
while COND do STMT , or
… Assuming X ∈ N the productions are of the form
• Context-free grammars are a natural notation X→ε , or
for this recursive structure X → Y1 Y2 ... Yn where Yi ∈ N ∪T

Compiler Design 1 (2011) 9 Compiler Design 1 (2011) 10

Notational Conventions Examples of CFGs

• In these lecture notes A fragment of our example language (simplified):


– Non-terminals are written upper-case
– Terminals are written lower-case STMT → if COND then STMT else STMT
– The start symbol is the left-hand side of the first
production
⏐ while COND do STMT
⏐ id = int

Compiler Design 1 (2011) 11 Compiler Design 1 (2011) 12


Examples of CFGs (cont.) The Language of a CFG

Grammar for simple arithmetic expressions: Read productions as replacement rules:

E→ E*E X → Y1 ... Yn
Means X can be replaced by Y1 ... Yn
⏐ E+E X→ε
⏐ (E) Means X can be erased (replaced with empty string)

⏐ id

Compiler Design 1 (2011) 13 Compiler Design 1 (2011) 14

Key Idea The Language of a CFG (Cont.)

(1) Begin with a string consisting of the start More formally, we write
symbol “S”
X1 L X i L X n → X1 L X i−1Y1 LYm X i+1 L X n
(2) Replace any non-terminal X in the string by
a right-hand side of some production if there is a production
X → Y1 LYn X i → Y1 LYm
(3) Repeat (2) until there are no non-terminals in
the string

Compiler Design 1 (2011) 15 Compiler Design 1 (2011) 16


The Language of a CFG (Cont.) The Language of a CFG

Write Let G be a context-free grammar with start



X 1 L X n → Y1 LYm symbol S. Then the language of G is:
if
X 1 L X n → L → L → Y1 LYm
{ ∗
a1 K an | S → a1 K an and every ai is a terminal }
in 0 or more steps

Compiler Design 1 (2011) 17 Compiler Design 1 (2011) 18

Terminals Examples

• Terminals are called so because there are no L(G) is the language of the CFG G
rules for replacing them

• Once generated, terminals are permanent


Strings of balanced parentheses {( ) | i ≥ 0}
i i

Two grammars:
• Terminals ought to be tokens of the language
S → (S ) S → (S )
OR
S → ε | ε

Compiler Design 1 (2011) 19 Compiler Design 1 (2011) 20


Example Example (Cont.)

A fragment of our example language (simplified): Some elements of the our language

STMT → if COND then STMT id = int


⏐ if COND then STMT else STMT if (id == id) then id = int else id = int
⏐ while COND do STMT while (id != id) do id = int
⏐ id = int while (id == id) do while (id != id) do id = int
COND → (id == id) if (id != id) then if (id == id) then id = int else id = int
⏐ (id != id)

Compiler Design 1 (2011) 21 Compiler Design 1 (2011) 22

Arithmetic Example Notes

Simple arithmetic expressions: The idea of a CFG is a big step.


But:
E → E+E | E ∗ E | (E) | id
Some elements of the language: • Membership in a language is just “yes” or “no”;
we also need the parse tree of the input
id id + id
(id) id ∗ id • Must handle errors gracefully

(id) ∗ id id ∗ (id) • Need an implementation of CFG’s (e.g., yacc)

Compiler Design 1 (2011) 23 Compiler Design 1 (2011) 24


More Notes Derivations and Parse Trees

• Form of the grammar is important A derivation is a sequence of productions


– Many grammars generate the same language
– Parsing tools are sensitive to the grammar
S →L →L →L
A derivation can be drawn as a tree
Note: Tools for regular languages (e.g., lex/ML-Lex) – Start symbol is the tree’s root
are also sensitive to the form of the regular
expression, but this is rarely a problem in practice – For a production X → Y1 LYn add children Y1 LYn
to node X

Compiler Design 1 (2011) 25 Compiler Design 1 (2011) 26

Derivation Example Derivation Example (Cont.)

• Grammar
E
E → E+E | E ∗ E | (E) | id E
• String → E+E
E + E
id ∗ id + id → E ∗ E+E
→ id ∗ E + E E * E id
→ id ∗ id + E
id id
→ id ∗ id + id
Compiler Design 1 (2011) 27 Compiler Design 1 (2011) 28
Derivation in Detail (1) Derivation in Detail (2)

E E

E + E
E
E
→ E+E

Compiler Design 1 (2011) 29 Compiler Design 1 (2011) 30

Derivation in Detail (3) Derivation in Detail (4)

E E

E
E E + E E + E
→ E+E
→ E+E
E * E → E ∗ E+E E * E
→ E ∗ E+E
→ id ∗ E + E
id

Compiler Design 1 (2011) 31 Compiler Design 1 (2011) 32


Derivation in Detail (5) Derivation in Detail (6)

E E
E
E
→ E+E
→ E+E E + E E + E
→ E ∗ E+E
→ E ∗ E+E
E * E → id ∗ E + E E * E id
→ id ∗ E + E
→ id ∗ id + E
→ id ∗ id + E id id id id
→ id ∗ id + id
Compiler Design 1 (2011) 33 Compiler Design 1 (2011) 34

Notes on Derivations Left-most and Right-most Derivations

• A parse tree has • The example is a left-most


– Terminals at the leaves derivation
– At each step, replace the E
– Non-terminals at the interior nodes left-most non-terminal
→ E+E
• An in-order traversal of the leaves is the • There is an equivalent → E+id
original input notion of a right-most
derivation → E ∗ E + id
• The parse tree shows the association of → E ∗ id + id
operations, the input string does not
→ id ∗ id + id
Compiler Design 1 (2011) 35 Compiler Design 1 (2011) 36
Right-most Derivation in Detail (1) Right-most Derivation in Detail (2)

E E

E + E
E
E
→ E+E

Compiler Design 1 (2011) 37 Compiler Design 1 (2011) 38

Right-most Derivation in Detail (3) Right-most Derivation in Detail (4)

E E

E
E E + E E + E
→ E+E
→ E+E
id → E+id E * E id
→ E+id
→ E ∗ E + id

Compiler Design 1 (2011) 39 Compiler Design 1 (2011) 40


Right-most Derivation in Detail (5) Right-most Derivation in Detail (6)

E E
E
E
→ E+E
→ E+E E + E E + E
→ E+id
→ E+id
E * E id → E ∗ E + id E * E id
→ E ∗ E + id
→ E ∗ id + id
→ E ∗ id + id id id id
→ id ∗ id + id
Compiler Design 1 (2011) 41 Compiler Design 1 (2011) 42

Derivations and Parse Trees Summary of Derivations

• Note that right-most and left-most • We are not just interested in whether
derivations have the same parse tree s ∈ L(G)
– We need a parse tree for s
• The difference is just in the order in which
branches are added • A derivation defines a parse tree
– But one parse tree may have many derivations

• Left-most and right-most derivations are


important in parser implementation

Compiler Design 1 (2011) 43 Compiler Design 1 (2011) 44


Ambiguity Ambiguity (Cont.)

• Grammar This string has two parse trees


E → E + E | E * E | ( E ) | int
E E
• String
E + E E * E
int * int + int
E * E int int E + E

int int int int

Compiler Design 1 (2011) 45 Compiler Design 1 (2011) 46

Ambiguity (Cont.) Dealing with Ambiguity

• A grammar is ambiguous if it has more than • There are several ways to handle ambiguity
one parse tree for some string
– Equivalently, there is more than one right-most or • Most direct method is to rewrite grammar
left-most derivation for some string
unambiguously
• Ambiguity is bad
– Leaves meaning of some programs ill-defined
E→T+E|T
• Ambiguity is common in programming languages
T → int * T | int | ( E )
– Arithmetic expressions
– IF-THEN-ELSE
• Enforces precedence of * over +

Compiler Design 1 (2011) 47 Compiler Design 1 (2011) 48


Ambiguity: The Dangling Else The Dangling Else: Example

• Consider the following grammar • The expression


if C1 then if C2 then S3 else S4
S → if C then S has two parse trees
| if C then S else S
if if
| OTHER
C1 if S4 C1 if
• This grammar is also ambiguous
C2 S3 C2 S3 S4

• Typically we want the second form


Compiler Design 1 (2011) 49 Compiler Design 1 (2011) 50

The Dangling Else: A Fix The Dangling Else: Example Revisited

• else matches the closest unmatched then • The expression if C1 then if C2 then S3 else S4
• We can describe this in the grammar
if if
S → MIF /* all then are matched */
| UIF /* some then are unmatched */
C1 if C1 if S4
MIF → if C then MIF else MIF
| OTHER
C2 S3 S4 C2 S3
UIF → if C then S
| if C then MIF else UIF • A valid parse tree • Not valid because the
(for a UIF) then expression is not
• Describes the same set of strings a MIF

Compiler Design 1 (2011) 51 Compiler Design 1 (2011) 52


Ambiguity Precedence and Associativity Declarations

• No general techniques for handling ambiguity • Instead of rewriting the grammar


– Use the more natural (ambiguous) grammar
• Impossible to convert automatically an – Along with disambiguating declarations
ambiguous grammar to an unambiguous one
• Most tools allow precedence and associativity
• Used with care, ambiguity can simplify the declarations to disambiguate grammars
grammar
– Sometimes allows more natural definitions • Examples …
– We need disambiguation mechanisms

Compiler Design 1 (2011) 53 Compiler Design 1 (2011) 54

Associativity Declarations Precedence Declarations

• Consider the grammar E → E + E | int • Consider the grammar E → E + E | E * E | int


• Ambiguous: two parse trees of int + int + int – And the string int + int * int

E E
E E
E * E E + E
E + E E + E

E E + E int int E * E
+ E int int E + E

int int int int int int int int


• Precedence declarations: %left +
• Left associativity declaration: %left + %left *
Compiler Design 1 (2011) 55 Compiler Design 1 (2011) 56
Error Handling Syntax Error Handling

• Purpose of the compiler is • Error handler should


– To detect non-valid programs – Report errors accurately and clearly
– To translate the valid ones – Recover from an error quickly
• Many kinds of possible errors (e.g. in C) – Not slow down compilation of valid code

Error kind Example Detected by …


Lexical …$… Lexer • Good error handling is not easy to achieve
Syntax … x *% … Parser
Semantic … int x; y = x(3); … Type checker
Correctness your favorite program Tester/User

Compiler Design 1 (2011) 57 Compiler Design 1 (2011) 58

Approaches to Syntax Error Recovery Error Recovery: Panic Mode

• From simple to complex • Simplest, most popular method


– Panic mode
– Error productions • When an error is detected:
– Automatic local or global correction – Discard tokens until one with a clear role is found
– Continue from there

• Not all are supported by all parser generators • Such tokens are called synchronizing tokens
– Typically the statement or expression terminators

Compiler Design 1 (2011) 59 Compiler Design 1 (2011) 60


Syntax Error Recovery: Panic Mode (Cont.) Syntax Error Recovery: Error Productions

• Consider the erroneous expression • Idea: specify in the grammar known common
(1 + + 2) + 3 mistakes
• Panic-mode recovery: • Essentially promotes common errors to
– Skip ahead to next integer and then continue alternative syntax
• Example:
• (ML)-Yacc: use the special terminal error to – Write 5 x instead of 5 * x
describe how much input to skip – Add the production E → … | E E
E → int | E + E | ( E ) | error int | ( error ) • Disadvantage
– Complicates the grammar

Compiler Design 1 (2011) 61 Compiler Design 1 (2011) 62

Syntax Error Recovery: Past and Present

• Past
– Slow recompilation cycle (even once a day)
– Find as many errors in one cycle as possible
– Researchers could not let go of the topic

• Present
– Quick recompilation cycle
– Users tend to correct one error/cycle
– Complex error recovery is needed less
– Panic-mode seems enough

Compiler Design 1 (2011) 63

You might also like