Lecture 3

Syntax Analysis
CS2210
Lecture 4
CS2210 Compiler Design 2004/05
Parser
source lexical analyzer token parser get next token symbol table parse tree rest of frontend IR
Parsing = determining whether a string of tokens can be generated by a grammar

Grammars
Precise, easy-to understand description of syntax Context-free grammars -> efficient parsers (automatically!) Help in translation and error detection
Eg. Attribute grammars Can add new constructs systematically

Easier language evolution
Syntax Errors
Many errors are syntactic or exposed by parsing
eg. Unbalanced () Report errors quickly & accurately Recover quickly (continue parsing after error) Little overhead on parse time
Error handling goals:

Error Recovery
Panic mode
Discard tokens until synchronization token found (often ;)
Phrase level
Local correction: replace a token by another and continue Encode commonly expected errors in grammar Find closest input string that is in L(G)
Error productions
Global correction
Too costly in practice
Context-free Grammars
Precise and easy way to specify the syntactical structure of a programming language Efficient recognition methods exist Natural specification of many recursive constructs:
expr -> expr + expr | term

Context-free Grammar Definition
Terminals T
Symbols which form strings of L(G), G a CFG (= tokens in the scanner), e.g. if, else, id Syntactic variables denoting sets of strings of L(G) Impose hierarchical structure (e.g., precedence rules) Denotes the set of strings of L(G) Rules that determine how strings are formed N -> (N|T) *
Nonterminals N

Start symbol S ( N)
Productions P

Example: Expression Grammar

expr -> expr op expr expr -> (expr) expr -> - expr expr -> id op -> + op -> op -> * op -> / op -> ^
Terminals:
{id, +, -, *, /, ^} {expr, op,} Expr
Nonterminals
Start symbol
Notational Conventions
Terminals

Nonterminals

a,b,c.. +,-,.. ,.; etc 0..9 expr or <expr>
A, B, C .. S start symbol (if present) or first nonterminal in production list u,v,..
Terminal strings
Grammar symbol strings
, A ->
Productions
Shorthands & Derivations

E -> E + E | E * E | (E) | - E | <id>
E => - E E derives -E => derives in 1 step =>* derive in n (0..) steps
More Definitions

L(G) language generated by G = set of strings derived from S S =>+ w : w sentence of G (w string of terminals) S =>+ : sentential form of G (string can contain nonterminals) G and G are equivalent : L(G) = L(G) A language generated by a grammar (of the form shown) is called a context-free language
Example
G = ({-,*,(,),<id>}, {E}, E, {E -> E + E, E-> E * E , E -> (E) , E-> - E, E -> <id>})
Sentence: -(<id> + <id>) Derivation: E => -E => -(E) => -(E+E)=>-(<id>+E) => -(<id> + <id>)
Leftmost derivation i.e. always replace leftmost nonterminal Rightmost derivation analogously Left /right sentential form
Parse Trees
E E => -E => -(E) => -(E+E)=> -(<id>+E) => -(<id> + <id>) ( E <id> Parse tree = graphical representation of a derivation ignoring replacement order E E + ) E <id>
Ambiguous Grammars
>=2 different parse trees for some sentence >= 2 leftmost/rightmost derivations Usually want to have unambiguous grammars
E.g. want to just one evaluation order: <id> + <id> * <id> to be parsed as <id> + (<id> * <id>) not (<id>+<id>)*<id> To keep grammars simple accept ambiguity and resolve separately (outside of grammar)
Expressive Power
CFGs are more powerful than REs

Can express matching () with CFGs Can express most properties desired for programming languages Identifiers declared before used L = {wcw|w is in (a|b) *} Parameter checking (#formals = #actuals) L ={a nbmcndm|n 1, m 1}
CFGs cannot express:
Eliminating Ambiguity (1)

Grammar stmt -> if expr then stmt | if expr then stmt else stmt | other is ambiguous: Sentence: if E1 then if E2 then S1 else S2
stmt => if expr then stmt => if E1 then stmt => if E1 then if expr then stmt else stmt => if E1 then if E2 then stmt else stmt => if E1 then if E2 then S1 else stmt => if E1 then if E2 then S1 else S 2
stmt => if expr then stmt else stmt => if E1 then stmt else stmt => if E1 then if expr then stmt else stmt => if E1 then if E2 then stmt else stmt => if E1 then if E2 then S1 else stmt => if E1 then if E2 then S1 else S 2
Which one do we prefer?
Eliminating Ambiguity (2)

Grammar stmt -> if expr then stmt | if expr then stmt else stmt | other is ambiguous: Sentence: if E1 then if E2 then S1 else S2
stmt -> matchted_stmt | unmatched_stmt matched_stmt -> if expr then matched_stmt else matched_stmt | other unmatched_stmt -> if expr then stmt | if expr then matched_stmt else unmatched_stmt
Left Recursion
If for grammar G there is a derivation A =>+ A, for some string then G is left recursive Example: S -> Aa | b A -> Ac | Sd |
Parsing
= determining whether a string of tokens can be generated by a grammar Two classes based on order in which parse tree is constructed:
Top-down parsing
Start construction at root of parse tree Start at leaves and proceed to root
Bottom-up parsing
Recursive Descent Parsing
A top-down method based on recursive procedures (one for each nonterminal typically)
May have to backtrack when wrong production was picked
Predictive parsing = a recursive descent parsing approach that avoids backtracking

More efficient Uses (limited) lookahead to decide what productions to use

Predictive Parser
Program with a (parsing) procedure for each nonterminal which
Decides what production to use (based on lookahead in the input) Uses a production by mimicking the right side
Predictive Parser Example

type -> simple | ^id | array [simple] of type simple -> integer | char | num dotdot num
procedure match(t:token); begin if lookahead = t then lookahead = nexttoken; else error; end; procedure type; begin if lookahead is in {integer,char,num) then simple else if lookakead = ^ then begin match(^);match(id) end else if lookahead = array then begin match(array);match([); simple; match(]);match(of); type end else error; end
Predictive Parsing Obstacles
expr -> expr + term

expr; match(+); term; Infinite recursion (left recursion)
stmt -> if expr then stmt else stmt | if expr then stmt
Common prefix
Cant predict production
Solution

Eliminate left recursion Left factoring

Eliminating Left Recursion (1)
Simple case: immediate left recursion: Replace A -> A | with A -> A A -> A |
Eliminating Left Recursion (2)

Order the nonterminals A 1 .. A n for i := 1 to n do begin for j := 1 to i-1 do begin replace each production of the form Ai -> Aj by the productions Ai -> 1 | 2 || k where A i -> 1 | 2 | | k are all current A j productions end eliminate immediate left recursion among the A i productions end
Example Eliminating Left Recursion

S -> Aa | b A -> Ac | Sd | Order: S,A
for i := 1 to n do begin for j := 1 to i-1 do begin replace each production of the form Ai -> A j by the productions Ai -> 1 | 2 || k where Ai -> 1 | 2 | | k are all current A j productions end eliminate immediate left recursion among the A i productions end
i=2,j=1: Eliminate A->S Replace A->Sd with A->Ac|Aad|bd|

Eliminate immediate left recursion: S->Aa|b A -> bdA|A A ->cA | adA |
Left Factoring
Find longest common prefix and turn into new nonterminal

stmt -> if expr then stmt stmt stmt -> else stmt |
Transition Diagrams

Create initial and final state For each production A -> X1X2Xn create a path from the initial to the final state, with edges labeled X1, X2, Xn
0 T + 3 6
E:
Non-recursive Predictive Parsers

Avoid recursion for efficiency reasons Typically built automatically by tools

Input X Y Z $ a + b $
Predictive Parsing Program
Stack
output M[A,a]gives production A symbol on stack a input symbol (and $)
Parsing Table M
Parsing Algorithm
X symbol on top of stack, a current input symbol
Stack contents and remaining input called parser configuration (initially $S on stack and complete input string)
If X=a=$ halt and announce success If X=a $ pop X off stack advance input to next symbol If X is a nonterminal use M[X,a] which contains production X->rhs or error replace X on stack with rhs or call error routine, respectively, e.g. X->UVW replace X with WVU (U on top) output the production (or augment parse tree)
1. 2. 3.
10
Construction of Parsing Table Helpers (1)
First() : =set of terminals that begin strings derived from

First(X) = {X} for terminal X If X-> a production add to First(X) For X->Y1Yk place a in First(X) if a in First(Y i) and First(Yj) for j=1i-1, if First(Yj) j=1k add to First(X)
Construction of Parsing Table Helpers (2)
Follow(A) := set of terminals a that can appear immediately to the right of A in some sentential form i.e., S =>* Aa for some , (a can include $)

Place $ in Follow(S), S start symbol, $ right end marker If there is a production A-> B put everything in First() except in Follow(B) If there is a production A-> B or A->B where is in First() then everything in Follow(A) is in Follow(B)
Construction Algorithm
Input: Grammar G Output: Parsing table M For each production A -> do For each terminal a in FIRST() add A-> to M[A, a] If is in FIRST() add A-> to M[A,b] for each terminal b in FOLLOW(A). ($ counts as a terminal in this step) Make each undefined entry in M to error
11
Example
E -> TE E -> +TE | T ->FT T -> *FT | F -> (E) | id FIRST(E) = FIRST(T) = FIRST(F) ={(,id} FIRST(E) = {+, } FIRST(T) = {*, } FOLLOW(E)=FOLLOW(E)={),$} FOLLOW(T)=FOLLOW(T)={+.),$} FOLLOW(F) ={+.*,),$} I + d
* (
E E T T F
LL(1)
A grammar whose parsing table has no multiply defined entries is said to be LL(1)

First L = left to right input scanning Second L = leftmost derivation (1) = 1 token lookahead
Not all grammars can be brought to LL(1) form, i.e., there are languages that do not fall into the LL(1) class
12

Lecture 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture 3

Uploaded by

Copyright:

Available Formats

Syntax Analysis

CS2210 Compiler Design 2004/05

Parsing = determining whether a string of tokens can be generated by a grammar

Eg. Attribute grammars Can add new constructs systematically

Easier language evolution

Many errors are syntactic or exposed by parsing

Error handling goals:

Discard tokens until synchronization token found (often ;)

Too costly in practice

CS2210 Compiler Design 2004/05

expr -> expr + expr | term

Context-free Grammar Definition

Example: Expression Grammar

{id, +, -, *, /, ^} {expr, op,} Expr

CS2210 Compiler Design 2004/05

a,b,c.. +,-,.. ,.; etc 0..9 expr or <expr>

A, B, C .. S start symbol (if present) or first nonterminal in production list u,v,..

Grammar symbol strings

CS2210 Compiler Design 2004/05

Shorthands & Derivations

E => - E E derives -E => derives in 1 step =>* derive in n (0..) steps

CS2210 Compiler Design 2004/05

CS2210 Compiler Design 2004/05

CS2210 Compiler Design 2004/05

CS2210 Compiler Design 2004/05

CFGs are more powerful than REs

CFGs cannot express:

CS2210 Compiler Design 2004/05

Eliminating Ambiguity (1)

Which one do we prefer?

CS2210 Compiler Design 2004/05

Eliminating Ambiguity (2)

CS2210 Compiler Design 2004/05

CS2210 Compiler Design 2004/05

Recursive Descent Parsing

May have to backtrack when wrong production was picked

Predictive parsing = a recursive descent parsing approach that avoids backtracking

More efficient Uses (limited) lookahead to decide what productions to use

Program with a (parsing) procedure for each nonterminal which

CS2210 Compiler Design 2004/05

Predictive Parser Example

CS2210 Compiler Design 2004/05

Predictive Parsing Obstacles

expr -> expr + term

expr; match(+); term; Infinite recursion (left recursion)

Cant predict production

Eliminate left recursion Left factoring

Eliminating Left Recursion (1)

CS2210 Compiler Design 2004/05

Eliminating Left Recursion (2)

Example Eliminating Left Recursion

i=2,j=1: Eliminate A->S Replace A->Sd with A->Ac|Aad|bd|

CS2210 Compiler Design 2004/05

Find longest common prefix and turn into new nonterminal

CS2210 Compiler Design 2004/05

CS2210 Compiler Design 2004/05

Non-recursive Predictive Parsers

Avoid recursion for efficiency reasons Typically built automatically by tools

output M[A,a]gives production A symbol on stack a input symbol (and $)

CS2210 Compiler Design 2004/05

X symbol on top of stack, a current input symbol

Construction of Parsing Table Helpers (1)

First() : =set of terminals that begin strings derived from

CS2210 Compiler Design 2004/05