You are on page 1of 12

Syntax Analysis

CS2210
Lecture 4

CS2210 Compiler Design 2004/05

Parser
source lexical analyzer token parser get next token symbol table parse tree rest of frontend IR

Parsing = determining whether a string of tokens can be generated by a grammar


CS2210 Compiler Design 2004/05

Grammars

Precise, easy-to understand description of syntax Context-free grammars -> efficient parsers (automatically!) Help in translation and error detection

Eg. Attribute grammars Can add new constructs systematically


CS2210 Compiler Design 2004/05

Easier language evolution

Syntax Errors

Many errors are syntactic or exposed by parsing

eg. Unbalanced () Report errors quickly & accurately Recover quickly (continue parsing after error) Little overhead on parse time
CS2210 Compiler Design 2004/05

Error handling goals:


Error Recovery

Panic mode

Discard tokens until synchronization token found (often ;)

Phrase level

Local correction: replace a token by another and continue Encode commonly expected errors in grammar Find closest input string that is in L(G)

Error productions

Global correction

Too costly in practice

CS2210 Compiler Design 2004/05

Context-free Grammars

Precise and easy way to specify the syntactical structure of a programming language Efficient recognition methods exist Natural specification of many recursive constructs:

expr -> expr + expr | term


CS2210 Compiler Design 2004/05

Context-free Grammar Definition

Terminals T

Symbols which form strings of L(G), G a CFG (= tokens in the scanner), e.g. if, else, id Syntactic variables denoting sets of strings of L(G) Impose hierarchical structure (e.g., precedence rules) Denotes the set of strings of L(G) Rules that determine how strings are formed N -> (N|T) *
CS2210 Compiler Design 2004/05

Nonterminals N

Start symbol S ( N)

Productions P

Example: Expression Grammar


expr -> expr op expr expr -> (expr) expr -> - expr expr -> id op -> + op -> op -> * op -> / op -> ^

Terminals:

{id, +, -, *, /, ^} {expr, op,} Expr

Nonterminals

Start symbol

CS2210 Compiler Design 2004/05

Notational Conventions

Terminals

Nonterminals

a,b,c.. +,-,.. ,.; etc 0..9 expr or <expr>

A, B, C .. S start symbol (if present) or first nonterminal in production list u,v,..

Terminal strings

Grammar symbol strings

, A ->

Productions

CS2210 Compiler Design 2004/05

Shorthands & Derivations


E -> E + E | E * E | (E) | - E | <id>

E => - E E derives -E => derives in 1 step =>* derive in n (0..) steps

CS2210 Compiler Design 2004/05

More Definitions

L(G) language generated by G = set of strings derived from S S =>+ w : w sentence of G (w string of terminals) S =>+ : sentential form of G (string can contain nonterminals) G and G are equivalent : L(G) = L(G) A language generated by a grammar (of the form shown) is called a context-free language
CS2210 Compiler Design 2004/05

Example
G = ({-,*,(,),<id>}, {E}, E, {E -> E + E, E-> E * E , E -> (E) , E-> - E, E -> <id>})
Sentence: -(<id> + <id>) Derivation: E => -E => -(E) => -(E+E)=>-(<id>+E) => -(<id> + <id>)

Leftmost derivation i.e. always replace leftmost nonterminal Rightmost derivation analogously Left /right sentential form

CS2210 Compiler Design 2004/05

Parse Trees
E E => -E => -(E) => -(E+E)=> -(<id>+E) => -(<id> + <id>) ( E <id> Parse tree = graphical representation of a derivation ignoring replacement order E E + ) E <id>

CS2210 Compiler Design 2004/05

Ambiguous Grammars

>=2 different parse trees for some sentence >= 2 leftmost/rightmost derivations Usually want to have unambiguous grammars

E.g. want to just one evaluation order: <id> + <id> * <id> to be parsed as <id> + (<id> * <id>) not (<id>+<id>)*<id> To keep grammars simple accept ambiguity and resolve separately (outside of grammar)

CS2210 Compiler Design 2004/05

Expressive Power

CFGs are more powerful than REs


Can express matching () with CFGs Can express most properties desired for programming languages Identifiers declared before used L = {wcw|w is in (a|b) *} Parameter checking (#formals = #actuals) L ={a nbmcndm|n 1, m 1}

CFGs cannot express:

CS2210 Compiler Design 2004/05

Eliminating Ambiguity (1)


Grammar stmt -> if expr then stmt | if expr then stmt else stmt | other is ambiguous: Sentence: if E1 then if E2 then S1 else S2

stmt => if expr then stmt => if E1 then stmt => if E1 then if expr then stmt else stmt => if E1 then if E2 then stmt else stmt => if E1 then if E2 then S1 else stmt => if E1 then if E2 then S1 else S 2

stmt => if expr then stmt else stmt => if E1 then stmt else stmt => if E1 then if expr then stmt else stmt => if E1 then if E2 then stmt else stmt => if E1 then if E2 then S1 else stmt => if E1 then if E2 then S1 else S 2

Which one do we prefer?

CS2210 Compiler Design 2004/05

Eliminating Ambiguity (2)


Grammar stmt -> if expr then stmt | if expr then stmt else stmt | other is ambiguous: Sentence: if E1 then if E2 then S1 else S2
stmt -> matchted_stmt | unmatched_stmt matched_stmt -> if expr then matched_stmt else matched_stmt | other unmatched_stmt -> if expr then stmt | if expr then matched_stmt else unmatched_stmt

CS2210 Compiler Design 2004/05

Left Recursion
If for grammar G there is a derivation A =>+ A, for some string then G is left recursive Example: S -> Aa | b A -> Ac | Sd |

CS2210 Compiler Design 2004/05

Parsing

= determining whether a string of tokens can be generated by a grammar Two classes based on order in which parse tree is constructed:

Top-down parsing

Start construction at root of parse tree Start at leaves and proceed to root
CS2210 Compiler Design 2004/05

Bottom-up parsing

Recursive Descent Parsing

A top-down method based on recursive procedures (one for each nonterminal typically)

May have to backtrack when wrong production was picked

Predictive parsing = a recursive descent parsing approach that avoids backtracking


More efficient Uses (limited) lookahead to decide what productions to use


CS2210 Compiler Design 2004/05

Predictive Parser

Program with a (parsing) procedure for each nonterminal which

Decides what production to use (based on lookahead in the input) Uses a production by mimicking the right side

CS2210 Compiler Design 2004/05

Predictive Parser Example


type -> simple | ^id | array [simple] of type simple -> integer | char | num dotdot num
procedure match(t:token); begin if lookahead = t then lookahead = nexttoken; else error; end; procedure type; begin if lookahead is in {integer,char,num) then simple else if lookakead = ^ then begin match(^);match(id) end else if lookahead = array then begin match(array);match([); simple; match(]);match(of); type end else error; end

CS2210 Compiler Design 2004/05

Predictive Parsing Obstacles

expr -> expr + term


expr; match(+); term; Infinite recursion (left recursion)

stmt -> if expr then stmt else stmt | if expr then stmt

Common prefix

Cant predict production

Solution

Eliminate left recursion Left factoring


CS2210 Compiler Design 2004/05

Eliminating Left Recursion (1)

Simple case: immediate left recursion: Replace A -> A | with A -> A A -> A |

CS2210 Compiler Design 2004/05

Eliminating Left Recursion (2)


Order the nonterminals A 1 .. A n for i := 1 to n do begin for j := 1 to i-1 do begin replace each production of the form Ai -> Aj by the productions Ai -> 1 | 2 || k where A i -> 1 | 2 | | k are all current A j productions end eliminate immediate left recursion among the A i productions end
CS2210 Compiler Design 2004/05

Example Eliminating Left Recursion


S -> Aa | b A -> Ac | Sd | Order: S,A
for i := 1 to n do begin for j := 1 to i-1 do begin replace each production of the form Ai -> A j by the productions Ai -> 1 | 2 || k where Ai -> 1 | 2 | | k are all current A j productions end eliminate immediate left recursion among the A i productions end

i=2,j=1: Eliminate A->S Replace A->Sd with A->Ac|Aad|bd|


Eliminate immediate left recursion: S->Aa|b A -> bdA|A A ->cA | adA |

CS2210 Compiler Design 2004/05

Left Factoring

Find longest common prefix and turn into new nonterminal


stmt -> if expr then stmt stmt stmt -> else stmt |

CS2210 Compiler Design 2004/05

Transition Diagrams

Create initial and final state For each production A -> X1X2Xn create a path from the initial to the final state, with edges labeled X1, X2, Xn
0 T + 3 6

E:

CS2210 Compiler Design 2004/05

Non-recursive Predictive Parsers


Avoid recursion for efficiency reasons Typically built automatically by tools


Input X Y Z $ a + b $
Predictive Parsing Program

Stack

output M[A,a]gives production A symbol on stack a input symbol (and $)

Parsing Table M

CS2210 Compiler Design 2004/05

Parsing Algorithm

X symbol on top of stack, a current input symbol

Stack contents and remaining input called parser configuration (initially $S on stack and complete input string)
If X=a=$ halt and announce success If X=a $ pop X off stack advance input to next symbol If X is a nonterminal use M[X,a] which contains production X->rhs or error replace X on stack with rhs or call error routine, respectively, e.g. X->UVW replace X with WVU (U on top) output the production (or augment parse tree)
CS2210 Compiler Design 2004/05

1. 2. 3.

10

Construction of Parsing Table Helpers (1)

First() : =set of terminals that begin strings derived from


First(X) = {X} for terminal X If X-> a production add to First(X) For X->Y1Yk place a in First(X) if a in First(Y i) and First(Yj) for j=1i-1, if First(Yj) j=1k add to First(X)

CS2210 Compiler Design 2004/05

Construction of Parsing Table Helpers (2)

Follow(A) := set of terminals a that can appear immediately to the right of A in some sentential form i.e., S =>* Aa for some , (a can include $)

Place $ in Follow(S), S start symbol, $ right end marker If there is a production A-> B put everything in First() except in Follow(B) If there is a production A-> B or A->B where is in First() then everything in Follow(A) is in Follow(B)

CS2210 Compiler Design 2004/05

Construction Algorithm
Input: Grammar G Output: Parsing table M For each production A -> do For each terminal a in FIRST() add A-> to M[A, a] If is in FIRST() add A-> to M[A,b] for each terminal b in FOLLOW(A). ($ counts as a terminal in this step) Make each undefined entry in M to error
CS2210 Compiler Design 2004/05

11

Example
E -> TE E -> +TE | T ->FT T -> *FT | F -> (E) | id FIRST(E) = FIRST(T) = FIRST(F) ={(,id} FIRST(E) = {+, } FIRST(T) = {*, } FOLLOW(E)=FOLLOW(E)={),$} FOLLOW(T)=FOLLOW(T)={+.),$} FOLLOW(F) ={+.*,),$} I + d

* (

E E T T F

CS2210 Compiler Design 2004/05

LL(1)

A grammar whose parsing table has no multiply defined entries is said to be LL(1)

First L = left to right input scanning Second L = leftmost derivation (1) = 1 token lookahead

Not all grammars can be brought to LL(1) form, i.e., there are languages that do not fall into the LL(1) class

CS2210 Compiler Design 2004/05

12

You might also like