You are on page 1of 44

Topic #4: Syntactic Analysis

(Parsing)

INF 524 Compiler Construction


Spring 2011
Lexical Analyzer and Parser
Parser
• Accepts string of tokens from lexical
analyzer (usually one token at a time)
• Verifies whether or not string can be
generated by grammar
• Reports syntax errors (recovers if
possible)
Errors
• Lexical errors (e.g. misspelled word)
• Syntax errors (e.g. unbalanced
parentheses, missing semicolon)
• Semantic errors (e.g. type errors)
• Logical errors (e.g. infinite recursion)
Error Handling
• Report errors clearly and accurately
• Recover quickly if possible
• Poor error recover may lead to avalanche
of errors
Error Recovery
• Panic mode: discard tokens one at a time
until a synchronizing token is found
• Phrase-level recovery: Perform local
correction that allows parsing to continue
• Error Productions: Augment grammar to
handle predicted, common errors
• Global Production: Use a complex
algorithm to compute least-cost sequence
of changes leading to parseable code
Context Free Grammars
• CFGs can represent recursive constructs that
regular expressions can not
• A CFG consists of:
– Tokens (terminals, symbols)
– Nonterminals (syntactic variables denoting sets of
strings)
– Productions (rules specifying how terminals and
nonterminals can combine to form strings)
– A start symbol (the set of strings it denotes is the
language of the grammar)
Derivations (Part 1)
• One definition of language: the set of
strings that have valid parse trees
• Another definition: the set of strings that
can be derived from the start symbol

E  E + E | E * E | (E) | – E | id
E => -E (read E derives –E)
E => -E => -(E) => -(id)
Derivations (Part 2)
• αAβ => αγβ if A  γ is a production
and α and β are arbitrary strings of
grammar symbols
• If a1 => a2 => … => an, we say a1
derives an
• => means derives in one step
• *=> means derives in zero or more steps
• +=> means derives in one or more steps
Sentences and Languages
• Let L(G) be the language generated by
the grammar G with start symbol S:
– Strings in L(G) may contain only tokens of G
– A string w is in L(G) if and only if S +=> w
– Such a string w is a sentence of G
• Any language that can be generated by a
CFG is said to be a context-free language
• If two grammars generate the same
language, they are said to be equivalent
Sentential Forms
• If S *=> α, where α may contain
nonterminals, we say that α is a sentential
form of G
• A sentence is a sentential form with no
nonterminals
Leftmost Derivations
• Only the leftmost nonterminal in any sentential
form is replaced at each step
• A leftmost step can be written as wAγ lm=> wδγ
– w consists of only terminals
– γ is a string of grammar symbols
• If α derives β by a leftmost derivation, then we
write α lm*=> β
• If S lm*=> α then we say that α is a left-
sentential form of the grammar
• Analogous terms exist for rightmost derivations
Parse Trees
• A parse tree can be viewed as a graphical
representation of a derivation
• Every parse tree has a unique leftmost
derivation (not true of every sentence)
• An ambiguous grammars has:
– more than one parse tree for at least one
sentence
– more than one leftmost derivation for at least
one sentence
Capability of Grammars
• Can describe most programming language
constructs
• An exception: requiring that variables are
declared before they are used
– Therefore, grammar accepts superset of
actual language
– Later phase (semantic analysis) does type
checking
Regular Expressions vs. CFGs
• Every construct that can be described by
an RE and also be described by a CFG
• Why use REs at all?
– Lexical rules are simpler to describe this way
– REs are often easier to read
– More efficient lexical analyzers can be
constructed
Verifying Grammars
• A proof that a grammar verifies a language
has two parts:
– Must show that every string generated by the
grammar is part of the language
– Must show that every string that is part of the
language can be generated by the grammar
• Rarely done for complete programming
languages!
Eliminating Ambiguity (1)

stmt  if expr then stmt


| if expr then stmt else stmt
| other

if E1 then if E2 then S1 else S2


Eliminating Ambiguity (2)
Eliminating Ambiguity (3)

stmt  matched
| unmatched
matched  if expr then matched else matched
| other
unmatched  if expr then stmt
| if expr then matched else unmatched
Left Recursion
• A grammar is left recursive if for any
nonterminal A such that there exists any
derivation A +=> Aα for any string α
• Most top-down parsing methods can not
handle left-recursive grammars
Eliminating Left Recursion (1)

A  Aα1 | Aα2 | … | Aαm | β1 | β2 | … | βn

A  β1A’ | β2A’ | … | βnA’


A’  α1A’ | α2A’ | … | αmA’ | ε

Harder case:
S  Aa | b
A  Ac | Sd | ε
Eliminating Left Recursion (2)
• First arrange the nonterminals in some
order A1, A2, … An
• Apply the following algorithm:

for i = 1 to n {
for j = 1 to i-1 {
replace each production of the form Ai  Ajγ
by the productions Ai  δ1γ | δ2γ | … | δkγ,
where Aj  δ1 | δ2 | … | δk are the Aj productions
}
eliminate the left recursion among Ai productions
}
Left Factoring
• Rewriting productions to delay decisions
• Helpful for predictive parsing
• Not guaranteed to remove ambiguity

A  αβ1 | αβ2

A  αA’
A’  β1 | β2
Limitations of CFGs
• Can not verify repeated strings
– Example: L1 = {wcw | w is in (a|b)*}
– Abstracts checking that variables are declared
• Can not verify repeated counts
– Example: L2 = {anbmcndm | n≥1 & m≥1}
– Abstracts checking that number of formal and
actual parameters are equal
• Therefore, some checks put off until
semantic analysis
Top Down Parsing
• Can be viewed two ways:
– Attempt to find leftmost derivation for input
string
– Attempt to create parse tree, starting from at
root, creating nodes in preorder
• General form is recursive descent parsing
– May require backtracking
– Backtracking parsers not used frequently
because not needed
Predictive Parsing
• A special case of recursive-descent
parsing that does not require backtracking
• Must always know which production to use
based on current input symbol
• Can often create appropriate grammar:
– removing left-recursion
– left factoring the resulting grammar
Transition Diagrams
• For parser:
– One diagram for each nonterminal
– Edge labels can be tokens or nonterminal
• A transition on a token means we should take that
transition if token is next input symbol
• A transition on a nonterminal can be thought of as
a call to a procedure for that nonterminal
• As opposed to lexical analyzers:
– One (or more) diagrams for each token
– Labels are symbols of input alphabet
Creating Transition Diagrams
• First eliminate left recursion from grammar
• Then left factor grammar
• For each nonterminal A:
– Create an initial and final state
– For every production A  X1X2…Xn, create a
path from initial to final state with edges
labeled X1, X2, …, Xn.
Using Transition Diagrams
• Predictive parsers:
– Start at start symbol of grammar
– From state s with edge to state t labeled with token
a, if next input token is a:
• State changes to t
• Input cursor moves one position right
– If edge labeled by nonterminal A:
• State changes to start state for A
• Input cursor is not moved
• If final state of A reached, then state changes to t
– If edge labeled by ε, state changes to t
• Can be recursive or non-recursive using stack
Transition Diagram Example
E  TE’
E  E + T | T E’  +TE’ | ε
T  T * F | F T  FT’
F  (E) | id T’  *FT’ | ε
F  (E) | id
E: T’:

E’:

T: F:
Simplifying Transition Diagrams
E’: E:
Nonrecursive Predictive Parsing (1)

Input

Stack
Nonrecursive Predictive Parsing (2)

• Program considers X, the symbol on top of


the stack, and a, the next input symbol
• If X = a = $, parser halts successfully
• if X = a ≠ $, parser pops X off stack and
advances to next input symbol
• If X is a nonterminal, the program consults
M[X, a] (production or error entry)
Nonrecursive Predictive Parsing (3)
• Initialize stack with start symbol of
grammar
• Initialize input pointer to first symbol of
input
• After consulting parsing table:
– If entry is production, parser replaces top
entry of stack with right side of production
(leftmost symbol on top)
– Otherwise, an error recovery routine is called
Predictive Parsing Table
Input Symbol
Nonter-
minal
id + * ( ) $

E ETE’ ETE’

E’ E’+TE’ E’ε E’ε

T TFT’ TFT’

T’ T’ε T’*FT’ T’ε T’ε

F Fid F(E)
Using a Predictive Parsing Table
Stack Input Output Stack Input Output

$E id+id*id$ … … …

$E’T id+id*id$ ETE’ $E’T’id id*id$ Fid

$E’T’F id+id*id$ TFT’ $E’T’ *id$

$E’T’id id+id*id$ Fid $E’T’F* *id$ T’*FT’

$E’T’ +id*id$ $E’T’F id$

$E’ +id*id$ T’ε $E’T’id id$ Fid

$E’T+ +id*id$ E’+TE’ $E’T’ $

$E’T id*id$ $E’ $ T’ ε

$E’T’F id*id$ TFT’ $ $ E’ ε


FIRST
• FIRST(α) is the set of all terminals that begin
any string derived from α
• Computing FIRST:
– If X is a terminal, FIRST(X) = {X}
– If Xε is a production, add ε to FIRST(X)
– If X is a nonterminal and XY1Y2…Yn is a
production:
• For all terminals a, add a to FIRST(X) if a is a member of
any FIRST(Yi) and ε is a member of FIRST(Y1),
FIRST(Y2), … FIRST(Yi-1)
• If ε is a member of FIRST(Y1), FIRST(Y2), …
FIRST(Yn), add ε to FIRST(X)
FOLLOW
• FOLLOW(A), for any nonterminal A, is the
set of terminals a that can appear
immediately to the right if A in some
sentential form
• More formally, a is in FOLLOW(A) if and
only if there exists a derivation of the form
S *=>αAaβ
• $ is in FOLLOW(A) if and only if there
exists a derivation of the form S *=> αA
Computing FOLLOW
• Place $ in FOLLOW(S)
• If there is a production A  αBβ, then
everything in FIRST(β) (except for ε) is
in FOLLOW(B)
• If there is a production A  αB, or a
production A  αBβ where FIRST(β)
contains ε,then everything in FOLLOW(A)
is also in FOLLOW(B)
FIRST and FOLLOW Example
E  TE’
E’  +TE’ | ε
T  FT’
T’  *FT’ | ε
F  (E) | id

FIRST(E) = FIRST(T) = FIRST(F) = {(, id}


FIRST(E’) = {+, ε}
FIRST(T’) = {*, ε}
FOLLOW(E) = FOLLOW(E’) = {), $}
FOLLOW(T) = FOLLOW(T’) = {+, ), $}
FOLLOW(F) = {+, *, $}
Creating a Predictive Parsing Table
• For each production A  α :
– For each terminal a in FIRST(α) add A  α
to M[A, a]
– If ε is in FIRST(α) add A  α to M[A, b]
for every terminal b in FOLLOW(A)
– If ε is in FIRST(α) and $ is in FOLLOW(A)
add A  α to M[A, $]
• Mark each undefined entry of M as an
error entry (use some recovery strategy)
Multiply-Defined Entries Example
S  iEtSS’ | a
S’  eS | ε
E  b

Nonter- Input Symbol


minal
a b i t e $

S S  a S  iEtSS’
S’  ε
S’ S’  ε
S’  eS
E E  b
LL(1) Grammars (1)
• Algorithm covered in class can be applied
to any grammar to produce a parsing table
• If parsing table has no multiply-defined
entries, grammar is said to be “LL(1)”
– First “L”, left-to-right scanning of input
– Second “L”, produces leftmost derivation
– “1” refers to the number of lookahead symbols
needed to make decisions
LL(1) Grammars (2)
• No ambiguous or left-recursive grammar
can be LL(1)
• Eliminating left recursion and left factoring
does not always lead to LL(1) grammar
• Some grammars can not be transformed
into an LL(1) grammar at all
• Although the example of a non-LL(1)
grammar we covered has a fix, there are
no universal rules to handle cases like this

You might also like