Topic #4: Syntactic Analysis (Parsing) : INF 524 Compiler Construction Spring 2011

Topic #4: Syntactic Analysis
(Parsing)
INF 524 Compiler Construction

Spring 2011
Lexical Analyzer and Parser
Parser
• Accepts string of tokens from lexical
analyzer (usually one token at a time)
• Verifies whether or not string can be
generated by grammar
• Reports syntax errors (recovers if
possible)
Errors
• Lexical errors (e.g. misspelled word)
• Syntax errors (e.g. unbalanced
parentheses, missing semicolon)
• Semantic errors (e.g. type errors)
• Logical errors (e.g. infinite recursion)
Error Handling
• Report errors clearly and accurately
• Recover quickly if possible
• Poor error recover may lead to avalanche
of errors
Error Recovery
• Panic mode: discard tokens one at a time
until a synchronizing token is found
• Phrase-level recovery: Perform local
correction that allows parsing to continue
• Error Productions: Augment grammar to
handle predicted, common errors
• Global Production: Use a complex
algorithm to compute least-cost sequence
of changes leading to parseable code
Context Free Grammars
• CFGs can represent recursive constructs that
regular expressions can not
• A CFG consists of:
– Tokens (terminals, symbols)
– Nonterminals (syntactic variables denoting sets of
strings)
– Productions (rules specifying how terminals and
nonterminals can combine to form strings)
– A start symbol (the set of strings it denotes is the
language of the grammar)
Derivations (Part 1)
• One definition of language: the set of
strings that have valid parse trees
• Another definition: the set of strings that
can be derived from the start symbol
E  E + E | E * E | (E) | – E | id
E => -E (read E derives –E)
E => -E => -(E) => -(id)
Derivations (Part 2)
• αAβ => αγβ if A  γ is a production
and α and β are arbitrary strings of
grammar symbols
• If a1 => a2 => … => an, we say a1
derives an
• => means derives in one step
• *=> means derives in zero or more steps
• +=> means derives in one or more steps
Sentences and Languages
• Let L(G) be the language generated by
the grammar G with start symbol S:
– Strings in L(G) may contain only tokens of G
– A string w is in L(G) if and only if S +=> w
– Such a string w is a sentence of G
• Any language that can be generated by a
CFG is said to be a context-free language
• If two grammars generate the same
language, they are said to be equivalent
Sentential Forms
• If S *=> α, where α may contain
nonterminals, we say that α is a sentential
form of G
• A sentence is a sentential form with no
nonterminals
Leftmost Derivations
• Only the leftmost nonterminal in any sentential
form is replaced at each step
• A leftmost step can be written as wAγ lm=> wδγ
– w consists of only terminals
– γ is a string of grammar symbols
• If α derives β by a leftmost derivation, then we
write α lm*=> β
• If S lm*=> α then we say that α is a left-
sentential form of the grammar
• Analogous terms exist for rightmost derivations
Parse Trees
• A parse tree can be viewed as a graphical
representation of a derivation
• Every parse tree has a unique leftmost
derivation (not true of every sentence)
• An ambiguous grammars has:
– more than one parse tree for at least one
sentence
– more than one leftmost derivation for at least
one sentence
Capability of Grammars
• Can describe most programming language
constructs
• An exception: requiring that variables are
declared before they are used
– Therefore, grammar accepts superset of
actual language
– Later phase (semantic analysis) does type
checking
Regular Expressions vs. CFGs
• Every construct that can be described by
an RE and also be described by a CFG
• Why use REs at all?
– Lexical rules are simpler to describe this way
– REs are often easier to read
– More efficient lexical analyzers can be
constructed
Verifying Grammars
• A proof that a grammar verifies a language
has two parts:
– Must show that every string generated by the
grammar is part of the language
– Must show that every string that is part of the
language can be generated by the grammar
• Rarely done for complete programming
languages!
Eliminating Ambiguity (1)
stmt  if expr then stmt

| if expr then stmt else stmt
| other
if E1 then if E2 then S1 else S2

stmt  matched
| unmatched
matched  if expr then matched else matched
| other
unmatched  if expr then stmt
| if expr then matched else unmatched
Left Recursion
• A grammar is left recursive if for any
nonterminal A such that there exists any
derivation A +=> Aα for any string α
• Most top-down parsing methods can not
handle left-recursive grammars
Eliminating Left Recursion (1)
A  Aα1 | Aα2 | … | Aαm | β1 | β2 | … | βn
A  β1A’ | β2A’ | … | βnA’

A’  α1A’ | α2A’ | … | αmA’ | ε
Harder case:
S  Aa | b
A  Ac | Sd | ε
Eliminating Left Recursion (2)
• First arrange the nonterminals in some
order A1, A2, … An
• Apply the following algorithm:
for i = 1 to n {
for j = 1 to i-1 {
replace each production of the form Ai  Ajγ
by the productions Ai  δ1γ | δ2γ | … | δkγ,
where Aj  δ1 | δ2 | … | δk are the Aj productions
}
eliminate the left recursion among Ai productions
}
Left Factoring
• Rewriting productions to delay decisions
• Helpful for predictive parsing
• Not guaranteed to remove ambiguity
A  αβ1 | αβ2
A  αA’
A’  β1 | β2
Limitations of CFGs
• Can not verify repeated strings
– Example: L1 = {wcw | w is in (a|b)*}
– Abstracts checking that variables are declared
• Can not verify repeated counts
– Example: L2 = {anbmcndm | n≥1 & m≥1}
– Abstracts checking that number of formal and
actual parameters are equal
• Therefore, some checks put off until
semantic analysis
Top Down Parsing
• Can be viewed two ways:
– Attempt to find leftmost derivation for input
string
– Attempt to create parse tree, starting from at
root, creating nodes in preorder
• General form is recursive descent parsing
– May require backtracking
– Backtracking parsers not used frequently
because not needed
Predictive Parsing
• A special case of recursive-descent
parsing that does not require backtracking
• Must always know which production to use
based on current input symbol
• Can often create appropriate grammar:
– removing left-recursion
– left factoring the resulting grammar
Transition Diagrams
• For parser:
– One diagram for each nonterminal
– Edge labels can be tokens or nonterminal
• A transition on a token means we should take that
transition if token is next input symbol
• A transition on a nonterminal can be thought of as
a call to a procedure for that nonterminal
• As opposed to lexical analyzers:
– One (or more) diagrams for each token
– Labels are symbols of input alphabet
Creating Transition Diagrams
• First eliminate left recursion from grammar
• Then left factor grammar
• For each nonterminal A:
– Create an initial and final state
– For every production A  X1X2…Xn, create a
path from initial to final state with edges
labeled X1, X2, …, Xn.
Using Transition Diagrams
• Predictive parsers:
– Start at start symbol of grammar
– From state s with edge to state t labeled with token
a, if next input token is a:
• State changes to t
• Input cursor moves one position right
– If edge labeled by nonterminal A:
• State changes to start state for A
• Input cursor is not moved
• If final state of A reached, then state changes to t
– If edge labeled by ε, state changes to t
• Can be recursive or non-recursive using stack
Transition Diagram Example
E  TE’
E  E + T | T E’  +TE’ | ε
T  T * F | F T  FT’
F  (E) | id T’  *FT’ | ε
F  (E) | id
E: T’:
E’:
T: F:
Simplifying Transition Diagrams
E’: E:
Nonrecursive Predictive Parsing (1)
Input
Stack
• Program considers X, the symbol on top of

the stack, and a, the next input symbol
• If X = a = $, parser halts successfully
• if X = a ≠ $, parser pops X off stack and
advances to next input symbol
• If X is a nonterminal, the program consults
M[X, a] (production or error entry)
• Initialize stack with start symbol of
grammar
• Initialize input pointer to first symbol of
input
• After consulting parsing table:
– If entry is production, parser replaces top
entry of stack with right side of production
(leftmost symbol on top)
– Otherwise, an error recovery routine is called
Predictive Parsing Table
Input Symbol
Nonter-
minal
id + * ( ) $
E ETE’ ETE’
E’ E’+TE’ E’ε E’ε
T TFT’ TFT’
T’ T’ε T’*FT’ T’ε T’ε
F Fid F(E)
Using a Predictive Parsing Table
Stack Input Output Stack Input Output
$E id+id*id$ … … …
$E’T id+id*id$ ETE’ $E’T’id id*id$ Fid
$E’T’F id+id*id$ TFT’ $E’T’ *id$
$E’T’id id+id*id$ Fid $E’T’F* *id$ T’*FT’
$E’T’ +id*id$ $E’T’F id$
$E’ +id*id$ T’ε $E’T’id id$ Fid
$E’T+ +id*id$ E’+TE’ $E’T’ $
$E’T id*id$ $E’ $ T’ ε
$E’T’F id*id$ TFT’ $ $ E’ ε

FIRST
• FIRST(α) is the set of all terminals that begin
any string derived from α
• Computing FIRST:
– If X is a terminal, FIRST(X) = {X}
– If Xε is a production, add ε to FIRST(X)
– If X is a nonterminal and XY1Y2…Yn is a
production:
• For all terminals a, add a to FIRST(X) if a is a member of
any FIRST(Yi) and ε is a member of FIRST(Y1),
FIRST(Y2), … FIRST(Yi-1)
• If ε is a member of FIRST(Y1), FIRST(Y2), …
FIRST(Yn), add ε to FIRST(X)
FOLLOW
• FOLLOW(A), for any nonterminal A, is the
set of terminals a that can appear
immediately to the right if A in some
sentential form
• More formally, a is in FOLLOW(A) if and
only if there exists a derivation of the form
S *=>αAaβ
• $ is in FOLLOW(A) if and only if there
exists a derivation of the form S *=> αA
Computing FOLLOW
• Place $ in FOLLOW(S)
• If there is a production A  αBβ, then
everything in FIRST(β) (except for ε) is
in FOLLOW(B)
• If there is a production A  αB, or a
production A  αBβ where FIRST(β)
contains ε,then everything in FOLLOW(A)
is also in FOLLOW(B)
FIRST and FOLLOW Example
E  TE’
E’  +TE’ | ε
T  FT’
T’  *FT’ | ε
F  (E) | id
FIRST(E) = FIRST(T) = FIRST(F) = {(, id}

FIRST(E’) = {+, ε}
FIRST(T’) = {*, ε}
FOLLOW(E) = FOLLOW(E’) = {), $}
FOLLOW(T) = FOLLOW(T’) = {+, ), $}
FOLLOW(F) = {+, *, $}
Creating a Predictive Parsing Table
• For each production A  α :
– For each terminal a in FIRST(α) add A  α
to M[A, a]
– If ε is in FIRST(α) add A  α to M[A, b]
for every terminal b in FOLLOW(A)
– If ε is in FIRST(α) and $ is in FOLLOW(A)
add A  α to M[A, $]
• Mark each undefined entry of M as an
error entry (use some recovery strategy)
Multiply-Defined Entries Example
S  iEtSS’ | a
S’  eS | ε
E  b
Nonter- Input Symbol

minal
a b i t e $
S S  a S  iEtSS’
S’  ε
S’ S’  ε
S’  eS
E E  b
LL(1) Grammars (1)
• Algorithm covered in class can be applied
to any grammar to produce a parsing table
• If parsing table has no multiply-defined
entries, grammar is said to be “LL(1)”
– First “L”, left-to-right scanning of input
– Second “L”, produces leftmost derivation
– “1” refers to the number of lookahead symbols
needed to make decisions
LL(1) Grammars (2)
• No ambiguous or left-recursive grammar
can be LL(1)
• Eliminating left recursion and left factoring
does not always lead to LL(1) grammar
• Some grammars can not be transformed
into an LL(1) grammar at all
• Although the example of a non-LL(1)
grammar we covered has a fix, there are
no universal rules to handle cases like this

Topic #4: Syntactic Analysis (Parsing) : INF 524 Compiler Construction Spring 2011

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Topic #4: Syntactic Analysis (Parsing) : INF 524 Compiler Construction Spring 2011

Uploaded by

Copyright:

Available Formats

Topic #4: Syntactic Analysis

INF 524 Compiler Construction

stmt  if expr then stmt

if E1 then if E2 then S1 else S2

A  Aα1 | Aα2 | … | Aαm | β1 | β2 | … | βn

A  β1A’ | β2A’ | … | βnA’

• Program considers X, the symbol on top of

E’ E’+TE’ E’ε E’ε

T’ T’ε T’*FT’ T’ε T’ε

$E’T id+idid$ ETE’ $E’T’id idid$ Fid

$E’T’F id+idid$ TFT’ $E’T’ id$

$E’T’id id+idid$ Fid $E’T’F id$ T’FT’

$E’T’ +id*id$ $E’T’F id$

$E’ +id*id$ T’ε $E’T’id id$ Fid

$E’T+ +id*id$ E’+TE’ $E’T’ $

$E’T id*id$ $E’ $ T’ ε

$E’T’F id*id$ TFT’ $ $ E’ ε

FIRST(E) = FIRST(T) = FIRST(F) = {(, id}

Nonter- Input Symbol

You might also like