You are on page 1of 60

Syntax Analysis

Acknowledgement
• Alfred V Aho, Monica S. Lam, Ravi Sethi,
Jeffrey D Ullman- “Compilers- Principles,
Techniques and Tools”

Girish Kumar Patnaik 2


Syntax Analysis
• The syntax of programming language
constructs can be specified by context-free
grammars or BNF (Backus-Naur Form)
notation

Dr. Girish Kumar Patnaik 3


Syntax Analysis
• Grammars offer significant benefits
– A grammar gives a precise, yet easy-to-understand, syntactic
specification of a programming language
– From certain classes of grammars, we can construct
automatically an efficient parser that determines the
syntactic structure of a source program
– The parser-construction process can reveal syntactic
ambiguities and trouble spots
– useful for translating source programs into correct object
code and for detecting errors
– allows a language to be evolved or developed iteratively

Dr. Girish Kumar Patnaik 4


The Role of the Parser
Token,
Source tokenval Parser
Lexical Intermediate
Program and rest of
Analyzer representation
Get next front-end
token

Lexical error Syntax error


Semantic error

Symbol Table

Dr. Girish Kumar Patnaik 5


The Role of the Parser
• Three general types of parsers for grammars: universal,
top-down, and bottom-up.
• Universal parsing methods such as the Cocke-Younger-
Kasami algorithm and Earley's algorithm can parse any
grammar
• top-down methods build parse trees from the top (root) to
the bottom (leaves)
• bottom-up methods start from the leaves and work their
way up to the root
• In either case, the input to the parser is scanned from left to
right, one symbol at a time

Dr. Girish Kumar Patnaik 6


The Role of the Parser
• Universal (any C-F grammar)
– Cocke-Younger-Kasimi
– Earley
• Top-down (C-F grammar with restrictions)
– General Case: Recursive descent (Special Case:
Predictive parsing)
– LL (Left-to-right, Leftmost derivation) methods
• Bottom-up (C-F grammar with restrictions)
– Operator precedence parsing
– LR (Left-to-right, Rightmost derivation) methods
• SLR, canonical LR, LALR

Dr. Girish Kumar Patnaik 7


Representative Grammars

Dr. Girish Kumar Patnaik 8


The Parser
• A parser implements a C-F grammar
• The role of the parser is two fold:
1. To check syntax (= string recognizer)
– And to report syntax errors accurately
2. To invoke semantic actions
– For static semantics checking, e.g. type checking of
expressions, functions, etc.
– For syntax-directed translation of the source code to
an intermediate representation
Dr. Girish Kumar Patnaik 9
SYNTAX ANALYSIS
The main features of syntax analysis, which is
done by parser, are as follows:
 Checks the Grammar
 Parse Tree Production
 Output Errors

Dr. Girish Kumar Patnaik 10


Syntax Error Handling
Common programming errors:
 Lexical Error : Such as misspelling an identifier,
keyword or operator.
 Syntactic Error : Such as an arithmetic expression with
unbalanced parentheses.
 Semantic Error : Such as operator applied to an
incompatible operand.
 Logical Error : Such as infinitely recursive call.

Dr. Girish Kumar Patnaik 11


Viable-Prefix Property
• The viable-prefix property of LL/LR parsers
allows early detection of syntax errors
– Goal: detection of an error as soon as possible without
further consuming unnecessary input
– How: detect an error as soon as the prefix of the input
does not match a prefix of any string in the language

Error is
Error is detected here
… detected here …
Prefix Prefix DO 10 I = 1;0
for (;)
… …
Dr. Girish Kumar Patnaik 12
Error Recovery Strategies
• Panic mode
– Discard input until a token in a set of designated
synchronizing tokens is found
• Phrase-level recovery
– Perform local correction on the input to repair the error
• Error productions
– Augment grammar with productions for erroneous
constructs
• Global correction
– Choose a minimal sequence of changes to obtain a
global least-cost correction
Dr. Girish Kumar Patnaik 13
1. Panic Mode
In case of an error like:
a=b + c // no semi-colon
d=e + f ;

The compiler will discard all subsequent tokens


till a semi-colon is encountered .

 This is a crude method but often turns out to be


the best method.

Dr. Girish Kumar Patnaik 14


 This method often skips a considerable
amount of input without checking it for
additional errors, it has an advantage of
simplicity

 In situations where multiple errors in the


same statements are rare, this method may
be quite adequate.
Dr. Girish Kumar Patnaik 15
2. Phrase Level Recovery
 On discovering an error, a parser may
perform local correction on the remaining
input; that is, it may replace a prefix of the
remaining input by some string that allows
the parser to continue.

 For example, in case of an error like the one


above, it will report the error, generate the
“;” and continue.
Dr. Girish Kumar Patnaik 16
3. Error Production
If we have an idea of common errors that
might occur, we can include the errors in
the grammar at hand. For example if we
have a production rule like:
E +E | -E | *E | /E
Then, a=+b;
a=-b;
a=*b;
a=/b;
Dr. Girish Kumar Patnaik 17
Here, the last two are error situations.
Now, we change the grammar as:
E +E | -E | *A | /A
A E
Hence, once it encounters *A, it sends an error
message asking the user if he is sure he wants
to use a unary “*”.

Dr. Girish Kumar Patnaik 18


4. Global Correction
 We would like compiler to make as few
changes as possible in processing an
incorrect input string.
 There are algorithms for choosing a
minimal amount of changes to obtain a
globally least-cost correction. Suppose we
have a line like this:
THIS IS A OCMPERIL SCALS.
Dr. Girish Kumar Patnaik 19
 To correct this, there is an attractor, which
checks how different tokens are from the
initial inputs, and checks the closest
attractor to the incorrect token.
 This is more of a probabilistic type of
error correction. Unfortunately, these
methods are in general too costly to
implement in terms of space and time.
Dr. Girish Kumar Patnaik 20
Top-Down Parsing
Top-Down Parsing
• Constructing a parse tree for the input string,
starting from the root and creating the nodes of the
parse tree in preorder (depth-first)
• top-down parsing can be viewed as finding a
leftmost derivation for an input string

Dr. Girish Kumar Patnaik 22


Dr. Girish Kumar Patnaik 23
Top-Down Parsing
• Recursive-Descent Parsing
– Backtracking is needed (If a choice of a production rule does not work, we
backtrack to try other alternatives.)
– It is a general parsing technique, but not widely used.
– Not efficient
• Predictive Parsing
– no backtracking
– efficient
– The class of grammars for which we can construct predictive parsers looking “k”
symbols ahead in the input is sometimes called the LL(k) class.
– needs a special form of grammars (LL(1) grammars).
– Recursive Predictive Parsing is a special form of Recursive Descent parsing without
backtracking.
– Non-Recursive (Table Driven) Predictive Parser is also known as LL(1) parser.

Dr. Girish Kumar Patnaik 24


Recursive-Descent Parsing
• A recursive-descent parsing program consists of a set of
procedures, one for each nonterminal
• Execution begins with the procedure for the start symbol,
which halts and announces success if its procedure body
scans the entire input string

Dr. Girish Kumar Patnaik 25


Recursive-Descent Parsing
• General recursive-descent may require
backtracking
• However, backtracking is rarely needed to parse
programming language constructs
• Backtracking is not very efficient
• A left-recursive grammar can cause a recursive-
descent parser, even one with backtracking, to go
into an infinite loop.

Dr. Girish Kumar Patnaik 26


Recursive-Descent Parsing

Dr. Girish Kumar Patnaik 27


Recursive-Descent Parsing
• Each non-terminal corresponds to a procedure.

Ex: A → aBb (This is only the production rule for A)

proc A {
- match the current token with a, and move to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
}

Dr. Girish Kumar Patnaik 28


Recursive-Descent Parsing
A → aBb | bAB

proc A {
case of the current token {
‘a’: - match the current token with a, and move to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
‘b’: - match the current token with b, and move to the next token;
- call ‘A’;
- call ‘B’;
}
}

Dr. Girish Kumar Patnaik 29


Recursive-Descent Parsing
A → aBe | cBd | C
B → bB | ε
C→f
proc C { match the current token with f,
proc A { and move to the next token; }
case of the current token {
a: - match the current token with a,
and move to the next token; proc B {
- call B; case of the current token {
- match the current token with e, b: - match the current token with b,
and move to the next token; and move to the next token;
c: - match the current token with c, - call B
and move to the next token; e,d: do nothing
- call B; }
- match the current token with d, }
follow set of B
and move to the next token;
f: - call C
} first set of C
} Dr. Girish Kumar Patnaik 30
FIRST and FOLLOW
• The construction of both top-down and bottom-up parsers is
aided by two functions, FIRST and FOLLOW
• FIRST(α) to be the set of terminals that begin strings derived
from α, where α is any string of grammar symbols,
• FOLLOW( A) to be the set of terminals a that can appear
immediately to the right of A, for nonterminal A
– the set of terminals “a” such that there exists a derivation of the form S
* αAaβ for some α and β

Dr. Girish Kumar Patnaik 31


FIRST

Dr. Girish Kumar Patnaik 32


FIRST
E → TE’
E’ → +TE’ | ε
T → FT’
T’ → *FT’ | ε
F → (E) | id

FIRST(F) = {(,id} FIRST(TE’) = {(,id}


FIRST(T’) = {*, ε} FIRST(+TE’ ) = {+}
FIRST(T) = {(,id} FIRST(ε) = {ε}
FIRST(E’) = {+, ε} FIRST(FT’) = {(,id}
FIRST(E) = {(,id} FIRST(*FT’) = {*}
FIRST(ε) = {ε}
FIRST((E)) = {(}
FIRST(id) = {id}
Dr. Girish Kumar Patnaik 33
FOLLOW

Dr. Girish Kumar Patnaik 34


FOLLOW
E → TE’
E’ → +TE’ | ε
T → FT’
T’ → *FT’ | ε
F → (E) | id

FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, ), $ }
FOLLOW(T’) = { +, ), $ }
FOLLOW(F) = {+, *, ), $ }
Dr. Girish Kumar Patnaik 35
LL ( 1 ) Grammars
• Predictive parsers, that is, recursive-descent parsers
needing no backtracking, can be constructed for a class of
grammars called LL(1)
• The first "L" in LL(1) stands for scanning the input from
left to right, the second "L" for producing a leftmost
derivation, and the “1" for using one input symbol of
lookahead at each step to make parsing action decisions
• No left-recursive or ambiguous grammar can be LL(1)

Dr. Girish Kumar Patnaik 36


LL ( 1 ) Grammars
• A grammar G is LL(1) if and only if whenever A → α | β are two
distinct productions of G, the following conditions hold:
1. For no terminal “a” do both α and β derive strings beginning
with “a”.
2. At most one of α and β can derive the empty string.
3. If β * ε then α does not derive any string beginning with a
terminal in FOLLOW(A) . Likewise, if α * ε , then β does not
derive any string beginning with a terminal in FOLLOW(A) .
• The first two conditions are equivalent to the statement that FIRST(α)
and FIRST(β) are disjoint sets.
• The third condition is equivalent to stating that if ε is in FIRST(β) ,
then FIRST(α) and FOLLOW(A) are disjoint sets, and likewise if ε is
in FIRST(α) .

Dr. Girish Kumar Patnaik 37


Predictive Parsing Table
• A predictive parsing table M [A, a] is a two-dimensional
array, where A is a nonterminal, and a is a terminal or the
symbol $ (input endmarker)

Dr. Girish Kumar Patnaik 38


Predictive Parsing Table

FIRST(F) = {(,id} FOLLOW(E) = { $, ) }


FIRST(T’) = {*, ε} FOLLOW(E’) = { $, ) }
FIRST(T) = {(,id} FOLLOW(T) = { +, ), $ }
FIRST(E’) = {+, ε} FOLLOW(T’) = { +, ), $ }
FIRST(E) = {(,id} FOLLOW(F) = {+, *, ), $ }
Dr. Girish Kumar Patnaik 39
Predictive Parsing Table
E → TE’ FIRST(TE’)={(,id}  E → TE’ into M[E,(] and M[E,id]
E’ → +TE’ FIRST(+TE’ )={+}  E’ → +TE’ into M[E’,+]
E’ → ε FIRST(ε)={ε}  none
but since ε in FIRST(ε)
and FOLLOW(E’)={$,)}  E’ → ε into M[E’,$] and M[E’,)]
T → FT’ FIRST(FT’)={(,id}  T → FT’ into M[T,(] and M[T,id]
T’ → *FT’ FIRST(*FT’ )={*}  T’ → *FT’ into M[T’,*]
T’ → ε FIRST(ε)={ε}  none
but since ε in FIRST(ε)
and FOLLOW(T’)={$,),+}  T’ → ε into M[T’,$], M[T’,)] and
M[T’,+]
F → (E) FIRST((E) )={(}  F → (E) into M[F,(]
F → id FIRST(id)={id}  F → id into M[F,id]

Dr. Girish Kumar Patnaik 40


Predictive Parsing Table
• For every LL(1) grammar , each parsing-table entry
uniquely identifies a production or signals an error.
• For some grammars, however, M may have some entries
that are multiply defined. Hence not a LL (1) Grammar

Dr. Girish Kumar Patnaik 41


Predictive Parsing Table
S→iCtSE | a FOLLOW(S) = { $,e }
E→eS | ε FOLLOW(E) = { $,e }
C→b FOLLOW(C) = { t }
a b e i t $
FIRST(iCtSE) = {i} S→
S S→a
FIRST(a) = {a} iCtSE
FIRST(eS) = {e} E E→eS E→
FIRST(ε) = {ε} E→ε ε
FIRST(b) = {b}
C C→b

two production rules for M[E,e]

Problem  ambiguity
Dr. Girish Kumar Patnaik 42
A Grammar which is not LL(1)
• What do we have to do it if the resulting parsing table
contains multiply defined entries?
• If we didn’t eliminate left recursion, eliminate the left recursion in the grammar.
• If the grammar is not left factored, we have to left factor the grammar.
• If its (new grammar’s) parsing table still contains multiply defined entries, that
grammar is ambiguous or it is inherently not a LL(1) grammar.
• A left recursive grammar cannot be a LL(1) grammar.
• A → Aα | β
 any terminal that appears in FIRST(β) also appears FIRST(Aα) because
Aα  βα.
 If β is ε, any terminal that appears in FIRST(α) also appears in FIRST(Aα)
and FOLLOW(A).
• A grammar is not left factored, it cannot be a LL(1) grammar
• A → αβ1 | αβ2
any terminal that appears in FIRST(αβ1) also appears in FIRST(αβ2).
• An ambiguous grammar cannot be a LL(1) grammar.

Dr. Girish Kumar Patnaik 43


Nonrecursive Predictive Parsing
• A non recursive predictive parser can be
built by maintaining a stack explicitly,
rather than implicitly via recursive calls.
• Non-Recursive predictive parsing is a table-
driven parser (table-driven predictive
parser)
• It is a top-down parser.
• It is also known as LL(1) Parser.

Dr. Girish Kumar Patnaik 44


Nonrecursive Predictive Parsing

Dr. Girish Kumar Patnaik 45


Nonrecursive Predictive Parsing
input buffer
– our string to be parsed. We will assume that its end is marked with a special
symbol $.
output
– a production rule representing a step of the derivation sequence (left-most
derivation) of the string in the input buffer
stack
– contains the grammar symbols
– at the bottom of the stack, there is a special end marker symbol $.
– initially the stack contains only the symbol $ and the starting symbol S.
$S  initial stack
– when the stack is emptied (ie. only $ left in the stack), the parsing is
completed
parsing table
– a two-dimensional array M[A,a]
– each row is a non-terminal symbol
– each column is a terminal symbol or the special symbol $
– each entry holds a production rule
Dr. Girish Kumar Patnaik 46
Nonrecursive Predictive Parsing
• METHOD: Initially, the parser is in a configuration with w$
in the input buffer and the start symbol S of G on top of the
stack, above $.

Dr. Girish Kumar Patnaik 47


Nonrecursive Predictive Parsing

Dr. Girish Kumar Patnaik 48


Nonrecursive
Predictive
Parsing

Dr. Girish Kumar Patnaik 49


Nonrecursive Predictive Parsing
S → aBa
a b $
S S → aBa LL(1) Parsing
B → bB | ε Table
B B→ε B → bB

stack input output


$S abba$ S → aBa
$aBa abba$
$aB bba$ B → bB
$aBb bba$
$aB ba$ B → bB
$aBb ba$
$aB a$ B→ε
$a a$
$ $ accept, successful completion
Dr. Girish Kumar Patnaik 50
Error Recovery in Predictive Parsing
• An error is detected during predictive parsing
– when the terminal on top of the stack does not match
the next input symbol
– when nonterminal A is on top of the stack, a is the next
input symbol, and M [A, a] is error (i.e. , the parsing-
table entry is empty)
• What should the parser do in an error case?
– The parser should be able to give an error message (as
much as possible meaningful error message).
– It should recover from that error case, and it should be
able to continue the parsing with the rest of the input.

Dr. Girish Kumar Patnaik 51


Error Recovery in Predictive Parsing
• Panic-Mode Error Recovery
– Skipping the input symbols until a synchronizing token is found.

• Phrase-Level Error Recovery


– Each empty entry in the parsing table is filled with a pointer to a specific error routine to take
care that error case.

• Error-Productions
– If we have a good idea of the common errors that might be encountered, we can augment the
grammar with productions that generate erroneous constructs.
– When an error production is used by the parser, we can generate appropriate error diagnostics.
– Since it is almost impossible to know all the errors that can be made by the programmers, this
method is not practical.

• Global-Correction
– Ideally, we would like the compiler to make as few change as possible in processing incorrect
inputs.
– We have to globally analyze the input to find the error.
– This is an expensive method, and it is not in practice.
Dr. Girish Kumar Patnaik 52
Panic-Mode Error Recovery in LL(1) Parsing

• Based on skipping symbols on the input until a


token in a selected set of synchronizing tokens
appears.
• The synchronizing tokens should be chosen so
that the parser recovers quickly from errors
that are likely to occur in practice.
• What is the synchronizing token?
– All the terminal-symbols in the follow set of a non-
terminal can be used as a synchronizing token set
for that non-terminal.

Dr. Girish Kumar Patnaik 53


Panic-Mode Error Recovery in LL(1) Parsing
• Some heuristics are:
– Place all symbols in FOLLOW(A) into the synchronizing set
for nonterminal A. If we skip tokens until an element of
FOLLOW(A) is seen and pop A from the stack, it is likely
that parsing can continue
– Add the symbols that begin higher-level constructs to the
synchronizing set of a lower-level construct. For example,
add keywords that begin statements to the synchronizing sets
for the nonterminals generating expressions.
– Add symbols in FIRST(A) to the synchronizing set for
nonterminal A, then it may be possible to resume parsing
according to A if a symbol in FIRST(A) appears in the input

Dr. Girish Kumar Patnaik 54


Panic-Mode Error Recovery in LL(1) Parsing

• Some heuristics are (Contd…):


– The production deriving ε can be used as a
default. Doing so may postpone some error
detection, but cannot cause an error to be
missed.
– If a terminal on top of the stack cannot be
matched, a simple idea is to pop the terminal,
issue a message saying that the terminal was
inserted, and continue parsing.

Dr. Girish Kumar Patnaik 55


Panic-Mode Error Recovery in LL(1) Parsing
E → TE’
E’ → +TE’ | ε
T → FT’
T’ → *FT’ | ε
F → (E) | id

FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, ), $ }
FOLLOW(T’) = { +, ), $ }
FOLLOW(F) = {+, *, ), $ }
Dr. Girish Kumar Patnaik 56
Panic-Mode Error Recovery in LL(1) Parsing
• "synch" indicating synchronizing tokens obtained from the FOLLOW set of the
nonterminal
• If the parser looks up entry M[A, a] and finds that it is blank, then the input symbol “a” is
skipped.
• If the entry is "synch," then the nonterminal on top of the stack is popped in an attempt to
resume parsing.
• If a token on top of the stack does not match the input symbol, then we pop the token
from the stack

Dr. Girish Kumar Patnaik 57


Panic-Mode
Error Recovery
in LL(1) Parsing

Dr. Girish Kumar Patnaik 58


Panic-Mode Error Recovery in LL(1) Parsing
S → AbS | e | ε a b c d e $
A → a | cAd S S → AbS sync S → AbS sync S → e S → ε
FOLLOW(S)={$} A A→ a sync A → cAd sync sync sync
FOLLOW(A)={b,d}

stack input output stack input output


$S aab$ S → AbS $S ceadb$ S → AbS
$SbA aab$ A→a $SbA ceadb$ A → cAd
$Sba aab$ $SbdAc ceadb$
$Sb ab$ Error: missing b, inserted $SbdA eadb$ Error:unexpected e (illegal A)
$S ab$ S → AbS (Remove all input tokens until first b or d, pop A)

$SbA ab$ A→a $Sbd db$


$Sba ab$ $Sb b$
$Sb b$ $S $ S→ε
$S $ S→ε $ $ accept
$ $ accept
Dr. Girish Kumar Patnaik 59
Phrase-Level Error Recovery
• Each empty entry in the parsing table is filled
with a pointer to a special error routine which will
take care that error case.
• These error routines may:
– change, insert, or delete input symbols
– issue appropriate error messages
– pop items from the stack
• We should be careful when we design these error
routines, because we may put the parser into an
infinite loop.
Dr. Girish Kumar Patnaik 60

You might also like