Syntax Analysis

Syntax Analysis
Acknowledgement
• Alfred V Aho, Monica S. Lam, Ravi Sethi,
Jeffrey D Ullman- “Compilers- Principles,
Techniques and Tools”
Girish Kumar Patnaik 2

Syntax Analysis
• The syntax of programming language
constructs can be specified by context-free
grammars or BNF (Backus-Naur Form)
notation
Dr. Girish Kumar Patnaik 3

Syntax Analysis
• Grammars offer significant benefits
– A grammar gives a precise, yet easy-to-understand, syntactic
specification of a programming language
– From certain classes of grammars, we can construct
automatically an efficient parser that determines the
syntactic structure of a source program
– The parser-construction process can reveal syntactic
ambiguities and trouble spots
– useful for translating source programs into correct object
code and for detecting errors
– allows a language to be evolved or developed iteratively

The Role of the Parser
Token,
Source tokenval Parser
Lexical Intermediate
Program and rest of
Analyzer representation
Get next front-end
token
Lexical error Syntax error

Semantic error
Symbol Table

• Three general types of parsers for grammars: universal,
top-down, and bottom-up.
• Universal parsing methods such as the Cocke-Younger-
Kasami algorithm and Earley's algorithm can parse any
grammar
• top-down methods build parse trees from the top (root) to
the bottom (leaves)
• bottom-up methods start from the leaves and work their
way up to the root
• In either case, the input to the parser is scanned from left to
right, one symbol at a time

• Universal (any C-F grammar)
– Cocke-Younger-Kasimi
– Earley
• Top-down (C-F grammar with restrictions)
– General Case: Recursive descent (Special Case:
Predictive parsing)
– LL (Left-to-right, Leftmost derivation) methods
• Bottom-up (C-F grammar with restrictions)
– Operator precedence parsing
– LR (Left-to-right, Rightmost derivation) methods
• SLR, canonical LR, LALR

Representative Grammars

The Parser
• A parser implements a C-F grammar
• The role of the parser is two fold:
1. To check syntax (= string recognizer)
– And to report syntax errors accurately
2. To invoke semantic actions
– For static semantics checking, e.g. type checking of
expressions, functions, etc.
– For syntax-directed translation of the source code to
an intermediate representation
SYNTAX ANALYSIS
The main features of syntax analysis, which is
done by parser, are as follows:
 Checks the Grammar
 Parse Tree Production
 Output Errors

Syntax Error Handling
Common programming errors:
 Lexical Error : Such as misspelling an identifier,
keyword or operator.
 Syntactic Error : Such as an arithmetic expression with
unbalanced parentheses.
 Semantic Error : Such as operator applied to an
incompatible operand.
 Logical Error : Such as infinitely recursive call.

Viable-Prefix Property
• The viable-prefix property of LL/LR parsers
allows early detection of syntax errors
– Goal: detection of an error as soon as possible without
further consuming unnecessary input
– How: detect an error as soon as the prefix of the input
does not match a prefix of any string in the language
Error is
Error is detected here
… detected here …
Prefix Prefix DO 10 I = 1;0
for (;)
… …
Error Recovery Strategies
• Panic mode
– Discard input until a token in a set of designated
synchronizing tokens is found
• Phrase-level recovery
– Perform local correction on the input to repair the error
• Error productions
– Augment grammar with productions for erroneous
constructs
• Global correction
– Choose a minimal sequence of changes to obtain a
global least-cost correction
1. Panic Mode
In case of an error like:
a=b + c // no semi-colon
d=e + f ;
The compiler will discard all subsequent tokens

till a semi-colon is encountered .
 This is a crude method but often turns out to be

the best method.

 This method often skips a considerable
amount of input without checking it for
additional errors, it has an advantage of
simplicity
 In situations where multiple errors in the

same statements are rare, this method may
be quite adequate.
2. Phrase Level Recovery
 On discovering an error, a parser may
perform local correction on the remaining
input; that is, it may replace a prefix of the
remaining input by some string that allows
the parser to continue.
 For example, in case of an error like the one

above, it will report the error, generate the
“;” and continue.
3. Error Production
If we have an idea of common errors that
might occur, we can include the errors in
the grammar at hand. For example if we
have a production rule like:
E +E | -E | *E | /E
Then, a=+b;
a=-b;
a=*b;
a=/b;
Here, the last two are error situations.
Now, we change the grammar as:
E +E | -E | *A | /A
A E
Hence, once it encounters *A, it sends an error
message asking the user if he is sure he wants
to use a unary “*”.

4. Global Correction
 We would like compiler to make as few
changes as possible in processing an
incorrect input string.
 There are algorithms for choosing a
minimal amount of changes to obtain a
globally least-cost correction. Suppose we
have a line like this:
THIS IS A OCMPERIL SCALS.
 To correct this, there is an attractor, which
checks how different tokens are from the
initial inputs, and checks the closest
attractor to the incorrect token.
 This is more of a probabilistic type of
error correction. Unfortunately, these
methods are in general too costly to
implement in terms of space and time.
Top-Down Parsing
Top-Down Parsing
• Constructing a parse tree for the input string,
starting from the root and creating the nodes of the
parse tree in preorder (depth-first)
• top-down parsing can be viewed as finding a
leftmost derivation for an input string

Top-Down Parsing
• Recursive-Descent Parsing
– Backtracking is needed (If a choice of a production rule does not work, we
backtrack to try other alternatives.)
– It is a general parsing technique, but not widely used.
– Not efficient
• Predictive Parsing
– no backtracking
– efficient
– The class of grammars for which we can construct predictive parsers looking “k”
symbols ahead in the input is sometimes called the LL(k) class.
– needs a special form of grammars (LL(1) grammars).
– Recursive Predictive Parsing is a special form of Recursive Descent parsing without
backtracking.
– Non-Recursive (Table Driven) Predictive Parser is also known as LL(1) parser.

Recursive-Descent Parsing
• A recursive-descent parsing program consists of a set of
procedures, one for each nonterminal
• Execution begins with the procedure for the start symbol,
which halts and announces success if its procedure body
scans the entire input string

• General recursive-descent may require
backtracking
• However, backtracking is rarely needed to parse
programming language constructs
• Backtracking is not very efficient
• A left-recursive grammar can cause a recursive-
descent parser, even one with backtracking, to go
into an infinite loop.


• Each non-terminal corresponds to a procedure.
Ex: A → aBb (This is only the production rule for A)
proc A {
- match the current token with a, and move to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
}

A → aBb | bAB
proc A {
case of the current token {
‘a’: - match the current token with a, and move to the next token;
- call ‘B’;
- match the current token with b, and move to the next token;
‘b’: - match the current token with b, and move to the next token;
- call ‘A’;
- call ‘B’;
}
}

A → aBe | cBd | C
B → bB | ε
C→f
proc C { match the current token with f,
proc A { and move to the next token; }
case of the current token {
a: - match the current token with a,
and move to the next token; proc B {
- call B; case of the current token {
- match the current token with e, b: - match the current token with b,
and move to the next token; and move to the next token;
c: - match the current token with c, - call B
and move to the next token; e,d: do nothing
- call B; }
- match the current token with d, }
follow set of B
and move to the next token;
f: - call C
} first set of C
} Dr. Girish Kumar Patnaik 30
FIRST and FOLLOW
• The construction of both top-down and bottom-up parsers is
aided by two functions, FIRST and FOLLOW
• FIRST(α) to be the set of terminals that begin strings derived
from α, where α is any string of grammar symbols,
• FOLLOW( A) to be the set of terminals a that can appear
immediately to the right of A, for nonterminal A
– the set of terminals “a” such that there exists a derivation of the form S
* αAaβ for some α and β

FIRST

FIRST
E → TE’
E’ → +TE’ | ε
T → FT’
T’ → *FT’ | ε
F → (E) | id
FIRST(F) = {(,id} FIRST(TE’) = {(,id}

FIRST(T’) = {*, ε} FIRST(+TE’ ) = {+}
FIRST(T) = {(,id} FIRST(ε) = {ε}
FIRST(E’) = {+, ε} FIRST(FT’) = {(,id}
FIRST(E) = {(,id} FIRST(*FT’) = {*}
FIRST(ε) = {ε}
FIRST((E)) = {(}
FIRST(id) = {id}
FOLLOW

FOLLOW
E → TE’
E’ → +TE’ | ε
T → FT’
T’ → *FT’ | ε
F → (E) | id
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, ), $ }
FOLLOW(T’) = { +, ), $ }
FOLLOW(F) = {+, *, ), $ }
LL ( 1 ) Grammars
• Predictive parsers, that is, recursive-descent parsers
needing no backtracking, can be constructed for a class of
grammars called LL(1)
• The first "L" in LL(1) stands for scanning the input from
left to right, the second "L" for producing a leftmost
derivation, and the “1" for using one input symbol of
lookahead at each step to make parsing action decisions
• No left-recursive or ambiguous grammar can be LL(1)

LL ( 1 ) Grammars
• A grammar G is LL(1) if and only if whenever A → α | β are two
distinct productions of G, the following conditions hold:
1. For no terminal “a” do both α and β derive strings beginning
with “a”.
2. At most one of α and β can derive the empty string.
3. If β * ε then α does not derive any string beginning with a
terminal in FOLLOW(A) . Likewise, if α * ε , then β does not
derive any string beginning with a terminal in FOLLOW(A) .
• The first two conditions are equivalent to the statement that FIRST(α)
and FIRST(β) are disjoint sets.
• The third condition is equivalent to stating that if ε is in FIRST(β) ,
then FIRST(α) and FOLLOW(A) are disjoint sets, and likewise if ε is
in FIRST(α) .

Predictive Parsing Table
• A predictive parsing table M [A, a] is a two-dimensional
array, where A is a nonterminal, and a is a terminal or the
symbol $ (input endmarker)

FIRST(F) = {(,id} FOLLOW(E) = { $, ) }

FIRST(T’) = {*, ε} FOLLOW(E’) = { $, ) }
FIRST(T) = {(,id} FOLLOW(T) = { +, ), $ }
FIRST(E’) = {+, ε} FOLLOW(T’) = { +, ), $ }
FIRST(E) = {(,id} FOLLOW(F) = {+, *, ), $ }
E → TE’ FIRST(TE’)={(,id}  E → TE’ into M[E,(] and M[E,id]
E’ → +TE’ FIRST(+TE’ )={+}  E’ → +TE’ into M[E’,+]
E’ → ε FIRST(ε)={ε}  none
but since ε in FIRST(ε)
and FOLLOW(E’)={$,)}  E’ → ε into M[E’,$] and M[E’,)]
T → FT’ FIRST(FT’)={(,id}  T → FT’ into M[T,(] and M[T,id]
T’ → *FT’ FIRST(*FT’ )={*}  T’ → *FT’ into M[T’,*]
T’ → ε FIRST(ε)={ε}  none
but since ε in FIRST(ε)
and FOLLOW(T’)={$,),+}  T’ → ε into M[T’,$], M[T’,)] and
M[T’,+]
F → (E) FIRST((E) )={(}  F → (E) into M[F,(]
F → id FIRST(id)={id}  F → id into M[F,id]

• For every LL(1) grammar , each parsing-table entry
uniquely identifies a production or signals an error.
• For some grammars, however, M may have some entries
that are multiply defined. Hence not a LL (1) Grammar

S→iCtSE | a FOLLOW(S) = { $,e }
E→eS | ε FOLLOW(E) = { $,e }
C→b FOLLOW(C) = { t }
a b e i t $
FIRST(iCtSE) = {i} S→
S S→a
FIRST(a) = {a} iCtSE
FIRST(eS) = {e} E E→eS E→
FIRST(ε) = {ε} E→ε ε
FIRST(b) = {b}
C C→b
two production rules for M[E,e]
Problem  ambiguity
A Grammar which is not LL(1)
• What do we have to do it if the resulting parsing table
contains multiply defined entries?
• If we didn’t eliminate left recursion, eliminate the left recursion in the grammar.
• If the grammar is not left factored, we have to left factor the grammar.
• If its (new grammar’s) parsing table still contains multiply defined entries, that
grammar is ambiguous or it is inherently not a LL(1) grammar.
• A left recursive grammar cannot be a LL(1) grammar.
• A → Aα | β
 any terminal that appears in FIRST(β) also appears FIRST(Aα) because
Aα  βα.
 If β is ε, any terminal that appears in FIRST(α) also appears in FIRST(Aα)
and FOLLOW(A).
• A grammar is not left factored, it cannot be a LL(1) grammar
• A → αβ1 | αβ2
any terminal that appears in FIRST(αβ1) also appears in FIRST(αβ2).
• An ambiguous grammar cannot be a LL(1) grammar.

Nonrecursive Predictive Parsing
• A non recursive predictive parser can be
built by maintaining a stack explicitly,
rather than implicitly via recursive calls.
• Non-Recursive predictive parsing is a table-
driven parser (table-driven predictive
parser)
• It is a top-down parser.
• It is also known as LL(1) Parser.


input buffer
– our string to be parsed. We will assume that its end is marked with a special
symbol $.
output
– a production rule representing a step of the derivation sequence (left-most
derivation) of the string in the input buffer
stack
– contains the grammar symbols
– at the bottom of the stack, there is a special end marker symbol $.
– initially the stack contains only the symbol $ and the starting symbol S.
$S  initial stack
– when the stack is emptied (ie. only $ left in the stack), the parsing is
completed
parsing table
– a two-dimensional array M[A,a]
– each row is a non-terminal symbol
– each column is a terminal symbol or the special symbol $
– each entry holds a production rule
• METHOD: Initially, the parser is in a configuration with w$
in the input buffer and the start symbol S of G on top of the
stack, above $.


Nonrecursive
Predictive
Parsing

S → aBa
a b $
S S → aBa LL(1) Parsing
B → bB | ε Table
B B→ε B → bB
stack input output

$S abba$ S → aBa
$aBa abba$
$aB bba$ B → bB
$aBb bba$
$aB ba$ B → bB
$aBb ba$
$aB a$ B→ε
$a a$
$ $ accept, successful completion
Error Recovery in Predictive Parsing
• An error is detected during predictive parsing
– when the terminal on top of the stack does not match
the next input symbol
– when nonterminal A is on top of the stack, a is the next
input symbol, and M [A, a] is error (i.e. , the parsing-
table entry is empty)
• What should the parser do in an error case?
– The parser should be able to give an error message (as
much as possible meaningful error message).
– It should recover from that error case, and it should be
able to continue the parsing with the rest of the input.

Error Recovery in Predictive Parsing
• Panic-Mode Error Recovery
– Skipping the input symbols until a synchronizing token is found.
• Phrase-Level Error Recovery

– Each empty entry in the parsing table is filled with a pointer to a specific error routine to take
care that error case.
• Error-Productions
– If we have a good idea of the common errors that might be encountered, we can augment the
grammar with productions that generate erroneous constructs.
– When an error production is used by the parser, we can generate appropriate error diagnostics.
– Since it is almost impossible to know all the errors that can be made by the programmers, this
method is not practical.
• Global-Correction
– Ideally, we would like the compiler to make as few change as possible in processing incorrect
inputs.
– We have to globally analyze the input to find the error.
– This is an expensive method, and it is not in practice.
Panic-Mode Error Recovery in LL(1) Parsing
• Based on skipping symbols on the input until a

token in a selected set of synchronizing tokens
appears.
• The synchronizing tokens should be chosen so
that the parser recovers quickly from errors
that are likely to occur in practice.
• What is the synchronizing token?
– All the terminal-symbols in the follow set of a non-
terminal can be used as a synchronizing token set
for that non-terminal.

• Some heuristics are:
– Place all symbols in FOLLOW(A) into the synchronizing set
for nonterminal A. If we skip tokens until an element of
FOLLOW(A) is seen and pop A from the stack, it is likely
that parsing can continue
– Add the symbols that begin higher-level constructs to the
synchronizing set of a lower-level construct. For example,
add keywords that begin statements to the synchronizing sets
for the nonterminals generating expressions.
– Add symbols in FIRST(A) to the synchronizing set for
nonterminal A, then it may be possible to resume parsing
according to A if a symbol in FIRST(A) appears in the input

• Some heuristics are (Contd…):

– The production deriving ε can be used as a
default. Doing so may postpone some error
detection, but cannot cause an error to be
missed.
– If a terminal on top of the stack cannot be
matched, a simple idea is to pop the terminal,
issue a message saying that the terminal was
inserted, and continue parsing.

E → TE’
E’ → +TE’ | ε
T → FT’
T’ → *FT’ | ε
F → (E) | id
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, ), $ }
FOLLOW(T’) = { +, ), $ }
FOLLOW(F) = {+, *, ), $ }
• "synch" indicating synchronizing tokens obtained from the FOLLOW set of the
nonterminal
• If the parser looks up entry M[A, a] and finds that it is blank, then the input symbol “a” is
skipped.
• If the entry is "synch," then the nonterminal on top of the stack is popped in an attempt to
resume parsing.
• If a token on top of the stack does not match the input symbol, then we pop the token
from the stack

Panic-Mode
Error Recovery
in LL(1) Parsing

S → AbS | e | ε a b c d e $
A → a | cAd S S → AbS sync S → AbS sync S → e S → ε
FOLLOW(S)={$} A A→ a sync A → cAd sync sync sync
FOLLOW(A)={b,d}
stack input output stack input output

$S aab$ S → AbS $S ceadb$ S → AbS
$SbA aab$ A→a $SbA ceadb$ A → cAd
$Sba aab$ $SbdAc ceadb$
$Sb ab$ Error: missing b, inserted $SbdA eadb$ Error:unexpected e (illegal A)
$S ab$ S → AbS (Remove all input tokens until first b or d, pop A)
$SbA ab$ A→a $Sbd db$

$Sba ab$ $Sb b$
$Sb b$ $S $ S→ε
$S $ S→ε $ $ accept
$ $ accept
Phrase-Level Error Recovery
• Each empty entry in the parsing table is filled
with a pointer to a special error routine which will
take care that error case.
• These error routines may:
– change, insert, or delete input symbols
– issue appropriate error messages
– pop items from the stack
• We should be careful when we design these error
routines, because we may put the parser into an
infinite loop.

Syntax Analysis

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Syntax Analysis

Uploaded by

Copyright:

Available Formats

Syntax Analysis

Girish Kumar Patnaik 2

Dr. Girish Kumar Patnaik 3

Dr. Girish Kumar Patnaik 4

Lexical error Syntax error

Dr. Girish Kumar Patnaik 5

Dr. Girish Kumar Patnaik 6

Dr. Girish Kumar Patnaik 7

Dr. Girish Kumar Patnaik 8

Dr. Girish Kumar Patnaik 10

Dr. Girish Kumar Patnaik 11

The compiler will discard all subsequent tokens

 This is a crude method but often turns out to be

Dr. Girish Kumar Patnaik 14

 In situations where multiple errors in the

 For example, in case of an error like the one

Dr. Girish Kumar Patnaik 18

Dr. Girish Kumar Patnaik 22

Dr. Girish Kumar Patnaik 24

Dr. Girish Kumar Patnaik 25

Dr. Girish Kumar Patnaik 26

Dr. Girish Kumar Patnaik 27

Ex: A → aBb (This is only the production rule for A)

Dr. Girish Kumar Patnaik 28

Dr. Girish Kumar Patnaik 29

Dr. Girish Kumar Patnaik 31

Dr. Girish Kumar Patnaik 32

FIRST(F) = {(,id} FIRST(TE’) = {(,id}

Dr. Girish Kumar Patnaik 34

Dr. Girish Kumar Patnaik 36

Dr. Girish Kumar Patnaik 37

Dr. Girish Kumar Patnaik 38

FIRST(F) = {(,id} FOLLOW(E) = { $, ) }

Dr. Girish Kumar Patnaik 40

Dr. Girish Kumar Patnaik 41

two production rules for M[E,e]

Dr. Girish Kumar Patnaik 43

Dr. Girish Kumar Patnaik 44

Dr. Girish Kumar Patnaik 45

Dr. Girish Kumar Patnaik 47

Dr. Girish Kumar Patnaik 48

Dr. Girish Kumar Patnaik 49

stack input output

Dr. Girish Kumar Patnaik 51

• Phrase-Level Error Recovery

• Based on skipping symbols on the input until a

Dr. Girish Kumar Patnaik 53

Dr. Girish Kumar Patnaik 54

• Some heuristics are (Contd…):

Dr. Girish Kumar Patnaik 55

Dr. Girish Kumar Patnaik 57

Dr. Girish Kumar Patnaik 58

stack input output stack input output

$SbA ab$ A→a $Sbd db$

You might also like