CFG Part 1 Prosenjit Roy

Syntax Analysis & Context-free Grammar
Syntax analysis
Syntax Analysis as a Phase of Compiler:
The compilation process operates as a sequence of phases, each of which transforms one representation of
the source program to another. Syntax analysis is the second phase down the line of the complex
undertaking of compilation and the syntax analyzer, the module that executes it, is the most important
constituent in the analysis part or the front end of the compiler. The syntax analyzer, more commonly
known as the parser, checks whether the lexical units of the programme being compiled arriving in
sequence from the previous lexical analysis phase maintain certain patterns that are permitted by the
syntactic specification of the source language. If they do, then the syntax analyzer generates a tree-like
structure called parse tree manifesting the legitimate syntactic structure of the source programme and
passes it to the subsequent semantic analyzer. In case they don’t, the syntax analyzer declares the
detection of syntax error in the input programme.
The Role of a Parser:

The syntax analyzer or parser works hand-in-hand with the lexical analyzer. Whenever the parser needs
further tokens, it requests the lexical analyzer which, in turn, scans the source programme from the
current position to identify the next token and returns it to the parser. The parser looks into the sequence
of tokens returned by the lexical analyzer and attempts to extract the constructs of the source language
appearing within the sequence. Thus, the role of parser is two-fold:
1. If the programme being compiled be grammatically correct, the parser succeeds in identifying a
sequence of grammar rules stipulated by the source language specification that can be applied to
extract the received stream of tokens in the pattern they occur in its input, and subsequently
outputs a representation of the input programme in the form of a parse tree which is passed on to
the semantic analyzer in a more compact form called syntax tree for further processing.
The parse tree serves two purposes:
i) It exhibits how the tokens fit well into the permitted syntactic constructs of
the source language.
ii) It delineates how the tokens should be grouped so as to guide semantic
actions.
2. If the input programme be syntactically ill-formed, the parser fails to construct any such parse
tree and the appropriate error message is flashed so that the user can take corrective action.
©Prosenjit Roy (Lecturer in Computer Sc. & Technology, APC Ray Polytechnic) Page 1
Context-free Grammars:
By design, every programming language has precise rules that prescribe the syntactic structure of well-
formed programs. In C, for example, a program is made up of functions, a function of declarations and
statements, a statement of expressions, and so on. The syntax of programming language constructs is
normally specified by using a notation popularly known as context-free grammar, which is also
sometimes called BNF (Backus-Naur Form) description.
Formal Definition:
A context-free grammar (or grammar for short) G of a language L is a 4-tuple (VN, VT, P, S), where
– VN is a set of non-terminals
– VT is the set of terminals (i.e. the set of valid words in the language L)
– P is a set of production rules (or productions for short) of the form A → α,
where A is a non-terminal and α is any string over (VNՍVT)
– S is a special symbol in VN, called the start symbol of the grammar
Note:
i) Non-terminals, also called syntactic variables, are special symbols that denote sets of strings; for
example, in the grammar for the if-then-else C-statement, condition may be a non-terminal that will
represent all valid conditional expressions such as “i<=n” in C.
ii) Productions are rewriting rules for generating valid sentences of the language consisting of all
terminals.
iii) For parsers, terminals will be the tokens like keywords such as if and while, or punctuation
symbols such as ; and (, or operators such as + and *, or identifiers denoted by id.
Example:
Grammar for arithmetic expression:
E → E + E | E * E | -E | (E) | id (7.1)
Derivation:
If a string ω contains a non-terminal A and A → 𝜓 be a production in grammar G, then we can replace
the A by 𝜓 in ω without altering its remaining parts. If ω=γAδ, then this is formally described by the
following notation:
γAδ ⇒ γ𝜓δ
This process is called direct derivation.
In general, we say that α derives β if there exist strings 𝜙1, 𝜙2, 𝜙3, … , 𝜙k-1, 𝜙k, with k ≥ 0 such that:
α ⇒ 𝜙1 ⇒ 𝜙2 ⇒ 𝜙3 ⇒ … ⇒ 𝜙k-1 ⇒ 𝜙k ⇒ β
This is formally denoted by
α *⇒ β
At each step in a derivation, there are two choices to be made. We need to choose which non-terminal to
replace, and having made this choice, we must pick a production with that non-terminal as head.
Derivations are particularly useful in generating terminal strings from the start symbol.
Sentential Form:
* α, then we say that the string α is a sentential form.
If S ⇒
Language:
The language L(G) generated by a grammar G is the set of terminal strings derivable from the start
symbol by using one or more production rules of G. That is to say:
+
L(G) = {w | w∈VT*, S ⇒w}
Leftmost Derivation:
A leftmost derivation of a terminal string is obtained by applying a production to the leftmost non-
terminal in each sentential form.
If α ⇒ β is a step in which the leftmost non-terminal in α is replaced, we write α ⇒ β
lm
* β
In general, if α derives β by a leftmost derivation, we write α ⇒
lm
Example:
The leftmost derivation of the string –(id+id) for grammar (7.1):
E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(id+E) ⇒ -(id+id)
lm lm lm lm lm
(1)
The strings of each step of a leftmost derivation are called left-sentential forms.
In derivation (1), for example, -E, -(E+E), -(id+id) are all left-sentential forms.
Rightmost Derivation:
A rightmost derivation of a terminal string is obtained by applying a production to the rightmost non-
terminal in each sentential form.
If α ⇒ β is a step in which the rightmost non-terminal in α is replaced, we write α ⇒ β
rm
* β
In general, if α derives β by a rightmost derivation, we write α ⇒
rm
Example:
The rightmost derivation of the string –(id+id) for grammar (7.1):
E ⇒ -E ⇒ -(E) ⇒ -(E+E) ⇒ -(E+id) ⇒ -(id+id)
rm rm rm rm rm
(2)
The strings of each step of a rightmost derivation are called right-sentential forms.
In derivation (2), for example, -E, -(E+E), -(E+id) are all right-sentential forms.
Parse Tree:
A parse tree is a graphical representation of a derivation that filters out the order in which productions are
applied to replace non-terminals.
 The root node of the parse tree is labelled with the start symbol.
 Each interior node of the parse tree represents the application of a production. The interior node
is labelled with the non-terminal A in the head of the production; the children of the node are
labelled, from left to right, by the symbols in the body of the production by which this A was
replaced during the derivation. That is to say, if A → X1X2…Xk be the production being applied,
then the nodes with labels X1, X2,…, Xk are the children of the node labelled A from left to right
respectively in the parse tree.
 If v1 and v2 are any two nodes at some level l and if v1 is to the left of v2, then we say that v1 is to
the left of any child of v2. Also, any child of v1 will be to the left of v2 as well as to the left of any
child of v2.
 The labels of the leaf nodes of the parse tree are terminals.
 The concatenation of the labels of the leaves of the parse tree in the left-to-right ordering, called
the yield of the parse tree, gives the terminal string being derived.
Example:
The parse tree for the terminal string -(id+id) in Fig. 7.2 represents derivation (1) as well as
derivation
(2), which shows that the parse tree ignores the variations in the order in which productions are applied to
replace the non-terminals.
E→E+E
|E ‫٭‬E
|-E
| (E)
| id
Ambiguity:
a grammar is said to be ambiguous if there exists more than one parse tree for some sentence. Put another
way, an ambiguous grammar is one that produces more than one leftmost derivation or more than one
rightmost derivation for the same sentence.
Example:
Consider grammar (7.1) which is formulated for generating arithmetic expressions.
Now consider the following arithmetic expression in C:
a+b*c
When passed through the lexical analyzer, the above expression will be converted to:
id+id*id
The above string of tokens has two distinct parse trees as shown in Fig. 7.3 (a) and 7.3 (b):
E→E+E
|E ‫٭‬E
|-E
| (E)
| id
Therefore, we can conclude that grammar (7.1) is ambiguous.

Note:
 The operator ‘*’ has higher precedence than ‘+’, so we normally evaluate the expression like
“a+b*c” as “a+(b*c)” rather than as “(a+b)*c”.
 The parse tree of Fig. 7.3 (a) enforces a grouping of tokens that will lead to semantic action
which is arithmetically “correct”, while the one of Fig. 7.3 (b) combines the tokens in a manner
that will dictate “incorrect” semantic action.
 For most parsers, it is customary that the grammar be made unambiguous, for if it is not, we
cannot uniquely determine which parse tree to select for a given sentence.
Elimination of Ambiguity:
A classic example of ambiguous grammar is the following "dangling-else" grammar:
stmt → if condition then stmt
| if condition then stmt else stmt (7.2)
| other_stmt
Here ‘other_stmt’ stands for any other unconditional statement.
Grammar (7.2) is ambiguous since the string
if a>b then if c>d then x=y else x=z (7.3)
has two distinct parse trees as shown in Fig. 7.4 (a) and Fig. 7.4 (b):
The parse tree of Fig. 7.4 (a) shows the situation in which the else is taken with the outer if, meaning that
the outer construct is an if-then-else statement while the inner one is an if-then statement. On the other
hand, in the parse tree of Fig. 7.4 (b), the else is taken with the inner if, meaning that the outer construct
is an if-then statement while the inner one is an if-then-else statement. Most programming languages
accept the second parse tree as the correct one, since the preferred rule is, "Each else is to be matched
with the closest previous unmatched then"; that is the else is associated with the innermost previous if.
We can eliminate the ambiguity of the dangling-else grammar (7.2) by rewriting it as the following
unambiguous grammar (7.4):
stmt → matched_stmt
| unmathched_stmt
matched_stmt → if condition then matched_stmt else matched_stmt (7.4)
| other_stmt
unmathched_stmt → if condition then stmt
| if condition then matched_stmt else unmathched_stmt
The idea is that a statement appearing between a then and an else must be ‘matched’; that is, the interior
statement must not end with an unmatched or open then. A matched statement is either an if-then-else
statement containing no open statements or it is any other kind of unconditional statement. This grammar
allows only one parsing for string (7.3); namely, the one that associates each else with the closest
previous unmatched then, which is demonstrated by the following unique parse tree of Fig. 7.5.

CFG Part 1 Prosenjit Roy

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CFG Part 1 Prosenjit Roy

Uploaded by

Copyright:

Available Formats

Syntax Analysis & Context-free Grammar

The Role of a Parser:

Therefore, we can conclude that grammar (7.1) is ambiguous.

You might also like