You are on page 1of 28

WUCS405

COMPILERS
Lecture 5
Recap
■ The second phase of the compiler is syntax analysis or
parsing.
■ The parser uses the first components of the tokens
produced by the lexical analyzer to create a tree-like
intermediate representation that depicts the grammatical
structure of the token stream.
– A typical representation is a syntax tree in which each
interior node represents an operation and the children of
the node represent the arguments of the operation.
Recap
■ The parser analyzes the source code (token stream)
against the production rules to detect any errors in the
code.
– The output of this phase is a parse tree.
Recap
■ Parser
– Checks the stream of words and their parts of speech
(produced by the scanner) for grammatical correctness
– Determines if the input is syntactically well formed
– Guides checking at deeper levels than syntax (static
semantics checking)
– Builds an IR representation of the code
Benefits Offered by Grammar
■ Grammars offer significant benefits for both language designers and
compiler writers:
■ A grammar gives a precise, yet easy-to-understand syntactic specification
to a programming language.
■ Parsers can automatically be constructed for certain classes of
grammars.
– The parser-construction process can reveal syntactic ambiguities and
trouble spots.
■ A grammar imparts structure to a language.
– The structure is useful for translating source programs into correct object
code and for detecting errors.
■ A grammar allows a language to be evolved.
– New constructs can be integrated more easily into an implementation
that follows the grammatical structure of the language.
Why not use RE/DFA
■ Advantages of RE/DFA
– Simple & powerful notation for specifying patterns
– Automatic construction of fast recognizers
– Many kinds of syntax can be specified with Res
■ Limits of RE/DFA
– Finite automata cannot count, which means a finite automaton
cannot accept a language like {𝑎! 𝑏 ! |𝑛 ≥ 1} that would require it
to keep count of the number of a’s before it sees the b’s.
– Therefore, RE cannot check the balance of parenthesis, brackets,
begin-end pairs.
CFG vrs RE/DFA
■ Grammars are more powerful notation than RE
■ Every construct that can be described by a RE can be
described by a grammar, but not vice-versa
■ Every regular language is a context–free language, but
not vice versa
Context-free Grammar (CFG)
■ A context-free grammar (or CFG) has four components:
– A set of terminal symbols,
– A set of nonterminal symbols (or variables)
– One nonterminal is distinguished as the start symbol
– A set of productions in the form: LHS à RHS where
■ LHS (called head or left side) is a single nonterminal symbol
■ RHS (called body or right side) consists of zero or more terminals and
nonterminal
■ Best Explained with an example…
– Suppose we want to describe all legal arithmetic expressions
using addition, subtraction, multiplication, and division.
Arithmetic Expressions
■ Here is one possible CFG:
– E → int E
– E → E Op E => E Op E
– E → (E) => E Op (E)
– Op → + => E Op (E Op E)
– Op → - => E * (E Op E)
– Op → * => int*(E Op E)
– Op → / => int*(int Op E)
– => int * (int Op int)
– => int * (int + int)
A Notational Shorthand
■ E → int ■ Productions with the
■ E → E Op E same head can be
grouped.
■ E → (E)
■ E → int | E Op E | (E)
■ Op → +
■ Op → + | - | * | /
■ Op → -
■ Op → *
■ Op → /
Not Shorthand Notation
■ The syntax for regular expressions does not carry over to
CFGs.
■ Cannot use *, |, or parentheses.
S → a*b

S → Ab
A → Aa | ε
CFG’s in Programming Languages
BLOCK → STMT
| { STMTS }
STMTS → ε
| STMT STMTS
STMT → EXPR;
| if (EXPR) BLOCK
| while (EXPR) BLOCK
| do BLOCK while (EXPR);
| BLOCK
Some CFG Notation
■ Capital letters at the beginning of the alphabet will
represent nonterminal.
– i.e. A, B, C, D
■ Lowercase letters at the end of the alphabet will
represent terminals.
– i.e. t, u, v, w
■ Lowercase Greek letters will represent arbitrary strings of
terminals and nonterminal.
– i.e. α, γ, ω
Derivation
E ■ A derivation is basically a sequence
⇒ E Op E of production rules, in order to get the
⇒ E Op (E) input string.
⇒ E Op (E Op E) ■ During parsing, we take two decisions
⇒ E * (E Op E) for some sentential form of input:
⇒ int*(E Op E) – Deciding the non-terminal which is to
⇒ int * (int Op E) be replaced.
⇒ int * (int Op int) – Deciding the production rule, by
⇒ int * (int + int) which, the non-terminal will be
replaced.
Derivation
■ A string αAω yields string αγω iff A → γ is a production.
– If α yields β, we write α ⇒ β.
■ We say that α derives β iff there is a sequence of strings
where α ⇒ α1 ⇒ α2 ⇒ ... ⇒ β
■ 𝛼 ⇒∗ 𝛽 means 𝛼 derives 𝛽 in zero or more steps
■ 𝛼 ⇒" 𝛽 means 𝛼 derives 𝛽 in one or more steps
■ If two grammars generate the same language, the
grammars are said to be equivalent.
■ Process of discovering a derivation is called parsing.
Leftmost and Rightmost Derivation
■ The point of parsing is to construct a derivation.
– At each step, we choose a nonterminal to replace.
– Different choices can lead to different derivations
■ Two derivations are of particular interest
■ Leftmost derivation - replace leftmost nonterminal at
each step, denoted as: ⇒#$
■ Rightmost derivation - replace rightmost nonterminal at
each step, denoted as: ⇒%$
Leftmost Derivation
■ If the sentential form of an input is scanned and replaced
from left to right, it is called leftmost derivation.
– It is a derivation in which each step expands the leftmost
nonterminal.
– The sentential form derived by the left-most derivation is
called the left-sentential form.
Rightmost Derivation
■ If we scan and replace the input with production rules,
from right to left, it is known as rightmost derivation.
– It is a derivation in which each step expands the
rightmost nonterminal.
■ The sentential form derived from the rightmost derivation
is called the right-sentential form.
Leftmost and Rightmost Derivations
Derivations
■ A derivation encodes two pieces of information:
– What productions were applied produce the resulting
string from the start symbol?
– In what order were they applied?
■ Multiple derivations might use the same productions, but
apply them in a different order.
Parse Trees
■ A parse tree is a labeled tree representation of a
derivation that filters out the order in which productions
are applied to replace nonterminal.
– The interior nodes are labeled by nonterminal
– The leaf nodes are labeled by terminals
– The children of each internal node A are labeled, from left to right,
by the symbols in the body of the production by which this A was
replaced during the derivation
– The start symbol of the derivation becomes the root of the parse
tree.
Example: Parse Tree
E
⇒ E Op E
Parse Trees
E
⇒ E Op E
⇒ int Op E
⇒ int * E
Parse Trees
E
⇒ E Op E
⇒ int Op E
⇒ int * E
⇒ int * (E)
⇒ int * (E Op E)
Parse Trees
⇒E
⇒ E Op E
⇒ int Op E
⇒ int * E
⇒ int * (E)
⇒ int * (E Op E)
⇒ int * (int Op E)
⇒ int * (int + E)
Parse Trees
⇒E
⇒ E Op E
⇒ int Op E
⇒ int * E
⇒ int * (E)
⇒ int * (E Op E)
⇒ int * (int Op E)
⇒ int * (int + E)
⇒ int * (int + int)
For Comparism
Parse Trees
■ Goal of syntax analysis: Recover the structure described
by a series of tokens.
■ If language is described as a CFG, goal is to recover a
parse tree for the the input string.
■ Usually we do some simplifications on the tree; more on
that later.

You might also like