You are on page 1of 12

Some Preliminaries Introduction

3 • For the next several weeks we’ll look at how one We usually break down the problem of defining a
can define a programming language programming language into two parts
• What is a language, anyway? • defining the PL’s syntax
“Language is a system of gestures, grammar, signs, • defining the PL’s semantics
sounds, symbols, or words, which is used to represent
and communicate concepts, ideas, meanings, and Syntax - the form or structure of the expressions,
thoughts”

Syntax
statements, and program units
• Human language is a way to communicate Semantics - the meaning of the expressions,
representations from one (human) mind to another statements, and program units
• What about a programming language?
Note: There is not always a clear boundary
A way to communicate representations (e.g., of data or a
procedure) between human minds and/or machines between the two

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

Syntax part

Why and How Syntax Overview


Why? We want specifications for several
communities: • Language preliminaries
• Other language designers
• Implementers • Context-free grammars and BNF
• Machines? • Syntax diagrams
• Programmers (the users of the language)

How? One ways is via natural language descriptions


(e.g., user’s manuals, text books) but there are a
number of more formal techniques for specifying the
syntax and semantics This is an overview of the standard
process of turning a text file into an
executable program.
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Introduction Lexical Structure of Formal Grammar
A sentence is a string of characters over some Programming Languages
• The structure of its lexemes (words or tokens)
• A (formal) grammar is a set of rules for
alphabet (e.g., def add1(n): return n + 1)
strings in a formal language
– token is a category of lexeme
A language is a set of sentences
• The scanning phase (lexical analyser) collects • The rules describe how to form strings
A lexeme is the lowest level syntactic unit of a characters into tokens from the language’s alphabet that are valid
language (e.g., *, add1, begin) • Parsing phase (syntactic analyser) determines according to the language's syntax
syntactic structure
A token is a category of lexemes (e.g., identifier)
Stream of
• A grammar does not describe the meaning
tokens and Result of
characters values parsing of the strings or what can be done with
Formal approaches to describing syntax:
them in whatever context — only their form
• Recognizers - used in compilers lexical Syntactic
analyser analyser
• Generators - what we'll study Adapted from Wikipedia
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

• Chomsky & Backus independently came up with equiv-


Grammars alent formalisms for specifying the syntax of a language
• Backus focused on a practical way of specifying an
BNF (continued)
artificial language, like Algol
Context-Free Grammars • Chomsky made fundamental contributions to mathe- A metalanguage is a language used to describe
• Developed by Noam Chomsky in the mid- matical linguistics and was motivated by the study of
another language.
human languages.
1950s.
• Language generators, meant to describe the In BNF, abstractions are used to represent
syntax of natural languages. classes of syntactic structures -- they act like
• Define a class of languages called context-free NOAM CHOMSKY, syntactic variables (also called nonterminal
MIT Institute Professor;
languages. Professor of Linguistics, symbols), e.g.
Linguistic Theory,
Syntax, Semantics, <while_stmt> ::= while <logic_expr> do <stmt>
Backus Normal/Naur Form (1959) Philosophy of Language

• Invented by John Backus to describe Algol 58 This is a rule; it describes the structure of a while
and refined by Peter Naur for Algol 60. statement
Six participants in the 1960 Algol conference in Paris. This
• BNF is equivalent to context-free grammars was taken at the 1974 ACM conference on the history of
programming languages. Top: John McCarthy, Fritz Bauer,
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Joe Wegstein. Bottom: John Backus, Peter Naur, Alan Perlis.
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
BNF Non-terminals, pre-terminals & terminals BNF
• A rule has a left-hand side (LHS) which is a single • A non-terminal symbol is any symbol that is in the LHS of • Repetition is done with recursion
non-terminal symbol and a right-hand side (RHS), a rule. These represent abstractions in the language (e.g.,
one or more terminal or non-terminal symbols if-then-else-statement in • E.g., Syntactic lists are described in BNF
• A grammar is a finite, nonempty set of rules <if-then-else-statement> ::= if <test> using recursion
then <statement> else <statement>
• A non-terminal symbol is “defined” by its rules.
• A non-terminal can have more than one rule • A terminal symbol is any symbol that is not on the LHS of • An <ident_list> is a sequence of one or more
a rule. AKA lexemes. These are the literal symbols that <ident>s separated by commas.
• Multiple rules can be combined with the vertical- will appear in a program (e.g., if, then, else in rules above).
bar ( | ) symbol (read as “or”)
• A pre-terminal symbol is one that appears as a LHS of
These two rules: rule(s), but in every case, the RHSs consist of single
<value> ::= <const> <ident_list> ::= <ident> |
terminal symbol, e.g., <digit> in
<value> ::= <ident> <ident> , <ident_list>
<digit> ::= 0 | 1 | 2 | 3 … 7 | 8 | 9
are equivalent to this one:
<value> ::= <const> | <ident>
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

BNF Example Derivations Derivation using BNF


Here is an example of a simple grammar for a subset of • A derivation is a repeated application of rules, starting
English with the start symbol and ending with a sentence Here is a derivation for “the man eats the apple.”
consisting of just all terminal symbols <sentence> -> <nounPhrase><verbPhrase>.
A sentence is noun phrase and verb phrase followed by a
period. • It demonstrates, or proves that the derived sentence is <article><noun><verbPhrase>.
<sentence> ::= <nounPhrase> <verbPhrase> . “generated” by the grammar and is thus in the language the<noun><verbPhrase>.
<nounPhrase> ::= <article> <noun> that the grammar defines the man <verbPhrase>.
<article> ::= a | the • As an example, consider our baby English grammar the man <verb><nounPhrase>.
<noun> ::= man | apple | worm | penguin <sentence> ::= <nounPhrase><verbPhrase>. the man eats <nounPhrase>.
<nounPhrase> ::= <article><noun> the man eats <article> < noun>.
<verbPhrase> ::= <verb>|<verb> <nounPhrase> <article> ::= a | the
<verb> ::= eats | throws | sees | is the man eats the <noun>.
<noun> ::= man | apple | worm | penguin
<verbPhrase> ::= <verb> | <verb><nounPhrase> the man eats the apple.
<verb> ::= eats | throws | sees | is

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Derivation Another BNF Example Finite and Infinite languages
<program> -> <stmts> Note: There is some
Every string of symbols in the derivation is <stmts> -> <stmt> variation in notation • A simple language may have a finite number
for BNF grammars.
a sentential form | <stmt> ; <stmts> Here we are using -> of sentences
in the rules instead
<stmt> -> <var> = <expr> of ::= . – The set of strings representing integers between -
A sentence is a sentential form that has only <var> -> a | b | c | d 10**6 and +10**6 is a finite language
terminal symbols <expr> -> <term> + <term> | <term> - <term>
– A finite language can be defined by enumerating the
<term> -> <var> | const sentences, but using a grammar might be much easier
A leftmost derivation is one in which the Here is a derivation: • Most interesting languages have an infinite
leftmost nonterminal in each sentential form <program> => <stmts>
=> <stmt> number of sentences
is the one that is expanded in the next step => <var> = <expr> • Even very simple grammars may yield
=> a = <expr>
A derivation may be either leftmost or => a = <term> + <term> infinite languages
=> a = <var> + <term>
rightmost or something else => a = b + <term>
=> a = b + const
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

Is English a finite or infinite language? Parse Tree Another Parse Tree


• Assume we have a finite set of words A parse tree is a hierarchical representation of
a derivation <program>
• Consider adding rules like the following to our
<sentence>
previous “baby English” grammar <stmts>
<sentence> ::= <sentence><conj><sentence>. <nounPhrase> <verbPhrase>

<conj> ::= and | or | because <stmt>


<article> <noun> <verb> <nounPhrase>
• Hint: Whenever you see recursion in a BNF <var> = <expr>
it’s likely that the language is infinite. the man eats
<article> <noun>
–When might it not be? a <term> + <term>
the apple

–The recursive rule might not be reachable. <var> const

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Derivations and Parse Trees Grammar Ambiguous English Sentences
• Note that a unannotated derivation (i.e., • A grammar is ambiguous if and only if
just the sequence of sentential forms) (iff) it generates a sentential form that
• I saw the man on the hill with a
contains some, but not all, information has two or more distinct parse trees telescope
about the parse structure • Time flies like an arrow;
• Ambiguous grammars are, in general,
• Sometimes, it’s ambiguous which symbol very undesirable in formal languages fruit flies like a banana
was replaced
• Can you guess why? • Buffalo buffalo Buffalo buffalo
• Also note that a parse tree contains
partial information about the derivation: • We can eliminate ambiguity by revising buffalo buffalo Buffalo buffalo
specifically, what rules were applied, but the grammar
not the order of application See: Syntactic Ambiguity
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

<e> -> <e> <op> <e> <e> -> <e> <op> <e>

An ambiguous grammar Two derivations for 1+2*3 <e> -> 1|2|3


<op> -> +|-|*|/
Two derivations for 1+2*3 <e> -> 1|2|3
<op> -> +|-|*|/

<e> -> <e> <op> <e> <e> -> <e> <op> <e> <e> -> <e> <op> <e> <e> -> <e> <op> <e>
Here is a simple grammar for expressions that -> 1 <op> <e> -> <e> <op> <e> <op> <e> -> 1 <op> <e> -> <e> <op> <e> <op> <e>
is ambiguous -> 1 + <e> -> 1 <op> <e> <op> <e>
-> 1 + <e> -> 1 <op> <e> <op> <e>
-> 1 + <e> <op> <e> -> 1 + <e> <op> <e>
-> 1 + <e> <op> <e> -> 1 + <e> <op> <e> -> 1 + 2 <op> <e> -> 1 + 2 <op> <e>
<e> -> <e> <op> <e> -> 1 + 2 <op> <e> -> 1 + 2 <op> <e> -> 1 + 2 * <e> -> 1 + 2 * <e>
-> 1 + 2 * <e> -> 1 + 2 * <e> -> 1 + 2 * 3 -> 1 + 2 * 3
<e> -> 1|2|3 Fyi… In a programming language, an expression
-> 1 + 2 * 3 -> 1 + 2 * 3 e
<op> -> +|-|*|/ is some code that is evaluated and produces a
value. A statement is code that is executed and
e
does something but does not produce a value. e e

e op e e op e
The sentence 1+2*3 can lead to two different e op e e op e
parse trees corresponding to 1+(2*3) and 1 + e op e e op e
* 3
* 3
(1+2)*3 1 + e op e e op e
2 * 3 1 + 2
2 * 3 1 + 2

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
The leaves of the trees are terminals and correspond to the sentence
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
<e> -> <e> <op> <e>
Two derivations for 1+2*3 <e> -> 1|2|3
<op> -> +|-|*|/ Operators Operator notation
e
• The traditional operator notation introduces • So, how do we interpret expressions like
e many problems. (a) 2 + 3 + 4
e op e e op e • Operators are used in (b) 2 + 3 * 4
– Prefix notation: Expression (* (+ 1 3) 2) in Lisp • While you might argue that it doesn’t matter for (a), it
1 + e op e e op e
* 3 – Infix notation: Expression (1 + 3) * 2 in Java can for different operators (2 ** 3 ** 4) or when the
– Postfix notation: Increment foo++ in C limits of representation are hit (e.g., round off in
1 + 2
2 * 3
• Operators can have one or more operands numbers, e.g., 1+1+1+1+1+1+1+1+1+1+1+10**6)
+ * – Increment in C is a one-operand operator: foo++ • Concepts:
– Subtraction in C is a two-operand operator: foo - bar – Explaining rules in terms of operator precedence
1 * + 3
– Conditional expression in C is a three-operand and associativity
operators: (foo == 3 ? 0 : 1)
2 3 1 2 – Realizing the rules in grammars
Lower trees represent the way we think about the sentence ‘meaning’
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

Operators: Precedence and Associativity Operator Precedence: Precedence Table Operator Precedence: Precedence Table
• Precedence and associativity deal with the
evaluation order within expressions
• Precedence rules specify order in which operators
of different precedence level are evaluated, e.g.:
“*” Has a higher precedence that “+”, so “*” groups more
tightly than “+”
• What is the results of 4 * 5 ** 6 ?
• A language’s precedence hierarchy should match
our intuitions, but the result’s not always perfect, as
in this Pascal example:
if A<B and C<D then A := 0 ;
• Pascal relational operators have lowest precedence!
if A < B and C < D then A := 0 ;

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Operators: Associativity Operators: Associativity Associativity in C
• For + and * it doesn’t matter in theory (though it can in
• In C, as in most languages, most of the operators
• Associativity rules specify order in which operators associate left to right
practice) but for – and / it matters in theory, too.
of the same precedence level are evaluated
• What should A-B-C mean? a + b + c => (a + b) + c
• Operators are typically either left associative or
(A – B) – C A – (B – C) • The various assignment operators however associate
right associative.
• What is the results of 2 ** 3 ** 4 ? right to left
• Left associativity is typical for +, - , * and / – 2 ** (3 ** 4) = 2 ** 81 = 2417851639229258349412352 = += -= *= /= %= >>= <<= &= ^= |=
• So A + B + C – (2 ** 3) ** 4 = 8 ** 4 = 4096
• Consider a += b += c, which is interpreted as
– Means: (A + B) + C • Languages diverge on this case:
– In Fortran, ** associates from right-to-left, as in normally
a += (b += c)
– And not: A + (B + C) the case for mathematics • and not as
• Does it matter? – In Ada, ** doesn’t associate; you must write the previous (a += b) += c
expression as 2 ** (3 ** 4) to obtain the expected answer
• Why?
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

Precedence and associativity in Grammar Precedence and associativity in Grammar Grammar (continued)
Sentence: const – const / const
If we use the parse tree to indicate precedence Operator associativity can also be
Derivation:
levels of the operators, we cannot have <expr> => <expr> - <term> indicated by a grammar
ambiguity => <term> - <term>
=> const - <term> <expr> -> <expr> + <expr> | const (ambiguous)
An unambiguous expression grammar: Parse tree: => const - <term> / const <expr> -> <expr> + const | const (unambiguous)
<expr> => const - const / const
<expr> -> <expr> - <term> | <term> <expr>

<term> -> <term> / const | const <expr> - <term> <expr> + const

<expr> + const
<term> <term> / const Does this grammar rule
const make the + operator right
const const or left associative?

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
An Expression Grammar A derivation A parse tree
Here’s a grammar to define simple arithmetic A derivation of a+b*2 using the expression grammar:
expressions over variables and numbers. A parse tree for a+b*2:

Exp => // Exp ::= Exp BinOp Exp


Exp ::= num __Exp__
Here’s another Exp BinOp Exp => // Exp ::= id
Exp ::= id common notation / | \
id BinOp Exp => // BinOp ::= '+'
Exp ::= UnOp Exp variant where Exp BinOp Exp
id + Exp => // Exp ::= Exp BinOp Exp
Exp := Exp BinOp Exp single quotes are | | / | \
id + Exp BinOp Exp => // Exp ::= num
Exp ::= '(' Exp ')' used to indicate id + Exp BinOp Exp
terminal symbols id + Exp BinOp num => // Exp ::= id
| | | |
UnOp ::= '+' and unquoted id + id BinOp num => // BinOp ::= '*'
a id * num
UnOp ::= '-' symbols are taken id + id * num
| |
BinOp ::= '+' | '-' | '*' | '/ as non-terminals. a + b * 2
b 2

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

Precedence Associativity
A parse tree
• Precedence refers to the order in which operations
Another possible parse tree for a+b*2: are evaluated • Associativity refers to the order in which two
• Usual convention: exponents > mult, div > add, sub of the same operation should be computed
__Exp__ • Deal with operations in categories: exponents, • 3+4+5 = (3+4)+5, left associative (all
| \ / mulops, addops. BinOps)
Exp BinOp Exp • A revised grammar that follows these conventions: • 3^4^5 = 3^(4^5), right associative
/ | \ | | Exp ::= Exp AddOp Exp
Exp ::= Term
• Conditionals right associate but have a
Exp BinOp Exp * num wrinkle: an else clause associates with closest
Term ::= Term MulOp Term
| | | | | Term ::= Factor unmatched if
id + id 2 Factor ::= '(' + Exp + ')‘
Factor ::= num | id
if a then if b then c else d
| |
a b
AddOp ::= '+' | '-’ = if a then (if b then c else d)
MulOp ::= '*' | '/'

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Adding associativity to the grammar Grammar
Exp
Exp
::=
::=
Exp AddOp Term
Term
Derivation
Exp =>
Example: conditionals
Term ::= Term MulOp Factor
Term ::= Factor Exp AddOp Term => • Most languages allow two conditional forms,
Adding associativity to the BinOp expression Factor
Factor
::=
::=
'(' Exp ')’
num | id
Exp AddOp Exp AddOp Term =>
with and without an else clause:
Term AddOp Exp AddOp Term =>
grammar AddOp
MulOp
::=
::=
'+' | '-‘
'*' | '/' Factor AddOp Exp AddOp Term => – if x < 0 then x = -x
Exp ::= Exp AddOp Term
Parse tree
Num AddOp Exp AddOp Term =>
– if x < 0 then x = -x else x = x+1
Exp ::= Term Num + Exp AddOp Term =>

Term ::= Term MulOp Factor


E Num + Factor AddOp Term => • But we’ll need to decide how to interpret:
Num + Num AddOp Term =>
Term ::= Factor E A T – if x < 0 then if y < 0 x = -1 else x = -2
Num + Num - Term =>
Factor ::= '(' Exp ')' - F Num + Num - Factor => • To which if does the else clause attach?
E A T Num + Num - Num
Factor ::= num | id
T
num • This is like the syntactic ambiguity in attach-
+ F
AddOp ::= '+' | '-' ment of prepositional phrases in English
F
MulOp ::= '*' | '/' num
– the man near a cat with a hat
num

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

Example: conditionals Example: conditionals Example: conditionals


• All languages use standard rule to determine • Goal: to create a correct grammar for conditionals. The final unambiguous grammar
which if expression an else clause attaches to • It needs to be non-ambiguous and the precedence is else
with nearest unmatched if Statement ::= Matched | Unmatched
• The rule: Matched ::= 'if' test 'then' Matched 'else' Matched
| 'whatever'
Unmatched ::= 'if' test 'then' Statement
– An else clause attaches to the nearest if to | 'if' test 'then' Matched ‘else’ Unmatched
its left that does not yet have an else clause
• The grammar is ambiguous. The second Conditional
• Example: allows unmatched ifs to be Conditionals
– Good: if test then (if test then whatever else whatever)
– if x < 0 then if y < 0 x = -1 else x = -2 – Bad: if test then (if test then whatever) else whatever
– if x < 0 then if y < 0 x = -1 else x = -2 • Goal: write a grammar that forces an else clause to
attach to the nearest if w/o an else clause

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Syntactic Sugar Extended BNF BNF vs EBNF
BNF:
• Syntactic sugar: syntactic features designed to Syntactic sugar: doesn’t extend the expressive power <expr> -> <expr> + <term>
make code easier to read or write while of the formalism, but does make it easier to use, i.e.,
more readable and more writable | <expr> - <term>
alternatives exist
| <term>
• Makes the language sweeter for humans to use: •Optional parts are placed in brackets ([])
things can be expressed more clearly, concisely, <term> -> <term> * <factor>
<proc_call> -> ident [ ( <expr_list>)]
or in an alternative style that some prefer | <term> / <factor>
•Put alternative parts of RHSs in parentheses and
• Syntactic sugar can be removed from language separate them with vertical bars | <factor>
without effecting what can be done
<term> -> <term> (+ | -) const EBNF:
• All applications of the construct can be
systematically replaced with equivalents that don’t •Put repetitions (0 or more) in braces ({}) <expr> -> <term> {(+ | -) <term>}
use it <ident> -> letter {letter | digit} <term> -> <factor> {(* | /) <factor>}
adapted from Wikipedia
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

Syntax Graphs Parsing complexity


Parsing • How hard is the parsing task?
Syntax Graphs - Put the terminals in circles or ellipses
and put the nonterminals in rectangles; connect with • A grammar describes the strings of tokens that are • Parsing an arbitrary context free grammar is O(n3),
lines with arrowheads syntactically legal in a PL e.g., it can take time proportional the cube of the
• A recogniser simply accepts or rejects strings. number of symbols in the input. This is bad!
e.g., Pascal type declarations
• A generator produces sentences in the language • If we constrain the grammar somewhat, we can
Provides an intuitive, graphical notation. described by the grammar always parse in linear time. This is good!
• A parser construct a derivation or parse tree for a • Linear-time parsing
sentence (if possible) – LL parsers • LL(n) : Left to right,
Leftmost derivation,
• Two common types of parsers are: » Recognize LL grammar look ahead at most n
– bottom-up or data driven » Use a top-down strategy symbols.
– LR parsers • LR(n) : Left to right,
– top-down or hypothesis driven Right derivation, look
• A recursive descent parser is a way to implement a » Recognize LR grammar ahead at most n
top-down parser that is particularly simple. » Use a bottom-up strategy symbols.
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Parsing complexity Parsing complexity Linear complexity parsing
• Practical parsers have time complexity that is linear
• How hard is the parsing task? • If it takes t1 seconds to parse your C program in the number of tokens, i.e., O(n)
with n lines of code, how long will it take to • If v2.0 or your program is twice as long, it will take
• Parsing an arbitrary context free grammar take if you make it twice as long? twice as long to parse
is O(n3) in the worst case.
- time(n) = t1, time(2n) = 23 * time(n) • This is achieved by modifying the grammar so it can
• E.g., it can take time proportional the cube be parsed more easily
of the number of symbols in the input - 8 times longer
• Linear-time parsing • LL(n) : Left to right,
• So what? • Suppose v3 of your code is has 10n lines? – LL parsers Leftmost derivation,
• This is bad! • 103 or 1000 times as long » Recognize LL grammar look ahead at most n
» Use a top-down strategy symbols.
• Windows Vista was said to have ~50M lines – LR parsers • LR(n) : Left to right,
of code » Recognize LR grammar Right derivation, look
» Use a bottom-up strategy ahead at most n
symbols.
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

Recursive Decent Parsing Hierarchy of Linear Parsers Recursive Decent Parsing Example
• Basic containment relationship Example: For the grammar:
• Each nonterminal in the grammar has a – All CFGs can be recognized by LR parser
subprogram associated with it; the – Only a subset of all the CFGs can be recognized by LL parsers <term> -> <factor> {(*|/)<factor>}
subprogram parses all sentential forms that
the nonterminal can generate We could use the following recursive
descent parsing subprogram (e.g., one in
• The recursive descent parsing subprograms CFGs LR parsing C)
are built directly from the grammar rules void term() {
factor(); /* parse first factor*/
• Recursive descent parsers, like other top- while (next_token == ast_code ||
LL parsing next_token == slash_code) {
down parsers, cannot be built from left- lexical(); /* get next token */
recursive grammars (why not?) factor(); /* parse next factor */
}
}

CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
The Summary
Chomsky • The syntax of a programming language is usually
hierarchy defined using BNF or a context free grammar
• In addition to defining what programs are
syntactically legal, a grammar also encodes
meaningful or useful abstractions (e.g., block of
statements)
• Typical syntactic notions like operator
•The Chomsky hierarchy precedence, associativity, sequences, optional
has four types of languages and their associated grammars and machines. statements, etc. can be encoded in grammars
•They form a strict hierarchy; that is, regular languages < context-free
languages < context-sensitive languages < recursively enumerable languages. • A parser is based on a grammar and takes an input
•The syntax of computer languages are usually describable by regular or string, does a derivation and produces a parse tree.
context free languages.
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.

You might also like