Professional Documents
Culture Documents
3 • For the next several weeks we’ll look at how one We usually break down the problem of defining a
can define a programming language programming language into two parts
• What is a language, anyway? • defining the PL’s syntax
“Language is a system of gestures, grammar, signs, • defining the PL’s semantics
sounds, symbols, or words, which is used to represent
and communicate concepts, ideas, meanings, and Syntax - the form or structure of the expressions,
thoughts”
Syntax
statements, and program units
• Human language is a way to communicate Semantics - the meaning of the expressions,
representations from one (human) mind to another statements, and program units
• What about a programming language?
Note: There is not always a clear boundary
A way to communicate representations (e.g., of data or a
procedure) between human minds and/or machines between the two
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Syntax part
• Invented by John Backus to describe Algol 58 This is a rule; it describes the structure of a while
and refined by Peter Naur for Algol 60. statement
Six participants in the 1960 Algol conference in Paris. This
• BNF is equivalent to context-free grammars was taken at the 1974 ACM conference on the history of
programming languages. Top: John McCarthy, Fritz Bauer,
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Joe Wegstein. Bottom: John Backus, Peter Naur, Alan Perlis.
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
BNF Non-terminals, pre-terminals & terminals BNF
• A rule has a left-hand side (LHS) which is a single • A non-terminal symbol is any symbol that is in the LHS of • Repetition is done with recursion
non-terminal symbol and a right-hand side (RHS), a rule. These represent abstractions in the language (e.g.,
one or more terminal or non-terminal symbols if-then-else-statement in • E.g., Syntactic lists are described in BNF
• A grammar is a finite, nonempty set of rules <if-then-else-statement> ::= if <test> using recursion
then <statement> else <statement>
• A non-terminal symbol is “defined” by its rules.
• A non-terminal can have more than one rule • A terminal symbol is any symbol that is not on the LHS of • An <ident_list> is a sequence of one or more
a rule. AKA lexemes. These are the literal symbols that <ident>s separated by commas.
• Multiple rules can be combined with the vertical- will appear in a program (e.g., if, then, else in rules above).
bar ( | ) symbol (read as “or”)
• A pre-terminal symbol is one that appears as a LHS of
These two rules: rule(s), but in every case, the RHSs consist of single
<value> ::= <const> <ident_list> ::= <ident> |
terminal symbol, e.g., <digit> in
<value> ::= <ident> <ident> , <ident_list>
<digit> ::= 0 | 1 | 2 | 3 … 7 | 8 | 9
are equivalent to this one:
<value> ::= <const> | <ident>
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Derivation Another BNF Example Finite and Infinite languages
<program> -> <stmts> Note: There is some
Every string of symbols in the derivation is <stmts> -> <stmt> variation in notation • A simple language may have a finite number
for BNF grammars.
a sentential form | <stmt> ; <stmts> Here we are using -> of sentences
in the rules instead
<stmt> -> <var> = <expr> of ::= . – The set of strings representing integers between -
A sentence is a sentential form that has only <var> -> a | b | c | d 10**6 and +10**6 is a finite language
terminal symbols <expr> -> <term> + <term> | <term> - <term>
– A finite language can be defined by enumerating the
<term> -> <var> | const sentences, but using a grammar might be much easier
A leftmost derivation is one in which the Here is a derivation: • Most interesting languages have an infinite
leftmost nonterminal in each sentential form <program> => <stmts>
=> <stmt> number of sentences
is the one that is expanded in the next step => <var> = <expr> • Even very simple grammars may yield
=> a = <expr>
A derivation may be either leftmost or => a = <term> + <term> infinite languages
=> a = <var> + <term>
rightmost or something else => a = b + <term>
=> a = b + const
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Derivations and Parse Trees Grammar Ambiguous English Sentences
• Note that a unannotated derivation (i.e., • A grammar is ambiguous if and only if
just the sequence of sentential forms) (iff) it generates a sentential form that
• I saw the man on the hill with a
contains some, but not all, information has two or more distinct parse trees telescope
about the parse structure • Time flies like an arrow;
• Ambiguous grammars are, in general,
• Sometimes, it’s ambiguous which symbol very undesirable in formal languages fruit flies like a banana
was replaced
• Can you guess why? • Buffalo buffalo Buffalo buffalo
• Also note that a parse tree contains
partial information about the derivation: • We can eliminate ambiguity by revising buffalo buffalo Buffalo buffalo
specifically, what rules were applied, but the grammar
not the order of application See: Syntactic Ambiguity
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
<e> -> <e> <op> <e> <e> -> <e> <op> <e>
<e> -> <e> <op> <e> <e> -> <e> <op> <e> <e> -> <e> <op> <e> <e> -> <e> <op> <e>
Here is a simple grammar for expressions that -> 1 <op> <e> -> <e> <op> <e> <op> <e> -> 1 <op> <e> -> <e> <op> <e> <op> <e>
is ambiguous -> 1 + <e> -> 1 <op> <e> <op> <e>
-> 1 + <e> -> 1 <op> <e> <op> <e>
-> 1 + <e> <op> <e> -> 1 + <e> <op> <e>
-> 1 + <e> <op> <e> -> 1 + <e> <op> <e> -> 1 + 2 <op> <e> -> 1 + 2 <op> <e>
<e> -> <e> <op> <e> -> 1 + 2 <op> <e> -> 1 + 2 <op> <e> -> 1 + 2 * <e> -> 1 + 2 * <e>
-> 1 + 2 * <e> -> 1 + 2 * <e> -> 1 + 2 * 3 -> 1 + 2 * 3
<e> -> 1|2|3 Fyi… In a programming language, an expression
-> 1 + 2 * 3 -> 1 + 2 * 3 e
<op> -> +|-|*|/ is some code that is evaluated and produces a
value. A statement is code that is executed and
e
does something but does not produce a value. e e
e op e e op e
The sentence 1+2*3 can lead to two different e op e e op e
parse trees corresponding to 1+(2*3) and 1 + e op e e op e
* 3
* 3
(1+2)*3 1 + e op e e op e
2 * 3 1 + 2
2 * 3 1 + 2
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
The leaves of the trees are terminals and correspond to the sentence
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
<e> -> <e> <op> <e>
Two derivations for 1+2*3 <e> -> 1|2|3
<op> -> +|-|*|/ Operators Operator notation
e
• The traditional operator notation introduces • So, how do we interpret expressions like
e many problems. (a) 2 + 3 + 4
e op e e op e • Operators are used in (b) 2 + 3 * 4
– Prefix notation: Expression (* (+ 1 3) 2) in Lisp • While you might argue that it doesn’t matter for (a), it
1 + e op e e op e
* 3 – Infix notation: Expression (1 + 3) * 2 in Java can for different operators (2 ** 3 ** 4) or when the
– Postfix notation: Increment foo++ in C limits of representation are hit (e.g., round off in
1 + 2
2 * 3
• Operators can have one or more operands numbers, e.g., 1+1+1+1+1+1+1+1+1+1+1+10**6)
+ * – Increment in C is a one-operand operator: foo++ • Concepts:
– Subtraction in C is a two-operand operator: foo - bar – Explaining rules in terms of operator precedence
1 * + 3
– Conditional expression in C is a three-operand and associativity
operators: (foo == 3 ? 0 : 1)
2 3 1 2 – Realizing the rules in grammars
Lower trees represent the way we think about the sentence ‘meaning’
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Operators: Precedence and Associativity Operator Precedence: Precedence Table Operator Precedence: Precedence Table
• Precedence and associativity deal with the
evaluation order within expressions
• Precedence rules specify order in which operators
of different precedence level are evaluated, e.g.:
“*” Has a higher precedence that “+”, so “*” groups more
tightly than “+”
• What is the results of 4 * 5 ** 6 ?
• A language’s precedence hierarchy should match
our intuitions, but the result’s not always perfect, as
in this Pascal example:
if A<B and C<D then A := 0 ;
• Pascal relational operators have lowest precedence!
if A < B and C < D then A := 0 ;
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Operators: Associativity Operators: Associativity Associativity in C
• For + and * it doesn’t matter in theory (though it can in
• In C, as in most languages, most of the operators
• Associativity rules specify order in which operators associate left to right
practice) but for – and / it matters in theory, too.
of the same precedence level are evaluated
• What should A-B-C mean? a + b + c => (a + b) + c
• Operators are typically either left associative or
(A – B) – C A – (B – C) • The various assignment operators however associate
right associative.
• What is the results of 2 ** 3 ** 4 ? right to left
• Left associativity is typical for +, - , * and / – 2 ** (3 ** 4) = 2 ** 81 = 2417851639229258349412352 = += -= *= /= %= >>= <<= &= ^= |=
• So A + B + C – (2 ** 3) ** 4 = 8 ** 4 = 4096
• Consider a += b += c, which is interpreted as
– Means: (A + B) + C • Languages diverge on this case:
– In Fortran, ** associates from right-to-left, as in normally
a += (b += c)
– And not: A + (B + C) the case for mathematics • and not as
• Does it matter? – In Ada, ** doesn’t associate; you must write the previous (a += b) += c
expression as 2 ** (3 ** 4) to obtain the expected answer
• Why?
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Precedence and associativity in Grammar Precedence and associativity in Grammar Grammar (continued)
Sentence: const – const / const
If we use the parse tree to indicate precedence Operator associativity can also be
Derivation:
levels of the operators, we cannot have <expr> => <expr> - <term> indicated by a grammar
ambiguity => <term> - <term>
=> const - <term> <expr> -> <expr> + <expr> | const (ambiguous)
An unambiguous expression grammar: Parse tree: => const - <term> / const <expr> -> <expr> + const | const (unambiguous)
<expr> => const - const / const
<expr> -> <expr> - <term> | <term> <expr>
<expr> + const
<term> <term> / const Does this grammar rule
const make the + operator right
const const or left associative?
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
An Expression Grammar A derivation A parse tree
Here’s a grammar to define simple arithmetic A derivation of a+b*2 using the expression grammar:
expressions over variables and numbers. A parse tree for a+b*2:
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Precedence Associativity
A parse tree
• Precedence refers to the order in which operations
Another possible parse tree for a+b*2: are evaluated • Associativity refers to the order in which two
• Usual convention: exponents > mult, div > add, sub of the same operation should be computed
__Exp__ • Deal with operations in categories: exponents, • 3+4+5 = (3+4)+5, left associative (all
| \ / mulops, addops. BinOps)
Exp BinOp Exp • A revised grammar that follows these conventions: • 3^4^5 = 3^(4^5), right associative
/ | \ | | Exp ::= Exp AddOp Exp
Exp ::= Term
• Conditionals right associate but have a
Exp BinOp Exp * num wrinkle: an else clause associates with closest
Term ::= Term MulOp Term
| | | | | Term ::= Factor unmatched if
id + id 2 Factor ::= '(' + Exp + ')‘
Factor ::= num | id
if a then if b then c else d
| |
a b
AddOp ::= '+' | '-’ = if a then (if b then c else d)
MulOp ::= '*' | '/'
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Adding associativity to the grammar Grammar
Exp
Exp
::=
::=
Exp AddOp Term
Term
Derivation
Exp =>
Example: conditionals
Term ::= Term MulOp Factor
Term ::= Factor Exp AddOp Term => • Most languages allow two conditional forms,
Adding associativity to the BinOp expression Factor
Factor
::=
::=
'(' Exp ')’
num | id
Exp AddOp Exp AddOp Term =>
with and without an else clause:
Term AddOp Exp AddOp Term =>
grammar AddOp
MulOp
::=
::=
'+' | '-‘
'*' | '/' Factor AddOp Exp AddOp Term => – if x < 0 then x = -x
Exp ::= Exp AddOp Term
Parse tree
Num AddOp Exp AddOp Term =>
– if x < 0 then x = -x else x = x+1
Exp ::= Term Num + Exp AddOp Term =>
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Syntactic Sugar Extended BNF BNF vs EBNF
BNF:
• Syntactic sugar: syntactic features designed to Syntactic sugar: doesn’t extend the expressive power <expr> -> <expr> + <term>
make code easier to read or write while of the formalism, but does make it easier to use, i.e.,
more readable and more writable | <expr> - <term>
alternatives exist
| <term>
• Makes the language sweeter for humans to use: •Optional parts are placed in brackets ([])
things can be expressed more clearly, concisely, <term> -> <term> * <factor>
<proc_call> -> ident [ ( <expr_list>)]
or in an alternative style that some prefer | <term> / <factor>
•Put alternative parts of RHSs in parentheses and
• Syntactic sugar can be removed from language separate them with vertical bars | <factor>
without effecting what can be done
<term> -> <term> (+ | -) const EBNF:
• All applications of the construct can be
systematically replaced with equivalents that don’t •Put repetitions (0 or more) in braces ({}) <expr> -> <term> {(+ | -) <term>}
use it <ident> -> letter {letter | digit} <term> -> <factor> {(* | /) <factor>}
adapted from Wikipedia
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
Recursive Decent Parsing Hierarchy of Linear Parsers Recursive Decent Parsing Example
• Basic containment relationship Example: For the grammar:
• Each nonterminal in the grammar has a – All CFGs can be recognized by LR parser
subprogram associated with it; the – Only a subset of all the CFGs can be recognized by LL parsers <term> -> <factor> {(*|/)<factor>}
subprogram parses all sentential forms that
the nonterminal can generate We could use the following recursive
descent parsing subprogram (e.g., one in
• The recursive descent parsing subprograms CFGs LR parsing C)
are built directly from the grammar rules void term() {
factor(); /* parse first factor*/
• Recursive descent parsers, like other top- while (next_token == ast_code ||
LL parsing next_token == slash_code) {
down parsers, cannot be built from left- lexical(); /* get next token */
recursive grammars (why not?) factor(); /* parse next factor */
}
}
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.
The Summary
Chomsky • The syntax of a programming language is usually
hierarchy defined using BNF or a context free grammar
• In addition to defining what programs are
syntactically legal, a grammar also encodes
meaningful or useful abstractions (e.g., block of
statements)
• Typical syntactic notions like operator
•The Chomsky hierarchy precedence, associativity, sequences, optional
has four types of languages and their associated grammars and machines. statements, etc. can be encoded in grammars
•They form a strict hierarchy; that is, regular languages < context-free
languages < context-sensitive languages < recursively enumerable languages. • A parser is based on a grammar and takes an input
•The syntax of computer languages are usually describable by regular or string, does a derivation and produces a parse tree.
context free languages.
CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc. CMSC 331, Some material © 1998 by Addison Wesley Longman, Inc.