You are on page 1of 86

Principles of Programming

Languages
UNIT – 1

SYNTAX & SEMANTICS


LANGUAGE

SYNTAX SEMANTIC PRAGMATICS


Syntax
• A syntax is defined a set of rules that defines
arrangements of tokens which are considered to
be valid in a given language.
• Syntax is combination of regular expressions,
characters, and terminal symbols.
• Syntax categories are defined by rules called
productions.
• Syntax refers to the form of code, which comes
before semantic processing in complation.
• The errors generated in syntax is called syntactic
error.
Semantic
• Semantics in simple term is meaningful sentences, which is
formed by using correctly well structured syntax.
• For example
• In normal English, if you say: “she is boy”. Now this sentences is
syntactically correct, but does not give any proper meaning. Its
ambiguous.
• Similarly in programming language if you write string name = 5;
x=“a”+1; int num=“subscribe”
these are semantically wrong. Since it makes no sense.
• Semantics is the study of meaning of a programming language.
• It does so by evaluating meaning of syntactically valid string.
Invalid syntax does not proceed for semantics.
• Semantics describes the processes which a computer follows
when executing a programming language.
• Input character stream

SYNTAX ANALYZER SEMANTIC ANALYZER


Pragmatics
• Pragmatics is the third part of a programming
language. It refers to practical aspects of how different
properties and features may be used to further
explore various objectives.
• For example: Lets consider an assignment operator(=)
• Syntactically : variable = operands
• Semantically : This variable will represent memory
address while the operand assignment denotes the
value.
• Pragmatically : what other ways can this assignment
operator be used. Ex. a+=2; a/=3 etc
General Problem of Describing Syntax

• A language is a set of strings of characters from some alphabet.

• The strings of a language are called as sentences or statements.

• The syntax rules specify which strings belong to the language.

• Lowest level syntactic units are known as lexemes.

• Lexemes of a programming language include numeric literals,


General Problem of Describing Syntax (cont...)

• Lexemes are partitioned into groups like


identifiers, keywords, literals etc.

• A token of a language is a category of its


lexemes.
General Problem of Describing Syntax
(cont...)
• Consider the following statement:

index = 2 * count + 17;

Lexemes Tokens

index identifier
= equal_sign
2 int_literal
* mult_op
count identifier
+ plus_op
17 int_literal
; semicolon
Language Recognizers
• A language can be defined in two ways: by recognition and
by generation.

• For a language L that uses an alphabet Σ of characters, we


need to construct a mechanism R, called a recognition
device.

• The recognition device would indicate whether the string


formed with characters from alphabet is in the language L
or not.

• The syntax analysis part of a compiler is a recognizer for the


language the compiler translates.
Language Generators
• A generator is a device used to generate the
sentences of a language.

• Generator is a device of limited usefulness as a


language descriptor as the sentence generated by
a generator is unpredictable.

• Example for language recognizer is a Finite State


Automata (FSA) and example for language
generator is CFG.
Formal Methods of Describing Syntax –
Context-Free Grammars
• Two of the four Chomsky’s classes of
grammars namely regular grammars and
context-free grammars are used to describe
the syntax of programming languages.

• Regular grammars are for describing tokens.

• Context-free grammars are for describing the


syntax of whole programming languages.
Formal Methods of Describing Syntax –
Backus-Naur Form (BNF)
• John Backus presented a paper describing
ALGOL 58 which introduced a new formal
notation for specifying programming language
syntax.

• Later Peter Naur slightly modified the notation


proposed by Backus for ALGOL 60. This revised
notation is called as Backus-Naur Form (BNF).
BNF - Fundamentals
• A meta-language is a language that is used to describe another
language. BNF is a meta-language for programming languages.

• BNF uses abstractions for syntactic structures. Abstraction names


are enclosed with angular brackets (< >). For example, the
abstraction for an assignment statement can be <assign> and its
definition is as follows:

<assign> -> <var> = <expression>

The text on the left side of the arrow is called left-hand side (LHS),
is the abstraction being defined. The text to the right of the arrow
is called as right-hand side (RHS), which is the definition of LHS
and can contain a mixture of tokens, lexemes or other
abstractions.
BNF – Fundamentals (cont...)
• The LHS and RHS combined is called a rule or
production.

• Example for the <assign> definition:


total = s1 + s2

• The abstractions in a BNF description, or a grammar,


are often called as non-terminals and the lexemes and
tokens of the rules are called terminals.

• A BNF description or a grammar is a collection of rules.


BNF – Fundamentals (cont...)
• A Java if statement can be described with the
following rules:

<if_stmt> -> if (<logic_expr>) <stmt>


<if_stmt> -> if (<logic_expr>) <stmt> else
<stmt>

Above two rules can be combined as follows:

<if_stmt> -> if (<logic_expr>) <stmt> |


if (<logic_expr>) <stmt> else <stmt>
BNF – Fundamentals (cont...)
• BNF does not contain ellipsis (...) to represent
variable-length lists. Instead it uses recursion in
the rules.

• A rule is said to be recursive if the LHS appears in


its RHS as shown below:

<iden_list> -> identifier |


identifier, <iden_list>
Grammars and Derivations
• A grammar is a generative device for defining languages.

• Sentences are generated through a sequence of application


of the rules, beginning with a special non-terminal symbol
known as start symbol.

• The sequence of rule applications is called a derivation.

• For a programming language, the start symbol often refers


the entire program and is denoted as <program>.
Grammars and Derivations (cont...)

Adopted from Concepts of Programming Languages - Sebesta


Grammars and Derivations (cont...)

• A derivation of a program is as follows:


Grammars and Derivations (cont...)
• The symbol => is read as “derives”.

• Each of the strings in the derivation, including


<program>, is called a sentential form.

• Derivations in which always the left most


non-terminals are replaced are known as leftmost
derivations.

• The sentential form consisting of only terminals, or


lexemes, is the generated sentence.
Parse Trees
• Grammars naturally describe the hierarchical structure
of sentences. These hierarchical structures are known
as parse trees.

• Every internal node in a parse tree is a non-terminal


symbol.

• Every leaf node is a terminal symbol.

• Every sub-tree describes one instance of an


abstraction in the sentence.
Parse Trees (cont...)
Parse Trees (cont...)
Ambiguity
• A grammar is said to be ambiguous if a string derived
by using the grammar has more than one parse tree.
Ambiguity (cont...)
Parse trees for the string A = B + C * A
Operator Precedence
• The mechanism which allows the implementation
to choose one operator among several operators
for evaluation is know as operator precedence.

• Ambiguous grammars makes it difficult to choose


one operator over another.

• General rule is to execute the operator which is


lower in the parse tree.
Operator Precedence (cont...)
Parse trees for the string A = B + C * A

In one parse tree * is lower and in another + is lower. Which one to choose?
Operator Precedence (cont...)
• Correct ordering is specified by using separate non-terminals to
represent the operands of operators that require different
precedence.

• Previous grammar can be re-written (unambiguous) as follows:


Operator Precedence (cont...)
Associativity
• The semantic rule which specifies the precedence in case of
same level operators is known as associativity.

• If the LHS of a rule appears first in its RHS, such grammar is said
to be left recursive.
Associativity (cont...)
• If the LHS of a rule appears last in its RHS, such
grammar is said to be right recursive.

• Left recursion supports left associativity and


right recursion supports right associativity.
Extended BNF (EBNF)
• Due to shortcomings in BNF, it was extended. The extended
version is known as Extended BNF or simply EBNF.

• Three extensions are commonly included in the various


versions of EBNF.

• First extension is denoting a optional part in the RHS using


square brackets.

Ex:
<if_stmt> -> if (<expr>) <stmt> [ else <stmt> ]
Extended BNF (EBNF) (cont...)
• Second extension is the use of braces in the
RHS to indicate that the enclosed part can be
repeated indefinitely.

Ex:
<iden_list> -> <identifier> {, <identifier> }
Extended BNF (EBNF) (cont...)
• Third extension deals with multiple-choice options
using the parentheses and OR operator, |.

Ex:
<term> -> <term> (* | / | % ) <factor>

• The brackets, braces and parentheses are known


as metasymbols.
Extended BNF (EBNF) (cont...)
Attribute Grammars
• An attribute grammar is used to describe
more about the structure of a programming
language.

• Attribute grammar is an extension to a CFG.

• Attribute grammar allows certain language


rules like type compatibility to be
conveniently described.
Attribute Grammars – Static Semantics

• Some characteristics of the programming languages


like type compatibility cannot be specified using BNF.

• A syntax rule that cannot be specified using BNF is, all


variables must be declared before their usage.

• These are examples of static semantic rules. Static


semantics can be checked at compile time.

• Attribute grammar is one of the alternatives for


describing static semantics. It was designed by Knuth.
Attribute Grammars – Basic Concepts
• Attribute grammars are CFGs along with attributes,
attribute computation functions and predicate functions.

• Attributes are associated with grammar symbols (terminals


and non-terminals) and are similar to variables.

• Attribute Computation Functions are associated with


grammar rules. They are used to specify how attribute
values are computed.

• Predicate functions, which state the static semantic rules,


are associated with grammar rules.
Attribute Grammars – Definition
• Associated with each grammar symbol X is a set of
attributes A(X).

• The set A(X) contains two disjoint sets S(X) and I(X),
called synthesized attributes and inherited attributes.

• Synthesized Attributes are used to pass semantic


information up the parse tree.

• Inherited Attributes pass semantic information down


and across a tree
Attribute Grammars – Definition (cont...)

• Associated with each grammar rule is a set of semantic functions.

• For a rule X0 -> X1....Xn , the synthesized attributes of X0 are


computed with semantic functions of the form S(X0) =
f(A(X1),...,A(Xn)). So the value of a synthesized attribute on a node
only depends upon the values of the attributes of that node’s child
nodes.

• Inherited attributes of symbols Xj, 1<=j<=n, are computed with a


semantic function of the form I(Xj) = f(A(X0),.....,A(Xn)). So the value
of inherited attribute on a node depends on attribute values of that
node’s parent node and those of its sibling nodes.
Attribute Grammars – Definition (cont...)

• A predicate function has the form of a


Boolean expression on the union of the
attribute set {A(X0),....,A(Xn)} and a set of
literal attribute values.

• The only derivations allowed with an attribute


grammar are those in which every predicate
associated with every non-terminal is true.
Intrinsic Attributes
• Intrinsic attributes are synthesized attributes
of leaf nodes whose values are determined
outside the parse tree (ex: type of a variable
from symbol table).

• Given the intrinsic attribute values on a parse


tree, the semantic functions can be used to
compute remaining attribute values.
Attribute Grammar – Example 1

Adopted from Concepts of Programming Languages - Sebesta

Attribute grammar that describes the rule that the


name on the end of an Ada procedure must match
the procedure’s name. (This rule cannot be stated
using BNF).

Note: Numbers represented as subscripts are used to denote the instances of an


abstraction.
Attribute Grammar – Example 2

actual_type:
Synthesized
Attribute

expected_type:
Inherited
Attribute
Attribute Grammar – Example 2 (cont...)
Attribute Grammar – Example 2 (cont...)
Attribute Grammar – Example 2 (cont...)
Dynamic Semantics
• Dynamic semantics deals with meaning of the
expressions, statements and program units.

• No universally accepted notation or approach


has been devised for dynamic semantics.
Operational Semantics
• Operational semantics specifies the meaning of a program
in terms of its implementation on a real or virtual machine.

• Change in the state of the machine defines the meaning of


the statement.

• To use operational semantics for a high-level language, a


virtual machine is needed.

• Highest level operational semantics is known as natural


operational semantics and lowest level is known as
structural operational semantics.
Operational Semantics - Ex
Operational Semantics - Evaluation
• Advantages:
– May be simple for small examples
– Good if used informally
– Useful for implementation

• Disadvantages:
– Very complex for large programs
– Lacks mathematical rigor

• Uses:
– Vienna Definition Language (VDL) used to define PL/I
– Compiler work
Denotational Semantics
• A formal method for specifying the meaning of programs.
Denotational semantics is based on recursive function theory.

• Key idea is to define a function that maps a program (a


syntactic object) to its meaning (a semantic object).

• The domain of the mapping function is called the syntactic


domain and the range is called semantic domain.

• The method is named denotational because the mathematical


objects denote the meaning of their corresponding entities.
Denotational vs. Operational
• Denotational semantics is similar to operational
semantics except:
– There is no virtual machine
– Language is mathematics (lambda calculus)

• Difference between denotational and operational


semantics:
– In operational semantics, the state changes are
defined by coded algorithms for a virtual machine
– In denotational semantics, they are defined by
rigorous mathematical functions
Denotational Semantics - Process
• Define a mathematical object for each
language entity.

• Define a function that maps instances of the


language entities onto instances of the
corresponding mathematical objects.
Denotational Semantics – Ex 1
Example: Representing binary strings as decimal numbers

a) Syntax b) Mapping Function Mbin


Denotational Semantics – The State of
a Program
• Denotational semantics is defined in terms of values of all the variables in
the program.

• If s is treated as a state of a program, then:


s = {<i1,v1>, ...... , <in,vn>}
where i is a variable and v is the corresponding value of that variable.

• VARMAP(ij, s) gives vj. Where i is a variable in a state s and v is the value of


that variable.

• These state changes are used to define the meanings of programs and
program constructs.
Denotational Semantics – Expressions

• Assumptions:
– Only operators are + and *
– An expression can have at most one operator
– Only operands are integer variables and integer
constants
– No parentheses
Denotational Semantics – Expressions (cont...)

• If Z is a set of integers and error is an error value,


then Z union {error} is the semantic domain.
Mapping function Me is:
Denotational Semantics – Evaluation
• Advantages:
– Compact & precise, with solid mathematical foundation
– Can be used to prove the correctness of a program
– Can be an aid to language design

• Disadvantages:
– Requires mathematical knowledge
– Hard for programmer to use

• Uses:
– Semantics for ALGOL 60 and Pascal
– Used in compiler generation and optimization
Axiomatic Semantics
• Axiomatic semantics is based on formal logic (first order predicate
calculus).

• Axiomatic semantics was originally used for program verification.

• Process is to define axioms or inference rules for each statement


type in the language.

• The logical expressions used are called assertions or predicates,


and state the relationship between variables.

• An assertion before a statement is called a pre-condition.

• An assertion following a statement is called a post-condition.


Axiomatic Semantics – Weakest
Preconditions
• The weakest pre-condition is the least restrictive
precondition that will guarantee the validity of the
associated post-condition.

Ex:
sum = 2 * x + 1 {sum > 1}

• In the above example, {sum > 1} is the post-condition for


the assignment statement.

• The weakest pre-condition for the above assignment


statement will be {x > 0}.
Axiomatic Semantics (cont...)
• An inference rule is a method of inferring the truth of one
assertion on the basis of the values of other assertions.

• The general form of an inference rule is:


S1, S2, ..... , Sn
-------------------
S

• The above rule states that if S1,...,Sn are true, then the truth of S
can be inferred. The top part of a rule is called its antecedent and
the bottom part is called its consequent.

• An axiom is a logical statement that is assumed to be true.


Therefore, an axiom is a inference rule without an antecedent.
Axiomatic Semantics – Assignment
Statements
• The pre-condition and post-condition of an assignment statement
together define precisely its meaning.

• If x = E is a general assignment statement and Q is its


post-condition, then its pre-condition can be given as: P = Qx->E

• Example:
a = b / 2 – 1 { a < 10 }

Weakest pre-condition is computed by substituting b / 2 – 1 for a


in the post-condition:

b / 2 – 1 < 10
b < 22
Axiomatic Semantics – Assignment
Statements (cont...)
• The notation for specifying the axiomatic
semantics of a given statement is{P} S {Q}.
Where, P is the pre-condition, Q is the
post-condition and S is the statement.

• For assignment statement, notation is:


{ Qx->E} x = E {Q}
Axiomatic Semantics - Evaluation
• Advantages:
– Can be very abstract
– May be useful in proofs of correctness
– Solid theoretical foundations

• Disadvantages:
– Predicate transforms are hard to define
– Hard to give complete meaning
– Does not suggest implementation

• Uses:
– Semantics of Pascal
– Reasoning about correctness
Describing Semantics - Summary
• Operational Semantics
– Informal descriptions
– Compiler work

• Denotational Semantics
– Formal definitions
– Provably correct implementations

• Axiomatic Semantics
– Reasoning about particular properties
– Proofs of correctness
UNIT – 1

LEXICAL ANALYSIS & PARSING


Reasons for Separating Lexical Analysis
and Syntax Analysis
• Simplicity – Removing low-level details of lexical
analysis from syntax analyzer makes it both smaller
and less complex.

• Efficiency – Optimization of syntax analyzer is not


necessary. So separating lexical analysis from syntax
analyzer improves efficiency.

• Portability – Lexical analyzer is platform dependent


and syntax analyzer can be made platform
independent. So both of them are separated.
Lexical Analysis
• A lexical analyzer is essentially a pattern matcher.

• Lexical analyzers extracts lexemes from a given string and produce the
corresponding tokens.

• Now-a-days lexical analyzers are subprograms for syntax analyzers where


lexical analyzer returns a single token for each call.

• Lexical analysis process includes skipping comments and white spaces as


they are not useful.

• Lexical analyzer inserts lexemes for user-defined names into symbol table.

• Lexical analyzer detects syntactic errors is tokens.


Lexical Analysis (cont...)
• There are three approaches for building a lexical
analyzer:
– Using a software like LEX (in UNIX) which takes token
patterns as input and generates a lexical analyzer.
– Designing a state transition diagram for token
patterns and building a program that implements the
diagram.
– Designing a state transition diagram for token
patterns and build a table-driven implementation of
the diagram.
Lexical Analysis (cont...)
• A state transition diagram is a directed graph. The nodes
are labelled with state names and arcs are labelled with
input characters that cause transitions among states.

• State diagrams used for lexical analysis are representations


of a class of mathematical machines called finite automata.

• Finite automata are used to recognize members of a class of


languages called regular languages.

• The tokens of a programming language are a regular


language, and a lexical analyzer is a finite automaton.
Lexical Analysis (cont...)
State diagram for arithmetic statements:
Parsing
• The process of analyzing syntax that is referred to
as syntax analysis is often called as parsing.

• Parsers for programming languages construct


parse trees for given programs.

• Goals of syntax analysis:


– To check the input program whether it is syntactically
correct.
– To produce a complete parse tree.
Parsing (cont...)
• Parsers are categorized based up on how they
build the parse tress:
– In top-down parsing, tree is built from the root
downward to the leaves.
– In bottom-up parsing, tree is built from the leaves
upward to the root.
Top-down Parsers
• A top-down parser builds a parse tree in
pre-order. A pre-order traversal begins with the
root.

• Each node is visited before its branches are


followed. Branches from a particular node are
followed in left-to-right order. This corresponds to
a left-most derivation.

• Parser’s task is find the next sentential form in the


left-most derivation.
Top-down Parsers (cont...)
• General form of a left sentential form is xAα.
Where x is a string of non-terminals, A is the
left-most non-terminal and α is a mixed string.

• If the rules for non-terminal A are: A->bB, A->cBb


and A->a, the parsing decision problem is to
choose the correct production for A.

• In general, top-down parsers scan the next token


and based on that, chooses the correct
production to apply for A.
Top-down Parsers (cont...)
• A recursive descent parser is a coded version of a
syntax analyzer based directly on the BNF
description of the syntax of language.

• Alternative to recursive descent parser is to use a


parsing table to implement BNF rules. Both of
these are called LL algorithms.

• The first L in LL specifies left-to-right scan of the


input and the second L specifies that a left most
derivation is generated.
Bottom-Up Parsers
• A bottom-up parser constructs a parse tree by beginning at the
leaves and moving towards the root.

• This parse order corresponds to the reverse of a right-most


derivation.

• Given a sentential form α, the parser must determine what


substring is the RHS of the rule in the grammar that must be
reduced to its LHS to obtain the previous sentential form in the
right-most derivation.

• There might be more than one RHS in the rules that matches a
substring in the sentential form. The correct RHS is called the
handle.
Bottom-Up Parsers (cont...)
• Consider the following grammar:
S -> aAc
A -> aA | b

• Right-most derivation for string aabc is:


S => aAc => aaAc => aabc

• In this example, when we start with aabc, the handle is easy to


find as only one RHS matches with the sentential form and it is A
-> b. So b is reduced to A and we arrive at sentential form aaAc.

• The most common bottom-up parsing algorithms are in LR family,


where L specifies a left-to-right scan and R specifies that a
right-most derivation is generated.
Complexity of Parsing
• Parsing algorithms that work for unambiguous grammar are
complicated and inefficient.

• Complexity of such algorithms is O(n3). That means the time


they take is cube to the length of the string to be parsed.

• This large amount of time is required because the


algorithms must backtrack and rebuild the trees in case of
errors.

• All algorithms used in commercial compliers have


complexity O(n).
Recursive Descent Parser
• A recursive descent parser is so named because it
consists of subprograms which are recursive in nature
and it produces a parse tree in top-down order.

• A recursive descent parser has a subprogram for each


non-terminal.

• A recursive descent parser starts with the start


symbol, applies rules replacing the left most
non-terminal until it encounters stop symbol (end of
input) or an error.
Recursive Descent Parser - Example

Example grammar:

E -> iE’
E’ -> +iE’ | ε
Recursive Descent Parser
E’( )
– Example (cont...)
{
if(l == ‘+’)
E( ) {
{ match(‘+’);
if(l == ‘i’) match(‘i’);
{ E’( );
match(‘i’); }
E’( ); else
} return;
} }

match(char t) main( )
{ {
if(l == t) E( );
l = getchar(); if(l == ‘$’)
else printf(“Parsing done!”);
printf(“error”); }
}
Recursive Descent Parser – Example
(cont...)
Example string: i + i $

i E’

+ i E’

You might also like