University of Wales Swansea
Department of Computer Science
Compilers
Course notes for module CS 218
Dr. Matt Poole 2002, edited by Mr. Christopher Whyley, 2nd Semester
2006/2007
wwwcompsci.swan.ac.uk/~cschris/compilers
1 Introduction
1.1 Compilation
Deﬁnition. Compilation is a process that translates a program in one language (the source
language) into an equivalent program in another language (the object or target language).
messages
Compiler
Error
Source
program
Target
program
An important part of any compiler is the detection and reporting of errors; this will be dis
cussed in more detail later in the introduction. Commonly, the source language is a highlevel
programming language (i.e. a problemoriented language), and the target language is a ma
chine language or assembly language (i.e. a machineoriented language). Thus compilation is a
fundamental concept in the production of software: it is the link between the (abstract) world
of application development and the lowlevel world of application execution on machines.
Types of Translators. An assembler is also a type of translator:
program program
Assembler
Assembly Machine
An interpreter is closely related to a compiler, but takes both source program and input data.
The translation and execution phases of the source program are one and the same.
Source
program
data
Input
Interpreter
Output
data
Although the above types of translator are the most wellknown, we also need knowledge of
compilation techniques to deal with the recognition and translation of many other types of
languages including:
2
• Commandline interface languages;
• Typesetting / word processing languages (e.g.T
E
X);
• Natural languages;
• Hardware description languages;
• Page description languages (e.g. PostScript);
• Setup or parameter ﬁles.
Early Development of Compilers.
1940’s. Early storedprogram computers were programmed in machine language. Later, as
sembly languages were developed where machine instructions and memory locations were given
symbolic forms.
1950’s. Early highlevel languages were developed, for example FORTRAN. Although more
problemoriented than assembly languages, the ﬁrst versions of FORTRAN still had many
machinedependent features. Techniques and processes involved in compilation were not well
understood at this time, and compilerwriting was a huge task: e.g. the ﬁrst FORTRAN
compiler took 18 man years of eﬀort to write.
Chomsky’s study of the structure of natural languages led to a classiﬁcation of languages
according to the complexity of their grammars. The contextfree languages proved to be useful
in describing the syntax of programming languages.
1960’s onwards. The study of the parsing problem for contextfree languages during the 1960’s
and 1970’s has led to eﬃcient algorithms for the recognition of contextfree languages. These al
gorithms, and associated software tools, are central to compiler construction today. Similarly,
the theory of ﬁnite state machines and regular expressions (which correspond to Chomsky’s
regular languages) have proven useful for describing the lexical structure of programming lan
guages.
From Algol 60, highlevel languages have become more problemoriented and machine indepen
dent, with features much removed from the machine languages into which they are compiled.
The theory and tools available today make compiler construction a managable task, even for
complex languages. For example, your compiler assignment will take only a few weeks (hope
fully) and will only be about 1000 lines of code (although, admittedly, the source language is
small).
1.2 The Context of a Compiler
The complete process of compilation is illustrated as:
3
source program
compiler
skeletal source program
preprocessor
assembly program
assembler
relocatable m/c code
link/load editor
absolute m/c code
1.2.1 Preprocessors
Preprocessing performs (usually simple) operations on the source ﬁle(s) prior to compilation.
Typical preprocessing operations include:
(a) Expanding macros (shorthand notations for longer constructs). For example, in C,
#define foo(x,y) (3*x+y*(2+x))
deﬁnes a macro foo, that when used in later in the program, is expanded by the
preprocessor. For example, a = foo(a,b) becomes
a = (3*a+b*(2+a))
(b) Inserting named ﬁles. For example, in C,
#include "header.h"
is replaced by the contents of the ﬁle header.h
1.2.2 Linkers
A linker combines object code (machine code that has not yet been linked) produced from
compiling and assembling many source programs, as well as standard library functions and
resources supplied by the operating system. This involves resolving references in each object
ﬁle to external variables and procedures declared in other ﬁles.
4
1.2.3 Loaders
Compilers, assemblers and linkers usually produce code whose memory references are made
relative to an undetermined starting location that can be anywhere in memory (relocatable
machine code). A loader calculates appropriate absolute addresses for these memory locations
and amends the code to use these addresses.
1.3 The Phases of a Compiler
The process of compilation is split up into six phases, each of which interacts with a symbol
table manager and an error handler. This is called the analysis/synthesis model of compilation.
There are many variants on this model, but the essential elements are the same.
syntax
analyser
semantic
analyser
intermediate
code gen’tor
code
optimizer
symboltable
manager
error
handler
source program
lexical
analyser
code
target program
generator
1.3.1 Lexical Analysis
A lexical analyser or scanner is a program that groups sequences of characters into lexemes,
and outputs (to the syntax analyser) a sequence of tokens. Here:
(a) Tokens are symbolic names for the entities that make up the text of the program;
e.g. if for the keyword if, and id for any identiﬁer. These make up the output of
the lexical analyser.
5
(b) A pattern is a rule that speciﬁes when a sequence of characters from the input
constitutes a token; e.g the sequence i, f for the token if, and any sequence of
alphanumerics starting with a letter for the token id.
(c) A lexeme is a sequence of characters from the input that match a pattern (and hence
constitute an instance of a token); for example if matches the pattern for if, and
foo123bar matches the pattern for id.
For example, the following code might result in the table given below.
program foo(input,output);var x:integer;begin
readln(x);writeln(’value read =’,x) end.
Lexeme Token Pattern
program program p, r, o, g, r, a, m
newlines, spaces, tabs
foo id (foo) letter followed by seq. of alphanumerics
( leftpar a left parenthesis
input input i, n, p, u, t
, comma a comma
output output o, u, t, p, u, t
) rightpar a right parenthesis
; semicolon a semicolon
var var v, a, r
x id (x) letter followed by seq. of alphanumerics
: colon a colon
integer integer i, n, t, e, g, e, r
; semicolon a semicolon
begin begin b, e, g, i, n
newlines, spaces, tabs
readln readln r, e, a, d, l, n
( leftpar a left parenthesis
x id (x) letter followed by seq. of alphanumerics
) rightpar a right parenthesis
; semicolon a semicolon
writeln writeln w, r, i, t, e, l, n
( leftpar a left parenthesis
’value read =’ literal (’value read =’) seq. of chars enclosed in quotes
, comma a comma
x id (x) letter followed by seq. of alphanumerics
) rightpar a right parenthesis
newlines, spaces, tabs
end end e, n, d
. fullstop a fullstop
6
It is the sequence of tokens in the middle column that are passed as output to the syntax
analyser.
This token sequence represents almost all the important information from the input program
required by the syntax analyser. Whitespace (newlines, spaces and tabs), although often im
portant in separating lexemes, is usually not returned as a token. Also, when outputting an id
or literal token, the lexical analyser must also return the value of the matched lexeme (shown
in parentheses) or else this information would be lost.
1.3.2 Symbol Table Management
A symbol table is a data structure containing all the identiﬁers (i.e. names of variables, proce
dures etc.) of a source program together with all the attributes of each identiﬁer.
For variables, typical attributes include:
• its type,
• how much memory it occupies,
• its scope.
For procedures and functions, typical attributes include:
• the number and type of each argument (if any),
• the method of passing each argument, and
• the type of value returned (if any).
The purpose of the symbol table is to provide quick and uniform access to identiﬁer attributes
throughout the compilation process. Information is usually put into the symbol table during
the lexical analysis and/or syntax analysis phases.
1.3.3 Syntax Analysis
A syntax analyser or parser is a program that groups sequences of tokens from the lexical
analysis phase into phrases each with an associated phrase type.
A phrase is a logical unit with respect to the rules of the source language. For example, consider:
a := x * y + z
7
After lexical analysis, this statement has the structure
id
1
assign id
2
binop
1
id
3
binop
2
id
4
Now, a syntactic rule of Pascal is that there are objects called ‘expressions’ for which the rules
are (essentially):
(1) Any constant or identiﬁer is an expression.
(2) If exp
1
and exp
2
are expressions then so is exp
1
binop exp
2
.
Taking all the identiﬁers to be variable names for simplicity, we have:
• By rule (1) exp
1
= id
2
and exp
2
= id
3
are both phrases with phrase type ‘expression’;
• by rule (2) exp
3
= exp
1
binop
1
exp
2
is also a phrase with phrase type ‘expression’;
• by rule (1) exp
4
= id
4
is a phase with type ‘expression’;
• by rule (2), exp
5
= exp
3
binop
2
exp
4
is a phrase with phrase type ‘expression’.
Of course, Pascal also has a rule that says:
id assign exp
is a phrase with phrase type ‘assignment’, and so the Pascal statement above is a phrase of
type ‘assignment’.
Parse Trees and Syntax Trees. The structure of a phrase is best thought of as a parse
tree or a syntax tree. A parse tree is tree that illustrates the grouping of tokens into phrases.
A syntax tree is a compacted form of parse tree in which the operators appear as the interior
nodes. The construction of a parse tree is a basic activity in compilerwriting.
A parse tree for the example Pascal statement is:
assignment
id
1
exp
5
exp
4
id
4
exp
3
binop
1
exp
1
id
2
id
3
exp
2
binop
2
assign
8
and a syntax tree is:
binop
1
id
1
binop
2
id
4
id
2
id
3
assign
Comment. The distinction between lexical and syntactical analysis sometimes seems arbitrary.
The main criterion is whether the analyser needs recursion or not:
• lexical analysers hardly ever use recursion; they are sometimes called linear analysers
since they scan the input in a ‘straight line’ (from from left to right).
• syntax analysers almost always use recursion; this is because phrase types are often deﬁned
in terms of themselves (cf. the phrase type ‘expression’ above).
1.3.4 Semantic Analysis
A semantic analyser takes its input from the syntax analysis phase in the form of a parse tree
and a symbol table. Its purpose is to determine if the input has a welldeﬁned meaning; in
practice semantic analysers are mainly concerned with type checking and type coercion based
on type rules. Typical type rules for expressions and assignments are:
Expression Type Rules. Let exp be an expression.
(a) If exp is a constant then exp is welltyped and its type is the type of the constant.
(b) If exp is a variable then exp is welltyped and its type is the type of the variable.
(c) If exp is an operator applied to further subexpressions such that:
(i) the operator is applied to the correct number of subexpressions,
(ii) each subexpression is welltyped and
(iii) each subexpression is of an appropriate type,
then exp is welltyped and its type is the result type of the operator.
Assignment Type Rules. Let var be a variable of type T
1
and let exp be a welltyped
expression of type T
2
. If
9
(a) T
1
= T
2
and
(b) T
1
is an assignable type
then var assign exp is a welltyped assignment.
For example, consider the following code fragment:
intvar := intvar + realarray
where intvar is stored in the symbol table as being an integer variable, and realarray as
an array or reals. In Pascal this assignment is syntactically correct, but semantically incorrect
since + is only deﬁned on numbers, whereas its second argument is an array. The semantic
analyser checks for such type errors using the parse tree, the symbol table and type rules.
1.3.5 Error Handling
Each of the six phases (but mainly the analysis phases) of a compiler can encounter errors. On
detecting an error the compiler must:
• report the error in a helpful way,
• correct the error if possible, and
• continue processing (if possible) after the error to look for further errors.
Types of Error. Errors are either syntactic or semantic:
Syntax errors are errors in the program text; they may be either lexical or grammatical:
(a) A lexical error is a mistake in a lexeme, for examples, typing tehn instead of then,
or missing oﬀ one of the quotes in a literal.
(b) A grammatical error is a one that violates the (grammatical) rules of the language,
for example if x = 7 y := 4 (missing then).
Semantic errors are mistakes concerning the meaning of a program construct; they may be
either type errors, logical errors or runtime errors:
(a) Type errors occur when an operator is applied to an argument of the wrong type,
or to the wrong number of arguments.
10
(b) Logical errors occur when a badly conceived program is executed, for example:
while x = y do ... when x and y initially have the same value and the body of
loop need not change the value of either x or y.
(c) Runtime errors are errors that can be detected only when the program is executed,
for example:
var x : real; readln(x); writeln(1/x)
which would produce a run time error if the user input 0.
Syntax errors must be detected by a compiler and at least reported to the user (in a helpful
way). If possible, the compiler should make the appropriate correction(s). Semantic errors are
much harder and sometimes impossible for a computer to detect.
1.3.6 Intermediate Code Generation
After the analysis phases of the compiler have been completed, a source program has been
decomposed into a symbol table and a parse tree both of which may have been modiﬁed by
the semantic analyser. From this information we begin the process of generating object code
according to either of two approaches:
(1) generate code for a speciﬁc machine, or
(2) generate code for a ‘general’ or abstract machine, then use further translators to
turn the abstract code into code for speciﬁc machines.
Approach (2) is more modular and eﬃcient provided the abstract machine language is simple
enough to:
(a) produce and analyse (in the optimisation phase), and
(b) easily translated into the required language(s).
One of the most widely used intermediate languages is ThreeAddress Code (TAC).
TAC Programs. A TAC program is a sequence of optionally labelled instructions. Some
common TAC instructions include:
(i) var
1
:= var
2
binop var
3
(ii) var
1
:= unop var
2
(iii) var
1
:= num
11
(iv) goto label
(v) if var
1
relop var
2
goto label
There are also TAC instructions for addresses and pointers, arrays and procedure calls, but will
will use only the above for the following discussion.
SyntaxDirected Code Generation. In essence, code is generated by recursively walking
through a parse (or syntax) tree, and hence the process is referred to as syntaxdirected code
generation. For example, consider the code fragment:
z := x * y + x
and its syntax tree (with lexemes replacing tokens):
:=
+
*
y
z
x
x
We use this tree to direct the compilation into TAC as follows.
At the root of the tree we see an assignment whose righthand side is an expression, and
this expression is the sum of two quantities. Assume that we can produce TAC code that
computes the value of the ﬁrst and second summands and stores these values in temp1 and
temp2 respectively. Then the appropriate TAC for the assignment statement is just
z := temp1 + temp2
Next we consider how to compute the values of temp1 and temp2 in the same topdown recursive
way.
For temp1 we see that it is the product of two quantities. Assume that we can produce TAC
code that computes the value of the ﬁrst and second multiplicands and stores these values in
temp3 and temp4 respectively. Then the appropriate TAC for the computing temp1 is
temp1 := temp3 * temp4
Continuing the recursive walk, we consider temp3. Here we see it is just the variable x and thus
the TAC code
12
temp3 := x
is suﬃcient. Next we come to temp4 and similar to temp3 the appropriate code is
temp4 := y
Finally, considering temp2, of course
temp2 := x
suﬃces.
Each code fragment is output when we leave the corresponding node; this results in the ﬁnal
program:
temp3 := x
temp4 := y
temp1 := temp3 * temp4
temp2 := x
z := temp1 + temp2
Comment. Notice how a compound expression has been broken down and translated into a
sequence of very simple instructions, and furthermore, the process of producing the TAC code
was uniform and simple. Some redundancy has been brought into the TAC code but this can
be removed (along with redundancy that is not due to the TACgeneration) in the optimisation
phase.
1.3.7 Code Optimisation
An optimiser attempts to improve the time and space requirements of a program. There are
many ways in which code can be optimised, but most are expensive in terms of time and space
to implement.
Common optimisations include:
• removing redundant identiﬁers,
• removing unreachable sections of code,
• identifying common subexpressions,
• unfolding loops and
13
• eliminating procedures.
Note that here we are concerned with the general optimisation of abstract code.
Example. Consider the TAC code:
temp1 := x
temp2 := temp1
if temp1 = temp2 goto 200
temp3 := temp1 * y
goto 300
200 temp3 := z
300 temp4 := temp2 + temp3
Removing redundant identiﬁers (just temp2) gives
temp1 := x
if temp1 = temp1 goto 200
temp3 := temp1 * y
goto 300
200 temp3 := z
300 temp4 := temp1 + temp3
Removing redundant code gives
temp1 := x
200 temp3 := z
300 temp4 := temp1 + temp3
Notes. Attempting to ﬁnd a ‘best’ optimisation is expensive for the following reasons:
• A given optimisation technique may have to be applied repeatedly until no further optimi
sation can be obtained. (For example, removing one redundant identiﬁer may introduce
another.)
• A given optimisation technique may give rise to other forms of redundancy and thus
sequences of optimisation techniques may have to be repeated. (For example, above
we removed a redundant identiﬁer and this gave rise to redundant code, but removing
redundant code may lead to further redundant identiﬁers.)
• The order in which optimisations are applied may be signiﬁcant. (How many ways are
there of applying n optimisation techniques to a given piece of code?)
14
1.3.8 Code Generation
The ﬁnal phase of the compiler is to generate code for a speciﬁc machine. In this phase we
consider:
• memory management,
• register assignment and
• machinespeciﬁc optimisation.
The output from this phase is usually assembly language or relocatable machine code.
Example. The TAC code above could typically result in the ARM assembly program shown
below. Note that the example illustrates a mechanical translation of TAC into ARM; it is not
intended to illustrate compact ARM programming!
.x EQUD 0 four bytes for x
.z EQUD 0 four bytes for z
.temp EQUD 0 four bytes each for temp1,
EQUD 0 temp3, and
EQUD 0 temp4.
.prog MOV R12,#temp R12 = base address
MOV R0,#x R0 = address of x
LDR R1,[R0] R1 = value of x
STR R1,[R12] store R1 at R12
MOV R0,#z R0 = address of z
LDR R1,[R0] R1 = value of z
STR R1,[R12,#4] store R1 at R12+4
LDR R1,[R12] R1 = value of temp1
LDR R2,[R12,#4] R2 = value of temp3
ADD R3,R1,R2 add temp1 to temp3
STR R3,[R12,#8] store R3 at R12+8
15
2 Languages
In this section we introduce the formal notion of a language, and the basic problem of recognising
strings from a language. These are central concepts that we will use throughout the remainder
of the course.
Note.This section contains mainly theoretical deﬁnitions; the lectures will cover examples and
diagrams illustrating the theory.
2.1 Basic Deﬁnitions
• An alphabet Σ is a ﬁnite nonempty set (of symbols).
• A string or word over an alphabet Σ is a ﬁnite concatenation (or juxtaposition) of symbols
from Σ.
• The length of a string w (that is, the number of characters comprising it) is denoted [w[.
• The empty or null string is denoted ǫ. (That is, ǫ is the unique string satisfying [ǫ[ = 0.)
• The set of all strings over Σ is denoted Σ
∗
.
• For each n ≥ 0 we deﬁne
Σ
n
= ¦w ∈ Σ
∗
[ [w[ = n¦.
• We deﬁne
Σ
+
=
n≥1
Σ
n
.
(Thus Σ
∗
= Σ
+
∪ ¦ǫ¦.)
• For a symbol or word x, x
n
denotes x concatenated with itself n times, with the convention
that x
0
denotes ǫ.
• A language over Σ is a set L ⊆ Σ
∗
.
• Two languages L
1
and L
2
over common alphabet Σ are equal if they are equal as sets.
Thus L
1
= L
2
if, and only if, L
1
⊆ L
2
and L
2
⊆ L
1
.
2.2 Decidability
Given a language L over some alphabet Σ, a basic question is: For each possible word w ∈ Σ
∗
,
can we eﬀectively decide if w is a member of L or not? We call this the decision problem for L.
Note the use of the word ‘eﬀectively’: this implies the mechanism by which we decide on
membership (or nonmembership) must be a ﬁnitistic, deterministic and mechanical procedure
16
that can be carried out by some form of computing agent. Also note the decision problem asks
if a given word is a member of L or not; that is, it is not suﬃcient to be only able to decide
when words are members of L.
More precisely then, a language L ⊆ Σ
∗
is said to be decidable if there exists an algorithm such
that for every w ∈ Σ
∗
(1) the algorithm terminates with output ‘Yes’ when w ∈ L and
(2) the algorithm terminates with output ‘No’ when w ,∈ L.
If no such algorithm exists then L is said to be undecidable.
Note. Decidability is based on the notion of an ‘algorithm’. In standard theoretical computer
science this is taken to mean a Turing Machine; this is an abstract, but extremely lowlevel
model of computation that is equivalent to a digital computer with an inﬁnite memory. Thus it
is suﬃcient in practice to use a more convenient model of computation such as Pascal programs
provided that any decidability arguments we make assume an inﬁnite memory.
Example. Let Σ = ¦0, 1¦ be an alphabet. Let L be the (inﬁnite) language
L = ¦w ∈ Σ
∗
[ w = 0
n
1for some n¦.
Does the program below solve the decision problem for L?
read( char );
if char = END_OF_STRING then
print( "No" )
else /* char must be ’0’ or ’1’ */
while char = ’0’ do
read( char )
od; /* char must be ’1’ or END_OF_STRING */
if char = ’1’ then
print( "Yes" )
else
print( "No" )
fi
fi
Answer: ....
2.3 Basic Facts
(1) Every ﬁnite language is decidable. (Hence every undecidable language is inﬁnite.)
17
(2) Not every inﬁnite language is undecidable.
(3) Programming languages are (usually) inﬁnite but (always) decidable. (Why?)
2.4 Applications to Compilation
Languages may be classiﬁed by the means in which they are deﬁned. Of interest to us are
regular languages and contextfree languages.
Regular Languages. The signiﬁcant aspects of regular languages are:
• they are deﬁned by ‘patterns’ called regular expressions;
• every regular language is decidable;
• the decision problem for any regular language is solved by a deterministic ﬁnite state
automaton (DFA); and
• programming languages’ lexical patterns are speciﬁed using regular expressions, and lex
ical analysers are (essentially) DFAs.
Regular languages and their relationship to lexical analysis are the subjects of the next section.
ContextFree Languages. The signiﬁcant aspects of contextfree languages are:
• they are deﬁned by ‘rules’ called contextfree grammars;
• every contextfree language is decidable;
• the decision problem for any contextfree language of interest to us is solved by a deter
ministic pushdown automaton (DPDA);
• programming language syntax is speciﬁed using contextfree grammars, and (most) parsers
are (essentially) DPDAs.
Contextfree languages and their relationship to syntax analysis are the subjects of sections 4
and 5.
18
3 Lexical Analysis
In this section we study some theoretical concepts with respect to the class of regular languages
and apply these concepts to the practical problem of lexical analysis. Firstly, in Section 3.1, we
deﬁne the notion of a regular expression and show how regular expressions determine regular
languages. We then, in Section 3.2, introduce deterministic ﬁnite automata (DFAs), the class
of algorithms that solve the decision problems for regular languages. We show how regular
expressions and DFAs can be used to specify and implement lexical analysers in Section 3.3,
and in Section 3.4 we take a brief look at Lex, a popular lexical analyser generator built upon
the theory of regular expressions and DFAs.
Note.This section contains mainly theoretical deﬁnitions; the lectures will cover examples and
diagrams illustrating the theory.
3.1 Regular Expressions
Recall from the Introduction that a lexical analyser uses pattern matching with respect to
rules associated with the source language’s tokens. For example, the token then is associated
with the pattern t, h, e, n, and the token id might be associated with the pattern “an
alphabetic character followed by any number of alphanumeric characters”. The notation of
regular expressions is a mathematical formalism ideal for expressing patterns such as these,
and thus ideal for expressing the lexical structure of programming languages.
3.1.1 Deﬁnition
Regular expressions represent patterns of strings of symbols. A regular expression r matches a
set of strings over an alphabet. This set is denoted L(r) and is called the language determined
or generated by r.
Let Σ be an alphabet. We deﬁne the set RE(Σ) of regular expressions over Σ, the strings they
match and thus the languages they determine, as follows:
• ∅ ∈ RE(Σ) matches no strings. The language determined is L(∅) = ∅.
• ǫ ∈ RE(Σ) matches only the empty string. Therefore L(ǫ) = ¦ǫ¦.
• If a ∈ Σ then a ∈ RE(Σ) matches the string a. Therefore L(a) = ¦a¦.
• if r and s are in RE(Σ) and determine the languages L(r) and L(s) respectively, then
– r[s ∈ RE(Σ) matches all strings matched either by r or by s. Therefore, L(r[s) =
L(r) ∪ L(s).
19
– rs ∈ RE(Σ) matches any string that is the concatenation of two strings, the ﬁrst
matching r and the second matching s. Therefore, the language determined is
L(rs) = L(r)L(s) = ¦uv ∈ Σ
∗
[ u ∈ L(r) and v ∈ L(s)¦.
(Given two sets S
1
and S
2
of strings, the notation S
1
S
2
denotes the set of all strings
formed by appending members of S
1
with members of S
2
.)
– r
∗
∈ RE(Σ) matches all ﬁnite concatenations of strings which all match r. The
language denoted is thus
L(r
∗
) = (L(r))
∗
=
i∈N
(L(r))
i
= ¦ǫ¦ ∪ L(r) ∪ L(r)L(r) ∪
3.1.2 Regular Languages
Let L be a language over Σ. L is said to be a regular language if L = L(r) for some r ∈ RE(Σ).
3.1.3 Notation
• We need to use parentheses to overide the convention concerning the precedence of the
operators. The normal convention is: ∗ is higher than concatenation, which is higher
than [. Thus, for example, a[bc
∗
is a[(b(c
∗
)).
• We write r
+
for rr
∗
.
• We write r? for ǫ[r.
• We write r
n
as an abbreviation for r . . . r (n times r), with r
0
denoting ǫ.
3.1.4 Lemma
Writing ‘r = s’ to mean ‘L(r) = L(s)’ for two regular expressions r, s ∈ RE(Σ), the following
identities hold for all r, s, t ∈ RE(Σ):
• r[s = s[r ([ is commutative)
• (r[s)[t = r[(s[t) ([ is associative)
• (rs)t = r(st) (concatenation is associative)
• r(s[t) = rs[rt (concatenation
• (r[s)t = rt[st distributes over [)
20
• ∅r = r∅ = ∅
• ∅[r = r
• ∅
∗
= ǫ
• r?
∗
= r
∗
• r
∗∗
= r
∗
• (r
∗
s
∗
)
∗
= (r[s)
∗
• ǫr = rǫ = r.
3.1.5 Regular deﬁnitions
It is often useful to give names to complex regular expressions, and to use these names in place
of the expressions they represent. Given an alphabet comprising all ASCII characters,
letter = A[B[ [Z[a[b[ [z
digit = 0[1[ [9
ident = letter(letter[digit)
∗
are examples of regular deﬁnitions for letters, digits and identiﬁers.
3.1.6 The Decision Problem for Regular Languages
For every regular expression r ∈ RE(Σ) there exists a stringprocessing machine M = M(r)
such that for every w ∈ Σ
∗
, when input to M:
(1) if w ∈ L(r) then M terminates with output ‘Yes’, and
(2) if w ,∈ L(r) then M terminates with output ‘No’.
Thus, every regular language is decidable.
The machines in question are Deterministic Finite State Automata.
3.2 Deterministic Finite State Automata
In this section we deﬁne the notion of a DFA without reference to its application in lexical
analysis. Here we are interested purely in solving the decision problem for regular languages;
that is, deﬁning machines that say “yes” or “no” given an inputted string, depending on its
membership of a particular language. In Section 3.3 we use DFAs as the basis for lexical
analysers: pattern matching algorithms that output sequences of tokens.
21
3.2.1 Deﬁnition
A deterministic ﬁnite state automaton (or DFA) M is a 5tuple
M = (Q, Σ, δ, q
0
, F)
where
• Q is a ﬁnite nonempty set of states,
• Σ is an alphabet,
• δ : QΣ → Q is the transition or nextstate function,
• q
0
∈ Q is the initial state, and
• F ⊆ Q is the set of accepting or ﬁnal states.
The idea behind a DFA M is that it is an abstract machine that deﬁnes a language L(M) ⊆ Σ
∗
in the following way:
• The machine begins in its start state q
0
;
• Given a string w ∈ Σ
∗
the machine reads the symbols of w one at a time from left to
right;
• Each symbol causes the machine to make a transition from its current state to a new
state; if the current state is q and the input symbol is a, then the new state is δ(q, a);
• The machine terminates when all the symbols of the string have been read;
• If, when the machine terminates, its state is a member of F, then the machine accepts
w, else it rejects w.
Note the name ‘ﬁnal state’ is not a good one since a DFA does not terminate as soon as a ‘ﬁnal
state’ has been entered. The DFA only terminates when all the input has been read.
We formalise this idea as follows:
Deﬁnition. Let M = (Q, Σ, δ, q
0
, F) be a DFA. We deﬁne
ˆ
δ : QΣ
∗
→ Q by
ˆ
δ(q, ǫ) = q
for each q ∈ Q and
ˆ
δ(q, aw) =
ˆ
δ(δ(q, a), w)
for each q ∈ Q, a ∈ Σ and w ∈ Σ
∗
.
We deﬁne the language of M by
L(M) = ¦w ∈ Σ
∗
[
ˆ
δ(q
0
, w) ∈ F¦.
22
3.2.2 Transition Diagrams
DFAs are best understood by depicting them as transition diagrams; these are directed graphs
with nodes representing states and labelled arcs between states representing transitions.
A transition diagram for a DFA is drawn as follows:
(1) Draw a node labelled ‘q’ for each state q ∈ Q.
(2) For every q ∈ Q and every a ∈ Σ draw an arc labelled a from node q to node δ(q, a);
(3) Draw an unlabelled arc from outside the DFA to the node representing the initial
state q
0
;
(4) Indicate each ﬁnal state by drawing a concentric circle around its node to form a
double circle.
3.2.3 Examples
Let M
1
= (Q, Σ, δ, q
0
, F) where Q = ¦1, 2, 3, 4¦, Σ = ¦a, b¦, q
0
= 1, F = ¦4¦ and where δ is
given by:
δ(1, a) = 2 δ(1, b) = 3
δ(2, a) = 3 δ(2, b) = 4
δ(3, a) = 3 δ(3, b) = 3
δ(4, a) = 3 δ(4, b) = 4.
From the transition diagram for M
1
it is clear that:
L(M
1
) = ¦w ∈ ¦a, b¦
∗
[
ˆ
δ(1, w) ∈ F¦
= ¦w ∈ ¦a, b¦
∗
[
ˆ
δ(1, w) = 4¦
= ¦ab, abb, abbb, . . . , ab
n
, . . .¦
= L(ab
+
).
Let M
2
be obtained from M
1
by adding states 1 and 2 to F. Then
L(M
2
) = L(ǫ[ab
∗
).
Let M
3
be obtained from M
1
by changing F to ¦3¦. Then
L(M
3
) = L((b[aa[abb
∗
a)(a[b)
∗
).
Simpliﬁcations to transition diagrams.
23
• It is often the case that a DFA has an ‘error state’, that is, a nonaccepting state from
which there are no transitions other than back to the error state. In such a case it is
convenient to apply the convention that any apparently missing transitions are transitions
to the error state.
• It is also common for there to be a large number of transitions between two given states
in a DFA, which results in a cluttered transition diagram. For example, in an identiﬁer
recognition DFA, there may be 52 arcs labelled with each of the lower and uppercase
letters from the start state to a state representing that a single letter has been recognised.
It is convenient in such cases to deﬁne a set comprising the labels of each of these arcs,
for example,
letter = ¦a, b, c, . . . , z, A, B, C, . . . , Z¦ ⊆ Σ
and to replace the arcs by a single arc labelled by the name of this set, e.g. letter.
It is acceptable practice to use these conventions provided it is made clear that they are being
operated.
3.2.4 Equivalence Theorem
(1) For every r ∈ RE(Σ) there exists a DFA M with alphabet Σ such that L(M) = L(r).
(2) For every DFA M with alphabet Σ there exists an r ∈ RE(Σ) such that L(r) =
L(M).
Proof. See J.E. Hopcroft and J. D. Ullman Introduction to Automata Theory, Languages, and
Computation (Addison Wesley, 1979).
Applications. The signiﬁcance of the Equivalence Theorem is that its proof is constructive;
there is an algorithm that, given a regular expression r, builds a DFA M such that L(M) =
L(r). Thus, if we can write a fragment of a programming language syntax in terms of regular
expressions, then by part (1) of the Theorem we can automatically construct a lexical analyser
for that fragment.
Part (2) of the Equivalence Theorem is a useful tool for showing that a language is regular,
since if we cannot ﬁnd a regular expression directly, part (2) states that it is suﬃcient to ﬁnd
a DFA that recognises the language.
The standard algorithm for constructing a DFA from a given regular expression is not diﬃcult,
but would require that we also take a look at nondeterministic ﬁnite state automata (NFAs).
NFAs are equivalent in power to DFAs but are slightly harder to understand (see the course
text for details). Given a regular expression, the REDFA algorithm ﬁrst constructs an NFA
equivalent to the RE (by a method known as Thompson’s Construction), and then transforms
the NFA into an equivalent DFA (by a method known as the Subset Construction).
24
3.3 DFAs for Lexical Analysis
Let’s suppose we wish to construct a lexical analyser based on a DFA. We have seen that it
is easy to construct a DFA that recognises lexemes for a given programming language token
(e.g. for individual keywords, for identiﬁers, and for numbers). However, a lexical analyser
has to deal with all of a programming language’s lexical patterns, and has to repeatedly match
sequences of characters against these patterns and output corresponding tokens. We illustrate
how lexical analysers may be constructed using DFAs by means of an example.
3.3.1 An example DFA lexical analyser
Consider ﬁrst writing a DFA for recognising tokens for a (minimal!) language with identiﬁers
and the symbols +, ; and := (we’ll add keywords later). A transition diagram for a DFA that
recognises these symbols is given by:
in_ident
start
other
=
done
assign
error
letter
letter
digit
error
identifier
(or keyword)
whitespace
:
in_assign
other (reread)
other
+
;
end_of_input
plus
semi_colon
end_of_file
The lexical analyser code (see next section) consists of a procedure get next token which out
puts the next token, which can be either identifier (for identiﬁers), plus (for +), semi colon
(for ;), assign (for :=), error (if an error occurs, for example if an invalid character such as
‘(’ is read or if a : is not followed by a =) and end of input (for when the complete input ﬁle
has been read).
The (lexical analyser based on) the DFA begins in state start, and returns a token when it
enters state done; the token returned depends on the ﬁnal transition it takes to enter the done
state, and is shown on the right hand side of the diagram.
25
For a state with an output arc labelled other, the intuition is that this transition is made
on reading any character except those labelled on the state’s other arcs; reread denotes
that the read character should not be consumed—it is reread as the ﬁrst character when
get next token is next called.
Notice that adding keyword recognition to the above DFA would be tricky to do by hand and
would lead to a complex DFA—why? However, we can recognise keywords as identiﬁers using
the above DFA, and when the accepting state for identiﬁers is entered the lexeme stored in the
buﬀer can be checked against a table of keywords. If there is a match, then the appropriate
keyword token is output, else identifier token is output.
Further, when we have recognised an identiﬁer we will also wish to output its string value as
well as the identifier token, so that this can be used by the next phase of compilation.
3.3.2 Code for the example DFA lexical analyser
Let’s consider how code may be written based on the above DFA. Let’s add the keywords if,
then, else and fi to the language to make it slightly more realistic.
Firstly, we deﬁne enumerated types for the sets of states and tokens:
state = (start, in_identifier, in_assign, done);
token = (k_if, k_then, k_else, k_fi, plus, identifier, assign,
semi_colon, error, end_of_input);
Next, we deﬁne some variables that are shared by the lexical analyser and the syntax anal
yser. The job of the procedure get next token is to set the value of current token to next
token, and if this token is identifier it also sets the value of current identifier to the
current lexeme. The value of the Boolean variable reread character determines whether the
last character read during the previous execution of get next token should be reread at the
beginning of its next execution. The current character variable holds the value of the last
character read by get next token.
current_token : token;
current_identifier : string[100];
reread_character : boolean;
current_character : char;
We also need the following auxiliary functions (their implementations are omitted here) with
the obvious interpretations:
function is_alpha(c : char) : boolean;
26
function is_digit(c : char) : boolean;
function is_white_space(c : char) : boolean;
Finally, we deﬁne two constant arrays:
{ Constants used to recognise keyword matches }
NUM_KEYWORDS = 4;
token_tab : array[1..NUM_KEYWORDS] of token = (k_if, k_then, k_else, k_fi);
keyword_tab : array[1..NUM_KEYWORDS] of string = (’if’, ’then’,’else’, ’fi’);
that store keyword tokens and keywords (with associated keywords and tokens stored at the
same location each array) and a function that searches the keyword array for a string and
returns the token associated with a matched keyword or the token identifier if not. Notice
that the arrays and function are easily modiﬁed for any number of keywords appearing in our
source language.
function keyword_lookup(s : string) : token;
{If s is a keyword, return this keyword’s token; else
return the identifier token}
var
index : integer;
found : boolean;
begin
keyword_lookup := identifier;
found := FALSE;
index := 1;
while (index <= NUM_KEYWORDS) and (not found) do begin
if keyword_tab[index] = s then begin
keyword_lookup := token_tab[index];
found := TRUE
end;
index := index + 1
end
end;
The get next token procedure is implemented as follows. Notice that within the main loop
a case statement is used to deal with transitions from the current state. After the loop exits
(when the done state is entered), if an identifier has been recognised, keyword lookup is
used to check whether or not a keyword has been matched.
27
procedure get_next_token;
{Sets the value of current_token by matching input characters. Also,
sets the values current_identifier and reread_character if
appropriate}
var
current_state : state;
no_more_input : boolean;
begin
current_state := start;
current_identifier := ’’;
while not (current_state = done) do begin
no_more_input := eof; {Check whether at end of file}
if not (reread_character or no_more_input) then
read(current_character);
reread_character := FALSE;
case current_state of
start:
if no_more_input then begin
current_token := end_of_input;
current_state := done
end else if is_white_space(current_character) then
current_state := start
else if is_alpha(current_character) then begin
current_identifier := current_identifier + current_character;
current_state := in_identifier
end else case current_character of
’;’ : begin
current_token := semi_colon;
current_state := done
end;
’+’ : begin
current_token := plus;
current_state := done
end;
’:’ :
current_state := in_assign
else begin
current_token := error;
current_state := done;
end
end; {case}
in_identifier:
if (no_more_input or not(is_alpha(current_character)
or is_digit(current_character))) then begin
current_token := identifier;
28
current_state := done;
reread_character := true
end else
current_identifier := current_identifier + current_character;
in_assign:
if no_more_input or (current_character <> ’=’) then begin
current_token := error;
current_state := done
end else begin
current_token := assign;
current_state := done
end
end; {case}
end; {while}
if (current_token = identifier) then
current_token := keyword_lookup(current_identifier);
end;
Test code (in the absence of a syntax analyser) might be the following. This just repeatedly
calls get next token until the end of the input ﬁle has been reached, and prints out the value
of the read token.
{Request tokens from lexical analyser, outputting their
values, until end_of_input}
begin
reread_character := false;
repeat
get_next_token;
writeln(’Current Token is ’, token_to_text(current_token));
if (current_token = identifier) then
writeln(’Identifier is ’, current_identifier);
until (current_token = end_of_input)
end.
where
function token_to_text(t : token) : string;
converts token values to text.
29
3.4 Lex
Lex is a widely available lexical analyser generator.
3.4.1 Overview
Given a Lex source ﬁle comprising regular expressions for various tokens, Lex generates a lexical
analyser (based on a DFA), written in C, that groups characters matching the expressions into
lexemes, and can return their corresponding tokens.
In essence, a Lex ﬁle comprises a number of lines typically of the form:
pattern action
where pattern is a regular expression and action is a piece of C code.
When run on a Lex ﬁle, Lex produces a C ﬁle called lex.yy.c (a lexical analyser).
When compiled, lex.yy.c takes a stream of characters as input and whenever a sequence of
characters matches a given regular expression the corresponding action is executed. Characters
not matching any regular expressions are simply copied to the output stream.
Example. Consider the Lex fragment:
a { printf( "read ’a’\n" ); }
b { printf( "read ’b’\n" ); }
After compiling (see below on how to do this) we obtain a binary executable which when
executed on the input:
sdfghjklaghjbfghjkbbdfghjk
dfghjkaghjklaghjk
produces
sdfghjklread ’a’
ghjread ’b’
fghjkread ’b’
read ’b’
dfghjk
dfghjkread ’a’
ghjklread ’a’
ghjk
30
Example. Consider the Lex program:
%{
int abc_count, xyz_count;
%}
%%
ab[cC] {abc_count++; }
xyz {xyz_count++; }
\n { ; }
. { ; }
%%
main()
{
abc_count = xyz_count = 0;
yylex();
printf( "%d occurrences of abc or abC\n", abc_count );
printf( "%d occurrences of xyz\n", xyz\_count );
}
• This ﬁle ﬁrst declares two global variables for counting the number of occurrences of abc
or abC and xyz.
• Next come the regular expressions for these lexemes and actions to increment the relevant
counters.
• Finally, there is a main routine to initialise the counters and call yylex().
When executed on input:
akhabfabcdbcaxyzXyzabChsdk
dfhslkdxyzabcabCdkkjxyzkdf
the lexical analyser produces:
4 occurrences of ’abc’ or ’abC’
3 occurrences of ’xyz’
Some features of Lex illustrated by this example are:
31
(1) The notation for [; for example, [cC] matches either c or C.
(2) The regular expression ¸n which matches a newline.
(3) The regular expression . which matches any character except a newline.
(4) The action ¦ ; ¦ which does nothing except to suppress printing.
3.4.2 Format of Lex Files
The format of a Lex ﬁle is:
deﬁnitions
analyser speciﬁcation
auxiliary functions
Lex Deﬁnitions. The (optional) deﬁnitions section comprises macros (see below) and global
declarations of types, variables and functions to be used in the actions of the lexical analyser
and the auxiliary functions (if present). All such global declaration code is written in C and
surrounded by %¦ and %¦.
Macros are abbreviations for regular expressions to be used in the analyser speciﬁcation. For
example, the token ‘identiﬁer’ could be deﬁned by:
IDENTIFIER [azAZ][azAZ09]*
The shorthand character range construction ‘[xy]’ matches any of the characters between
(and including) x and y. For example, [ac] means the same as abc, and [acAC] means
the same as abcABC.
Deﬁnitions may use other deﬁnitions (enclosed in braces) as illustrated in:
ALPHA [azAZ]
ALPHANUM [azAZ09]
IDENTIFIER {ALPHA}{ALPHANUM}*
and:
ALPHA [azAZ]
NUM [09]
ALPHANUM ({ALPHA}{NUM})
IDENTIFIER {ALPHA}{ALPHANUM}*
Notice the use of parentheses in the deﬁnition of ALPHANUM. What would happen without them?
32
Lex Analyser Speciﬁcations. These have the form:
r
1
¦ action
1
¦
r
2
¦ action
2
¦
. . . . . .
r
n
¦ action
n
¦
where r
1
, r
2
, . . . , r
n
are regular expressions (possibly involving macros enclosed in braces) and
action
1
, action
2
, . . ., action
n
are sequences of C statements.
Lex translates the speciﬁcation into a function yylex() which, when when called, causes the
following to happen:
• The current input character(s) are scanned to look for a match with the regular expres
sions.
• If there is no match, the current character is printed out, and the scanning process resumes
with the next character.
• If the next m characters match r
i
then
(a) the matching characters are assigned to string variable yytext,
(b) the integer variable yyleng is assigned the value m,
(c) the next m characters are skipped, and
(d) action
i
is executed. If the last instruction of action
i
is return n; (where n
is an integer expression) then the call to yylex() terminates and the value of
n is returned as the function’s value; otherwise yylex() resumes the scanning
process.
• If endofﬁle is read at any stage, then the call to yylex() terminates returning the value
0.
• If there is a match against two or more regular expressions, then the expression giving
the longest lexeme is chosen; if all lexemes are of the same length then the ﬁrst matching
expression is chosen.
Lex Auxiliary Functions. This optional section has the form:
fun
1
fun
2
. . .
fun
n
where each fun
i
is a complete C function.
We can also compile lex.yy.c with the lex library using the command:
gcc lex.yy.c ll
33
This has the eﬀect of automatically including a standard main() function, equivalent to:
main()
{
yylex();
return;
}
Thus in the absence of any return statements in the analyser’s actions, this one call to yylex()
consumes all the input up to and including endofﬁle.
3.4.3 Lexical Analyser Example
The lex program below illustrates how a lexical analyser for a Pascaltype language is deﬁned.
Notice that the regular expression for identiﬁers is placed at the end of the list (why?).
We assume that the syntax analyser requests tokens by repeatedly calling the function yylex().
The global variable yylval (of type integer in this example) is generally used to pass tokens’
attributes from the lexical analyser to the syntax analyser and is shared by both phases of the
compiler. Here it is being used to pass integers’ values and identiﬁers’ symbol table positions
to the syntax analyser.
34
%{
definitions (as integers) of IF, THEN, ELSE, ID, INTEGER, ...
%}
delim [ \t\n]
ws {delim}+
letter [AZaz]
digit [09]
id {letter}({letter}{digit})*
integer [+\]?{digit}+
%%
{ws} { ; }
if { return(IF); }
then { return(THEN); }
else { return(ELSE); }
... ...
{integer} { yylval = atoi(yytext); return(INTEGER); }
{id} { yylval = InstallInTable(); return(ID); }
%%
int InstallInTable()
{
put yytext in symbol table and return
the position it has been inserted.
}
35
4 Syntax Analysis
In this section we will look at the second phase of compilation: syntax analysis, or parsing.
Since parsing is a central concept in compilation and because (unlike lexical analysis) there are
many approaches to parsing, this section makes up most of the remainder of the course.
In Section 4.1 we discuss the class of contextfree languages and its relationship to the syntac
tic structure of programming languages and the compilation process. Parsing algorithms for
contextfree languages fall into two main categories: topdown and bottomup parsers (the names
refer to the process of parse tree construction). Diﬀerent types of topdown and bottomup
parsing algorithms will be discussed in Sections 4.2 and 4.3 respectively.
4.1 ContextFree Languages
Regular languages are inadequate for specifying all but the simplest aspects of programming
language syntax. To specify morecomplex languages such as
• L = ¦w ∈ ¦a, b¦
∗
[ w = a
n
b
n
for some n¦,
• L = ¦w ∈ ¦(, )¦
∗
[ w is a wellbalanced string of parentheses¦ and
• the syntax of most programming languages,
we use contextfree languages. In this section we deﬁne contextfree grammars and languages,
and their use in describing the syntax of programming languages. This section is intended to
provide a foundation for the following sections on parsing and parser construction.
Note Like Section 3, this section contains mainly theoretical deﬁnitions; the lectures will cover
examples and diagrams illustrating the theory.
4.1.1 ContextFree Grammars
Deﬁnition. A contextfree grammar is a tuple G = (T, N, S, P) where:
• T is a ﬁnite nonempty set of (terminal) symbols (tokens),
• N is a ﬁnite nonempty set of (nonterminal) symbols (denoting phrase types) disjoint from
T,
• S ∈ N (the start symbol), and
• P is a set of (contextfree) productions (denoting ‘rules’ for phrase types) of the form
A → α where A ∈ N and α ∈ (T ∪ N)
∗
.
36
Notation. In what follows we use:
• a, b, c, . . . for members of T,
• A, B, C, . . . for members of N,
• . . . , X, Y, Z for members of T ∪ N,
• u, v, w, . . . for members of T
∗
, and
• α, β, γ, . . . for members of (T ∪ N)
∗
.
Examples.
(1) G
1
= (T, N, S, P) where
• T = ¦a, b¦,
• N = ¦S¦ and
• P = ¦S → ab, S → aSb¦.
(2) G
2
= (T, N, S, P) where
• T = ¦a, b¦,
• N = ¦S, X¦ and
• P = ¦S → X, S → aa, S → bb, S → aSa, S → bSb, X → a, X → b¦.
Notation. It is customary to ‘deﬁne’ a context free grammar by simply listing its productions
and assuming:
• The terminals and nonterminals of the grammar are exactly those terminals appearing in
the productions. (It is usually clear from the context whether a symbol is a terminal or
nonterminal.)
• The start symbol is the nonterminal on the lefthand side of the ﬁrst production.
• Righthand sides separated by ‘[’ indicate alternatives.
For example, G
2
above can be written as
S → X [ aa [ bb [ aSa [ bSb
X → a [ b
37
BNF Notation. The deﬁnition of a grammar as given so far is ﬁne for simple theoretical
grammars. For grammars of real programming languages, a popular notation for contextfree
grammars is BackusNaur Form. Here, nonterminal symbols are given a descriptive name and
placed in angled brackets, and “[” is used, as above, to indicate alternatives. The alphabet
will also commonly include keywords and language symbols such as <=. For example, BNF for
simple arithmetic expressions might be
< exp > → < exp > + < exp > [ < exp > − < exp > [
< exp > ∗ < exp > [ < exp > / < exp > [
( < exp > ) [ < number >
< number > → < digit > [ < digit > < number >
< digit > → 0 [ 1 [ 2 [ 3 [ 4 [ 5 [ 6 [ 7 [ 8 [ 9.
and a typical rule for while loops would be:
< while loop > → while < exp > do < command list > od
< command list > → < assignment > [ < conditional > . . .
Note that the symbol ::= is often used in place of →.
Derivation and Languages. A (contextfree) grammar G deﬁnes a language L(G) in the
following way:
• We say αAβ immediately derives αγβ if A → γ is a production of G. We write αAβ =⇒
αγβ.
• We say α derives β in zero or more steps, in symbols α
∗
=⇒ β, if
(i) α = β or
(ii) for some γ we have α
∗
=⇒ γ and γ =⇒ β.
• If S
∗
=⇒ α then we say α is a sentential form.
• We say α derives β in one or more steps, in symbols α
+
=⇒ β, if α
∗
=⇒ β and α ,= β.
• We deﬁne L(G) by
L(G) = ¦w ∈ T
∗
[ S
+
=⇒ w¦.
• Let L ⊆ T
∗
. L is said to be a contextfree language if L = L(G) for some contextfree
grammar G.
38
4.1.2 Regular Grammars
A grammar is said to be regular if all its productions are of the form:
(a) A → uB or A → u (a rightlinear grammar)
(b) A → Bu or A → u (a leftlinear grammar).
A language L ⊆ T
∗
is regular (see Section 3) if L = L(G) for some regular grammar G.
4.1.3 ǫProductions
A grammar may involve a production of the form
A →
(that is, with an empty righthand side). This is called an ǫproduction and is often written A →
ǫ. ǫproductions are often useful in specifying languages although they can cause diﬃculties
(ambiguities), especially in parser design, and may need to be removed–see later.
As an example of the use of ǫproductions,
< stmt > → begin < stmt > end < stmt > [ ǫ
generates wellbalanced ‘begin/end’ strings (and the empty string).
4.1.4 Left and Rightmost Derivations
A derivation in which the leftmost (rightmost) nonterminal is replaced at each step in the
derivation is called a leftmost (rightmost) derivation.
Examples. Let G be a grammar with productions
E → E + E [ E ∗ E [ x [ y [ z.
(a) A leftmost derivation of x + y ∗ z is
E =⇒ E + E =⇒ x + E =⇒ x + E ∗ E =⇒ x + y ∗ E =⇒ x + y ∗ z.
(b) A rightmost derivation of x + y ∗ z is
E =⇒ E + E =⇒ E + E ∗ E =⇒ E + E ∗ z =⇒ E + y ∗ z =⇒ x + y ∗ z.
39
(c) Another leftmost derivation of x + y ∗ z is
E =⇒ E ∗ E =⇒ E + E ∗ E =⇒ x + E ∗ E =⇒ x + y ∗ E =⇒ x + y ∗ z
(thus left and rightmost derivations need not be unique).
(d) The following is neither left nor rightmost:
E =⇒ E + E =⇒ E + E ∗ E =⇒ E + y ∗ E =⇒ x + y ∗ E =⇒ x + y ∗ z.
4.1.5 Parse Trees
Let G = (T, N, S, P) be a contextfree grammar.
(1) A parse (or derivation) tree in grammar G is a pictorial representation of a derivation
in which
(a) every node has a label that is a symbol of T ∪ N ∪ ¦ǫ¦,
(b) the root of the tree has label S,
(c) interior nodes have nonterminal labels,
(d) if node n has label A and descendants n
1
. . . n
k
(in order, lefttoright)
with labels X
1
. . . X
k
respectively, then A → X
1
X
k
must be a produc
tion of P, and
(e) if node n has label ǫ then n is a leaf and is the only descendant of its
parent.
(2) The yield of a parse tree with root node r is y(r) where:
(a) if n is a leaf node with label a then y(n) = a, and
(b) if n is an interior node with descendents n
1
. . . n
k
then
y(n) = y(n
1
) y(n
k
).
Theorem. Let G = (T, N, S, P) be a contextfree grammar. For each α ∈ (T ∪N)
∗
there exists
a parse tree in grammar G with yield α if, and only if, S
∗
=⇒ α.
Examples.
1. Consider the parse trees for the derivations (a) and (c) of the above examples. Notice that
the derivations are diﬀerent and so are the parse trees, although both derivations are leftmost.
2. Let G be a grammar with productions
< stmt > → if < boolexp > then < stmt > [
if < boolexp > then < stmt > else < stmt > [
< other >
< boolexp > → . . .
40
< other > → . . .
A classical problem in parsing is that of the dangling else: Should
if < boolexp > then if < boolexp > then < other > else < other >
be parsed as
if < boolexp > then ( if < boolexp > then < other > else < other > )
or
if < boolexp > then ( if < boolexp > then < other > ) else < other >
Theorem. Let G = (T, N, S, P) be a contextfree grammar and let w ∈ L(G). For every parse
tree of w there is precisely one leftmost derivation of w and precisely one rightmost derivation
of w.
4.1.6 Ambiguity
A grammar G for which there is some w ∈ L(G) for which there are two (or more) parse trees
is said to be ambiguous. In such a case, w is known as an ambiguous sentence. Ambiguity is
usually an undesirable property since it may lead to diﬀerent interpretations of the same string.
Unfortunately:
Equivalently (by the above theorem), a grammar G is ambiguous if there is some w ∈ L(G) for
which there exists more than one leftmost derivation (or rightmost derivation).
Theorem. The decision problem: Is G an unambiguous grammar? is undecidable.
However, for a particular grammar we can sometimes resolve ambiguity via disambiguating
rules that force troublesome strings to be parsed in a particular way. For example, let G be
with productions
< stmt > → < matched > [ < unmatched >
< matched > → if < boolexp > then < matched > else < matched > [
< other >
< unmatched > → if < boolexp > then < stmt > [
if < boolexp > < then > < matched > else < unmatched >
< boolexp > → . . .
< other > → . . .
(< matched > is intended to denote an ifstatement with a matching ‘else’ or some other (non
conditional) statement, whereas < unmatched > denotes an ifstatement with an unmatched
‘if’.)
41
Now, according to this G,
if < boolexp > then if < boolexp > then < other > else < other >
has a unique parse tree. (exercise: convince yourself that this parse tree really is unique.)
4.1.7 Inherent Ambiguity
We have seen that it is possible to remove ambiguity from a grammar by changing the (non
terminals and) productions to give an equivalent (in terms of the language deﬁned) but unam
biguous grammar. However, there are some contextfree languages for which every grammar is
ambiguous; such languages are called inherently ambiguous.
Formally,
Theorem.There exists a contextfree language L such that for every contextfree grammar G,
if L = L(G) then G is ambiguous.
Furthermore,
Theorem. The decision problem: Is L an inherently ambiguous language? is undecidable.
4.1.8 Limits of ContextFree Grammars
It is known that not every language is contextfree. For example,
L = ¦w ∈ ¦a, b, c¦
∗
[ w = a
n
b
n
c
n
for some n ≥ 1¦
is not contextfree, nor is
L = ¦w ∈ ¦a, b, c¦
∗
[ w = a
m
b
n
c
p
for m ≤ n ≤ p¦
although proof of these facts is beyond the scope of this course.
More signiﬁcantly, there are some aspects of programming language syntax that cannot be
captured with a contextfree grammar. Perhaps the most wellknow example of this is the
rule that variables must be declared before they are used. Such properties are often called
contextsensitive since they can be speciﬁed by grammars that have productions of the form
αAβ → αγβ
which, informally, says that A can be replaced by γ but only when it occurs in the context
α. . . β.
In practice we still use contextfree grammars; problems are resolved by allowing strings (pro
grams) that parse but are technically invalid, and then by adding further code that checks for
invalid (contextsensitive) properties.
42
In this section we describe a number of techniques for constructing a parser for a given context
free language from a contextfree grammar for that language. Parsers fall into two broad
categories: topdown parsers and bottomup parsers, as described in the next two sections.
4.2 TopDown Parsers
Topdown parsing can be thought of as the process of creating a parse tree for a given input
string by starting at the top with the start symbol and then working depthﬁrst downwards to
the leaves. Equivalently, it can be viewed as creating a leftmost derivation of the input string.
We begin, in Section 4.2.1, by showing how a simple parser may be coded in an adhoc way using
a standard programming language such as Pascal. In Section 4.2.2 we deﬁne those grammars
for which such a simple parser may be written. In Section 4.2.3 we show how to add syntax
tree construction and (very limited) error recovery code to these parsing algorithms. Finally,
in Section 4.2.4 we show how this simple approach to parsing can be made more structured by
using the same algorithm for all grammars, where the algorithm uses a diﬀerent lookup table
for each particular grammar.
4.2.1 Recursive Descent Parsing
The simplest idea for creating a parser for a grammar is to construct a recursive descent parser
(r.d.p.). This is a topdown parser, and needs backtracking in general; i.e the parser needs to
try out diﬀerent production rules; if one production rule fails it needs to backtrack in the input
string (and the parse tree) in order to try further rules. Since backtracking can be costly in
terms of parsing time, and because other parsing methods (e.g. bottomup parsers) do not
require it, it is rare to write backtracking parsers in practice.
We will show here how to write a nonbacktracking r.d.p. for a given grammar. Such parsers
are commonly called predictive parsers since they attempt to predict which production rule
to use at each step in order to avoid the need to backtrack; if when attempting to apply a
production rule, unexpected input is read, then this input must be an error and the parser
reports this fact. As we will see, is not possible to write such a simple parser for all grammars:
for many programming languages’ grammars such a parser cannot be written. For these we’ll
need more powerful parsing methods which we will look at later.
We will not concern ourselves here with the construction of a parse tree or syntax tree—we’ll
just attempt to write code that produces (implicitly, via procedure calls) a derivation of an
input string (if the string is in the language given by the grammar) or reports an error (for
an incorrect string). This code will form the basis of a complete syntax analyser—syntax tree
construction and error recovery code will just be inserted into it at the correct places, as we’ll
see in Section 4.2.3.
43
We will write a r.d.p. for the grammar G with productions:
A → a [ BCA
B → be [ cA
C → d
We will assume the existence of a procedure get next token which reads terminal symbols (or
tokens) from the input, and places them into the variable current token. The tokens for the
above grammar are a, b, c, d and e. We’ll also assume the tokens end of input (for end of
string) and error for a nonvalid input.
We’ll also assume the following auxiliary procedure which tests whether the current token is
what is expected and, if so, reads the next token; otherwise it calls an error routine. Later
on we’ll see more about error recovery, but for now assume that the procedure error simply
outputs a message and causes the program to terminate.
procedure match(expected : token);
begin
if current_token = expected then
get_next_token
else
error
end;
The parser is constructed from three procedures, one for each nonterminal symbol. The aim
of each procedure is to try to match any of the righthandsides of production rules for that
nonterminal symbol against the next few symbols on the input string; if it fails to do this then
there must be an error in the input string, and the procedure reports this fact.
Consider the nonterminal symbol A and the Aproductions A → a and A → BCA. The
procedure parse A attempts to match either an a or BCA. It ﬁrst needs to decide which
production rule to choose: it makes its choice based on the current input token. If this is a, it
obviously uses the production rule A → a and is bound to succeed since it already knows it is
going to ﬁnd the a(!).
Notice that any string derivable from BCA begins with either a b or a c, because of the right
hand sides of the two Bproductions. Therefore, if either b or c is the current token, then
parse A will attempt to parse the next few input symbols using the rule A → BCA. In order
to attempt to match BCA it calls procedures parse B, parse C and parse A in turn, each of
which may result in a call to error if the input string is not in the language. If the current
token is neither a, b or c the input cannot be derived from A and therefore error is called.
procedure parse_A
44
begin
case current_token of
a : match(a);
b, c : begin
parse_B; parse_C; parse_A
end
else
error
end
end;
The procedures parse B and parse C are constructed in a similar fashion.
procedure parse_B;
begin
case current_token of
b : begin
match(b);
match(e)
end;
c : begin
match(c);
parse_A
end
else
error
end
end;
procedure parse_C;
begin
match(d)
end;
To execute the parser, we need to (i) make sure that the ﬁrst token of the input string has
been read into current token, (ii) call parse A since this is the start symbol, and if an error
is not generated by this call (iii) check that the end of the input string has been reached. This
is accomplished by
get_next_token;
parse_A;
45
if current_token = end_of_input then
writeln(‘Success!’)
else
error
By executing the parser on a range of inputs that drive the algorithm through all possible
branches (try the algorithm on beda, bed and cbedada for examples), it is easy to convince
oneself that the algorithm is a neat and correct solution to the parsing problem for this particular
grammar.
Notice how the algorithm
i. determines a sequence of productions that are equivalent to the construction of a parse
tree that starts from the root and builds down to the leaves, and
ii. determines a leftmost derivation.
Although it is possible to construct compact, eﬃcient recursive descent parsers by hand for
some very large grammars, the method is somewhat ad hoc and more useful generalpurpose
methods are preferred.
We note that the method of constructing a r.d.p. will fail to produce a correct parser if the
grammar is left recursive; that is if A
+
=⇒ Aα for some nonterminal A–why?
Also notice that each for each nonterminal of the above grammar it is possible to predict which
production (if any) is to be matched against, purely on the basis of the current input character.
This is clearly the case for the grammar G above, because as we have seen:
• To recognise an A: if the current symbol is a, we try the production rule A → a; if the
symbol is b or c, we try A → BCA; otherwise we report an error;
• To recognise a B, if the current symbol is b, we try B → be; if the symbol is cA we try
B → cA; otherwise we report an error;
• To recognise a C, if the current symbol is d, we try C → d; otherwise we report an error.
The choice is determined by the socalled First sets of the righthand sides of the production
rules. In general, First(α) for any string α ∈ (T ∪ N)
∗
denotes the set of all terminal symbols
that can begin strings derivable from α. For example, the First set of the right hand side of the
rules A → a and A → BCA are First(a) = ¦a¦ and First(BCA) = ¦b, c¦ respectively. Notice
that, to be able to write a predictive parser for a grammar, for any nonterminal symbol, the
First sets for all pairs of distinct productions’ righthand sides must be disjoint. This is the
46
case for G, since
First(a) ∩ First(BCA) = ¦a¦ ∩ ¦b, c¦
= ∅
First(be) ∩ First(cA) = ¦b¦ ∩ ¦c¦
= ∅.
(and C has no pairs of productions since there is only one Cproduction rule).
For the grammar G
2
, with productions
A → b [ BCA
B → be [ cA
C → d
this is not the case, since
First(b) ∩ First(BCA) = ¦b¦ ∩ ¦b, c¦
= ¦b¦
,= ∅
and therefore we can not write a predictive parser for G
2
.
Things become slightly more complicated when grammars have ε productions. For example,
let the grammar G
3
have productions:
A → aBb
B → c [ ε.
Consider parsing the string adb. Obviously, parse A would be written:
procedure parse_A
begin
if current_token = a then begin
match(a); parse_B; match(b)
end else
error
end;
How do we write a procedure parse B that would recognise strings derivable from B? Well,
consider what the parser should do for the following three strings:
• acb. The procedure parse A would consume the ﬁrst a and then call parse B. We would
want parse B to try the rule B → c to consume the c, to leave parse A to consume the
ﬁnal b and thus result in a successful parse.
47
• ab. Again parse A would consume the a. Now, we would like parse B to apply B → ε
(which does not consume any input), leaving parse A to consume the b (success again).
• aa. Again parse A would consume the a. Now, obviously parse B should not try B → c
(since the next symbol is a), nor should it try B → ε since there is no way in which the
next symbol a is going to be matched—therefore it should (correctly) report an error.
The choice then, of whether to use the ε production depends on whether the next symbol can
legally immediately follow a B in a sentential form. The set of all such symbols is denoted
Follow(B). For G
2
, Follow(B) = ¦b¦ and thus the code for parse B should be:
procedure parse_B;
begin
case current_token of
c : match(c);
b : (* do nothing *)
else
error
end
end;
(Notice that the b is not consumed). Finally, let G
4
be a grammar with productions
A → aBb
B → b [ ε
and consider parsing the strings ab and abb. Obviously, parse A will be as above, and thus would
consume the a in both strings. However, it is impossible for a procedure parse B to choose,
purely on the basis of the current character (b in both cases) whether to try the production
B → b (which would be correct for abb but not for ab) or B → ε (which would be correct for
ab but not for abb).
Formally, the problem is that ε is in the First set one of the Bproduction’s right hand sides
(we use the convention that First(ε) = ¦ε¦) and that one of the other First sets—First(b)—
of a righthand side is not disjoint from the set Follow(B). Indeed, both sets are ¦b¦. This
condition of disjointedness must be fullﬁlled to be able to write a predictive parser.
Formally, grammars for which predictive parsers can be constructed are called LL(1) grammars.
(The ﬁrst ‘L’ is for lefttoright reading of the input, the second ‘L’ is for leftmost derivation,
and the ‘1’ is for one symbol of lookahead.) LL(1) grammars are the most widelyused examples
of LL(k) grammars: those that require k symbols of lookahead. In the following section we
deﬁne what it means for a grammar to be LL(1) based on the concept of First and Follow
sets.
48
4.2.2 LL(1) Grammars
To deﬁne formally what it means for a grammar to be LL(1) we ﬁrst deﬁne the two functions
First and Follow. These functions help us to construct a predictive parser for a given LL(1)
grammar, and their deﬁnition should be the ﬁrst step in writing such a parser.
We let First(α) be the set of all terminals that can begin strings derived from α (First(α) also
includes ǫ if α
∗
=⇒ ǫ).
To compute First(X) for all symbols X, apply the following rules until nothing else can be
added to any First set:
• if X ∈ T, then First(X) = ¦X¦.
• if X → ǫ ∈ P, then add ǫ to First(X).
• if X → Y
1
. . . Y
n
∈ P, then add all nonǫ members of First(Y
1
) to First(X). If Y
1
∗
=⇒ ǫ
then also add all nonǫ members of First(Y
2
) to First(X). If Y
1
∗
=⇒ ǫ and Y
2
∗
=⇒ ǫ then
also add all nonǫ members of First(Y
3
) to First(X), etc. If all Y
1
, . . . , Y
n
∗
=⇒ ǫ, then
add ǫ to First(X).
To compute First(X
1
. . . X
m
) for any string X
1
. . . X
m
, ﬁrst add to First(X
1
. . . X
m
) all non
ǫ members of First(X
1
). If ǫ ∈ First(X
1
) then add all nonǫ members of First(X
2
) to
First(X
1
. . . X
m
), etc. Finally, if every set First(X
1
), . . . , First(X
m
) contains ǫ then we also
add ǫ to First(X
1
. . . X
m
).
From the deﬁnition of First, we can give a ﬁrst condition for a grammar G = (T, N, S, P) to
be LL(1), as follows. For each nonterminal A ∈ N, suppose that
A → β
1
[ β
2
[ [ β
n
are all the Aproductions in P. For G to be LL(1) it is necessary, but not suﬃcient, for
First(β
i
) ∩ First(β
j
) = ∅
for each 1 ≤ i, j ≤ n where i ,= j.
We let Follow(A) be the set of terminal symbols that can appear immediately to the right of
the nonterminal symbol A in a sentential form. If A appears as the rightmost nonterminal in
a sentential form, then Follow(A) also contains the symbol $.
To compute Follow(A) for all nonterminals A, apply the following rules until nothing can be
added to any Follow set.
• Place $ in Follow(S).
• If there is a production A → αBβ, then place all nonǫ members of First(β) in Follow(B).
49
• If there is a production A → αB, or a production A → αBβ where β
∗
=⇒ ǫ, then place
everything in Follow(A) into Follow(B).
Now, supposing a grammar G meets the ﬁrst condition above. G is an LL(1) grammar if it also
meets the following condition. For each A ∈ N, let
A → β
1
[ β
2
[ [ β
n
be all the Aproductions. If ǫ ∈ First(β
k
) for some 1 ≤ k ≤ n (and notice, if the ﬁrst condition
holds, there will only be at most one such value of k) then it must be the case that
Follow(A) ∩ First(β
i
) = ∅
for each 1 ≤ i ≤ n, i ,= k.
Example. Let G have the following productions:
S → BAaC
A → D [ E
B → b
C → c
D → aF
E → ǫ
F → f
First and Follow sets for this grammar are:
First(BAaC) = ¦b¦ Follow(S) = ¦$¦
First(D) = ¦a¦ Follow(A) = ¦a¦
First(E) = ¦ǫ¦ Follow(B) = ¦a¦
First(b) = ¦b¦ Follow(C) = ¦$¦
First(c) = ¦c¦ Follow(D) = ¦a¦
First(aF) = ¦a¦ Follow(E) = ¦a¦
First(f) = ¦f¦ Follow(F) = ¦a¦.
It is clear that G meets the ﬁrst condition above, as
First(D) ∩ First(E) = ¦a¦ ∩ ¦ǫ¦ = ∅.
However, G does not meet the second condition because
Follow(A) ∩ First(D) = ¦a¦ ∩ ¦a¦ , = ∅.
Many nonLL(1) grammars can be transformed into LL(1) grammars (by removing left recursion
and left factoring (see Section 4.2.5), to enable a predictive parser to be written for them.
However, although left factoring and elimination of left recursion are easy to implement, they
often make the grammar hard to read. Moreover, not every grammar can be transformed into
an LL(1) grammar, and for such grammars we need other methods of parsing.
50
4.2.3 Nonrecursive Predictive Parsing
A signiﬁcant aspect of LL(1) grammars is that it is possible to construct nonrecursive predictive
parsers from an LL(1) grammar. Such parsers explicitly maintain a stack, rather than implicitly
via recursive calls. They also use a parsingtable which is a two dimensional array M of elements
M[A, a] where A is a nonterminal symbol and a is either a terminal symbol or the endofstring
symbol $. Each element M[A, a] of M is either an Aproduction or an error entry. Essentially,
M[A, a] says which production to apply when we need to match an A (in this case, A will be
on the top of the stack), and when the next symbol input symbol is a.
Constructing an LL(1) parsing table. To construct the parsing table M, we complete the
following steps for each production A → α of the grammar G:
• For each terminal a in First(α), let M[A, a] be A → α.
• If ǫ ∈ First(α), let M[A, b] be A → α for all terminals b in Follow(A). If ǫ ∈ First(α)
and $ ∈ Follow(A), let M[A, $] be A → α.
Let all other elements of M denote an error.
What happens if we try to construct such a table for a nonLL(1) grammar?
The LL(1) parsing algorithm. The nonrecursive predictive parsing algorithm (common to
all LL(1) grammars) is shown below. Note the use of $ to mark the bottom of the stack and
the end of the input string.
set stack to $S (with S on top);
set input pointer to point to the first symbol of the input w$;
repeat the following step
(letting X be top of stack and a be current input symbol):
if X = a = $ then we have parsed the input successfully;
else if X = a (a token) then pop X off the stack and advance input pointer;
else if X is a token or $ but not equal to a then report an error;
else (X is a nonterminal) consult M[X,a]:
if this entry is a production X > Y1...Yn then pop X from the stack
and replace it with Yn...Y1 (with Y1 on top);
output the production X > Y1...Yn
else report an error
Example. Consider the grammar G with productions:
C → cD [ Ed
D → EFC [ e
E → ǫ [ f
F → g
51
First we deﬁne useful First and Follow sets for this grammar:
First(cD) = ¦c¦ First(f) = ¦f¦ Follow(C) = ¦$¦
First(Ed) = ¦d, f¦ First(g) = ¦g¦ Follow(D) = ¦$¦
First(EFC) = ¦f, g¦ First(F) = ¦g¦ Follow(E) = ¦d, g¦
First(ǫ) = ¦ǫ¦ First(C) = ¦c, d, f¦ Follow(F) = ¦c, d, f¦
This gives the following LL(1) parsing table:
c d e f g $
C C → cD C → Ed C → Ed
D D → e D → EFC D → EFC
E E → ǫ E → f E → ǫ
F F → g
which, when used with the LL(1) parsing algorithm on input string cgfd, gives:
Stack Input Action
$C cgfd$ C → cD
$Dc cgfd$ match
$D gfd$ D → EFC
$CFE gfd$ E → ǫ
$CF gfd$ F → g
$Cg gfd$ match
$C fd$ C → Ed
$dE fd$ E → f
$df fd$ match
$d d$ match
$ $ accept.
52
4.2.4 Syntaxtree Construction for RecursiveDescent Parsers
In this section we extend the recursive descent parser writing technique of Section 4.2.1 in order
to build syntax trees. We consider the construction of a r.d.p. for the following simple grammar
which only allows assignments and ifstatements.
¸command list) ::= ¸command) ¸rest commands)
¸rest commands) ::= ; ¸command list) [ null
¸command) ::= ¸if command) [ ¸assignment)
¸if command) ::= if ¸expression) then ¸command list) else ¸command list) fi
¸assignment) ::= identifier := ¸expression)
¸expression) ::= ¸atom) ¸rest expression)
¸rest expression) ::= + ¸atom) ¸rest expression) [ null
¸atom) ::= identifier
In order to construct a predictive parser, we ﬁrst list the important First and Follow sets.
These are (grouped for each of the eight nonterminals):
First(¸command) ¸rest commands)) = ¦if, identifier¦
First(; ¸command list)) = ¦;¦
First(null) = ¦null¦
Follow(¸rest commands)) = ¦else, fi, $¦
First(¸if command)) = ¦if¦
First(¸assignment)) = ¦identifier¦
First(if ¸expression) then ¸command list) else ¸command list) fi) = ¦if¦
First(identifier := ¸expression)) = ¦identifier¦
First(¸atom) ¸rest expression)) = ¦identifier¦
First(+ ¸atom) ¸rest expression)) = ¦+¦
First(null) = ¦null¦
Follow(¸rest expression)) = ¦;, else, fi, then, $¦
First(identifier) = ¦identifier¦
53
We will construct syntax trees rather than parse trees, since the former are much smaller and
simpler yet contain all the information required to enable straightforward code generation.
Notice that:
• command sequences are better constructed as lists rather than as trees as would be natural
given the grammar;
• expressions are better constructed as syntax trees that better illustrate the expression
(with proper rules of precedence and associativity), rather than directly as given by the
grammar;
• the grammar fails to represent the fact that ‘+’ should be left associative; (i.e. x+y+z
is parsed as x+(y+z) rather than the correct (x+y)+z—this is not really a problem for
‘+’, but for ‘’ it would be, since 111 would be evaluated 1(11) = 1 rather than the
conventional (11)1 = 1. In fact, it is impossible to write an LL(1) that achieves this
task. Therefore, we will get the parser to construct the correct (left associative) syntax
tree using a clever technique–see the code.
We ﬁrst write some type deﬁnitions to represent arbitrary nodes in a syntax tree. Assume that
MAXC is a constant with value 3, which represents the maximum number of children a node can
have (ifcommands need three children, for the condition, the then clause and the else clause).
{ There are command nodes and expression nodes.
Command nodes can be if or assign nodes; expression
nodes can be indentifiers or operations }
node_kind = (n_command, n_expr);
command_kind = (c_if, c_assign);
expr_kind = (e_ident, e_operation);
tree_ptr = ^tree_node;
tree_node = record
children : array[1..MAXC] of tree_ptr;
kind : node_kind; { is it a command or expression node? }
ckind : command_kind; { what kind of command is it? }
ident : string[20]; { for assignment lhs and simple expressions }
sibling : tree_ptr; { the next command in the list }
ekind : expr_kind; { what kind of expression is it? }
operator : token { expression operator }
end;
Note that some of the ﬁelds of this record type will be redundant for certain types of node. For
example, only command nodes will have siblings and only expression nodes will have operators.
54
We could use variant records (or unions in C) but this would complicate the discussion slightly,
so we choose not to. Also note that we do not need a ﬁeld for operators, since we only have ‘+’
in the grammar; however, using this ﬁeld we can easily add extra operations to the grammar.
Recall from Section 4.2.1, that we write a procedure parse A for each nonterminal A in the
grammar, the aim of which is to recognise strings derivable from A. To write a parser that
constructs syntax trees, we replace the procedure parse A by a function whose additional role
it is to construct and return a syntax tree for the recognised string. The code for the above
grammar is as follows.
We assume that the lexical analyser takes the form of a procedure get next token that updates
the variables current token and current identifier, and that we have a match procedure
that checks tokens. The tokens for the grammar are k if, k then, k else, k fi, assign,
semi colon, identifier, emd of input and error.
We ﬁrst deﬁne two auxiliary functions that the other functions will call in order to create nodes
for expressions or commands, respectively.
function new_expr_node(t : expr_kind) : tree_ptr;
{Construct a new expression node of kind t}
var
i : integer; temp : tree_ptr;
begin
new(temp);
for i := 1 to MAXC do temp^.children[i] := nil;
temp^.kind := n_expr;
temp^.ekind := t;
new_expr_node := temp
end;
function new_command_node(t : command_kind) : tree_ptr;
{Construct a new command node of kind t}
var
i : integer; temp : tree_ptr;
begin
new(temp);
for i := 1 to MAXC do temp^.children[i] := nil;
temp^.kind := n_command;
temp^.ckind := t;
temp^.sibling := nil;
new_command_node := temp
end;
Next, we deﬁne match and syntax error procedures:
55
procedure match(expected : token);
{Check the current token is as expected; if so,
read the next token in}
begin
if current_token = expected then
get_next_token
else
syntax_error(’Unexpected token’)
end;
procedure syntax_error(s : string);
{Output error message}
begin
writeln(s);
writeln(’Exiting’);
halt
end;
For the ﬁrst three rules in the BNF (those for lists of commands) we have the following functions.
A linked list of commands is built using tree nodes’ sibling ﬁelds.
Note that in parse command list, we do not bother looing at the current token to see if it is
valid, since we know parse command will do this. Notice also how the parse rest commands
function uses the set Follow(¸rest commands)) = ¦else, fi, $¦ to determine when to use the
εproduction.
function parse_command_list : tree_ptr;
var
temp : tree_ptr;
begin
temp := parse_command;
temp^.sibling := parse_rest_commands;
parse_command_list := temp
end;
56
function parse_rest_commands : tree_ptr;
var
temp : tree_ptr;
begin
temp := nil;
if current_token = semi_colon then begin
match(semi_colon);
temp := parse_command_list
end else if not (current_token in [k_else,
k_fi, end_of_input]) then
syntax_error(’Incomplete command’);
parse_rest_commands := temp
end;
function parse_command : tree_ptr;
begin
case current_token of
k_if : parse_command := parse_if_command;
identifier : parse_command := parse_assignment;
else begin
syntax_error(’Expected if or variable’);
parse_command := nil
end
end
end;
The two types of command (if commands and assignments) are parsed using the following
functions.
function parse_assignment : tree_ptr;
var
temp : tree_ptr;
begin
temp := new_command_node(c_assign);
temp^.ident := current_identifier;
match(identifier);
match(assign);
temp^.children[1] := parse_expression;
parse_assignment := temp
end;
57
function parse_if_command : tree_ptr;
var
temp : tree_ptr;
begin
temp := new_command_node(c_if);
match(k_if);
temp^.children[1] := parse_expression;
match(k_then);
temp^.children[2] := parse_command_list;
match(k_else);
temp^.children[3] := parse_command_list;
match(k_fi);
parse_if_command := temp
end;
The ﬁnal three functions handle expressions. Notice that the function parse rest expression
takes an argument comprising the (already parsed) left hand side expression, and also that it
uses the set Follow(¸rest expression)) = ¦;, then, else, fi, $¦ to determine whether to use the
εproduction.
Follow through the functions for the case x+y+z to convince yourself that they return the
correct interpretation (x+y)+z.
function parse_expression : tree_ptr;
{Parse an expression, and force it to be left associative}
var
left : tree_ptr;
begin
left := parse_atom;
parse_expression := parse_rest_expression(left);
end;
58
function parse_rest_expression(left : tree_ptr) : tree_ptr;
{If there is an operator on the input, construct an expression
with lhs left (already parsed) by parsing rhs. Otherwise, return
left as the expression}
var
new_e : tree_ptr;
begin
if (current_token = plus) then begin
new_e := new_expr_node(e_operation);
new_e^.operator := current_token;
new_e^.children[1] := left;
get_next_token;
new_e^.children[2] := parse_atom;
parse_rest_expression := parse_rest_expression(new_e)
end else if (current_token in [semi_colon, end_of_input,
k_else, k_fi, k_then]) then
parse_rest_expression := left
else begin
syntax_error(’Badly formed expression’);
parse_rest_expression := nil
end
end;
function parse_atom : tree_ptr;
var
temp : tree_ptr;
begin
temp := nil;
if current_token = identifier then begin
temp := new_expr_node(e_identifier);
temp^.ident := current_identifier
match(identifier);
end else
syntax_error(’Badly formed expression’)
parse_atom := temp
end;
4.2.5 Operations on contextfree grammars for topdown parsing
As we have seen, many grammars are not suitable for use in topdown parsing. In particular,
for predictive parsing only LL(1) grammars may be used. Many grammars can be transformed
into more suitable forms in order to make construction of topdown parsers possible. Two
59
useful techniques are leftfactoring and removal of left recursion.
Left factoring. Many simple parsing algorithms (e.g. those we have studied so far) cannot
cope with grammars with productions such as
¸if stmt) → if ¸bool) then ¸stmt list) fi [
if ¸bool) then ¸stmt list) else ¸stmt list) fi
where two or more productions for a given nonterminal share a common preﬁx. This is because
they have to make a decision on which production rule to use based only on the ﬁrst symbol
of the rule.
We will not look at the general algorithm for leftfactoring a grammar. For the above example
however, we simply factorout the common preﬁx to obtain the equivalent grammar rules:
¸if stmt) → if ¸bool) then ¸stmt list) ¸rest if)
¸rest if) → else ¸stmt list) fi [ fi
Removing Immediate Left Recursion. A grammar is said to be immediately left recursive
if there is a production of the form A → Aα. Grammars for programming languages are often
immediately left recursive, or have general left recursion (see next section). However, a large
class of parsing algorithms (speciﬁcally topdown parsers) cannot cope with such grammars.
Here we shall show how immediate left recursion can be removed from a contextfree grammar,
and in the next section we will see how general left recursion can be removed.
Consider a grammar G with productions A → Aα [ β where β does not begin with an A.
We can make a new grammar G
′
that deﬁnes the same language as G but without any left
recursion, by replacing the above productions with
A → βA
′
A
′
→ αA
′
[ ǫ
More generally, consider a grammar G with productions
A → Aα
1
[ Aα
2
[ [ Aα
n
[ β
1
[ β
2
[ [ β
m
where none of β
1
, . . . , β
m
begin with an A. We can make a new grammar G
′
without left
recursion by replacing the above productions with
A → β
1
A
′
[ β
2
A
′
[ [ β
m
A
′
A
′
→ α
1
A
′
[ α
2
A
′
[ [ α
n
A
′
[ ǫ
Removing Left Recursion. A grammar is said to be left recursive if A
+
=⇒ Aα for some
nonterminal A. Let G be a grammar that has no ǫproductions and no cycles. (A grammar G
has a cycle if A
+
=⇒ A for some nonterminal A.) The following is an algorithm that removes
left recursion from G.
60
Arrange the nonterminals in some order A
1
. . . A
n
for i := 1 to n do
for j := 1 to i1 do
replace each production of the form A
i
→ A
j
γ with A
i
→ δ
1
γ [ [ δ
k
γ
where A
j
→ δ
1
[ [ δ
k
are all the current A
j
productions;
od
remove any immediate left recursion from all current A
i
productions
od
In order to understand how the algorithm works, we begin by considering how G might be left
recursive. First, we arrange the nonterminals into some order A
1
, . . . , A
n
. In order for G to be
left recursive, it must be that A
i
+
=⇒ A
i
α for some i and some α. Now consider an arbitrary
A
i
production A
i
→ β. Clearly, there are two possibilities for β:
(1) β is of the form β = aγ for some terminal a, or
(2) β is of the form β = A
j
γ for some nonterminal A
j
.
Notice that if every A
i
production is of the form (1) then any derivation from A
i
will yield a
string that starts with a terminal symbol. Since A
i
is a nonterminal, it is impossible to have
A
i
+
=⇒ A
i
α for any α — contradicting the assumption of G’s left recursiveness. Thus, it must
be that at least one A
i
production is of the form A
i
→ A
j
γ. Suppose that j > i for every such
production (for all nonterminals A
i
): Then it must be that whenever A
i
derives a string α and
the leftmost symbol of α is a nonterminal A
k
, then k > i. Since this means k ,= i, G cannot be
left recursive.
The way the algorithm works is to transform the productions of G in such a way that the new
productions are either of the form A
i
→ aα or of the form A
i
→ A
j
α where j > i, and thus
the new grammar cannot be left recursive.
In more detail, the algorithm considers the A
i
 productions for i = 1, 2, . . . , n. The case i = 1
is a special case since if A
i
→ A
j
α then j ≥ i. If j > i then the production is in the right form,
and if j = i then we have an instance of immediate left recursion that can be removed as in
the previous section.
For i = 2, . . . , n, it is possible to have A
i
productions of the form A
i
→ A
j
α with j < i. For
each of these productions the algorithm substitutes for A
j
each possible right hand side of an
A
j
production. That is, if A
j
→ δ
1
[ [ δ
p
are all the A
j
productions, then the algorithm
replaces A
i
→ A
j
γ with A
i
→ δ
1
γ [ [ δ
p
γ. Now, since the algorithm has already processed
the A
j
productions (think inductively here!), it must be the case that if the leftmost symbol of
δ
r
(r = 1, . . . , p) is a nonterminal A
k
say, then k > j (because δ
r
is the right hand side of an
A
j
production). Thus the lowest index of any nonterminal occurring as the leftmost symbol of
the right hand side of any A
i
production has been increased from j to k.
61
We require this lowest index to be greater than (or equal to) i. If k is already greater than (or
equal to) i then the production A
i
→ δ
r
γ is of the right form (we can remove the immediate left
recursion if k = i). Otherwise, j < k < i. Now, k − j iterations of the inner loop later, j will
be equal to k, and so the algorithm will then consider A
i
productions of the form A
i
→ A
k
α.
As before, since k < i the algorithm has already considered the A
k
productions, and so all the
A
k
productions whose right hand sides start with a nonterminal A
l
will have l > k. Thus, on
substituting these right hand sides for A
k
in A
i
→ A
k
α, we obtain productions whose right
hand sides begin with a terminal symbol a nonterminal A
l
for l > k. Thus the lowest index of
any nonterminal occurring as the leftmost symbol of the right hand side of any A
i
production
has now been increased from j to k to l.
Continuing in this way, we see that by the time that we have dealt with the case j = i −1 we
must arrive at a situation where the lowest index has been increased to i or greater.
Example. Let us remove the left recursion from the following productions:
A
1
→ a [ A
2
b
A
2
→ c [ A
3
d
A
3
→ e [ A
4
f
A
4
→ g [ A
2
h
Consider what happens when we execute the above algorithm on these productions. First, the
nonterminals are already ordered, so we proceed with the outer loop. For each value of i the
algorithm modiﬁes the set of A
i
productions — this set may grow as the algorithm proceeds.
Consider the A
i
productions for i = 1, 2, 3. Notice each is of the form A
i
→ α where the
leftmost symbol of α is either a terminal, or is A
j
for j > i. Thus the algorithm does nothing
for i = 1, 2, 3.
At i = 4, there are no A
i
productions of the form A
4
→ A
1
γ and so nothing happens for j = 1.
At j = 2, A
4
→ A
2
γ is a production for γ = h. The current A
2
productions are A
2
→ c [ A
3
d,
and so we replace A
4
→ A
2
h with A
4
→ ch [ A
3
dh. This ends the case j = 2, and the
A
4
productions are currently A
4
→ g [ ch [ A
3
dh.
At j = 3, we see A
4
→ A
3
γ is a current A
4
production for γ = dh. The current A
3
productions
are A
3
→ e [ A
4
f, and so A
4
→ A
3
dh is replaced with A
4
→ edh [ A
4
fdh. This ends the case
j = 3 and the A
4
productions are now A
4
→ g [ ch [ edh [ A
4
fdh.
This ends the inner loop, and so we remove the immediate left recursion from A
4
’s productions,
resulting in:
A
4
→ gA
′
4
[ chA
′
4
[ edhA
′
4
A
′
4
→ fdhA
′
4
[ ǫ.
62
4.3 Bottomup Parsers
Bottomup parsing can be thought of as the process of creating a parse tree for a given input
string by working upwards from the leaves towards the root. The most wellknown form of
bottomup parsing is that of shiftreduce parsing to which we will restrict our attention; this is
a nonrecursive, nonbacktracking method. Shiftreduce parsing works by matching substrings
of the input string against productions of a grammar G and then replacing such substrings
with the appropriate nonterminal, so that the input string (provided it is in L(G)) is eventually
‘reduced’ to the start symbol S. In contrast to topdown parsing that traces out a leftmost
derivation, shiftreduce parsing aims to yield a rightmost derivation (in reverse).
Example. Consider a grammar G with the following productions:
A → ab [ BAc
B → CA
C → d [ Ba [ abc.
Now consider the input string w = dababc. The d matches C → d so w can be reduced to
w
1
= Cababc. Next, the ﬁrst ab matches A → ab so w
1
can be reduced to w
2
= CAabc. Next,
CA matches B → CA so w
2
can be reduced to w
3
= Babc. Now ab matches A → ab so w
3
can
be reduced to w
4
= BAc. Finally, w
4
can be reduced to A (the start symbol) by the production
A → BAc. We have reduced w to the start symbol, and reversing this process does indeed
yield a rightmost derivation of w from A:
A =⇒ BAc =⇒ Babc =⇒ CAabc =⇒ Cababc =⇒ dababc.
There are other sequences of reductions that reduce w to A but they need not yield rightmost
derivations. For example, we could have chosen to reduce w
2
to CAAc (and then to BAc), but
this yields:
A =⇒ BAc =⇒ CAAc =⇒ CAabc =⇒ Cababc =⇒ dababc.
Further, notice that not all sequences of reductions lead to the start symbol. For example, w
2
can be reduced to CAC that can only be reduced to BC.
General idea of shiftreduce parsing. A shiftreduce parser uses a stack in a similar way
to LL(1) parsers. The stack begins empty, and a parse is complete when S appears as the only
symbol on the stack. The parser has two possible actions (besides ‘accept’ and ‘error’) at each
step:
• To shift the current input symbol onto the top of the stack.
• To reduce a string (for example, α) at the top of the stack to the nonterminal A, given
the production A → α.
Consider the above grammar and the input string dababc. The actions that a shiftreduce
parser might make are:
63
Stack Input Action
dababc$ shift
d ababc$ reduce C → d
C ababc$ shift
Ca babc$ shift
Cab abc$ reduce A → ab
CA abc$ reduce B → CA
B abc$ shift
Ba bc$ shift
Bab c$ reduce A → ab
BA c$ shift
BAc $ reduce A → BAc
A $ accept.
It is important to note that this is only an illustration of what a shiftreduce parser might do; it
does not show how the parser makes decisions on what actions to take. The main problem with
shiftreduce parsing is deciding when to shift and when to reduce (notice that in the example,
we do not reduce the top of the stack every time possible—e.g we did not reduce the Ba to
C when the stack contained Bab as this would have given an incorrect parse). A shiftreduce
parser looks at the contents of the stack and at (some of) the input to decide what action to
take. Diﬀerent types of shiftreduce parsers use the available stack/input in diﬀerent ways, as
will will see in the next section.
Handles. The diﬃculty in writing a shiftreduce parser begins with deciding when reducing a
substring of an input string will result in a (reverse) step in a rightmost derivation.
Informally, a handle of a string is a substring that matches the right hand side of a production,
and whose reduction represents one (reverse) step in a rightmost derivation from the start
symbol.
Formally, a handle of a rightsentential form γ is a pair h = (A → β, p) comprising a production
A → β and a position p indicating where the substring β may be found and replaced by A
to produce the the previous rightsentential form in a rightmost derivation of γ. That is, if
S
∗
=⇒ αAw =⇒ αβw = γ is a rightmost derivation, then (A → β, p) is a handle of αβw when
p marks the position immediately after α.
For example, with reference to the example above, (A → ab, 2) is a handle of Cababc since
Cababc is a rightsentential form, and replacing ab by A at position 2 yields CAabc that is the
predecessor of Cababc in its rightmost derivation. In contrast, (A → ab, 4) is not a handle of
Cababc since although CabAc =⇒ Cababc is a rightmost derivation, it is not possible to derive
CabAc from the start symbol with a rightmost derivation.
An important fact underlying the shift/reduce method of parsing is that handles will always
appear on the top of the stack. In other words, if there is a handle to reduce at all, then the last
symbol of the handle must be on top of the stack; it is then a matter of tracing down through
64
the stack to ﬁnd the start of the handle.
Viable Preﬁxes. The second technical notion that we require in order to understand how a
shift/reduce parser can be made to make choices that lead to a rightmost derivation is that of
a viable preﬁx.
A viable preﬁx is a preﬁx of a rightsentential form that does not continue past the right hand
end of the rightmost handle of that sentential form. For the example, the viable preﬁxes of
Cababc are ǫ, C, Ca and Cab since the rightmost handle of Cababc is the ﬁrst occurrence of ab.
The signiﬁcance of viable preﬁxes is that provided the input seen up to a given point can be
reduced to a viable preﬁx (as is always possible initially since ǫ is always a viable preﬁx), it is
always possible to add more symbols to yield a handle at the top of the stack.
4.3.1 LR(k) Parsing
An LR(k) grammar is one for which we can construct a shift/reduce parser that reads input
lefttoright and builds an rightmost derivation in reverse, using at most k symbols of lookahead.
There are situations where a shift/reduce parser cannot resolve conﬂicts even knowing the entire
contents of the stack and the next k inputs: such grammars, which include all ambiguous
grammars, are called nonLR.
However, shift/reduce parsing can be adapted to cope with ambiguous grammars, and moreover
LR parsing (i.e. LR(k) parsing for some k ≥ 0) has a number of advantages. Most importantly,
LR parsers can be constructed to recognise virtually all programming language constructs for
which contextfree grammars can be devised (including all LL(k) grammars). In fact, it is rarely
the case that more than one lookahead character is required. Also, the LR parsing method can
be implemented eﬃciently, and LR parsers can detect syntax errors as soon as theoretically
possible.
The process of constructing an LR parser is however relatively involved, and in general, too
diﬃcult to do by hand. For this reason parser generators have been devised to automate the
process of constructing an LR parser where this is possible. One wellknown such tool is the
program yacc. The remainder of this section is devoted to explaining the LR(k) parsing method
in order to provide a thorough understanding of the ideas behind this tool.
4.3.2 LR, SLR and LALR parsing
The LR parsing algorithm is a stackbased shift/reduce parsing algorithm involving a parsing
table that dictates an appropriate action for a given state of the parse. We will look at four
diﬀerent kinds of LR parser that use at most one symbol of lookahead. Each kind is an instance
of the general LR parsing algorithm but is distinguished by the kind of parsing table employed.
65
The four types are:
• LR(0) parsers. As implied by the 0, the actions taken by LR(0) parsers are not inﬂuenced
by lookahead characters.
• SLR(1) parsers (“S” meaning “simple”).
• LALR(1) parsers (“LA” meaning “lookahead”).
• LR(1) parsers.
The grammars for which each of the types of parsers may be constructed is described by the
hierarchy:
LR(0) ⊂ SLR(1) ⊂ LALR(1) ⊂ LR(1)
such that every LR(0) grammar is also an SLR(1) grammar, etc. Roughly speaking, the more
general the parser type, the more diﬃcult the (manual) construction of the parsing table (i.e.
LR(0) parsing tables are the easiest to construct, LR(1) tables are the most diﬃcult). Finally,
although the parsing tables for LR(0), SLR(1) and LALR(1) grammars are essentially the
same size, LR(1) parsing tables can be extremely large. Because many programming language
constructs cannot be described by SLR(1) grammars (and hence neither by LR(0) grammars),
but almost all can be described (with eﬀort) by LALR(1) grammars, it is the LALR(1) class
of parser that has proven the most popular for bottomup parsing; yacc, for example, is an
LALR(1) parser generator.
We will ﬁrst look at the LR parsing algorithm (or ‘LRP’ for ‘LR parser’ for short) and then
look at how to construct the parsing table in the case of each of the others.
4.3.3 The LR Parsing Algorithm
The LRP is essentially a shift/reduce parser that is augmented with states. The stack holds
strings of the form
s
0
X
1
s
1
X
2
s
2
X
m
s
m
(with s
m
on the top) where X
1
, . . . , X
m
are grammar symbols and s
0
, . . . , s
m
are states. The
idea behind a state in the stack is that it summarises the information in the stack below it:
to be more precise, a state represents a viable preﬁx that has occurred before that state was
reached. One of these states represents ‘nothing has yet been read’ (note that the empty string
is always a viable preﬁx) and is designated the initial state. In fact, this initial state is the
initial state of a DFA that recognises viable preﬁxes as we shall see below.
The states are used by the LRP’s goto function: this is a function that takes a state and a
grammar symbol as arguments and produces a new state; the idea is that if goto(s, X) = s
′
then
s
′
represents the viable preﬁx obtainable by appending an X to the viable preﬁx represented by
s. (In other words, the goto function is the transition function of the DFA mentioned above.)
66
The behaviour of the LRP is best described in terms of conﬁgurations; that is, a pair
(s
0
X
1
s
1
X
2
s
2
X
m
s
m
, a
i
. . . a
n
)
comprising the contents of the stack and the input yet to be read. The idea behind such
a conﬁguration is that it represents the LRP having up to this point constructed the right
sentential form
X
1
X
2
. . . X
m
a
i
. . . a
n
and about to read a
i
.
The LRP starts with the initial state on its stack and the input pointer at the start of the
input. It then repeatedly looks at the current (topmost) state and input symbol and takes the
appropriate action before considering the next state and input, until processing is complete.
For a given conﬁguration such as the above (with current state s
m
and input symbol a
i
) the
action action(s
m
, a
i
) taken by the LRP is one of the following:
• action(s
m
, a
i
) = shift s. Here the parser puts a
i
and s onto the stack (in that order)
and advances the input pointer. Thus from the conﬁguration above, the LRP enters the
conﬁguration:
(s
0
X
1
s
1
X
2
s
2
X
m
s
m
a
i
s, a
i+1
. . . a
n
).
• action(s
m
, a
i
) = reduce by A → β. This action occurs when the top r (say) grammar
symbols constitute β; that is, when X
m−r+1
. . . X
m
= β. When this does occur, the
topmost 2r symbols are replaced with A and s = goto(s
m−r
, A) (in that order); thus, in
this case the conﬁguration becomes:
(s
0
X
1
s
1
X
2
s
2
X
m−r
s
m−r
As, a
i
. . . a
n
).
• action(s
m
, a
i
) = accept. Here parsing is completed and the LRP terminates.
• action(s
m
, a
i
) = error. Here a syntax error has been detected and the LRP calls an error
recovery routine.
4.3.4 LR(0) State Sets and Parsing
We begin our discussion of the diﬀerent types of LR parser by looking at LR(0) parsers. As
implied by the ‘0’, an LR(0) parser makes decisions on whether to shift or reduce based purely
on the contents of the stack, not on lookahead. Although few useful grammars are LR(0), the
construction of LR(0) parsing tables illustrates most of the concepts that we will need to build
SLR(1), LR(1) and LALR(1) parsing tables.
As mentioned previously, the idea behind the ‘states’ on the stack of an LR parser is that a
state represents a viable preﬁx that has occurred on the stack up to a given point in the parse.
67
We begin to make this idea more precise by deﬁning an LR(0) item (or ‘item’ for short) to
be a string of the form A → α • β where A → αβ is a production. (If A →, that is, A → ǫ,
then A → • is the only item for this production.) Thus, for example, the production A → XY
yields the three items: A → •XY , A → X • Y and A → XY •.
In general, the dot (•) is a marker denoting the current position in parsing a string; that is, an
item of the form A → α • β represents a parser having parsed α and about to (try and) parse
β.
We begin the process of constructing LR(0) parsing tables by augmenting the grammar G with
a new production S
′
→ S (where S
′
is a new start symbol) to form an augmented grammar
G
′
. The reason for this is that we can use the item S
′
→ •S to represent the start of the parse
(that is, the initial state).
Item Closures. Notice that if A → α• Bβ is an item, this represents a state in a shift/reduce
parser where α is the current viable preﬁx on the top of the stack, and we expect to see (a string
derivable from) Bβ in the input. However, if B → γ is a production then (a string derivable
from) γ is also acceptable input at this point.
Thus, in this situation A → α • Bβ and B → •γ are equivalent in the sense that α is a viable
preﬁx for sentential forms derivable from either item. In order to compute all the items of the
parse equivalent to a given item (or set of item) we deﬁne the closure of a set of items I for a
grammar using the following algorithm:
closure(I)
let J := I
repeat
for each item A → α • Xβ in J
for each X → γ add X → •γ to J
until J does not change
return J
The reason why the closure operation is deﬁned on sets of items will be clearer later.
State Transitions. If the parser is in a certain state and then reads a symbol X, then
intuitively the parser will change state in the sense that the set of strings it expects to see
having read the X will be diﬀerent (in general) to the set of strings the parser was expecting
to see before the X. More formally, if A → α • Xβ is an item then reading an X results in the
new item A → αX • β (and all the items equivalent to it).
We formalise these transitions between states of the parse with the function goto(I, X) that
takes a set of items and a grammar symbol X as input, and returns a set of (equivalent) items
as output, using the following algorithm:
goto(I,X)
68
let J be the empty set
for any item A → α • Xβ in I
add A → αX • γ to J
return closure(J)
LR(0) State Set Construction. We now construct all possible states that an LR(0) parser
for an LR(0) grammar G = (T, N, S, P) can be in. Initially, the parser will be in the state
closure(¦S
′
→ •S¦).
We construct all other states from this initial state by using the goto and closure operations
above, until no new states can be generated. The following algorithm accomplishes this task:
C := ¦closure(¦S
′
→ •S¦)¦
repeat
for each state I ∈ C
for each item A → α • Xβ in I
let J be goto(I, X)
add J to C
until C does not change
return C
Example. Consider the LR(0) grammar
S → (S) [ a
which we augment with the production S
′
→ S. We begin construction of the LR(0) state set
by computing the closure of the item ¦S
′
→ •S¦,
¦S
′
→ •S, S → •(S), S → •a¦
which we will refer to as I
0
. So C = ¦I
0
¦ after the initial step of the algorithm. In the next
step of the algorithm we compute goto(I
0
, S), goto(I
0
, a) and goto(I
0
, ():
goto(I
0
, S) = closure(¦S
′
→ S•¦)
= ¦S
′
→ S•¦
= I
1
(say)
goto(I
0
, a) = closure(¦S → a•¦)
= ¦S → a•¦
= I
2
(say)
goto(I
0
, () = closure(¦S → (•S)¦)
= ¦S → (•S), S → •(S), S → •a¦
= I
3
(say)
69
and thus C = ¦I
0
, I
1
, I
2
, I
3
¦ after the ﬁrst iteration of the repeatforever loop.
On the next iteration, (noticing that neither I
1
nor I
2
contain items of the form A → α • Xβ)
we consider I
3
and compute goto(I
3
, S), goto(I
3
, a) and goto(I
3
, ():
goto(I
3
, S) = closure(¦S → (S•)¦)
= ¦S → (S•)¦
= I
4
(say)
goto(I
3
, a) = closure(¦S → a•¦)
= I
2
goto(I
3
, () = closure(¦S → (•S)¦)
= I
3
and now C = ¦I
0
, I
1
, I
2
, I
3
, I
4
¦. On the next iteration, we need only compute goto(I
4
, )):
goto(I
4
, )) = closure(¦S → (S)•¦)
= ¦S → (S)•¦
= I
5
(say)
and now C = ¦I
0
, I
1
, I
2
, I
3
, I
4
, I
5
¦. It is clear that a further iteration of the loop will not yield
any further states, so we have ﬁnished.
The states and goto functions can be depicted as a DFA....
Constructing LR(0) Parsing tables. We construct an LR(0) parsing table as follows:
(1) Construct the set ¦I
0
, . . . , I
n
¦ of states for the augmented grammar G
′
.
(2) The parsing actions for state I
i
are determined as follows:
(a) If A → α • aβ ∈ I
i
and goto(I
i
, a) = I
j
then action(i, a) = shift j.
(b) If A → α• ∈ I
i
then action(i, a) = reduce A → α, for each a ∈ T ∪ ¦$¦.
(c) If S
′
→ S• ∈ I
i
then action(i, $) = accept.
(3) The goto actions for state i are as given by directly from the goto function; that is,
if goto(I
i
, A) = I
j
then let goto(i, A) = j.
(4) All entries not deﬁned by the above rules are made error.
(5) The initial state is the one constructed from ¦S
′
→ •S¦.
Rule (a) above says that, if we have parsed α and we can see a terminal symbol b then we
should shift the symbol onto the stack and go into state j.
Rule (b) says that, if we have ﬁnished parsing a particular production rule α then we should
reduce by that production rule for all terminal symbols.
70
Rule (c) says that, if we have successfully parsed the start symbol then we have ﬁnished.
This method of creating a parser will fail if there are any conﬂicts in the above rules, and
grammars leading to such conﬂicts are not LR(0).
Example. For the example grammar we obtain the following parsing table, where sj means
shift j, rj means reduce using production rule j (where rule 1 is S → (S) and rule 2 is S → a),
and ok means ‘accept’.
action goto
State a ( ) $ S
0 s2 s3 1
1 ok
2 r2 r2 r2 r2
3 s2 s3 4
4 s5
5 r1 r1 r1 r1
A trace of the parse of the string ((a)) is as follows:
Stack Input Action
0 ((a))$ shift 3
0(3 (a))$ shift 3
0(3(3 a))$ shift 2
0(3(3a2 ))$ reduce S → a
0(3(3S4 ))$ shift 5
0(3(3S4)5 )$ reduce S → (S)
0(3S4 )$ shift 5
0(3S4)5 $ reduce S → (S)
0S1 $ accept
Notice that from the rules for constructing an LR(0) parsing table, for a grammar to be LR(0),
if a state contains an item of the form A → α• (signifying the fact that α should be reduced
to A) then it cannot also contain another item of the form:
(a) B → β•, or
(b) B → β • aγ.
In the case of (a), the parser cannot decide whether to reduce using rule A → α or rule B → β;
this is known as a reducereduce conﬂict. In the case of (b), the parser cannot decide whether
to reduce using rule A → α or to shift (by the item B → β •aγ); this is known as a shiftreduce
conﬂict. A grammar is LR(0) if, and only if, all states contain either no items of the form
A → α•, or contain exactly one item.
71
4.3.5 SLR(1) Parsing
SLR(1) parsing is more powerful than LR(0) as it consults (by the way an SLR(1) parsing table
is constructed) the next input token to direct its actions. The construction of an SLR(1) table
begins with the LR(0) state set construction as with LR(0) parsers.
Let’s consider an example. We ﬁrst construct the LR(0) state set. We then show that the
grammar is not LR(0) due to shiftreduce conﬂicts. We then show how to construct general
SLR(1) parsing tables and apply this to our example.
Example. Consider the following productions for a grammar for arithmetic expressions.
(1) E → E + T
(2) E → T
(3) T → T ∗ F
(4) T → F
(5) F → (E)
(6) F → a
augmented with the production E
′
→ E.
The state set construction proceeds as follows. We begin by constructing the initial state:
I
0
= closure(¦E
′
→ •E¦)
= ¦E
′
→ •E, E → •E + T, E → •T, T → •T ∗ F, T → •F, F → •(E), F → •a¦.
Thus C = ¦I
0
¦ after the initial step of the algorithm.
Next, we see that we need to compute goto(I
0
, X) for X = a, (, E, T, F:
goto(I
0
, a) = closure(¦F → a•¦)
= ¦F → a•¦
= I
1
(say)
goto(I
0
, () = closure(¦F → (•E)¦)
= ¦F → (•E), E → •E + T, E → •T, T → •T ∗ F,
T → •F, F → •(E), F → •a¦
= I
2
(say)
goto(I
0
, E) = closure(¦E
′
→ E•, E → E • +T¦)
= ¦E
′
→ E•, E → E • +T¦
= I
3
(say)
goto(I
0
, T) = closure(¦E → T•, T → T • ∗F¦)
72
= ¦E → T•, T → T • ∗F¦
= I
4
(say)
goto(I
0
, F) = closure(¦T → F•¦)
= ¦T → F•¦
= I
5
(say).
Thus now C = ¦I
0
, I
1
, I
2
, I
3
, I
4
, I
5
¦.
On the next iteration of the loop we need to consider I
1
, I
2
, I
3
, I
4
, I
5
. First, notice that there
are no items in I
1
that have any symbols following the marker, and hence goto(I
1
, X) = ∅ for
each possible symbol X. We now compute goto(I
2
, X) for X = a, (, E, T, F as follows:
goto(I
2
, a) = closure(¦F → a•¦)
= I
1
goto(I
2
, () = closure(¦F → (•E)¦)
= I
2
goto(I
2
, E) = closure(¦F → (E•), E → E • +T¦)
= ¦F → (E•), E → E • +T¦
= I
6
(say)
goto(I
2
, T) = closure(¦E → T•, T → T • ∗F¦)
= I
4
goto(I
2
, F) = closure(¦T → F•¦)
= I
5
.
For I
3
we only need to compute goto(I
3
, +):
goto(I
3
, +) = closure(¦E → E +•T¦)
= ¦E → E +•T, T → •T ∗ F, T → •F, F → •(E), F → •a¦
= I
7
(say).
For I
4
we need only compute goto(I
4
, ∗):
goto(I
4
, ∗) = closure(¦T → T ∗ •F¦)
= ¦T → T ∗ •F, F → •(E), F → •a¦)
= I
8
(say).
We also see that goto(I
5
, X) = ∅ for all X, as in the case of I
1
. Thus, after the second iteration
of the loop we have C = ¦I
0
, I
1
, I
2
, I
3
, I
4
, I
5
, I
6
, I
7
, I
8
¦.
In the next iteration we need to consider I
6
, I
7
, I
8
. For I
6
, we see that we need to compute
goto(I
6
, +) and goto(I
6
, )):
goto(I
6
, +) = closure(¦E → E +•T¦)
= I
7
goto(I
6
, )) = closure(¦F → (E)•¦)
73
= ¦F → (E)•¦
= I
9
(say).
For I
7
, we see that we need to compute goto(I
7
, X) for X = a, (, T, F:
goto(I
7
, a) = closure(¦F → a•¦)
= I
1
goto(I
7
, () = closure(¦F → (•E)¦)
= I
2
goto(I
7
, T) = closure(¦E → E + T•, T → T • ∗F)¦)
= ¦E → E + T•, T → T • ∗F¦)
= I
10
(say)
goto(I
7
, F) = closure(¦T → F•¦)
= I
5
.
For I
8
, we see that we need to compute goto(I
8
, X) for X = a, (, F:
goto(I
8
, a) = closure(¦F → a•¦)
= I
1
goto(I
8
, () = closure(¦F → (•E)¦)
= I
2
goto(I
8
, F) = closure(¦T → T ∗ F•¦)
= ¦T → T ∗ F•¦
= I
11
(say).
Thus, after the third iteration of the loop we have C = ¦I
0
, I
1
, I
2
, I
3
, I
4
, I
5
, I
6
, I
7
, I
8
, I
9
, I
10
, I
11
¦.
Next, we need to consider I
9
, I
10
, I
11
. For I
9
and I
11
, we see that goto(I
9
, X) = goto(I
11
, X) = ∅
for all X. For I
10
we only need to compute goto(I
10
, ∗):
goto(I
10
, ∗) = closure(¦T → T ∗ •F¦)
= I
8
.
Thus, C did not change during the fourth iteration of the loop and we have ﬁnished(!).
Now, consider an attempt at constructing an LR(0) parsing table from this state set. Specif
ically, consider the state I
4
(we could also have chosen I
10
). This state contains the items
E → T• and T → T • ∗F and goto(I
4
, ∗) = I
8
. This is a shiftreduce conﬂict (we would
attempt to place both s8 and r2 at location (4, ∗) in the table).
Constructing SLR(1) Parsing Tables.
In LR(0) parsing a reduce action is performed whenever the parser enters a state i (i.e. I
i
) whose
(only) item is of the form A → α• regardless of the next input symbol. In SLR(1) parsing, if a
state contains an item of this form then the reduction of α to A is only performed if the next
token is in Follow(A) (the set of all terminal symbols that can follow A in a sentential form),
74
else it performs a diﬀerent action (e.g. a shift, reduce, or error) based on the other items of
state I
i
.
The rules for constructing SLR(1) parsing tables are the same as for LR(0) tables giving in the
previous section, except that we replace rule 2(b):
If A → α• ∈ I
i
then action(i, a) = reduce A → α, for each a ∈ T ∪ ¦$¦
by
If A → α• ∈ I
i
then action(i, a) = reduce A → α, for each a ∈ Follow(A).
Lets consider the example grammar. The Follow sets are
Follow(E) = ¦$, +, )¦
Follow(T) = ¦$, ∗, +, )¦
Follow(F) = ¦$, ∗, +, )¦.
Notice ﬁrst that the shiftreduce conﬂict mentioned above has been resolved. Recall the items
E → T• and T → T • ∗F of I
4
where goto(I
4
, ∗) = I
8
. As ∗ / ∈ Follow(E), we do not choose to
reduce in this state (i.e. we place only s8 in the parsing table at position (4, ∗)).
The complete SLR(1) parsing table for this grammar is as shown below.
action goto
State a + ∗ ( ) $ E T F
0 s1 s2 3 4 5
1 r6 r6 r6 r6
2 s1 s2 6 4 5
3 s7 ok
4 r2 s8 r2 r2
5 r4 r4 r4 r4
6 s7 s9
7 s1 s2 10 5
8 s1 s2 11
9 r5 r5 r5 r5
10 r1 s8 r1 r1
11 r3 r3 r3 r3
A parse of the string a + a ∗ a is:
75
Step Stack Input Action
1 0 a + a ∗ a shift 1
2 0a1 +a ∗ a reduce by F → a
3 0F5 +a ∗ a reduce by T → F
4 0T4 +a ∗ a reduce by E → T
5 0E3 +a ∗ a shift 7
6 0E3 + 7 a ∗ a shift 1
7 0E3 + 7a1 ∗a reduce by F → a
8 0E3 + 7F5 ∗a reduce by T → F
9 0E3 + 7T10 ∗a shift 8
10 0E3 + 7T10 ∗ 8 a shift 1
11 0E3 + 7T10 ∗ 8a1 $ reduce by F → a
12 0E3 + 7T10 ∗ 8F11 $ reduce by T → T ∗ F
13 0E3 + 7T10 $ reduce by E → E + T
14 0E3 $ accept
It can be seen from the SLR(1) table construction rules that a grammar is SLR(1) if, and only
if, the following conditions are satisﬁed for each state i:
(a) For any item A → α • aβ, there is no other item B → γ• with a ∈ Follow(B);
(b) For any two items A → α• and B → β•, Follow(A) ∩ Follow(B) = ∅.
Violation of (a) will lead to a shiftreduce conﬂict: if the parser is in state i and a is the next
token, the parser cannot decide whether to shift the a (and goto(i, a)) onto the stack or to
reduce the top of the stack using the rule B → γ. Violation of (b) will lead to a reducereduce
conﬂict: if the parser is in state i and the next token a is in Follow(A) and Follow(B), the
parser cannot decide whether to reduce using rule A → α or rule B → β. For nonSLR(1)
grammars that are still LR(1), we need to look at LR(1) and LALR(1) parsers.
76
5 Javacc
Javacc is a compiler compiler which takes a grammar speciﬁcation and writes a parser for that
language in Java. It comes with a useful tool called Jjtree which in turn produces a Javacc
ﬁle with extra code added to construct a tree from a parse. By running the output from Jjtree
through Javacc and then compiling the resulting Java ﬁles we get a full topdown parser for
our grammar. The process is shown in ﬁgure 1
Language Specification
myFile.jjt
jjtree
myFile.jj
Parser in Java
javac
parser
Javacc file
myFiles.java
javacc
Figure 1: The JavaTree Compilation Process
Because the parser is topdown we can only use Javacc on LL(k) grammars, meaning that we
have to remove any leftrecursion from the grammar before writing the parser.
5.1 Grammar and Coursework
The grammar we will be compiling, both in these notes and in your coursework, is deﬁned here:
program → decList begin statementList [ statementList
decList → dec decList [ dec
dec → type id
type → ﬂoat [ int
statementList → statement ; statementList [ statement
statement → ifStatement [ repeatStatement [ assignStatement [ readStatement [ writeStatement
ifStatement → ifPart end [ ifPart elsePart end
77
ifPart → if exp then statementList
elsePart → else statementList
repeatStatement → repeat statementList until exp
assignStatement → id := exp
readStatement → read id
writeStatement → write exp
exp → simpleExp boolop simpleExp [ simpleExp
boolOp → < [ =
simpleExp → term addOp simpleExp [ term
addOp → + [ 
term → factor mulOp factor [ factor
mulOp → * [ /
factor → id [ ﬂoat [ int [ ( exp )
So we have an optional list of variable declarations, followed by a series of statements. There
are 5 statements, a conditional, an iterational, an assignment, and a read and write statement.
All variables must be declared before use and each variable can be a ﬂoating point number
or an integer. We also have various expressions. Note how the grammar forces all arithmetic
expressions to have the correct operator precedence. Other ways of achieving this will be
discussed in the lectures. There are no boolean variables but two boolean operators are available
for conditional and repeat statements, less than, and equals. The terminals in this grammar
are precisely those written in bold in the above list.
Your coursework comes in seven parts with a deadline at midnight on the Friday of weeks 2 
6, 8 and 11. The coursework is designed to take 30 hours to complete in total, so you should
spend about 3 hours each week. The early parts shouldn’t take that long so you are encouraged
to work ahead. I hope to put a model answer on Blackboard by Monday evening each week,
together with your grade, so if you achieved a disappointing grade for one assignment you can
use the model answer for the next. Each week the coursework will build upon the previous
work. Because I am giving model answers each week there will be no extensions. If you have
a reasonable excuse for not submitting on time you will be exempted from that coursework.
You are recommended to spend each Friday afternoon working on this assignment. Although
there will not be formal lab classes you are strongly recommended to use Friday afternoons
as such. You plan each week should be to read the relevant section of this document, try out
the examples, ensure that you understand fully what is going on, and complete the coursework
for that week. If this takes less than three hours you should start work on the next one. The
ﬁrst ﬁve courseworks are worth 10% each, the last two are worth 25% each. The coursework in
total is worth 30% of the marks for the module.
78
5.2 Recognising Tokens
Our ﬁrst job is to build a recogniser for the terminal symbols in the grammar. Lets look at a
minimal javacc ﬁle:
options{
STATIC = true;
}
PARSER_BEGIN(LexicalAnalyser)
class LexicalAnalyser{
public static void main(String[] args){
LexicalAnalyser lexan = new LexicalAnalyser(System.in);
try{
lexan.start();
}//end try
catch(ParseException e){
System.out.println(e.getMessage());
}//end catch
System.out.println("Finished Lexical Analysis");
}//end main
}//end class
PARSER_END(LexicalAnalyser)
//Ignore all whitespace
SKIP:{" ""\t""\n""\r"}
//Declare the tokens
TOKEN:{<INT: (["0""9"])+>}
//Now the operators
TOKEN:{<PLUS: "+">}
TOKEN:{<MINUS: "">}
TOKEN:{<MULT: "*">}
TOKEN:{<DIV: "/">}
void start():
{}
{
<PLUS> {System.out.println("Plus");} 
<MINUS> {System.out.println("Minus");} 
<DIV> {System.out.println("Divide");} 
<MULT> {System.out.println("Multiply");}
<INT> {System.out.println("Integer");}
}
79
First we have a section for options. Most of the options available default to acceptable values
for this example but we want this program to be static because we are going to write a main
method here rather than in another ﬁle. Therefore we set this to true because it defaults to
false.
Next we have a block of code delimited by PARSER BEGIN(LexicalAnalyser) and
PARSER END(LexicalAnalyser). The argument is the name which Javacc will give the ﬁnished
parser. Within that there is a class declaration. All code within the PARSER BEGIN and
PARSER END delimiters will be output verbatim by Javacc. Now we write the main method,
which constructs a new parser and calls its start() method. This has to be placed inside a try
catch block because Javacc automatically generates its own internally deﬁned ParseExceptions
and throws them if something goes wrong with the parse. Finally we print a message to the
screen announcing that we have ﬁnished the parse successfully.
We then have some SKIPped items. In this case we are simply telling the ﬁnal parser to ignore
all whitespace, spaces, tabs, newlines and carriage returns. This is normal as we usually want
to encourage users of our language to use whitespace to indent their programs to make them
easier to read. There are languages in which whitespace has meaning, e.g. Haskell, but these
tend to be rare.
We then have to declare the tokens or terminal symbols of the grammar. The ﬁrst declaration
is simply a regular expression for integers. We can put in square brackets a range of values,
in this case 0  9, separated by a hyphen. This means “anything between these two values
inclusive”. Then wrapping the range expression in parentheses and adding a “+” symbol we
say that an INT (our name for integers) is one or more instances of a digit. Note that literals
are always included in quote marks.
We then include deﬁnitions for the arithmetic operators. For such a simple example this is not
necessary since we could use literals instead, but we will be making this more complex later so
it’s as well to start oﬀ the right way. Besides, it sometimes makes the code diﬃcult to read if
we mix literals and references freely. A TOKEN declaration consists of:
• The keyword TOKEN.
• A colon.
• An open brace.
• An open angle bracket.
• The name by which we will be referring to this token.
• A colon.
80
• A regular expression deﬁning the pattern for recognising this token.
• The close angle bracket.
• The close brace.
We can have multiple token (or skip) declarations in a single line by including the [ (or) symbol,
as in the skip declarations in the example above.
We ﬁnally deﬁne the start() method in which all the work is done. A method declaration in
Javacc consists of the return type (in this case void), the name of the method (in this case
start) and any arguments required, followed by a colon and two (!) sets of braces. The ﬁrst
one contains variables local to this method (in this case none). The second one contains the
grammar’s production rules. In the above example we are saying that a legal program consists
of a single instance of one of the grammar terminals.
Each statement inside the start() method follows a format whereby a previously deﬁned token
is followed by a set on one or more (one in this example) statements inside braces. This states
that when the respective token is identiﬁed by the parser the code inside the braces (which
must be legal Java code) is executed. Thus, if the parser recognises an int it will print integer
to the terminal.
We ﬁrst run the code through Javacc :Javacc lexan.jj. (The ﬁletype .jj is compulsory). This
gives us a series of messages. Most of these relate to the ﬁles being produced by Javacc. We
then compile the resulting java ﬁles: javac *.java. This gives us the .class ﬁles which can be
run from the command line with Java LexicalAnalyser.
N.B. You are strongly recommended to type in the above code and compile it. You
should do this with all the examples given in this chapter. Test your parser with
both legal and illegal input to see what happens.
At the moment we can only read a single token. The ﬁrst improvement is to continue to read
tokens until either we reach the end of the input or we read illegal input. We can do this by
simply creating another regular expression inside the start() method.
void start():
{}
{
(<PLUS> 
<MINUS> 
<DIV> 
<MULT> 
<INT>)+
}
Also we can redirect the input from the console to an input ﬁle:
81
java LexicalAnalyser < input.txt
This ﬁle will now, when compiled, produce a parser which recognises any number of our oper
ators and integers, printing to the screen what it has found. I will continue to do this until it
reaches the end of input or ﬁnds an error.
It is time to do assignment 1. The deadline for this is Friday midnight of week
two. You must write a Javacc ﬁle which recognises all legal terminal symbols in
the grammar deﬁned on page 77.
5.3 JjTree
So far all we have done is create a lexical analyser which merely decides whether each input
is a token (terminal symbol) or not. The ﬁrst time it ﬁnds an error it exits. The above
program demonstrates how each token is speciﬁed and given a name. Then when each token
is recognised we insert a piece of java code to execute. For a complete compiler we need to
construct a tree. Having done that we can traverse the tree generating machinecode for our
target machine. Luckily Javacc comes with a tree builder which inserts all the code necessary
for the construction of such a tree. We will now build a complete parser using this tool – Jjtree.
First of all it should be noted that Jjtree is a preprocessor for Javacc, i.e. it takes our input
ﬁle and outputs Javacc code with tree building code built in. The process is shown in ﬁgure 1.
The input ﬁle to Jjtree needs a ﬁletype of jjt so let’s just change the ﬁle we have been using
to Lexan.jjt (from Lexan.jj) and add a few lines of code:
options{
STATIC = true;
}
PARSER_BEGIN(LexicalAnalyser)
class LexicalAnalyser{
public static void main(String[] args)throws ParseException{
LexicalAnalyser lexan = new LexicalAnalyser(System.in);
SimpleNode root = lexan.start();
System.out.println("Finished Lexical Analysis");
root.dump("");
}
}
PARSER_END(LexicalAnalyser)
//Ignore all whitespace
82
SKIP:{" ""\t""\n""\r"}
//Declare the tokens
TOKEN:{<INT: (["0""9"])+>}
//Now the operators
TOKEN:{<PLUS: "+">}
TOKEN:{<MINUS: "">}
TOKEN:{<MULT: "*">}
TOKEN:{<DIV: "/">}
SimpleNode start():
{}
{
(<PLUS> {System.out.println("Plus");} 
<MINUS> {System.out.println("Minus");} 
<DIV> {System.out.println("Divide");} 
<MULT> {System.out.println("Multiply");} 
<INT>{System.out.println("Integer");
)+
{return jjtThis;}
}
We have stated that calling lexan.start() will return an object of type SimpleNode and as
signed this to a variable called root. This will be the root of our tree. SimpleNode is the
default type of node returned by JavaTree methods. We will be seeing later how we can change
this behaviour. The ﬁnal line of the main method now calls root.dump(‘‘’’), passing it an
empty string. This is a method deﬁned within the ﬁles produced automatically by JavaTree
and its behaviour can be changed if we wish. At the moment it prints to the output a text
description of the tree which has been built by the parse if successful. The start() method now
has a return value of type simpleNode and the ﬁnal line of the method returns jjtThis. jjtThis
is the node currently under construction when it is executed. Running this code (Lexan.jjt)
through jjtree produces a ﬁle called Lexan.jj. Run this through Javacc and we get the usual se
ries of .java ﬁles. Finally invoking javac *.java produces our parser, this time with tree building
capabilities built in. Running the compiled code with the following input ﬁle...
123
+
 * /
...produces the following output:
83
Integer
Plus
Minus
Multiply
Divide
Finished Lexical Analysis
start
It is the same as before except that we now see the result of printing out the tree (dump(””)).
It produces the single line start. This is in fact the name of the root of the tree. We have
constructed a tree with a single node called start. Let’s add a few production rules to get a
clearer idea of what is happening.
SimpleNode start():
{}
{
(multExp())* {return jjtThis;}
}
void multExp():
{}
{
intExp() <MULT> intExp()
}
void intExp():
{}
{
<INT>
}
What we are doing here is stating that all internal nodes in our ﬁnished tree take the form of
a method, while all terminal nodes are tokens which we have deﬁned.
The code is the same until the deﬁnition of the start() method. The we have a series of methods,
each of which contains a production rule for that nonterminal symbol. The start() rule expects
0 or more multExp’s. A multExp is an intExp followed by a MULT terminal followed by another
intExp. An intExp is simply an INT terminal. The grammar in BNF form is as follows:
start → multExp [ multExp start [ ǫ
multExp → intExp MULT intExp
intExp → INT
84
where INT and MULT are terminal symbols.
The output from this after being run on the input:
1*12
3 *14
45 * 6
is more interesting:
start
multExp
intExp
intExp
multExp
intExp
intExp
multExp
intExp
intExp
Pictorially the tree we have built from the parse of the input is shown in ﬁgure 2
Start
MultExp
MultExp MultExp
IntExp IntExp IntExp IntExp IntExp IntExp
Figure 2: The tree built from the parse
We have a start node containing three multExp’s, each of which contains two children, both
of which are intExp’s. It is time for assignment 2. All you have to do is change the
ﬁle you produced for assignment 1 into a jjtree ﬁle which contains the complete
speciﬁcation for the module grammar given on page 77
Notice how the nodes take the same name as the method name in which they are deﬁned? Let’s
change this behaviour:
SimpleNode start() #Root:{}{...}
void multExp() #MultiplicationExpression:{}{...}
void intExp() #IntegerExpression:{}{...}
85
All we have done is precede the method body with a local name (indicated by the # sign).
This gives us the following, more easily understandable output:
Root
MultiplicationExpression
IntegerExpression
IntegerExpression
MultiplicationExpression
IntegerExpression
IntegerExpression
MultiplicationExpression
IntegerExpression
IntegerExpression
To see the full eﬀect of this change we can add two options to the ﬁrst section of the ﬁle:
MULTI = true;
NODE_PREFIX = "";
The ﬁrst tells the compiler builder to create diﬀerent types of node, rather than just SimpleNode.
If we don’t add the second line the ﬁles thus generated will all have a default preﬁx. We will
be using these diﬀerent names later.
It is time for assignment 3. Change all the methods you have written so far so
that they produce node names and types which are other than the names of the
methods.
5.4 Information within nodes
With the addition of the above method of changing node names we have a great deal of infor
mation about interior nodes. What about terminal nodes? We know that we have, for example,
an identiﬁer or perhaps a ﬂoating point number but what is the name of that identiﬁer,
or the actual value of the ﬂoat? We will need to know that information in order to be able to
use it when, for example, generating code.
Looking at the list of ﬁles that Javacc generates for us we can see lots of nodes. Examining the
code we can see that all of them extend SimpleNode. Let’s look at the superclass:
/* Generated By:JJTree: Do not edit this line. SimpleNode.java */
public class SimpleNode implements Node {
protected Node parent;
86
protected Node[] children;
protected int id;
protected LexicalAnalyser parser;
public SimpleNode(int i) {
id = i;
}
public SimpleNode(LexicalAnalyser p, int i) {
this(i);
parser = p;
}
public void jjtOpen() {
}
public void jjtClose() {
}
public void jjtSetParent(Node n) { parent = n; }
public Node jjtGetParent() { return parent; }
public void jjtAddChild(Node n, int i) {
if (children == null) {
children = new Node[i + 1];
} else if (i >= children.length) {
Node c[] = new Node[i + 1];
System.arraycopy(children, 0, c, 0, children.length);
children = c;
}
children[i] = n;
}
public Node jjtGetChild(int i) {
return children[i];
}
public int jjtGetNumChildren() {
return (children == null) ? 0 : children.length;
}
/* You can override these two methods in subclasses of SimpleNode to
customize the way the node appears when the tree is dumped. If
87
your output uses more than one line you should override
toString(String), otherwise overriding toString() is probably all
you need to do. */
public String toString() { return LexicalAnalyserTreeConstants.jjtNodeName[id]; }
public String toString(String prefix) { return prefix + toString(); }
/* Override this method if you want to customize how the node dumps
out its children. */
public void dump(String prefix) {
System.out.println(toString(prefix));
if (children != null) {
for (int i = 0; i < children.length; ++i) {
SimpleNode n = (SimpleNode)children[i];
if (n != null) {
n.dump(prefix + " ");
}
}
}
}
}
The child nodes for this node are kept in an Array. jjtgetChild(i), returns the i
th
child node,
jjtGetNumChildren() returns the number of children this node has and dump() alters how
the nodes are printed to the screen when we call the Root note. Notice that dump() in particular
recursively invokes dump() on all its children after printing itself, giving a preorder printout
of the tree. This can easily be changed to inorder or postorder.
We can add functionality to the node classes by adding it to SimpleNode, since all other nodes
inherit from this. Add the following to the SimpleNode ﬁle:
protected String type = "Interior node";
protected int intValue;
public void setType(String type){this.type = type;}
public void setInt(int value){intValue = value;}
public String getType(){return type;}
public int getInt(){return intValue;}
By adding these members to the SimpleNode class all subclasses of SimpleNode will inherit
them. Now change the code in the intExp method as follows:
88
void factor() #Factor:
{Token t;}
{
<ID> 
<FLOAT> 
t = <INT> {jjtThis.setType("int"); jjtThis.setInt(Integer.parseInt(t.image));} 
<OPENPAR> exp() <CLOSEPAR>
}
Each time javacc recognises a predeﬁned token it produces an object of type Token. The class
Token contains some useful methods of which two are seen here, kind and image. This is
exactly the information we need to add to our nodes.
What we are saying here is that if the method factor() is called then it will be an ID, and
int, a ﬂoat or a parenthesised expression. If it is an int we add to the factor node the type
and the value of this particular int.
Now, when we have a factor node we know, if it is an integer, what value it has. We can also,
by adding to this code, know exactly which type of factor we are dealing with. Look at the
dump() method:
public void dump(String prefix) {
System.out.println(toString(prefix));
if(toString().equals("Factor")){
if(getType().equals("int")){
System.out.println("I am an integer and my value is " + getInt());}
}
if (children != null) {
for (int i = 0; i < children.length; ++i) {
SimpleNode n = (SimpleNode)children[i];
if (n != null) {
n.dump(prefix + " ");
}
}
}
}
Now when we call the dump() method of our root node we will get all the information we had
previously, plus, for each terminal factor node, if it is an integer we will know its value as well.
Time for assignment 4. Alter your code (in the factor method and the SimpleNode
class) so that variables, integers and ﬂoats will all print out what they are, plus
their value or name.
89
5.5 Conditional statements in jjtree
Many of the methods we have deﬁned so far for the internal nodes have an optional part. This
means that, for example, a statement list is a single statement followed by an optional semi
colon and statement list. This use of recursion allows us to have as many statements in our
grammar as we want. Unfortunately it means that when the tree is constructed it will not be
possible to tell what exactly we have when we are examining a statement list node without
counting to see how many child nodes there are. This unnecessarily complicates our code so
let’s see how to alter it:
void multExp() #MultiplicationExpression(>1):
{}
{
intExp() (<MULT> intExp())?
}
The production rule now states that a multExp is a single intExp followed by 0 or 1 intExps
(indicated by the question mark). The (>1) expression says to return a MultiplicationExpres
sion node only if there is more than one child. If the expression consists of just a single intExp
the method returns that, rather than a MultiplicationExpression.
This very much simpliﬁes the output of dump(), and also makes our job much easier when
walking the tree to, for example, type check assignment statements, or generate machine code.
Now when we visit an expression or statement we know exactly what sort of node we are
dealing with.
5.6 Symbol Tables
Because our terminal nodes are treated diﬀerently from interior nodes we need to maintain a
symbol table to keep track of them. We will use this during all phases of our ﬁnished compiler
so we should look at it now. Of course we will implement a symbol table as a hash table, so
that we can access our variables in constant time.
This would also be a convenient time to add some demands to our language, such as the re
quirement to declare variables before using them, and forbidding, or at least detecting, multiple
declarations.
Our Variables have a name, a type and a value. This is a trivial class to implement:
public class Variable{
private String type;
private String name;
90
private int intValue = 0;
private double floatValue = 0.0;
public Variable(int value, String name){
this.name = name;
type = "integer";
intValue = value;
}
public Variable(double value, String name){
this.name = name;
type = "float";
floatValue = value;
}
public String getType(){return type;}
public String getString(){return name;}
public int getInt(){return intValue;}
public double getDouble(){return floatValue;}
}
Now that we have Variables we need to declare a HashMap in which to put them. Once
we have done that we add to our LexicalAnalyser class a method for adding variables to our
symbol table:
public static void addToSymbolTable(Variable v){
if(symbolTable.containsKey(v.getName())){
System.out.println("Variable " + v.getName() + " multiply declared");
}
else{
symbolTable.put(v.getName(), v);
}
}
This quite simply interrogates the symbol table to see if the Variable we have just read exists
or not. If it does we print out an error message, otherwise we add the new variable to the table.
All that is left to do is invoke our function when we read a variable declaration:
void dec() #Declaration:
{String tokType;
Token tokName;
Variable v;}
91
{
tokType = type() tokName = <ID>{
if(tokType.equals("integer"))v = new Variable(0, tokName.image);
else v = new Variable(0.0, tokName.image);
addToSymbolTable(v);
}
}
String type() #Type:
{Token t;}
{
t = <FPN> {return "float";} 
t = <INTEGER> {return "integer";}
}
Read the above carefully, type it in and make sure you understand exactly what is going on.
Then start this week’s assignment. Assignment 5. Alter the code you have written so
far so that methods with alternative numbers of children return the appropriate
type of node. Then add the symbol table code given above. (Ideally you should
declare the main method in a class of its own to avoid having all the extra methods
and ﬁelds declared as static.) Then add code which detects if the writer is trying
to use a variable which has not been declared, printing an error message if (s)he
does so.
5.7 Visiting the Tree
In altering the dump() method in the previous example we merely altered the toString()
method. But let’s think about what we are trying to do, especially with the abilities you have
all demonstrated by progressing thus far in your studies.
Let us consider what we have here. We have a data structure (a tree) which we want to
traverse, operating upon each node (for the time being we will merely print out details of the
node). Those of you who took CS211  Programming with Objects and Threads will recognise
immediately the Visitor pattern. (See, you always knew that module would come in handy.)
Fortunately the writers of Javacc also took that module so they have already thought of this.
If we add a VISITOR switch to the options section:
options{
MULTI = true;
NODE_PREFIX = "";
STATIC = false;
VISITOR = true;
92
}
...jjtree writes a Visitor interface for us to implement. It deﬁnes an accept method for each
of the nodes allowing us to pass our implementation of the Visitor to each node. This is the
Visitor interface for our current example:
/* Generated By:JJTree: Do not edit this line. ./LexicalAnalyserVisitor.java */
public interface LexicalAnalyserVisitor
{
public Object visit(SimpleNode node, Object data);
public Object visit(Root node, Object data);
public Object visit(DeclarationList node, Object data);
public Object visit(Declaration node, Object data);
public Object visit(Type node, Object data);
public Object visit(StatementList node, Object data);
public Object visit(Statement node, Object data);
public Object visit(IfStatement node, Object data);
public Object visit(IfPart node, Object data);
public Object visit(ElsePart node, Object data);
public Object visit(RepeatStatement node, Object data);
public Object visit(AssignmentStatement node, Object data);
public Object visit(ReadStatement node, Object data);
public Object visit(WriteStatement node, Object data);
public Object visit(Expression node, Object data);
public Object visit(BooleanOperator node, Object data);
public Object visit(SimpleExpression node, Object data);
public Object visit(AdditionOperator node, Object data);
public Object visit(Term node, Object data);
public Object visit(MultiplicationOperator node, Object data);
public Object visit(Factor node, Object data);
}
Warning 1. Jjtree does not consistently rewrite all of its classes. This means that
if you have been writing these classes and experimenting with them as you have
read this (highly recommended) you should delete all ﬁles except the .jjt ﬁle, add
the new options and then recompile with Jjtree.
Warning 2. We will be writing an implementation of the Visitor interface. This is
our own ﬁle and is not therefore rewritten by Jtree. Therefore from now on if we
alter the .jjt ﬁle and then compile byte code with javac *.java we may get errors if
we have not checked our implementation of the Visitor ﬁrst.
93
Now instead of dump()ing our tree we visit it with a visitor. By implementing the visit()
method of each particular type of node in a diﬀerent way we can do anything we want, without
altering the construct of the nodes themselves, exactly what the Visitor pattern was designed
to do.
public class MyVisitor implements ParserVisitor{
public Object visit(Root node, Object data){
System.out.println("Hello. I am a Root node");
if(node.jjtGetNumChildren() == 0){
System.out.println("I have no children");
}
else{
for(int i = 0; i < node.jjtGetNumChildren(); i++){
node.children[i].jjtAccept(this, data);
}
}
}
Simple isn’t it! Assignment 6. Write a Visitor class, implementing LexicalAnaly
serVisitor, which visits each node in turn, printing out the type of node it is and
how many children it has.
5.8 Semantic Analyses
Now that we have our symbol table and a Visitor interface it is easy to implement a Visitor
object which can do anything we wish with the data held in the tree which our parser builds.
In the last assignment all we did was print out details of the type of node. In a real compiler we
would want eventually to generate machine code for our target machine. We will not be doing
that as an assignment, though if you are enjoying this module I hope you might want to do this
out of interest. For your ﬁnal assignment you are going to be doing some semantic analysis,
to ensure that you do not attempt to assign incorrect values to variables, e.g. a ﬂoating point
number to an integer variable.
You have three weeks to do this assignment so there will be plenty of help given during lectures
and you have plenty of time to come and see me, or ask Program Advisory for help. It should
be clear thet we only need to write code for assignment statements. Let’s look at this:
void assignStatement() #AssignmentStatement:
{}
{
94
<ID> <ASSIGN> exp()
}
When we read the ID token we can check its type by looking it up in the symbol table as
we have done before. All we need to do is decide what the type of the exp is. This is your
assignment. Start from the bottom nodes of your tree and work your way up to the expression.
95
6 Tiny
For your coursework you will be writing a compiler which compiles a while language, similar to
the one you met in Theory of Programming Languages, into an assembly language called tiny.
A tiny machine simulator can be downloaded from the course website to test your compiler.
This section describes the tiny language.
6.1 Basic Architecture
tiny consists of a read–only instruction memory, a data memory, and a set of eight general–
purpose registers. These all use nonnegative integer addresses, beginning at 0. Register 7 is
the program counter and is the only special register, as described below.
The following C declarations will be used in the descriptions which follow:
#define IADDR_SIZE...
/* size of instruction memory */
#define DADDR_SIZE...
/* size of data memory */
#define NO_REGS 8
/* number of registers */
#define PC_REG 7
Instruction iMem[IADDR_SIZE];
int dMem[DADDR_SIZE];
int reg[NO_REGS];
tiny performs a conventional fetch–execute cycle:
do
/* fetch */
currentInstruction = iMem[reg[pcRegNo]++];
/* execute current instruction */
...
...
while (!(halterror));
At start–up the tiny machine sets all registers and data memory to 0, then loads the value
of the highest legal address (namely DADDR SIZE  1) into dMem[0]. This allows memory
to easily be added to the machine since programs can ﬁnd out during execution how much
96
RO Instructions
Format : opcode r, s, t
Opcode Eﬀect
HALT stop execution (operands ignored)
IN reg[r] ← integer value read from the standard input. (s and t ignored)
OUT reg[r] → the standard output (s and t ignored)
ADD reg[r] = reg[s] + reg[t]
SUB reg[r] = reg[s]  reg[t]
MUL reg[r] = reg[s] * reg[t]
DIV reg[r] = reg[s] / reg[t] (may generate ZERO DIV)
RM Instructions
Format : opcode r, d(s) a = d + reg[s]; any reference to dMem[a] generates DMEM ERR
if a < 0 or a ≥ DADDR SIZE
Opcode Eﬀect
LD reg[r] = dMem[a] (load r with memory value at a)
LDA reg[r] = a (load address a directly into r)
LDC reg[r] = d (load constant d directly into r. s is ignored)
ST dMem[a] = reg[r] (store value in r to memory location a)
JLT if (reg[r] < 0) reg[PC REG] = a
(jump to instruction a if r is negative, similarly for the following)
JLE if(reg[r] ≤ 0) reg[PC REG] = a
JGE if(reg[r] ≥ 0) reg[PC REG] = a
JGT if(reg[r] > 0) reg[PC REG] = a
JEQ if(reg[r] == 0) reg[PC REG] = a
JNE if(reg[r] != 0) reg[PC REG] = a
Table 1: The tiny instruction set
memory is available. The machine then starts to execute instructions beginning at iMem[0].
The machine stops when a HALT instruction is executed. The possible error conditions include
IMEM ERR, which occurs if reg[PC REG] < 0 or reg[PC REG] ≥ IADDR SIZE in the fetch
step, and the two conditions DMEM ERR and ZERO DIV, which occur during instruction
execution execution as described below.
The instruction set of the machine is given in table 1, together with a brief description of the
eﬀect of each instruction.
There are two basic instruction formats: register–only, or RO instructions, and register–
memory, or RM instructions. A register–only instruction has the format
opcode r,s,t
where the operands r, s, t are legal registers (checked at load time). Thus, such instructions
are three–address, and all three addresses must be registers. All arithmetic instructions are
97
limited to this format, as are the two primitive input/output instructions.
A register–memory instruction has the format
opcode r,d(s)
In this code r and s must be legal registers (checked at load time), and d is a positive or
negative integer representing an oﬀset. This instruction is a two–address instruction, where the
ﬁrst address is always a register and the second address is a memory address a given by a =
d + reg[r], where a must be a legal address (0 ≤ a < DADDR SIZE). If a is out of the legal
range, then DMEM ERR is generated during execution.
RM instructions include three diﬀerent load instructions corresponding to the three addressing
modes “load constant” (LDC), “load address” (LDA), and “load memory” (LD). In addition
there is one store instruction and six conditional jump instructions.
In both RO and RM instructions, all three operands must be present, even though some of
them may be ignored. This is due to the simple nature of the loader, which only distinguishes
between the two classes of instruction (RO and RM) and does not allow diﬀerent formats within
each class. (This actually makes code generation easier since only two diﬀerent routines will be
needed).
Table 1 and the discussion of the machine up to this point represent the complete tiny archi
tecture. In particular there is no hardware stack or other facilities of any kind. No register
except the PC is special in any way, there is no sp or fp. A compiler for the machine must
therefore maintain any runtime environment organisation entirely manually. Although this may
be unrealistic it has the advantage that all operations must be generated explicitly as they are
needed.
Since the instruction set is minimal we should give some idea of how they can be used to achieve
some more complicated programming language operations. (Indeed, the machine is adequate,
if not comfortable, for even very sophisticated languages).
i. The target register in the arithmetic, IN, and load operations comes ﬁrst, and the source
register(s) come second, similar to the 80x85 and unlike the Sun SparcStation. There is
no restriction on the use of registers for sources and targets; in particular the source and
target registers may be the same.
ii. All arithmetic operations are restricted to registers. No operations (except load and store
operations) act directly on memory. In this the machine resembles RISC machines such
as the Sun SparcStation. On the other hand the machine has only 8 registers, while most
RISC processors have many more.
iii. There are no ﬂoating point operations or ﬂoating point registers. (But the language you
are compiling has no ﬂoating point variables either).
98
iv. There is no restriction on the use of the PC in any of the instructions. Indeed, since there
is no unconditional jump, it must be simulated by using the PC as the target register in
an LDA instruction:
LDA 7, d(s)
This instruction has the eﬀect of jumping to location d + reg[s].
v. There is also no indirect jump instruction, but it too can be imitated if necessary, by
using an LD instruction. For example,
LD 7,0(1)
jumps to the instruction whose address is stored in memory at the location pointed to by
register 1.
vi. The conditional jump instructions (JLT etc.), can be made relative to the current position
in the program by using the PC as the second register. For example
JEQ 0,4(7)
causes the machine to jump ﬁve instructions forward in the code if register 0 is 0. An
unconditional jump can also be made relative to the PC by using the PC twice in an
LDA instruction. Thus
LDA 7,4(7)
performs an unconditional jump three instructions backwards.
vii. There are no procedure calls or JSUB instruction. Instead we must write
LD 7,d(s)
which has the eﬀect of jumping to the procedure whose entry address is dMem[d +
reg[s]]. Of course we need to remember to save the return address ﬁrst by executing
something like
LDA 0,1(7)
which places the current PC value plus one into reg[0].
6.2 The machine simulator
The machine accepts text ﬁles containing TM instructions as described above with the following
conventions:
• An entirely blank line is ignored
• A line beginning with an asterisk is considered to be a comment and is ignored.
• Any other line must contain an integer instruction location followed by a colon followed
by a legal instruction. Any text after the instruction is considered to be a comment and
is ignored.
99
The machine contains no other features–in particular there are no symbolic labels and no macro
facilities.
We now have a look at a TM program which computes the factorial of a number input by the
user.
* This program inputs an integer, computes
* its factorial if it is positive,
* and prints the result
0: IN 0,0,0 r0 = read
1: JLE 0,6(7) if 0 < r0 then
2: LDC 1,1,0 r1 = 1
3: LDC 2,1,0 r2 = 1
* repeat
4: MUL 1,1,0 r1 = r1 * r0
5: SUB 0,0,2 r0 = r0  r2
6: JNE 0,3(7) until r0 = 0
7: OUT 1,0,0 write r1
8: HALT 0,0,0 halt
* end of program
Strictly speaking there is no need for the HALT instruction at the end of the code since the
machine sets all instruction locations to HALT before loading the program, however it is useful
to put it in as a reminder, and also as a target for jumps which wish to end the program.
If this program were saved in the ﬁle fact.tm then this ﬁle can be loaded and executed as in
the following sample session. (The simulator assumes the ﬁle extension .tm if one is not given).
tm fact TM simulation (enter h for help) ... Enter command: g Enter value for IN instruction:
7 OUT instruction prints: 5040 HALT: 0,0,0 Halted Enter command: q Simulation done.
The g command stands for “go”, meaning that the program is executed starting at the current
contents of the PC (which is 0 after loading), until a HALT instruction is read. The complete
list of simulator commands can be obtained by using the command h.
100
1
1.1
Introduction
Compilation
Deﬁnition. Compilation is a process that translates a program in one language (the source language) into an equivalent program in another language (the object or target language).
Error messages
Source program
Compiler
Target program
An important part of any compiler is the detection and reporting of errors; this will be discussed in more detail later in the introduction. Commonly, the source language is a highlevel programming language (i.e. a problemoriented language), and the target language is a machine language or assembly language (i.e. a machineoriented language). Thus compilation is a fundamental concept in the production of software: it is the link between the (abstract) world of application development and the lowlevel world of application execution on machines. Types of Translators. An assembler is also a type of translator:
Assembly program
Assembler
Machine program
An interpreter is closely related to a compiler, but takes both source program and input data. The translation and execution phases of the source program are one and the same.
Source program Interpreter Input data Output data
Although the above types of translator are the most wellknown, we also need knowledge of compilation techniques to deal with the recognition and translation of many other types of languages including: 2
• Commandline interface languages; • Typesetting / word processing languages (e.g.TEX); • Natural languages; • Hardware description languages; • Page description languages (e.g. PostScript); • Setup or parameter ﬁles. Early Development of Compilers. 1940’s. Early storedprogram computers were programmed in machine language. Later, assembly languages were developed where machine instructions and memory locations were given symbolic forms. 1950’s. Early highlevel languages were developed, for example FORTRAN. Although more problemoriented than assembly languages, the ﬁrst versions of FORTRAN still had many machinedependent features. Techniques and processes involved in compilation were not wellunderstood at this time, and compilerwriting was a huge task: e.g. the ﬁrst FORTRAN compiler took 18 man years of eﬀort to write. Chomsky’s study of the structure of natural languages led to a classiﬁcation of languages according to the complexity of their grammars. The contextfree languages proved to be useful in describing the syntax of programming languages. 1960’s onwards. The study of the parsing problem for contextfree languages during the 1960’s and 1970’s has led to eﬃcient algorithms for the recognition of contextfree languages. These algorithms, and associated software tools, are central to compiler construction today. Similarly, the theory of ﬁnite state machines and regular expressions (which correspond to Chomsky’s regular languages) have proven useful for describing the lexical structure of programming languages. From Algol 60, highlevel languages have become more problemoriented and machine independent, with features much removed from the machine languages into which they are compiled. The theory and tools available today make compiler construction a managable task, even for complex languages. For example, your compiler assignment will take only a few weeks (hopefully) and will only be about 1000 lines of code (although, admittedly, the source language is small).
1.2
The Context of a Compiler
The complete process of compilation is illustrated as: 3
as well as standard library functions and resources supplied by the operating system. that when used in later in the program. For example.2. #include "header.b) becomes a = (3*a+b*(2+a)) (b) Inserting named ﬁles.h
1. 4
. in C.h" is replaced by the contents of the ﬁle header.y) (3*x+y*(2+x)) deﬁnes a macro foo. #define foo(x.skeletal source program preprocessor source program compiler assembly program assembler relocatable m/c code link/load editor absolute m/c code
1. a = foo(a. in C. Typical preprocessing operations include: (a) Expanding macros (shorthand notations for longer constructs).1
Preprocessors
Preprocessing performs (usually simple) operations on the source ﬁle(s) prior to compilation. is expanded by the preprocessor.2
Linkers
A linker combines object code (machine code that has not yet been linked) produced from compiling and assembling many source programs. For example. This involves resolving references in each object ﬁle to external variables and procedures declared in other ﬁles. For example.2.
5
. There are many variants on this model. This is called the analysis/synthesis model of compilation. A loader calculates appropriate absolute addresses for these memory locations and amends the code to use these addresses. assemblers and linkers usually produce code whose memory references are made relative to an undetermined starting location that can be anywhere in memory (relocatable machine code). and id for any identiﬁer. e.
1. These make up the output of the lexical analyser.g.1
Lexical Analysis
A lexical analyser or scanner is a program that groups sequences of characters into lexemes. Here: (a) Tokens are symbolic names for the entities that make up the text of the program.2.
source program lexical analyser syntax analyser semantic analyser intermediate code gen’tor code optimizer code generator target program
symboltable manager
error handler
1. each of which interacts with a symbol table manager and an error handler. if for the keyword if.3. and outputs (to the syntax analyser) a sequence of tokens.3
Loaders
Compilers.3
The Phases of a Compiler
The process of compilation is split up into six phases. but the essential elements are the same.1.
a. t. n a left parenthesis letter followed by seq. n. var x : integer . Lexeme program foo ( input . u. p. o. tabs e. f for the token if . for example if matches the pattern for if . spaces. n a left parenthesis seq. t a right parenthesis a semicolon v. Token program id (foo) lef tpar input comma output rightpar semicolon var id (x) colon integer semicolon begin readln lef tpar id (x) rightpar semicolon writeln lef tpar literal (’value read =’) comma id (x) rightpar end f ullstop Pattern p. of alphanumerics a right parenthesis newlines. r a semicolon b. t a comma o. n. u. and any sequence of alphanumerics starting with a letter for the token id. of alphanumerics a colon i.begin readln(x). writeln ( ’value read =’ . e. i. l. m newlines. t. the following code might result in the table given below. t. n. l. e. r. tabs r. p. g. and foo123bar matches the pattern for id. e. r. g. r. e.g the sequence i. x ) end . For example. g. u.writeln(’value read =’. n newlines. of alphanumerics a right parenthesis a semicolon w. of chars enclosed in quotes a comma letter followed by seq. tabs letter followed by seq.output). of alphanumerics a left parenthesis i. (c) A lexeme is a sequence of characters from the input that match a pattern (and hence constitute an instance of a token). spaces. output ) . spaces. e. r letter followed by seq. i. d a fullstop
6
. a. d.
program foo(input.x) end. e.var x:integer. begin readln ( x ) . a.(b) A pattern is a rule that speciﬁes when a sequence of characters from the input constitutes a token.
Also. names of variables. procedures etc. typical attributes include: • the number and type of each argument (if any). For variables. the lexical analyser must also return the value of the matched lexeme (shown in parentheses) or else this information would be lost. consider: a := x * y + z
7
.3
Syntax Analysis
A syntax analyser or parser is a program that groups sequences of tokens from the lexical analysis phase into phrases each with an associated phrase type. when outputting an id or literal token. • how much memory it occupies. • the method of passing each argument.
1.e. This token sequence represents almost all the important information from the input program required by the syntax analyser.
1.3. typical attributes include: • its type.3. although often important in separating lexemes. and • the type of value returned (if any). spaces and tabs). For example.It is the sequence of tokens in the middle column that are passed as output to the syntax analyser. For procedures and functions. The purpose of the symbol table is to provide quick and uniform access to identiﬁer attributes throughout the compilation process.) of a source program together with all the attributes of each identiﬁer. A phrase is a logical unit with respect to the rules of the source language.2
Symbol Table Management
A symbol table is a data structure containing all the identiﬁers (i. Information is usually put into the symbol table during the lexical analysis and/or syntax analysis phases. Whitespace (newlines. is usually not returned as a token. • its scope.
• by rule (1) exp4 = id4 is a phase with type ‘expression’. • by rule (2) exp3 = exp1 binop1 exp2 is also a phrase with phrase type ‘expression’. Pascal also has a rule that says: id assign exp is a phrase with phrase type ‘assignment’. this statement has the structure id1 assign id2 binop1 id3 binop2 id4 Now. Taking all the identiﬁers to be variable names for simplicity. The structure of a phrase is best thought of as a parse tree or a syntax tree. and so the Pascal statement above is a phrase of type ‘assignment’. A syntax tree is a compacted form of parse tree in which the operators appear as the interior nodes. A parse tree for the example Pascal statement is: assignment id1 exp3 exp1 id2 binop1 exp2 id3 8 assign exp5 binop2 exp4 id4
. a syntactic rule of Pascal is that there are objects called ‘expressions’ for which the rules are (essentially): (1) Any constant or identiﬁer is an expression. we have: • By rule (1) exp1 = id2 and exp2 = id3 are both phrases with phrase type ‘expression’. The construction of a parse tree is a basic activity in compilerwriting.After lexical analysis. Of course. Parse Trees and Syntax Trees. A parse tree is tree that illustrates the grouping of tokens into phrases. • by rule (2). (2) If exp1 and exp2 are expressions then so is exp1 binop exp2 . exp5 = exp3 binop2 exp4 is a phrase with phrase type ‘expression’.
this is because phrase types are often deﬁned in terms of themselves (cf. If 9
. (c) If exp is an operator applied to further subexpressions such that: (i) the operator is applied to the correct number of subexpressions. Let exp be an expression.4
Semantic Analysis
A semantic analyser takes its input from the syntax analysis phase in the form of a parse tree and a symbol table. The distinction between lexical and syntactical analysis sometimes seems arbitrary. then exp is welltyped and its type is the result type of the operator. (b) If exp is a variable then exp is welltyped and its type is the type of the variable. (a) If exp is a constant then exp is welltyped and its type is the type of the constant. Assignment Type Rules. The main criterion is whether the analyser needs recursion or not: • lexical analysers hardly ever use recursion.and a syntax tree is: assign id1 binop2 id4
binop1 id2 id3
Comment. the phrase type ‘expression’ above). • syntax analysers almost always use recursion. Typical type rules for expressions and assignments are: Expression Type Rules.
1. Let var be a variable of type T1 and let exp be a welltyped expression of type T2 . in practice semantic analysers are mainly concerned with type checking and type coercion based on type rules. they are sometimes called linear analysers since they scan the input in a ‘straight line’ (from from left to right). (ii) each subexpression is welltyped and (iii) each subexpression is of an appropriate type.3. Its purpose is to determine if the input has a welldeﬁned meaning.
for example if x = 7 y := 4 (missing then). and realarray as an array or reals. but semantically incorrect since + is only deﬁned on numbers.(a) T1 = T2 and (b) T1 is an assignable type then var assign exp is a welltyped assignment.5
Error Handling
Each of the six phases (but mainly the analysis phases) of a compiler can encounter errors. • correct the error if possible. or to the wrong number of arguments.
10
. typing tehn instead of then.
1. In Pascal this assignment is syntactically correct. or missing oﬀ one of the quotes in a literal. and • continue processing (if possible) after the error to look for further errors. for examples. they may be either type errors. whereas its second argument is an array. Errors are either syntactic or semantic: Syntax errors are errors in the program text. logical errors or runtime errors: (a) Type errors occur when an operator is applied to an argument of the wrong type. The semantic analyser checks for such type errors using the parse tree. consider the following code fragment: intvar := intvar + realarray where intvar is stored in the symbol table as being an integer variable. (b) A grammatical error is a one that violates the (grammatical) rules of the language. Types of Error. For example. Semantic errors are mistakes concerning the meaning of a program construct.3. On detecting an error the compiler must: • report the error in a helpful way. they may be either lexical or grammatical: (a) A lexical error is a mistake in a lexeme. the symbol table and type rules.
If possible. a source program has been decomposed into a symbol table and a parse tree both of which may have been modiﬁed by the semantic analyser. A TAC program is a sequence of optionally labelled instructions. Syntax errors must be detected by a compiler and at least reported to the user (in a helpful way). and (b) easily translated into the required language(s)..
1. Semantic errors are much harder and sometimes impossible for a computer to detect. (c) Runtime errors are errors that can be detected only when the program is executed. writeln(1/x) which would produce a run time error if the user input 0. when x and y initially have the same value and the body of loop need not change the value of either x or y. One of the most widely used intermediate languages is ThreeAddress Code (TAC). TAC Programs. for example: while x = y do . or (2) generate code for a ‘general’ or abstract machine.3..(b) Logical errors occur when a badly conceived program is executed. Approach (2) is more modular and eﬃcient provided the abstract machine language is simple enough to: (a) produce and analyse (in the optimisation phase). then use further translators to turn the abstract code into code for speciﬁc machines. for example: var x : real. Some common TAC instructions include: (i) var1 := var2 binop var3 (ii) var1 := unop var2 (iii) var1 := num 11
. readln(x).6
Intermediate Code Generation
After the analysis phases of the compiler have been completed. From this information we begin the process of generating object code according to either of two approaches: (1) generate code for a speciﬁc machine. the compiler should make the appropriate correction(s).
(iv) goto label (v) if var1 relop var2 goto label There are also TAC instructions for addresses and pointers, arrays and procedure calls, but will will use only the above for the following discussion. SyntaxDirected Code Generation. In essence, code is generated by recursively walking through a parse (or syntax) tree, and hence the process is referred to as syntaxdirected code generation. For example, consider the code fragment: z := x * y + x and its syntax tree (with lexemes replacing tokens):
:=
z
+
*
x
x
y
We use this tree to direct the compilation into TAC as follows. At the root of the tree we see an assignment whose righthand side is an expression, and this expression is the sum of two quantities. Assume that we can produce TAC code that computes the value of the ﬁrst and second summands and stores these values in temp1 and temp2 respectively. Then the appropriate TAC for the assignment statement is just z := temp1 + temp2 Next we consider how to compute the values of temp1 and temp2 in the same topdown recursive way. For temp1 we see that it is the product of two quantities. Assume that we can produce TAC code that computes the value of the ﬁrst and second multiplicands and stores these values in temp3 and temp4 respectively. Then the appropriate TAC for the computing temp1 is temp1 := temp3 * temp4 Continuing the recursive walk, we consider temp3. Here we see it is just the variable x and thus the TAC code 12
temp3 := x is suﬃcient. Next we come to temp4 and similar to temp3 the appropriate code is temp4 := y Finally, considering temp2, of course temp2 := x suﬃces. Each code fragment is output when we leave the corresponding node; this results in the ﬁnal program: temp3 := x temp4 := y temp1 := temp3 * temp4 temp2 := x z := temp1 + temp2 Comment. Notice how a compound expression has been broken down and translated into a sequence of very simple instructions, and furthermore, the process of producing the TAC code was uniform and simple. Some redundancy has been brought into the TAC code but this can be removed (along with redundancy that is not due to the TACgeneration) in the optimisation phase.
1.3.7
Code Optimisation
An optimiser attempts to improve the time and space requirements of a program. There are many ways in which code can be optimised, but most are expensive in terms of time and space to implement. Common optimisations include: • removing redundant identiﬁers, • removing unreachable sections of code, • identifying common subexpressions, • unfolding loops and 13
• eliminating procedures. Note that here we are concerned with the general optimisation of abstract code. Example. Consider the TAC code: temp1 := temp2 := if temp1 temp3 := goto 300 200 temp3 := 300 temp4 := x temp1 = temp2 goto 200 temp1 * y z temp2 + temp3
Removing redundant identiﬁers (just temp2) gives temp1 := if temp1 temp3 := goto 300 200 temp3 := 300 temp4 := Removing redundant code gives temp1 := x 200 temp3 := z 300 temp4 := temp1 + temp3 Notes. Attempting to ﬁnd a ‘best’ optimisation is expensive for the following reasons: • A given optimisation technique may have to be applied repeatedly until no further optimisation can be obtained. (For example, removing one redundant identiﬁer may introduce another.) • A given optimisation technique may give rise to other forms of redundancy and thus sequences of optimisation techniques may have to be repeated. (For example, above we removed a redundant identiﬁer and this gave rise to redundant code, but removing redundant code may lead to further redundant identiﬁers.) • The order in which optimisations are applied may be signiﬁcant. (How many ways are there of applying n optimisation techniques to a given piece of code?) 14 x = temp1 goto 200 temp1 * y z temp1 + temp3
temp3.[R12] R0.[R12.[R0] R1.[R12] R2. • register assignment and • machinespeciﬁc optimisation. In this phase we consider: • memory management. The TAC code above could typically result in the ARM assembly program shown below.1.[R12.temp EQUD EQUD EQUD EQUD EQUD 0 0 0 0 0 four bytes for x four bytes for z four bytes each for temp1. it is not intended to illustrate compact ARM programming! .[R0] R1.#x R1.8
Code Generation
The ﬁnal phase of the compiler is to generate code for a speciﬁc machine.prog
MOV MOV LDR STR MOV LDR STR LDR LDR ADD STR
R12.#8]
R12 = base address R0 = address of x R1 = value of x store R1 at R12 R0 = address of z R1 = value of z store R1 at R12+4 R1 = value of temp1 R2 = value of temp3 add temp1 to temp3 store R3 at R12+8
15
.#4] R3.#temp R0.[R12.z .#z R1. The output from this phase is usually assembly language or relocatable machine code. Note that the example illustrates a mechanical translation of TAC into ARM.R1.#4] R1.
.x .R2 R3. and temp4.3. Example.
2. Note. and the basic problem of recognising strings from a language. ǫ is the unique string satisfying ǫ = 0.2
Decidability
Given a language L over some alphabet Σ.1
Basic Deﬁnitions
• An alphabet Σ is a ﬁnite nonempty set (of symbols). and only if. can we eﬀectively decide if w is a member of L or not? We call this the decision problem for L.) • For a symbol or word x. • Two languages L1 and L2 over common alphabet Σ are equal if they are equal as sets. Note the use of the word ‘eﬀectively’: this implies the mechanism by which we decide on membership (or nonmembership) must be a ﬁnitistic. These are central concepts that we will use throughout the remainder of the course. xn denotes x concatenated with itself n times.2
Languages
In this section we introduce the formal notion of a language. • A language over Σ is a set L ⊆ Σ∗ . a basic question is: For each possible word w ∈ Σ∗ . the number of characters comprising it) is denoted w. the lectures will cover examples and diagrams illustrating the theory.
(Thus Σ∗ = Σ+ ∪ {ǫ}. (That is. • A string or word over an alphabet Σ is a ﬁnite concatenation (or juxtaposition) of symbols from Σ.This section contains mainly theoretical deﬁnitions. • We deﬁne Σ+ =
n≥1
Σn .) • The set of all strings over Σ is denoted Σ∗ .
2. L1 ⊆ L2 and L2 ⊆ L1 . deterministic and mechanical procedure 16
. Thus L1 = L2 if. • The empty or null string is denoted ǫ. • The length of a string w (that is. • For each n ≥ 0 we deﬁne Σn = {w ∈ Σ∗  w = n}. with the convention that x0 denotes ǫ.
that can be carried out by some form of computing agent. Decidability is based on the notion of an ‘algorithm’. If no such algorithm exists then L is said to be undecidable. if char = END_OF_STRING then print( "No" ) else /* char must be ’0’ or ’1’ */ while char = ’0’ do read( char ) od. Let Σ = {0.... (Hence every undecidable language is inﬁnite.3
Basic Facts
(1) Every ﬁnite language is decidable. this is an abstract. Let L be the (inﬁnite) language L = {w ∈ Σ∗  w = 0n 1for some n}. Example. it is not suﬃcient to be only able to decide when words are members of L. More precisely then. Note. Does the program below solve the decision problem for L? read( char ). Also note the decision problem asks if a given word is a member of L or not. In standard theoretical computer science this is taken to mean a Turing Machine.
2. Thus it is suﬃcient in practice to use a more convenient model of computation such as Pascal programs provided that any decidability arguments we make assume an inﬁnite memory. but extremely lowlevel model of computation that is equivalent to a digital computer with an inﬁnite memory.) 17
. a language L ⊆ Σ∗ is said to be decidable if there exists an algorithm such that for every w ∈ Σ∗ (1) the algorithm terminates with output ‘Yes’ when w ∈ L and (2) the algorithm terminates with output ‘No’ when w ∈ L. 1} be an alphabet. /* char must be ’1’ or END_OF_STRING */ if char = ’1’ then print( "Yes" ) else print( "No" ) fi fi Answer: . that is.
and • programming languages’ lexical patterns are speciﬁed using regular expressions.(2) Not every inﬁnite language is undecidable. • programming language syntax is speciﬁed using contextfree grammars.
18
. Regular languages and their relationship to lexical analysis are the subjects of the next section. • every regular language is decidable. (3) Programming languages are (usually) inﬁnite but (always) decidable. and lexical analysers are (essentially) DFAs. ContextFree Languages. The signiﬁcant aspects of regular languages are: • they are deﬁned by ‘patterns’ called regular expressions. • every contextfree language is decidable. Contextfree languages and their relationship to syntax analysis are the subjects of sections 4 and 5. • the decision problem for any contextfree language of interest to us is solved by a deterministic pushdown automaton (DPDA). Regular Languages. and (most) parsers are (essentially) DPDAs. (Why?)
2. • the decision problem for any regular language is solved by a deterministic ﬁnite state automaton (DFA).4
Applications to Compilation
Languages may be classiﬁed by the means in which they are deﬁned. The signiﬁcant aspects of contextfree languages are: • they are deﬁned by ‘rules’ called contextfree grammars. Of interest to us are regular languages and contextfree languages.
3. then – rs ∈ RE(Σ) matches all strings matched either by r or by s.1
Deﬁnition
Regular expressions represent patterns of strings of symbols. The notation of regular expressions is a mathematical formalism ideal for expressing patterns such as these. the strings they match and thus the languages they determine.3. and in Section 3.1. a popular lexical analyser generator built upon the theory of regular expressions and DFAs. 19
. • ǫ ∈ RE(Σ) matches only the empty string. the class of algorithms that solve the decision problems for regular languages. • if r and s are in RE(Σ) and determine the languages L(r) and L(s) respectively. Firstly. Therefore L(a) = {a}. For example. the token then is associated with the pattern t. • If a ∈ Σ then a ∈ RE(Σ) matches the string a.
3.3
Lexical Analysis
In this section we study some theoretical concepts with respect to the class of regular languages and apply these concepts to the practical problem of lexical analysis. Note.1. The language determined is L(∅) = ∅. and thus ideal for expressing the lexical structure of programming languages. h. Let Σ be an alphabet. in Section 3. as follows: • ∅ ∈ RE(Σ) matches no strings.2. A regular expression r matches a set of strings over an alphabet. and the token id might be associated with the pattern “an alphabetic character followed by any number of alphanumeric characters”. we deﬁne the notion of a regular expression and show how regular expressions determine regular languages. This set is denoted L(r) and is called the language determined or generated by r.4 we take a brief look at Lex. Therefore. introduce deterministic ﬁnite automata (DFAs).This section contains mainly theoretical deﬁnitions. We show how regular expressions and DFAs can be used to specify and implement lexical analysers in Section 3. e. L(rs) = L(r) ∪ L(s). the lectures will cover examples and diagrams illustrating the theory. n. We deﬁne the set RE(Σ) of regular expressions over Σ. We then. in Section 3. Therefore L(ǫ) = {ǫ}.1
Regular Expressions
Recall from the Introduction that a lexical analyser uses pattern matching with respect to rules associated with the source language’s tokens.
) – r ∗ ∈ RE(Σ) matches all ﬁnite concatenations of strings which all match r. • We write r + for rr ∗ . s. t ∈ RE(Σ): • rs = sr ( is commutative) • (rs)t = r(st) ( is associative) • (rs)t = r(st) (concatenation is associative) • r(st) = rsrt (concatenation • (rs)t = rtst distributes over ) 20
. abc∗ is a(b(c∗ )).1. • We write r n as an abbreviation for r .1.2
Regular Languages
Let L be a language over Σ.
3. the language determined is L(rs) = L(r)L(s) = {uv ∈ Σ∗  u ∈ L(r) and v ∈ L(s)}. for example. the ﬁrst matching r and the second matching s. . The language denoted is thus L(r ∗ ) = (L(r))∗ =
i∈N
(L(r))i
= {ǫ} ∪ L(r) ∪ L(r)L(r) ∪ · · ·
3. the following identities hold for all r. .– rs ∈ RE(Σ) matches any string that is the concatenation of two strings.
3.1. Thus.4
Lemma
Writing ‘r = s’ to mean ‘L(r) = L(s)’ for two regular expressions r. The normal convention is: ∗ is higher than concatenation. with r 0 denoting ǫ. • We write r? for ǫr.3
Notation
• We need to use parentheses to overide the convention concerning the precedence of the operators. (Given two sets S1 and S2 of strings. Therefore. r (n times r). which is higher than . the notation S1 S2 denotes the set of all strings formed by appending members of S1 with members of S2 . s ∈ RE(Σ). L is said to be a regular language if L = L(r) for some r ∈ RE(Σ).
and (2) if w ∈ L(r) then M terminates with output ‘No’. deﬁning machines that say “yes” or “no” given an inputted string. Thus.1.2
Deterministic Finite State Automata
In this section we deﬁne the notion of a DFA without reference to its application in lexical analysis. digits and identiﬁers.6
The Decision Problem for Regular Languages
For every regular expression r ∈ RE(Σ) there exists a stringprocessing machine M = M(r) such that for every w ∈ Σ∗ . In Section 3. 21
. 3. and to use these names in place of the expressions they represent.
3. The machines in question are Deterministic Finite State Automata.• ∅r = r∅ = ∅ • ∅r = r • ∅∗ = ǫ • r?∗ = r ∗ • r ∗∗ = r ∗ • (r ∗ s∗ )∗ = (rs)∗ • ǫr = rǫ = r. that is. every regular language is decidable. Here we are interested purely in solving the decision problem for regular languages. letter = AB · · · Zab · · · z digit = 01 · · · 9 ident = letter(letterdigit)∗ are examples of regular deﬁnitions for letters. depending on its membership of a particular language.5 Regular deﬁnitions
It is often useful to give names to complex regular expressions. Given an alphabet comprising all ASCII characters. when input to M: (1) if w ∈ L(r) then M terminates with output ‘Yes’.
3.3 we use DFAs as the basis for lexical analysers: pattern matching algorithms that output sequences of tokens.1.
else it rejects w.1
Deﬁnition
A deterministic ﬁnite state automaton (or DFA) M is a 5tuple M = (Q. a). • Given a string w ∈ Σ∗ the machine reads the symbols of w one at a time from left to right. δ. a). The DFA only terminates when all the input has been read. We deﬁne the language of M by ˆ L(M) = {w ∈ Σ∗  δ(q0 . then the new state is δ(q. w) ∈ F }. • q0 ∈ Q is the initial state. • If.2. q0 . • Each symbol causes the machine to make a transition from its current state to a new state. δ. Let M = (Q. Σ. and • F ⊆ Q is the set of accepting or ﬁnal states. w) for each q ∈ Q. its state is a member of F . ǫ) = q for each q ∈ Q and ˆ ˆ δ(q. • Σ is an alphabet. The idea behind a DFA M is that it is an abstract machine that deﬁnes a language L(M) ⊆ Σ∗ in the following way: • The machine begins in its start state q0 . 22
.3. then the machine accepts w. F ) where • Q is a ﬁnite nonempty set of states. Σ. We deﬁne δ : Q × Σ∗ → Q by ˆ δ(q. if the current state is q and the input symbol is a. • δ : Q × Σ → Q is the transition or nextstate function. a ∈ Σ and w ∈ Σ∗ . • The machine terminates when all the symbols of the string have been read. Note the name ‘ﬁnal state’ is not a good one since a DFA does not terminate as soon as a ‘ﬁnal state’ has been entered. q0 . aw) = δ(δ(q. F ) be a DFA. when the machine terminates. We formalise this idea as follows: ˆ Deﬁnition.
. 4}. . b}∗  δ(1. 2.2. w) ∈ F } ˆ {w ∈ {a. b) = 4 δ(3. q0 . . w) = 4} {ab. . b) = 3 δ(4.2
Transition Diagrams
DFAs are best understood by depicting them as transition diagrams. A transition diagram for a DFA is drawn as follows: (1) Draw a node labelled ‘q’ for each state q ∈ Q. b}∗  δ(1. . Simpliﬁcations to transition diagrams. a) = 3 δ(2. b) = 3 δ(2. abn . 23
. Let M3 be obtained from M1 by changing F to {3}. (4) Indicate each ﬁnal state by drawing a concentric circle around its node to form a double circle. . a).
Let M2 be obtained from M1 by adding states 1 and 2 to F . . these are directed graphs with nodes representing states and labelled arcs between states representing transitions. (2) For every q ∈ Q and every a ∈ Σ draw an arc labelled a from node q to node δ(q. F = {4} and where δ is given by: δ(1. abb. b) = 4.
3. a) = 2 δ(1.3
Examples
Let M1 = (Q. a) = 3 δ(4. q0 = 1.} L(ab+ ).2. F ) where Q = {1. Σ. Σ = {a. 3. abbb. Then L(M2 ) = L(ǫab∗ ). (3) Draw an unlabelled arc from outside the DFA to the node representing the initial state q0 . b}.3. From the transition diagram for M1 it is clear that: L(M1 ) = = = = ˆ {w ∈ {a. Then L(M3 ) = L((baaabb∗ a)(ab)∗ ). δ. a) = 3 δ(3.
3. in an identiﬁer recognition DFA. which results in a cluttered transition diagram. The signiﬁcance of the Equivalence Theorem is that its proof is constructive. The standard algorithm for constructing a DFA from a given regular expression is not diﬃcult. In such a case it is convenient to apply the convention that any apparently missing transitions are transitions to the error state. Thus. since if we cannot ﬁnd a regular expression directly. . See J. there may be 52 arcs labelled with each of the lower. c. there is an algorithm that. (2) For every DFA M with alphabet Σ there exists an r ∈ RE(Σ) such that L(r) = L(M). • It is also common for there to be a large number of transitions between two given states in a DFA. 1979). Z} ⊆ Σ and to replace the arcs by a single arc labelled by the name of this set. For example.E. Part (2) of the Equivalence Theorem is a useful tool for showing that a language is regular. . Languages. z. part (2) states that it is suﬃcient to ﬁnd a DFA that recognises the language. if we can write a fragment of a programming language syntax in terms of regular expressions. Proof. that is.• It is often the case that a DFA has an ‘error state’. letter = {a. It is convenient in such cases to deﬁne a set comprising the labels of each of these arcs. C. NFAs are equivalent in power to DFAs but are slightly harder to understand (see the course text for details). Given a regular expression. A.and uppercase letters from the start state to a state representing that a single letter has been recognised. for example. 24
.4
Equivalence Theorem
(1) For every r ∈ RE(Σ) there exists a DFA M with alphabet Σ such that L(M) = L(r).2. a nonaccepting state from which there are no transitions other than back to the error state. and then transforms the NFA into an equivalent DFA (by a method known as the Subset Construction). b. . It is acceptable practice to use these conventions provided it is made clear that they are being operated. .g. the REDFA algorithm ﬁrst constructs an NFA equivalent to the RE (by a method known as Thompson’s Construction). Applications. given a regular expression r. Hopcroft and J. and Computation (Addison Wesley. . builds a DFA M such that L(M) = L(r). then by part (1) of the Theorem we can automatically construct a lexical analyser for that fragment. . but would require that we also take a look at nondeterministic ﬁnite state automata (NFAs). . B. . Ullman Introduction to Automata Theory. letter. D. e.
and has to repeatedly match sequences of characters against these patterns and output corresponding tokens. We have seen that it is easy to construct a DFA that recognises lexemes for a given programming language token (e. The (lexical analyser based on) the DFA begins in state start. A transition diagram for a DFA that recognises these symbols is given by:
digit letter other (reread) in_ident letter whitespace : in_assign start = done end_of_file end_of_input + plus .
3.). However.g. the token returned depends on the ﬁnal transition it takes to enter the done state. error (if an error occurs. plus (for +). 25
.3
DFAs for Lexical Analysis
Let’s suppose we wish to construct a lexical analyser based on a DFA. and := (we’ll add keywords later). for example if an invalid character such as ‘(’ is read or if a : is not followed by a =) and end of input (for when the complete input ﬁle has been read).1
An example DFA lexical analyser
Consider ﬁrst writing a DFA for recognising tokens for a (minimal!) language with identiﬁers and the symbols +. semi colon (for . assign (for :=).3. We illustrate how lexical analysers may be constructed using DFAs by means of an example. which can be either identifier (for identiﬁers). and returns a token when it enters state done. semi_colon other error assign other identifier (or keyword) error
The lexical analyser code (see next section) consists of a procedure get next token which outputs the next token. for identiﬁers. and for numbers). .3. a lexical analyser has to deal with all of a programming language’s lexical patterns. and is shown on the right hand side of the diagram. for individual keywords.
k_fi. else and fi to the language to make it slightly more realistic. then the appropriate keyword token is output. Let’s add the keywords if. k_then. in_assign. The value of the Boolean variable reread character determines whether the last character read during the previous execution of get next token should be reread at the beginning of its next execution. reread_character : boolean. in_identifier. 26
. identifier. We also need the following auxiliary functions (their implementations are omitted here) with the obvious interpretations: function is_alpha(c : char) : boolean. Notice that adding keyword recognition to the above DFA would be tricky to do by hand and would lead to a complex DFA—why? However. Next. we deﬁne enumerated types for the sets of states and tokens: state = (start. token = (k_if. error. and when the accepting state for identiﬁers is entered the lexeme stored in the buﬀer can be checked against a table of keywords. then.For a state with an output arc labelled other. The current character variable holds the value of the last character read by get next token. when we have recognised an identiﬁer we will also wish to output its string value as well as the identifier token. Firstly. The job of the procedure get next token is to set the value of current token to next token. semi_colon. plus.
3. current_character : char. current_identifier : string[100]. so that this can be used by the next phase of compilation. done). k_else. If there is a match.3. we deﬁne some variables that are shared by the lexical analyser and the syntax analyser. current_token : token. assign. else identifier token is output. reread denotes that the read character should not be consumed—it is reread as the ﬁrst character when get next token is next called.2
Code for the example DFA lexical analyser
Let’s consider how code may be written based on the above DFA. Further. end_of_input). we can recognise keywords as identiﬁers using the above DFA. the intuition is that this transition is made on reading any character except those labelled on the state’s other arcs. and if this token is identifier it also sets the value of current identifier to the current lexeme.
begin keyword_lookup := identifier. Notice that within the main loop a case statement is used to deal with transitions from the current state. keyword lookup is used to check whether or not a keyword has been matched. return this keyword’s token.NUM_KEYWORDS] of token = (k_if. Finally. The get next token procedure is implemented as follows.
27
. found : boolean.’else’. token_tab : array[1. k_else. index := index + 1 end end. found := TRUE end. found := FALSE. else return the identifier token} var index : integer. while (index <= NUM_KEYWORDS) and (not found) do begin if keyword_tab[index] = s then begin keyword_lookup := token_tab[index]. we deﬁne two constant arrays: { Constants used to recognise keyword matches } NUM_KEYWORDS = 4. Notice that the arrays and function are easily modiﬁed for any number of keywords appearing in our source language.function is_digit(c : char) : boolean. ’fi’). After the loop exits (when the done state is entered). function keyword_lookup(s : string) : token. k_then. {If s is a keyword. keyword_tab : array[1. function is_white_space(c : char) : boolean. index := 1. if an identifier has been recognised. k_fi). that store keyword tokens and keywords (with associated keywords and tokens stored at the same location each array) and a function that searches the keyword array for a string and returns the token associated with a matched keyword or the token identifier if not.NUM_KEYWORDS] of string = (’if’.. ’then’..
procedure get_next_token; {Sets the value of current_token by matching input characters. Also, sets the values current_identifier and reread_character if appropriate} var current_state : state; no_more_input : boolean; begin current_state := start; current_identifier := ’’; while not (current_state = done) do begin no_more_input := eof; {Check whether at end of file} if not (reread_character or no_more_input) then read(current_character); reread_character := FALSE; case current_state of start: if no_more_input then begin current_token := end_of_input; current_state := done end else if is_white_space(current_character) then current_state := start else if is_alpha(current_character) then begin current_identifier := current_identifier + current_character; current_state := in_identifier end else case current_character of ’;’ : begin current_token := semi_colon; current_state := done end; ’+’ : begin current_token := plus; current_state := done end; ’:’ : current_state := in_assign else begin current_token := error; current_state := done; end end; {case} in_identifier: if (no_more_input or not(is_alpha(current_character) or is_digit(current_character))) then begin current_token := identifier;
28
current_state := done; reread_character := true end else current_identifier := current_identifier + current_character; in_assign: if no_more_input or (current_character <> ’=’) then begin current_token := error; current_state := done end else begin current_token := assign; current_state := done end end; {case} end; {while} if (current_token = identifier) then current_token := keyword_lookup(current_identifier); end;
Test code (in the absence of a syntax analyser) might be the following. This just repeatedly calls get next token until the end of the input ﬁle has been reached, and prints out the value of the read token. {Request tokens from lexical analyser, outputting their values, until end_of_input} begin reread_character := false; repeat get_next_token; writeln(’Current Token is ’, token_to_text(current_token)); if (current_token = identifier) then writeln(’Identifier is ’, current_identifier); until (current_token = end_of_input) end. where function token_to_text(t : token) : string; converts token values to text.
29
3.4
Lex
Lex is a widely available lexical analyser generator.
3.4.1
Overview
Given a Lex source ﬁle comprising regular expressions for various tokens, Lex generates a lexical analyser (based on a DFA), written in C, that groups characters matching the expressions into lexemes, and can return their corresponding tokens. In essence, a Lex ﬁle comprises a number of lines typically of the form: pattern action
where pattern is a regular expression and action is a piece of C code. When run on a Lex ﬁle, Lex produces a C ﬁle called lex.yy.c (a lexical analyser). When compiled, lex.yy.c takes a stream of characters as input and whenever a sequence of characters matches a given regular expression the corresponding action is executed. Characters not matching any regular expressions are simply copied to the output stream. Example. Consider the Lex fragment: a b { printf( "read ’a’\n" ); } { printf( "read ’b’\n" ); }
After compiling (see below on how to do this) we obtain a binary executable which when executed on the input: sdfghjklaghjbfghjkbbdfghjk dfghjkaghjklaghjk produces sdfghjklread ’a’ ghjread ’b’ fghjkread ’b’ read ’b’ dfghjk dfghjkread ’a’ ghjklread ’a’ ghjk 30
yylex(). %% main() { abc_count = xyz_count = 0. • Next come the regular expressions for these lexemes and actions to increment the relevant counters. printf( "%d occurrences of abc or abC\n". } {xyz_count++. abc_count ). } { . xyz_count. • Finally.Example. printf( "%d occurrences of xyz\n". Consider the Lex program: %{ int abc_count. xyz\_count ). } { . When executed on input: akhabfabcdbcaxyzXyzabChsdk dfhslkdxyzabcabCdkkjxyzkdf the lexical analyser produces: 4 occurrences of ’abc’ or ’abC’ 3 occurrences of ’xyz’ Some features of Lex illustrated by this example are: 31
. } {abc_count++. there is a main routine to initialise the counters and call yylex(). %} %% ab[cC] xyz \n . }
• This ﬁle ﬁrst declares two global variables for counting the number of occurrences of abc or abC and xyz.
3. Deﬁnitions may use other deﬁnitions (enclosed in braces) as illustrated in: ALPHA ALPHANUM IDENTIFIER and: ALPHA NUM ALPHANUM IDENTIFIER [azAZ] [09] ({ALPHA}{NUM}) {ALPHA}{ALPHANUM}* [azAZ] [azAZ09] {ALPHA}{ALPHANUM}*
Notice the use of parentheses in the deﬁnition of ALPHANUM. the token ‘identiﬁer’ could be deﬁned by: IDENTIFIER [azAZ][azAZ09]*
The shorthand character range construction ‘[xy]’ matches any of the characters between (and including) x and y. [cC] matches either c or C.2
Format of Lex Files
The format of a Lex ﬁle is: deﬁnitions analyser speciﬁcation auxiliary functions Lex Deﬁnitions. (4) The action { . What would happen without them? 32
. For example. All such global declaration code is written in C and surrounded by %{ and %}. (3) The regular expression .(1) The notation for . and [acAC] means the same as abcABC. for example.4. The (optional) deﬁnitions section comprises macros (see below) and global declarations of types. For example. [ac] means the same as abc. which matches any character except a newline. (2) The regular expression \n which matches a newline. } which does nothing except to suppress printing. variables and functions to be used in the actions of the lexical analyser and the auxiliary functions (if present). Macros are abbreviations for regular expressions to be used in the analyser speciﬁcation.
if all lexemes are of the same length then the ﬁrst matching expression is chosen. ..yy. • If endofﬁle is read at any stage.yy. f unn where each f uni is a complete C function. when when called. . • If the next m characters match ri then (a) the matching characters are assigned to string variable yytext. . • If there is no match.. (where n is an integer expression) then the call to yylex() terminates and the value of n is returned as the function’s value... If the last instruction of actioni is return n. This optional section has the form: f un1 f un2 . (c) the next m characters are skipped.. These have the form: r1 { action1 } r2 { action2 } . . . . Lex Auxiliary Functions. otherwise yylex() resumes the scanning process. action2 . rn { actionn } where r1 . then the call to yylex() terminates returning the value 0. then the expression giving the longest lexeme is chosen. • If there is a match against two or more regular expressions.c with the lex library using the command: gcc lex..Lex Analyser Speciﬁcations. . . causes the following to happen: • The current input character(s) are scanned to look for a match with the regular expressions. actionn are sequences of C statements. (b) the integer variable yyleng is assigned the value m. r2 . We can also compile lex. rn are regular expressions (possibly involving macros enclosed in braces) and action1 .. Lex translates the speciﬁcation into a function yylex() which. and the scanning process resumes with the next character. and (d) actioni is executed. the current character is printed out.c ll 33
.
equivalent to: main() { yylex().
34
.3
Lexical Analyser Example
The lex program below illustrates how a lexical analyser for a Pascaltype language is deﬁned. Notice that the regular expression for identiﬁers is placed at the end of the list (why?). return. Here it is being used to pass integers’ values and identiﬁers’ symbol table positions to the syntax analyser. } Thus in the absence of any return statements in the analyser’s actions. this one call to yylex() consumes all the input up to and including endofﬁle. The global variable yylval (of type integer in this example) is generally used to pass tokens’ attributes from the lexical analyser to the syntax analyser and is shared by both phases of the compiler. We assume that the syntax analyser requests tokens by repeatedly calling the function yylex().4.
3.This has the eﬀect of automatically including a standard main() function.
ELSE.%{ definitions (as integers) of IF. return(INTEGER). { yylval = atoi(yytext). ID. {integer} {id} %% int InstallInTable() { put yytext in symbol table and return the position it has been inserted. } { return(THEN).. } [ \t\n] {delim}+ [AZaz] [09] {letter}({letter}{digit})* [+\]?{digit}+
35
. } { . } { return(IF).. return(ID).. %} delim ws letter digit id integer %% {ws} if then else . } { return(ELSE). THEN. INTEGER. } { yylval = InstallInTable(). } .... .
1
ContextFree Languages
Regular languages are inadequate for specifying all but the simplest aspects of programming language syntax. b}∗  w = an bn for some n}.
4. • S ∈ N (the start symbol). we use contextfree languages.4
Syntax Analysis
In this section we will look at the second phase of compilation: syntax analysis.
4. • N is a ﬁnite nonempty set of (nonterminal) symbols (denoting phrase types) disjoint from T. or parsing.1
ContextFree Grammars
Deﬁnition.3 respectively. To specify morecomplex languages such as • L = {w ∈ {a. Note Like Section 3. this section contains mainly theoretical deﬁnitions. A contextfree grammar is a tuple G = (T. In this section we deﬁne contextfree grammars and languages. In Section 4. P ) where: • T is a ﬁnite nonempty set of (terminal) symbols (tokens). )}∗  w is a wellbalanced string of parentheses} and • the syntax of most programming languages. • L = {w ∈ {(. and their use in describing the syntax of programming languages. Since parsing is a central concept in compilation and because (unlike lexical analysis) there are many approaches to parsing. Diﬀerent types of topdown and bottomup parsing algorithms will be discussed in Sections 4. the lectures will cover examples and diagrams illustrating the theory. This section is intended to provide a foundation for the following sections on parsing and parser construction. this section makes up most of the remainder of the course.1. and • P is a set of (contextfree) productions (denoting ‘rules’ for phrase types) of the form A → α where A ∈ N and α ∈ (T ∪ N)∗ . S.1 we discuss the class of contextfree languages and its relationship to the syntactic structure of programming languages and the compilation process. N. Parsing algorithms for contextfree languages fall into two main categories: topdown and bottomup parsers (the names refer to the process of parse tree construction). 36
.2 and 4.
b. • u. γ. . . P ) where • T = {a. S → bSb. G2 above can be written as S → X  aa  bb  aSa  bSb X → ab
37
. . N. Y. . S. • A. S → aSb}. (2) G2 = (T. . . for members of (T ∪ N)∗ . b}. In what follows we use: • a. . for members of T ∗ . . β. • . . . B.) • The start symbol is the nonterminal on the lefthand side of the ﬁrst production. . S → bb. • N = {S.Notation. • Righthand sides separated by ‘’ indicate alternatives. Z for members of T ∪ N. S → aSa. . c. v. X → b}. Notation. (It is usually clear from the context whether a symbol is a terminal or nonterminal. for members of N. X → a. For example. X} and • P = {S → X. S → aa. and • α. X. Examples. P ) where • T = {a. • N = {S} and • P = {S → ab. C. It is customary to ‘deﬁne’ a context free grammar by simply listing its productions and assuming: • The terminals and nonterminals of the grammar are exactly those terminals appearing in the productions. for members of T . . . b}. S. . (1) G1 = (T. w. N.
to indicate alternatives. • We say α derives β in zero or more steps. For grammars of real programming languages. nonterminal symbols are given a descriptive name and placed in angled brackets. as above. We write αAβ =⇒ αγβ.
+ + ∗ ∗ ∗ ∗

< conditional > .
38
. if α =⇒ β and α = β. L is said to be a contextfree language if L = L(G) for some contextfree grammar G. . in symbols α =⇒ β. The deﬁnition of a grammar as given so far is ﬁne for simple theoretical grammars. • We say α derives β in one or more steps. a popular notation for contextfree grammars is BackusNaur Form. A (contextfree) grammar G deﬁnes a language L(G) in the following way: • We say αAβ immediately derives αγβ if A → γ is a production of G. Here. in symbols α =⇒ β. • If S =⇒ α then we say α is a sentential form. For example. if (i) α = β or (ii) for some γ we have α =⇒ γ and γ =⇒ β. and “” is used. • We deﬁne L(G) by L(G) = {w ∈ T ∗  S =⇒ w}. BNF for simple arithmetic expressions might be < exp > → < exp > + < exp > < exp > ∗ < exp > ( < exp > )  < number > → < digit >    < exp > − < exp > < exp > / < exp >  
< number > < digit > < number >
< digit > → 0  1  2  3  4  5  6  7  8  9.BNF Notation. . Derivation and Languages. • Let L ⊆ T ∗ . The alphabet will also commonly include keywords and language symbols such as <=. and a typical rule for while loops would be: < while loop > → while < exp > do < command list > od < command list > → < assignment > Note that the symbol ::= is often used in place of →.
(a) A leftmost derivation of x + y ∗ z is E =⇒ E + E =⇒ x + E =⇒ x + E ∗ E =⇒ x + y ∗ E =⇒ x + y ∗ z. ǫ
4.
39
. This is called an ǫproduction and is often written A → ǫ.3
ǫProductions
A grammar may involve a production of the form A→ (that is. < stmt > → begin < stmt > end < stmt >  generates wellbalanced ‘begin/end’ strings (and the empty string).1.4
Left. and may need to be removed–see later.1.1. As an example of the use of ǫproductions. Let G be a grammar with productions E → E + E  E ∗ E  x  y  z. A language L ⊆ T ∗ is regular (see Section 3) if L = L(G) for some regular grammar G. with an empty righthand side). ǫproductions are often useful in specifying languages although they can cause diﬃculties (ambiguities).
4.2
Regular Grammars
A grammar is said to be regular if all its productions are of the form: (a) A → uB or A → u (a rightlinear grammar) (b) A → Bu or A → u (a leftlinear grammar). Examples. (b) A rightmost derivation of x + y ∗ z is E =⇒ E + E =⇒ E + E ∗ E =⇒ E + E ∗ z =⇒ E + y ∗ z =⇒ x + y ∗ z.and Rightmost Derivations
A derivation in which the leftmost (rightmost) nonterminal is replaced at each step in the derivation is called a leftmost (rightmost) derivation.4. especially in parser design.
then A → X1 · · · Xk must be a production of P . N. . and only if. (d) The following is neither left. nk then y(n) = y(n1 ) · · · y(nk ). . (2) The yield of a parse tree with root node r is y(r) where: (a) if n is a leaf node with label a then y(n) = a. Notice that the derivations are diﬀerent and so are the parse trees. 4. S =⇒ α.nor rightmost: E =⇒ E + E =⇒ E + E ∗ E =⇒ E + y ∗ E =⇒ x + y ∗ E =⇒ x + y ∗ z. (c) interior nodes have nonterminal labels.1. Xk respectively. Consider the parse trees for the derivations (a) and (c) of the above examples. (d) if node n has label A and descendants n1 .5 Parse Trees
Let G = (T. . and (e) if node n has label ǫ then n is a leaf and is the only descendant of its parent. . 2. Theorem. Let G be a grammar with productions < stmt > → if < boolexp > then < stmt >  if < boolexp > then < stmt > else < stmt >  < other > < boolexp > → . (b) the root of the tree has label S.and rightmost derivations need not be unique). P ) be a contextfree grammar. . . although both derivations are leftmost.(c) Another leftmost derivation of x + y ∗ z is E =⇒ E ∗ E =⇒ E + E ∗ E =⇒ x + E ∗ E =⇒ x + y ∗ E =⇒ x + y ∗ z (thus left. N. 1. and (b) if n is an interior node with descendents n1 . Examples. lefttoright) with labels X1 . For each α ∈ (T ∪ N)∗ there exists ∗ a parse tree in grammar G with yield α if. (1) A parse (or derivation) tree in grammar G is a pictorial representation of a derivation in which (a) every node has a label that is a symbol of T ∪ N ∪ {ǫ}. S. . 40
. Let G = (T. S. nk (in order. P ) be a contextfree grammar. .
P ) be a contextfree grammar and let w ∈ L(G). For example.< other > → .1.6
Ambiguity
A grammar G for which there is some w ∈ L(G) for which there are two (or more) parse trees is said to be ambiguous. let G be with productions < stmt > → < matched >  < unmatched > < matched > → if < boolexp > then < matched > else < matched >  < other > < unmatched > → if < boolexp > then < stmt >  if < boolexp > < then > < matched > else < unmatched > < boolexp > → . . Unfortunately: Equivalently (by the above theorem). < other > → . S. However. . . .) 41
. . (< matched > is intended to denote an ifstatement with a matching ‘else’ or some other (nonconditional) statement. A classical problem in parsing is that of the dangling else: Should if < boolexp > then if < boolexp > then < other > else < other > be parsed as if < boolexp > then ( if < boolexp > then < other > else < other > ) or if < boolexp > then ( if < boolexp > then < other > ) else < other > Theorem. In such a case. . N. a grammar G is ambiguous if there is some w ∈ L(G) for which there exists more than one leftmost derivation (or rightmost derivation). Ambiguity is usually an undesirable property since it may lead to diﬀerent interpretations of the same string. For every parse tree of w there is precisely one leftmost derivation of w and precisely one rightmost derivation of w. w is known as an ambiguous sentence. whereas < unmatched > denotes an ifstatement with an unmatched ‘if’. Let G = (T. Theorem. for a particular grammar we can sometimes resolve ambiguity via disambiguating rules that force troublesome strings to be parsed in a particular way. The decision problem: Is G an unambiguous grammar? is undecidable.
4.
Theorem. c}∗  w = an bn cn for some n ≥ 1} is not contextfree. . Theorem. b. and then by adding further code that checks for invalid (contextsensitive) properties. Perhaps the most wellknow example of this is the rule that variables must be declared before they are used. Furthermore. L = {w ∈ {a. such languages are called inherently ambiguous. there are some contextfree languages for which every grammar is ambiguous. 42
.There exists a contextfree language L such that for every contextfree grammar G. Such properties are often called contextsensitive since they can be speciﬁed by grammars that have productions of the form αAβ → αγβ which. nor is L = {w ∈ {a. if L = L(G) then G is ambiguous. if < boolexp > then if < boolexp > then < other > else < other > has a unique parse tree. problems are resolved by allowing strings (programs) that parse but are technically invalid.1. informally. there are some aspects of programming language syntax that cannot be captured with a contextfree grammar. b. In practice we still use contextfree grammars. The decision problem: Is L an inherently ambiguous language? is undecidable. β.)
4. (exercise: convince yourself that this parse tree really is unique.7
Inherent Ambiguity
We have seen that it is possible to remove ambiguity from a grammar by changing the (nonterminals and) productions to give an equivalent (in terms of the language deﬁned) but unambiguous grammar.1. Formally. c}∗  w = am bn cp for m ≤ n ≤ p} although proof of these facts is beyond the scope of this course.Now. says that A can be replaced by γ but only when it occurs in the context α . according to this G. . However. More signiﬁcantly.8
Limits of ContextFree Grammars
It is known that not every language is contextfree.
4. For example.
it is rare to write backtracking parsers in practice.d.
4. This code will form the basis of a complete syntax analyser—syntax tree construction and error recovery code will just be inserted into it at the correct places.In this section we describe a number of techniques for constructing a parser for a given contextfree language from a contextfree grammar for that language. Such parsers are commonly called predictive parsers since they attempt to predict which production rule to use at each step in order to avoid the need to backtrack. Finally.1.2 we deﬁne those grammars for which such a simple parser may be written.
43
. as we’ll see in Section 4.p. In Section 4.1
Recursive Descent Parsing
The simplest idea for creating a parser for a grammar is to construct a recursive descent parser (r. and because other parsing methods (e. For these we’ll need more powerful parsing methods which we will look at later. for a given grammar. We begin. unexpected input is read. if one production rule fails it needs to backtrack in the input string (and the parse tree) in order to try further rules.2.).
4.2. This is a topdown parser. Parsers fall into two broad categories: topdown parsers and bottomup parsers. Since backtracking can be costly in terms of parsing time. is not possible to write such a simple parser for all grammars: for many programming languages’ grammars such a parser cannot be written. As we will see. via procedure calls) a derivation of an input string (if the string is in the language given by the grammar) or reports an error (for an incorrect string). in Section 4.2.2. as described in the next two sections. We will show here how to write a nonbacktracking r. in Section 4. Equivalently. if when attempting to apply a production rule. We will not concern ourselves here with the construction of a parse tree or syntax tree—we’ll just attempt to write code that produces (implicitly. i. by showing how a simple parser may be coded in an adhoc way using a standard programming language such as Pascal. and needs backtracking in general.p.2. then this input must be an error and the parser reports this fact. it can be viewed as creating a leftmost derivation of the input string.d.4 we show how this simple approach to parsing can be made more structured by using the same algorithm for all grammars.3 we show how to add syntaxtree construction and (very limited) error recovery code to these parsing algorithms.e the parser needs to try out diﬀerent production rules. In Section 4. where the algorithm uses a diﬀerent lookup table for each particular grammar.2. bottomup parsers) do not require it.3.g.2
TopDown Parsers
Topdown parsing can be thought of as the process of creating a parse tree for a given input string by starting at the top with the start symbol and then working depthﬁrst downwards to the leaves.
if it fails to do this then there must be an error in the input string. The tokens for the above grammar are a. for the grammar G with productions: A → a  BCA B → be  cA C → d We will assume the existence of a procedure get next token which reads terminal symbols (or tokens) from the input. b or c the input cannot be derived from A and therefore error is called. and the procedure reports this fact. but for now assume that the procedure error simply outputs a message and causes the program to terminate. Therefore. If the current token is neither a. We’ll also assume the tokens end of input (for end of string) and error for a nonvalid input. b. Consider the nonterminal symbol A and the Aproductions A → a and A → BCA. and places them into the variable current token.p. because of the right hand sides of the two Bproductions. each of which may result in a call to error if the input string is not in the language. procedure parse_A 44
.We will write a r. d and e. one for each nonterminal symbol. it obviously uses the production rule A → a and is bound to succeed since it already knows it is going to ﬁnd the a(!). then parse A will attempt to parse the next few input symbols using the rule A → BCA. We’ll also assume the following auxiliary procedure which tests whether the current token is what is expected and. begin if current_token = expected then get_next_token else error end. In order to attempt to match BCA it calls procedures parse B. The aim of each procedure is to try to match any of the righthandsides of production rules for that nonterminal symbol against the next few symbols on the input string. The procedure parse A attempts to match either an a or BCA.d. If this is a. Notice that any string derivable from BCA begins with either a b or a c. parse C and parse A in turn. c. The parser is constructed from three procedures. otherwise it calls an error routine. procedure match(expected : token). Later on we’ll see more about error recovery. It ﬁrst needs to decide which production rule to choose: it makes its choice based on the current input token. reads the next token. if either b or c is the current token. if so.
parse_A end else error end end. c : begin match(c). (ii) call parse A since this is the start symbol. and if an error is not generated by this call (iii) check that the end of the input string has been reached. begin match(d) end. 45
. To execute the parser. procedure parse_C. match(e) end. procedure parse_B.begin case current_token of a : match(a). This is accomplished by get_next_token. c : begin parse_B. parse_A end else error end end. we need to (i) make sure that the ﬁrst token of the input string has been read into current token. parse_A. b. begin case current_token of b : begin match(b). parse_C. The procedures parse B and parse C are constructed in a similar fashion.
p. we try B → be. it is easy to convince oneself that the algorithm is a neat and correct solution to the parsing problem for this particular grammar. the F irst sets for all pairs of distinct productions’ righthand sides must be disjoint. Although it is possible to construct compact. • To recognise a C. if the current symbol is b. because as we have seen: • To recognise an A: if the current symbol is a. if the current symbol is d. to be able to write a predictive parser for a grammar.d. will fail to produce a correct parser if the + grammar is left recursive. determines a sequence of productions that are equivalent to the construction of a parse tree that starts from the root and builds down to the leaves. that is if A =⇒ Aα for some nonterminal A–why? Also notice that each for each nonterminal of the above grammar it is possible to predict which production (if any) is to be matched against. This is clearly the case for the grammar G above. For example. the method is somewhat ad hoc and more useful generalpurpose methods are preferred. determines a leftmost derivation. otherwise we report an error. we try C → d. otherwise we report an error. This is the
46
. Notice how the algorithm i. We note that the method of constructing a r. otherwise we report an error.if current_token = end_of_input then writeln(‘Success!’) else error By executing the parser on a range of inputs that drive the algorithm through all possible branches (try the algorithm on beda. The choice is determined by the socalled F irst sets of the righthand sides of the production rules. if the symbol is cA we try B → cA. for any nonterminal symbol. F irst(α) for any string α ∈ (T ∪ N)∗ denotes the set of all terminal symbols that can begin strings derivable from α. we try the production rule A → a. Notice that. bed and cbedada for examples). purely on the basis of the current input character. eﬃcient recursive descent parsers by hand for some very large grammars. if the symbol is b or c. c} respectively. and ii. the F irst set of the right hand side of the rules A → a and A → BCA are F irst(a) = {a} and F irst(BCA) = {b. we try A → BCA. In general. • To recognise a B.
to leave parse A to consume the ﬁnal b and thus result in a successful parse. We would want parse B to try the rule B → c to consume the c. with productions A → b  BCA B → be  cA C → d this is not the case. c} = ∅ F irst(be) ∩ F irst(cA) = {b} ∩ {c} = ∅. match(b) end else error end. since F irst(a) ∩ F irst(BCA) = {a} ∩ {b. parse A would be written: procedure parse_A begin if current_token = a then begin match(a). For the grammar G2 . Obviously.case for G. The procedure parse A would consume the ﬁrst a and then call parse B. consider what the parser should do for the following three strings: • acb. (and C has no pairs of productions since there is only one Cproduction rule). let the grammar G3 have productions: A → aBb B → c  ε. Things become slightly more complicated when grammars have ε productions. c} = {b} = ∅ and therefore we can not write a predictive parser for G2 . How do we write a procedure parse B that would recognise strings derivable from B? Well. For example. parse_B. since F irst(b) ∩ F irst(BCA) = {b} ∩ {b. Consider parsing the string adb. 47
.
(The ﬁrst ‘L’ is for lefttoright reading of the input. obviously parse B should not try B → c (since the next symbol is a). and thus would consume the a in both strings. 48
. b : (* do nothing *) else error end end. F ollow(B) = {b} and thus the code for parse B should be: procedure parse_B. The choice then. nor should it try B → ε since there is no way in which the next symbol a is going to be matched—therefore it should (correctly) report an error.• ab. • aa. This condition of disjointedness must be fullﬁlled to be able to write a predictive parser. Formally. begin case current_token of c : match(c). leaving parse A to consume the b (success again). we would like parse B to apply B → ε (which does not consume any input). of whether to use the ε production depends on whether the next symbol can legally immediately follow a B in a sentential form. (Notice that the b is not consumed). let G4 be a grammar with productions A → aBb B → bε and consider parsing the strings ab and abb. Finally. Again parse A would consume the a. Now. In the following section we deﬁne what it means for a grammar to be LL(1) based on the concept of F irst and F ollow sets. Formally. purely on the basis of the current character (b in both cases) whether to try the production B → b (which would be correct for abb but not for ab) or B → ε (which would be correct for ab but not for abb). the second ‘L’ is for leftmost derivation. grammars for which predictive parsers can be constructed are called LL(1) grammars. However. parse A will be as above. it is impossible for a procedure parse B to choose. and the ‘1’ is for one symbol of lookahead. Indeed. Obviously. Now. For G2 .) LL(1) grammars are the most widelyused examples of LL(k) grammars: those that require k symbols of lookahead. Again parse A would consume the a. The set of all such symbols is denoted F ollow(B). the problem is that ε is in the F irst set one of the Bproduction’s right hand sides (we use the convention that F irst(ε) = {ε}) and that one of the other F irst sets—F irst(b)— of a righthand side is not disjoint from the set F ollow(B). both sets are {b}.
etc. To compute F irst(X1 . we can give a ﬁrst condition for a grammar G = (T. Xm ). then F irst(X) = {X}. . .4. If ǫ ∈ F irst(X1 ) then add all nonǫ members of F irst(X2 ) to F irst(X1 .2
LL(1) Grammars
To deﬁne formally what it means for a grammar to be LL(1) we ﬁrst deﬁne the two functions F irst and F ollow. F irst(Xm ) contains ǫ then we also add ǫ to F irst(X1 . . . then add ǫ to F irst(X). S. If all Y1 . Xm ) all nonǫ members of F irst(X1 ). N. suppose that A → β1  β2  · · ·  βn are all the Aproductions in P . Xm ). j ≤ n where i = j. We let F irst(α) be the set of all terminals that can begin strings derived from α (F irst(α) also ∗ includes ǫ if α =⇒ ǫ). . . Finally. apply the following rules until nothing else can be added to any F irst set: • if X ∈ T . and their deﬁnition should be the ﬁrst step in writing such a parser. . etc. . . • if X → ǫ ∈ P . Yn ∈ P . . To compute F irst(X) for all symbols X. Xm ) for any string X1 . From the deﬁnition of F irst.2. if every set F irst(X1 ). . . If Y1 =⇒ ǫ and Y2 =⇒ ǫ then ∗ also add all nonǫ members of F irst(Y3 ) to F irst(X). 49
∗
. ﬁrst add to F irst(X1 . For G to be LL(1) it is necessary. For each nonterminal A ∈ N. . but not suﬃcient. for F irst(βi ) ∩ F irst(βj ) = ∅ for each 1 ≤ i. Xm . then add ǫ to F irst(X). • Place $ in F ollow(S). We let F ollow(A) be the set of terminal symbols that can appear immediately to the right of the nonterminal symbol A in a sentential form. then F ollow(A) also contains the symbol $. P ) to be LL(1). Yn =⇒ ǫ. then add all nonǫ members of F irst(Y1) to F irst(X). These functions help us to construct a predictive parser for a given LL(1) grammar. . then place all nonǫ members of F irst(β) in F ollow(B). . . If Y1 =⇒ ǫ ∗ ∗ then also add all nonǫ members of F irst(Y2 ) to F irst(X). . • If there is a production A → αBβ. apply the following rules until nothing can be added to any F ollow set. To compute F ollow(A) for all nonterminals A. . • if X → Y1 . If A appears as the rightmost nonterminal in a sentential form. . as follows. .
G is an LL(1) grammar if it also meets the following condition. i = k. to enable a predictive parser to be written for them. they often make the grammar hard to read. G does not meet the second condition because F ollow(A) ∩ F irst(D) = {a} ∩ {a} = ∅. or a production A → αBβ where β =⇒ ǫ. not every grammar can be transformed into an LL(1) grammar. if the ﬁrst condition holds. supposing a grammar G meets the ﬁrst condition above. However. Let G have the following productions: S → BAaC A → DE B → b C → c D → aF E → ǫ F → f First and Follow sets for this grammar are: F irst(BAaC) F irst(D) F irst(E) F irst(b) F irst(c) F irst(aF ) F irst(f ) = {b} = {a} = {ǫ} = {b} = {c} = {a} = {f } F ollow(S) F ollow(A) F ollow(B) F ollow(C) F ollow(D) F ollow(E) F ollow(F ) = {$} = {a} = {a} = {$} = {a} = {a} = {a}.
It is clear that G meets the ﬁrst condition above. as F irst(D) ∩ F irst(E) = {a} ∩ {ǫ} = ∅. let A → β1  β2  · · ·  βn be all the Aproductions. although left factoring and elimination of left recursion are easy to implement.
∗
Example. and for such grammars we need other methods of parsing.• If there is a production A → αB. For each A ∈ N.2.5). Many nonLL(1) grammars can be transformed into LL(1) grammars (by removing left recursion and left factoring (see Section 4. Moreover. If ǫ ∈ F irst(βk ) for some 1 ≤ k ≤ n (and notice. However. 50
. there will only be at most one such value of k) then it must be the case that F ollow(A) ∩ F irst(βi ) = ∅ for each 1 ≤ i ≤ n. then place everything in F ollow(A) into F ollow(B). Now.
.Yn then pop X from the stack and replace it with Yn. else if X is a token or $ but not equal to a then report an error.Yn else report an error
Example. Note the use of $ to mark the bottom of the stack and the end of the input string.. What happens if we try to construct such a table for a nonLL(1) grammar? The LL(1) parsing algorithm.
set stack to $S (with S on top). let M[A. output the production X > Y1. They also use a parsingtable which is a two dimensional array M of elements M[A.. $] be A → α.Y1 (with Y1 on top). a] be A → α. A will be on the top of the stack). let M[A. let M[A.2. set input pointer to point to the first symbol of the input w$. else (X is a nonterminal) consult M[X. b] be A → α for all terminals b in F ollow(A). Constructing an LL(1) parsing table. rather than implicitly via recursive calls. Each element M[A. else if X = a (a token) then pop X off the stack and advance input pointer.a]: if this entry is a production X > Y1.3
Nonrecursive Predictive Parsing
A signiﬁcant aspect of LL(1) grammars is that it is possible to construct nonrecursive predictive parsers from an LL(1) grammar. a] where A is a nonterminal symbol and a is either a terminal symbol or the endofstring symbol $. M[A. To construct the parsing table M. repeat the following step (letting X be top of stack and a be current input symbol): if X = a = $ then we have parsed the input successfully. Consider the grammar G with productions: C → cD  Ed D → EF C  e E → ǫf F → g 51
. • If ǫ ∈ F irst(α).. a] of M is either an Aproduction or an error entry. we complete the following steps for each production A → α of the grammar G: • For each terminal a in F irst(α). The nonrecursive predictive parsing algorithm (common to all LL(1) grammars) is shown below.. and when the next symbol input symbol is a. Such parsers explicitly maintain a stack.4. Essentially.. a] says which production to apply when we need to match an A (in this case. Let all other elements of M denote an error. If ǫ ∈ F irst(α) and $ ∈ F ollow(A).
gives: Stack $C $Dc $D $CF E $CF $Cg $C $dE $df $d $ Input Action cgf d$ C → cD cgf d$ match gf d$ D → EF C gf d$ E→ǫ gf d$ F →g gf d$ match f d$ C → Ed f d$ E→f f d$ match d$ match $ accept. when used with the LL(1) parsing algorithm on input string cgf d. f } F ollow(C) F ollow(D) F ollow(E) F ollow(F ) = {$} = {$} = {d.
52
. f }
This gives the following LL(1) parsing table: c C → cD d C → Ed E→ǫ f C → Ed D → e D → EF C E→f e g D → EF C E→ǫ F →g $
C D E F
which. d. g} = {c.First we deﬁne useful F irst and F ollow sets for this grammar: F irst(cD) F irst(Ed) F irst(EF C) F irst(ǫ) = {c} = {d. f } = {f. d. g} = {ǫ} F irst(f ) F irst(g) F irst(F ) F irst(C) = {f } = {g} = {g} = {c.
These are (grouped for each of the eight nonterminals): F irst( command rest commands ) = {if. We consider the construction of a r.} F irst(null) = {null} F ollow( rest commands ) = {else. else.4
Syntaxtree Construction for RecursiveDescent Parsers
In this section we extend the recursive descent parser writing technique of Section 4. $} F irst(identifier) = {identifier} 53
. then. fi.2.d.. command list ) = {. $} F irst( if command ) = {if} F irst( assignment ) = {identifier} F irst(if expression then command list else command list fi) = {if} F irst(identifier := expression ) = {identifier} F irst( atom F irst(+ atom rest expression ) = {identifier} rest expression ) = {+} F irst(null) = {null} F ollow( rest expression ) = {. fi. we ﬁrst list the important F irst and F ollow sets.1 in order to build syntax trees. for the following simple grammar which only allows assignments and ifstatements. command list if command
::= if expression then command list else command list fi ::= identifier := expression ::= atom rest expression rest expression  null ::= + atom
::= identifier
In order to construct a predictive parser.2.p. identifier}
F irst(.4. command list rest commands command if command assignment expression rest expression atom ::= ::= command rest commands   null assignment
::= .
We will construct syntax trees rather than parse trees. Assume that MAXC is a constant with value 3. rather than directly as given by the grammar. e_operation). we will get the parser to construct the correct (left associative) syntax tree using a clever technique–see the code. c_assign). { sibling : tree_ptr. which represents the maximum number of children a node can have (ifcommands need three children. • expressions are better constructed as syntax trees that better illustrate the expression (with proper rules of precedence and associativity). the then clause and the else clause).e. • the grammar fails to represent the fact that ‘+’ should be left associative. only command nodes will have siblings and only expression nodes will have operators. since the former are much smaller and simpler yet contain all the information required to enable straightforward code generation. (i. For example. { There are command nodes and expression nodes. is it a command or expression node? } what kind of command is it? } for assignment lhs and simple expressions } the next command in the list } what kind of expression is it? } expression operator }
Note that some of the ﬁelds of this record type will be redundant for certain types of node. In fact. but for ‘’ it would be. { operator : token { end. expression nodes can be indentifiers or operations } node_kind = (n_command. We ﬁrst write some type deﬁnitions to represent arbitrary nodes in a syntax tree. x+y+z is parsed as x+(y+z) rather than the correct (x+y)+z—this is not really a problem for ‘+’. n_expr). command_kind = (c_if. Notice that: • command sequences are better constructed as lists rather than as trees as would be natural given the grammar. tree_node = record children : array[1. since 111 would be evaluated 1(11) = 1 rather than the conventional (11)1 = 1. Command nodes can be if or assign nodes. { ekind : expr_kind. tree_ptr = ^tree_node. for the condition. 54
.MAXC] kind : node_kind. Therefore. { ident : string[20]. expr_kind = (e_ident.. it is impossible to write an LL(1) that achieves this task. { ckind : command_kind.
of tree_ptr.
ckind := t. temp^. The code for the above grammar is as follows.We could use variant records (or unions in C) but this would complicate the discussion slightly.children[i] := nil. function new_expr_node(t : expr_kind) : tree_ptr. we deﬁne match and syntax error procedures: 55
. however. semi colon. Next. temp^.2. begin new(temp). so we choose not to. To write a parser that constructs syntax trees. We assume that the lexical analyser takes the form of a procedure get next token that updates the variables current token and current identifier. and that we have a match procedure that checks tokens. emd of input and error.kind := n_expr. that we write a procedure parse A for each nonterminal A in the grammar. temp : tree_ptr. Recall from Section 4. identifier. temp^. begin new(temp).sibling := nil. We ﬁrst deﬁne two auxiliary functions that the other functions will call in order to create nodes for expressions or commands. respectively.1.kind := n_command. function new_command_node(t : command_kind) : tree_ptr. using this ﬁeld we can easily add extra operations to the grammar. {Construct a new command node of kind t} var i : integer. for i := 1 to MAXC do temp^. {Construct a new expression node of kind t} var i : integer. k then. since we only have ‘+’ in the grammar. k fi. for i := 1 to MAXC do temp^. new_command_node := temp end. we replace the procedure parse A by a function whose additional role it is to construct and return a syntax tree for the recognised string. the aim of which is to recognise strings derivable from A. temp^. new_expr_node := temp end.ekind := t. temp^. temp : tree_ptr. The tokens for the grammar are k if.children[i] := nil. Also note that we do not need a ﬁeld for operators. assign. k else.
56
. parse_command_list := temp end. temp^. $} to determine when to use the εproduction. read the next token in} begin if current_token = expected then get_next_token else syntax_error(’Unexpected token’) end. begin temp := parse_command. writeln(’Exiting’). {Output error message} begin writeln(s). var temp : tree_ptr. Notice also how the parse rest commands function uses the set F ollow( rest commands ) = {else. fi. Note that in parse command list. halt end.procedure match(expected : token). For the ﬁrst three rules in the BNF (those for lists of commands) we have the following functions. we do not bother looing at the current token to see if it is valid. A linked list of commands is built using tree nodes’ sibling ﬁelds. {Check the current token is as expected. if so. procedure syntax_error(s : string).sibling := parse_rest_commands. since we know parse command will do this. function parse_command_list : tree_ptr.
children[1] := parse_expression. k_fi. temp^. if current_token = semi_colon then begin match(semi_colon). The two types of command (if commands and assignments) are parsed using the following functions. identifier : parse_command := parse_assignment. match(assign). begin temp := new_command_node(c_assign). match(identifier). begin temp := nil. var temp : tree_ptr. else begin syntax_error(’Expected if or variable’). temp^. parse_rest_commands := temp end.
57
. var temp : tree_ptr. begin case current_token of k_if : parse_command := parse_if_command.ident := current_identifier. function parse_assignment : tree_ptr. parse_assignment := temp end. end_of_input]) then syntax_error(’Incomplete command’). function parse_command : tree_ptr. parse_command := nil end end end.function parse_rest_commands : tree_ptr. temp := parse_command_list end else if not (current_token in [k_else.
function parse_expression : tree_ptr.children[3] := parse_command_list. match(k_then). match(k_if). begin left := parse_atom.
58
. Follow through the functions for the case x+y+z to convince yourself that they return the correct interpretation (x+y)+z. temp^.children[1] := parse_expression. match(k_fi). {Parse an expression. var temp : tree_ptr. then. parse_expression := parse_rest_expression(left).function parse_if_command : tree_ptr. parse_if_command := temp end. else.children[2] := parse_command_list. temp^. and also that it uses the set F ollow( rest expression ) = {. and force it to be left associative} var left : tree_ptr. $} to determine whether to use the εproduction. The ﬁnal three functions handle expressions.. Notice that the function parse rest expression takes an argument comprising the (already parsed) left hand side expression. match(k_else). end. begin temp := new_command_node(c_if). temp^. fi.
Otherwise.children[2] := parse_atom.2.
4. function parse_atom : tree_ptr. new_e^. parse_rest_expression := nil end end.operator := current_token. begin temp := nil. get_next_token. k_then]) then parse_rest_expression := left else begin syntax_error(’Badly formed expression’). for predictive parsing only LL(1) grammars may be used.ident := current_identifier match(identifier). temp^. var temp : tree_ptr. end_of_input. if current_token = identifier then begin temp := new_expr_node(e_identifier). new_e^. begin if (current_token = plus) then begin new_e := new_expr_node(e_operation). In particular. return left as the expression} var new_e : tree_ptr. parse_rest_expression := parse_rest_expression(new_e) end else if (current_token in [semi_colon.children[1] := left. new_e^. construct an expression with lhs left (already parsed) by parsing rhs. end else syntax_error(’Badly formed expression’) parse_atom := temp end.5
Operations on contextfree grammars for topdown parsing
As we have seen. Many grammars can be transformed into more suitable forms in order to make construction of topdown parsers possible. {If there is an operator on the input. many grammars are not suitable for use in topdown parsing. Two 59
. k_fi.function parse_rest_expression(left : tree_ptr) : tree_ptr. k_else.
. Left factoring. Many simple parsing algorithms (e. 60
+
. . . a large class of parsing algorithms (speciﬁcally topdown parsers) cannot cope with such grammars. A grammar is said to be immediately left recursive if there is a production of the form A → Aα. consider a grammar G with productions A → Aα1  Aα2  · · ·  Aαn  β1  β2  · · ·  βm where none of β1 . (A grammar G + has a cycle if A =⇒ A for some nonterminal A. those we have studied so far) cannot cope with grammars with productions such as if stmt → if bool then stmt list fi 
if bool then stmt list else stmt list fi where two or more productions for a given nonterminal share a common preﬁx. We will not look at the general algorithm for leftfactoring a grammar. and in the next section we will see how general left recursion can be removed. We can make a new grammar G′ without left recursion by replacing the above productions with A → β1 A′  β2 A′  · · ·  βm A′ A′ → α1 A′  α2 A′  · · ·  αn A′  ǫ Removing Left Recursion. Here we shall show how immediate left recursion can be removed from a contextfree grammar. A grammar is said to be left recursive if A =⇒ Aα for some nonterminal A. βm begin with an A. Grammars for programming languages are often immediately left recursive. For the above example however. However. we simply factorout the common preﬁx to obtain the equivalent grammar rules: if stmt rest if → if bool then stmt list → else stmt list fi  fi rest if
Removing Immediate Left Recursion.g. We can make a new grammar G′ that deﬁnes the same language as G but without any left recursion. Consider a grammar G with productions A → Aα  β where β does not begin with an A. or have general left recursion (see next section). by replacing the above productions with A → βA′ A′ → αA′  ǫ More generally. Let G be a grammar that has no ǫproductions and no cycles. This is because they have to make a decision on which production rule to use based only on the ﬁrst symbol of the rule.useful techniques are leftfactoring and removal of left recursion.) The following is an algorithm that removes left recursion from G. .
it is possible to have Ai productions of the form Ai → Aj α with j < i. . Since Ai is a nonterminal. . Thus. An . it must be that Ai =⇒ Ai α for some i and some α. . . since the algorithm has already processed the Aj productions (think inductively here!). . Notice that if every Ai production is of the form (1) then any derivation from Ai will yield a string that starts with a terminal symbol. there are two possibilities for β: (1) β is of the form β = aγ for some terminal a. . we begin by considering how G might be left recursive. In more detail. . 61
. then the algorithm replaces Ai → Aj γ with Ai → δ1 γ  · · ·  δp γ. we arrange the nonterminals into some order A1 . The case i = 1 is a special case since if Ai → Aj α then j ≥ i. if Aj → δ1  · · ·  δp are all the Aj productions. and if j = i then we have an instance of immediate left recursion that can be removed as in the previous section. . Suppose that j > i for every such production (for all nonterminals Ai ): Then it must be that whenever Ai derives a string α and the leftmost symbol of α is a nonterminal Ak . and thus the new grammar cannot be left recursive. G cannot be left recursive. Clearly. . . it must be the case that if the leftmost symbol of δr (r = 1. Since this means k = i. n. it must be that at least one Ai production is of the form Ai → Aj γ. or (2) β is of the form β = Aj γ for some nonterminal Aj . For each of these productions the algorithm substitutes for Aj each possible right hand side of an Aj production.productions for i = 1. . An for i := 1 to n do for j := 1 to i1 do replace each production of the form Ai → Aj γ with Ai → δ1 γ  · · ·  δk γ where Aj → δ1  · · ·  δk are all the current Aj productions. That is. Now consider an arbitrary Ai production Ai → β. . then k > i. The way the algorithm works is to transform the productions of G in such a way that the new productions are either of the form Ai → aα or of the form Ai → Aj α where j > i. it is impossible to have + Ai =⇒ Ai α for any α — contradicting the assumption of G’s left recursiveness.Arrange the nonterminals in some order A1 . . . . In order for G to be + left recursive. If j > i then the production is in the right form. the algorithm considers the Ai . od remove any immediate left recursion from all current Ai productions od In order to understand how the algorithm works. First. then k > j (because δr is the right hand side of an Aj production). . Thus the lowest index of any nonterminal occurring as the leftmost symbol of the right hand side of any Ai production has been increased from j to k. 2. n. . For i = 2. Now. p) is a nonterminal Ak say. .
we obtain productions whose right hand sides begin with a terminal symbol a nonterminal Al for l > k. 3. Let us remove the left recursion from the following productions: A1 A2 A3 A4 → → → → a  A2 b c  A3 d e  A4 f g  A2 h
Consider what happens when we execute the above algorithm on these productions. At i = 4. Otherwise. Notice each is of the form Ai → α where the leftmost symbol of α is either a terminal. At j = 2. If k is already greater than (or equal to) i then the production Ai → δr γ is of the right form (we can remove the immediate left recursion if k = i). since k < i the algorithm has already considered the Ak productions. and so the algorithm will then consider Ai productions of the form Ai → Ak α. so we proceed with the outer loop. 3. This ends the case j = 3 and the A4 productions are now A4 → g  ch  edh  A4 f dh. and the A4 productions are currently A4 → g  ch  A3 dh. on substituting these right hand sides for Ak in Ai → Ak α. 2. or is Aj for j > i. and so A4 → A3 dh is replaced with A4 → edh  A4 f dh. there are no Ai productions of the form A4 → A1 γ and so nothing happens for j = 1. and so all the Ak productions whose right hand sides start with a nonterminal Al will have l > k. j < k < i. Now. The current A2 productions are A2 → c  A3 d. k − j iterations of the inner loop later. This ends the case j = 2. and so we replace A4 → A2 h with A4 → ch  A3 dh. we see that by the time that we have dealt with the case j = i − 1 we must arrive at a situation where the lowest index has been increased to i or greater. Thus the algorithm does nothing for i = 1. Continuing in this way. j will be equal to k. Example. At j = 3. 2. resulting in: A4 → gA′4  chA′4  edhA′4 A′4 → f dhA′4  ǫ. First. A4 → A2 γ is a production for γ = h. and so we remove the immediate left recursion from A4 ’s productions. As before. For each value of i the algorithm modiﬁes the set of Ai productions — this set may grow as the algorithm proceeds. Consider the Ai productions for i = 1. Thus the lowest index of any nonterminal occurring as the leftmost symbol of the right hand side of any Ai production has now been increased from j to k to l. This ends the inner loop. the nonterminals are already ordered. Thus.
62
.We require this lowest index to be greater than (or equal to) i. we see A4 → A3 γ is a current A4 production for γ = dh. The current A3 productions are A3 → e  A4 f .
There are other sequences of reductions that reduce w to A but they need not yield rightmost derivations. The d matches C → d so w can be reduced to w1 = Cababc. this is a nonrecursive. and reversing this process does indeed yield a rightmost derivation of w from A: A =⇒ BAc =⇒ Babc =⇒ CAabc =⇒ Cababc =⇒ dababc. The actions that a shiftreduce parser might make are: 63
. we could have chosen to reduce w2 to CAAc (and then to BAc). nonbacktracking method. CA matches B → CA so w2 can be reduced to w3 = Babc. The parser has two possible actions (besides ‘accept’ and ‘error’) at each step: • To shift the current input symbol onto the top of the stack. We have reduced w to the start symbol. Next. Now consider the input string w = dababc. Now ab matches A → ab so w3 can be reduced to w4 = BAc. Finally. Consider the above grammar and the input string dababc. shiftreduce parsing aims to yield a rightmost derivation (in reverse). • To reduce a string (for example. Consider a grammar G with the following productions: A → ab  BAc B → CA C → d  Ba  abc. Shiftreduce parsing works by matching substrings of the input string against productions of a grammar G and then replacing such substrings with the appropriate nonterminal. notice that not all sequences of reductions lead to the start symbol. Next. In contrast to topdown parsing that traces out a leftmost derivation. α) at the top of the stack to the nonterminal A. the ﬁrst ab matches A → ab so w1 can be reduced to w2 = CAabc. w2 can be reduced to CAC that can only be reduced to BC. and a parse is complete when S appears as the only symbol on the stack. but this yields: A =⇒ BAc =⇒ CAAc =⇒ CAabc =⇒ Cababc =⇒ dababc. given the production A → α. so that the input string (provided it is in L(G)) is eventually ‘reduced’ to the start symbol S. The most wellknown form of bottomup parsing is that of shiftreduce parsing to which we will restrict our attention. A shiftreduce parser uses a stack in a similar way to LL(1) parsers. For example. Further. General idea of shiftreduce parsing. Example. For example. The stack begins empty. w4 can be reduced to A (the start symbol) by the production A → BAc.3
Bottomup Parsers
Bottomup parsing can be thought of as the process of creating a parse tree for a given input string by working upwards from the leaves towards the root.4.
then (A → β. and replacing ab by A at position 2 yields CAabc that is the predecessor of Cababc in its rightmost derivation. (A → ab. Diﬀerent types of shiftreduce parsers use the available stack/input in diﬀerent ways. That is. if ∗ S =⇒ αAw =⇒ αβw = γ is a rightmost derivation. In contrast. if there is a handle to reduce at all. p) comprising a production A → β and a position p indicating where the substring β may be found and replaced by A to produce the the previous rightsentential form in a rightmost derivation of γ.Stack d C Ca Cab CA B Ba Bab BA BAc A
Input dababc$ ababc$ ababc$ babc$ abc$ abc$ abc$ bc$ c$ c$ $ $
Action shift reduce C → d shift shift reduce A → ab reduce B → CA shift shift reduce A → ab shift reduce A → BAc accept. as will will see in the next section.
It is important to note that this is only an illustration of what a shiftreduce parser might do. The main problem with shiftreduce parsing is deciding when to shift and when to reduce (notice that in the example. it does not show how the parser makes decisions on what actions to take. p) is a handle of αβw when p marks the position immediately after α. In other words. For example. Handles. Informally. 4) is not a handle of Cababc since although CabAc =⇒ Cababc is a rightmost derivation. a handle of a string is a substring that matches the right hand side of a production. a handle of a rightsentential form γ is a pair h = (A → β.g we did not reduce the Ba to C when the stack contained Bab as this would have given an incorrect parse). Formally. with reference to the example above. it is not possible to derive CabAc from the start symbol with a rightmost derivation. we do not reduce the top of the stack every time possible—e. The diﬃculty in writing a shiftreduce parser begins with deciding when reducing a substring of an input string will result in a (reverse) step in a rightmost derivation. and whose reduction represents one (reverse) step in a rightmost derivation from the start symbol. it is then a matter of tracing down through 64
. A shiftreduce parser looks at the contents of the stack and at (some of) the input to decide what action to take. An important fact underlying the shift/reduce method of parsing is that handles will always appear on the top of the stack. 2) is a handle of Cababc since Cababc is a rightsentential form. (A → ab. then the last symbol of the handle must be on top of the stack.
it is rarely the case that more than one lookahead character is required. We will look at four diﬀerent kinds of LR parser that use at most one symbol of lookahead. However.1
LR(k) Parsing
An LR(k) grammar is one for which we can construct a shift/reduce parser that reads input lefttoright and builds an rightmost derivation in reverse. A viable preﬁx is a preﬁx of a rightsentential form that does not continue past the right hand end of the rightmost handle of that sentential form.e. The process of constructing an LR parser is however relatively involved. and in general. the viable preﬁxes of Cababc are ǫ. the LR parsing method can be implemented eﬃciently. too diﬃcult to do by hand.
4. For this reason parser generators have been devised to automate the process of constructing an LR parser where this is possible. C.
65
. Also. The remainder of this section is devoted to explaining the LR(k) parsing method in order to provide a thorough understanding of the ideas behind this tool.the stack to ﬁnd the start of the handle. Viable Preﬁxes. LR(k) parsing for some k ≥ 0) has a number of advantages. Ca and Cab since the rightmost handle of Cababc is the ﬁrst occurrence of ab. and moreover LR parsing (i. Each kind is an instance of the general LR parsing algorithm but is distinguished by the kind of parsing table employed. SLR and LALR parsing
The LR parsing algorithm is a stackbased shift/reduce parsing algorithm involving a parsing table that dictates an appropriate action for a given state of the parse. There are situations where a shift/reduce parser cannot resolve conﬂicts even knowing the entire contents of the stack and the next k inputs: such grammars.3. Most importantly. it is always possible to add more symbols to yield a handle at the top of the stack. are called nonLR. which include all ambiguous grammars. and LR parsers can detect syntax errors as soon as theoretically possible. The second technical notion that we require in order to understand how a shift/reduce parser can be made to make choices that lead to a rightmost derivation is that of a viable preﬁx. In fact. For the example. shift/reduce parsing can be adapted to cope with ambiguous grammars. The signiﬁcance of viable preﬁxes is that provided the input seen up to a given point can be reduced to a viable preﬁx (as is always possible initially since ǫ is always a viable preﬁx).
4. using at most k symbols of lookahead. LR parsers can be constructed to recognise virtually all programming language constructs for which contextfree grammars can be devised (including all LL(k) grammars). One wellknown such tool is the program yacc.2
LR.3.
X) = s′ then s′ represents the viable preﬁx obtainable by appending an X to the viable preﬁx represented by s. . . although the parsing tables for LR(0). Roughly speaking. • LR(1) parsers. The states are used by the LRP’s goto function: this is a function that takes a state and a grammar symbol as arguments and produces a new state. .e. etc. We will ﬁrst look at the LR parsing algorithm (or ‘LRP’ for ‘LR parser’ for short) and then look at how to construct the parsing table in the case of each of the others. Finally. .
4. Because many programming language constructs cannot be described by SLR(1) grammars (and hence neither by LR(0) grammars). LR(1) parsing tables can be extremely large. the goto function is the transition function of the DFA mentioned above.3
The LR Parsing Algorithm
The LRP is essentially a shift/reduce parser that is augmented with states. sm are states. it is the LALR(1) class of parser that has proven the most popular for bottomup parsing.) 66
. . the more general the parser type. Xm are grammar symbols and s0 . (In other words. but almost all can be described (with eﬀort) by LALR(1) grammars. The stack holds strings of the form s 0 X1 s 1 X2 s 2 · · · X m s m (with sm on the top) where X1 .The four types are: • LR(0) parsers. this initial state is the initial state of a DFA that recognises viable preﬁxes as we shall see below. for example. As implied by the 0. One of these states represents ‘nothing has yet been read’ (note that the empty string is always a viable preﬁx) and is designated the initial state. The idea behind a state in the stack is that it summarises the information in the stack below it: to be more precise. In fact. the more diﬃcult the (manual) construction of the parsing table (i. LR(1) tables are the most diﬃcult). yacc. .3. . the actions taken by LR(0) parsers are not inﬂuenced by lookahead characters. LR(0) parsing tables are the easiest to construct. is an LALR(1) parser generator. . • LALR(1) parsers (“LA” meaning “lookahead”). the idea is that if goto(s. • SLR(1) parsers (“S” meaning “simple”). The grammars for which each of the types of parsers may be constructed is described by the hierarchy: LR(0) ⊂ SLR(1) ⊂ LALR(1) ⊂ LR(1) such that every LR(0) grammar is also an SLR(1) grammar. a state represents a viable preﬁx that has occurred before that state was reached. SLR(1) and LALR(1) grammars are essentially the same size.
Xm = β. • action(sm . It then repeatedly looks at the current (topmost) state and input symbol and takes the appropriate action before considering the next state and input. For a given conﬁguration such as the above (with current state sm and input symbol ai ) the action action(sm . This action occurs when the top r (say) grammar symbols constitute β.
4. . an LR(0) parser makes decisions on whether to shift or reduce based purely on the contents of the stack. thus. Thus from the conﬁguration above. . The LRP starts with the initial state on its stack and the input pointer at the start of the input. ai ) taken by the LRP is one of the following: • action(sm . Although few useful grammars are LR(0). As implied by the ‘0’. • action(sm . not on lookahead. . . ai ) = reduce by A → β. A) (in that order).The behaviour of the LRP is best described in terms of conﬁgurations. Here parsing is completed and the LRP terminates. . the topmost 2r symbols are replaced with A and s = goto(sm−r .4
LR(0) State Sets and Parsing
We begin our discussion of the diﬀerent types of LR parser by looking at LR(0) parsers. an ). Here a syntax error has been detected and the LRP calls an error recovery routine. . when Xm−r+1 . The idea behind such a conﬁguration is that it represents the LRP having up to this point constructed the rightsentential form X1 X2 . . ai . Xm ai . ai ) = accept. Here the parser puts ai and s onto the stack (in that order) and advances the input pointer. . that is. until processing is complete. . the construction of LR(0) parsing tables illustrates most of the concepts that we will need to build SLR(1). in this case the conﬁguration becomes: (s0 X1 s1 X2 s2 · · · Xm−r sm−r As. 67
. an ) comprising the contents of the stack and the input yet to be read. LR(1) and LALR(1) parsing tables. When this does occur. . As mentioned previously. ai+1 . ai . the idea behind the ‘states’ on the stack of an LR parser is that a state represents a viable preﬁx that has occurred on the stack up to a given point in the parse. an ). . ai ) = shif t s. ai ) = error. a pair (s0 X1 s1 X2 s2 · · · Xm sm . an and about to read ai . that is. the LRP enters the conﬁguration: (s0 X1 s1 X2 s2 · · · Xm sm ai s. • action(sm .3. .
X) that takes a set of items and a grammar symbol X as input. Item Closures. (If A →. We begin the process of constructing LR(0) parsing tables by augmenting the grammar G with a new production S ′ → S (where S ′ is a new start symbol) to form an augmented grammar G′ . then intuitively the parser will change state in the sense that the set of strings it expects to see having read the X will be diﬀerent (in general) to the set of strings the parser was expecting to see before the X. A → X • Y and A → XY •. If the parser is in a certain state and then reads a symbol X. the initial state). the dot (•) is a marker denoting the current position in parsing a string. In order to compute all the items of the parse equivalent to a given item (or set of item) we deﬁne the closure of a set of items I for a grammar using the following algorithm: closure(I) let J := I repeat for each item A → α • Xβ in J for each X → γ add X → •γ to J until J does not change return J The reason why the closure operation is deﬁned on sets of items will be clearer later.We begin to make this idea more precise by deﬁning an LR(0) item (or ‘item’ for short) to be a string of the form A → α • β where A → αβ is a production. In general. that is. and we expect to see (a string derivable from) Bβ in the input. Notice that if A → α • Bβ is an item. the production A → XY yields the three items: A → •XY . Thus. for example. an item of the form A → α • β represents a parser having parsed α and about to (try and) parse β. if A → α • Xβ is an item then reading an X results in the new item A → αX • β (and all the items equivalent to it). More formally. State Transitions. using the following algorithm: goto(I.X) 68
. if B → γ is a production then (a string derivable from) γ is also acceptable input at this point. The reason for this is that we can use the item S ′ → •S to represent the start of the parse (that is. and returns a set of (equivalent) items as output. that is. then A → • is the only item for this production. in this situation A → α • Bβ and B → •γ are equivalent in the sense that α is a viable preﬁx for sentential forms derivable from either item. We formalise these transitions between states of the parse with the function goto(I. However. this represents a state in a shift/reduce parser where α is the current viable preﬁx on the top of the stack. A → ǫ.) Thus.
() = closure({S → (•S)}) = {S → (•S). a) = closure({S → a•}) = {S → a•} = I2 (say) S → •(S). In the next step of the algorithm we compute goto(I0 . X) add J to C until C does not change return C Example. goto(I0 . We begin construction of the LR(0) state set by computing the closure of the item {S ′ → •S}. S → •(S). = I3 (say) 69
. Consider the LR(0) grammar S → (S)  a which we augment with the production S ′ → S. S. the parser will be in the state closure({S ′ → •S}). P ) can be in. S). So C = {I0 } after the initial step of the algorithm. until no new states can be generated. {S ′ → •S. a) and goto(I0. We construct all other states from this initial state by using the goto and closure operations above. S) = closure({S ′ → S•}) = {S ′ → S•} = I1 (say) goto(I0. S → •a} goto(I0 .let J be the empty set for any item A → α • Xβ in I add A → αX • γ to J return closure(J) LR(0) State Set Construction. The following algorithm accomplishes this task: C := {closure({S ′ → •S})} repeat for each state I ∈ C for each item A → α • Xβ in I let J be goto(I. Initially. S → •a}
which we will refer to as I0 . (): goto(I0 . N. We now construct all possible states that an LR(0) parser for an LR(0) grammar G = (T.
Rule (a) above says that. a) = Ij then action(i. I1 . S). if we have parsed α and we can see a terminal symbol b then we should shift the symbol onto the stack and go into state j. if we have ﬁnished parsing a particular production rule α then we should reduce by that production rule for all terminal symbols. (c) If S ′ → S• ∈ Ii then action(i. we need only compute goto(I4. )): goto(I4 . $) = accept. (noticing that neither I1 nor I2 contain items of the form A → α • Xβ) we consider I3 and compute goto(I3 .. . I2 . I4 }. . (): goto(I3 . () = closure({S → (•S)}) = I3 and now C = {I0 . S) = closure({S → (S•)}) = {S → (S•)} = I4 = I2 goto(I3 . I1 . (3) The goto actions for state i are as given by directly from the goto function. so we have ﬁnished. 70
. I3 .. . On the next iteration. a) = shif t j. a) and goto(I3 . I2 . It is clear that a further iteration of the loop will not yield any further states. I5 }. . goto(I3 . I4 . (5) The initial state is the one constructed from {S ′ → •S}. I2 . I1 . Rule (b) says that. for each a ∈ T ∪ {$}. A) = j. (4) All entries not deﬁned by the above rules are made error. that is. The states and goto functions can be depicted as a DFA. if goto(Ii. (b) If A → α• ∈ Ii then action(i. We construct an LR(0) parsing table as follows: (1) Construct the set {I0 . A) = Ij then let goto(i. In } of states for the augmented grammar G′ . (2) The parsing actions for state Ii are determined as follows: (a) If A → α • aβ ∈ Ii and goto(Ii . I3 } after the ﬁrst iteration of the repeatforever loop. a) = closure({S → a•})
and now C = {I0 . a) = reduce A → α. Constructing LR(0) Parsing tables. )) = closure({S → (S)•}) = {S → (S)•} = I5 (say) (say) goto(I3 . On the next iteration.and thus C = {I0 .. I3 .
For the example grammar we obtain the following parsing table. the parser cannot decide whether to reduce using rule A → α or rule B → β. In the case of (a). and only if. and grammars leading to such conﬂicts are not LR(0). In the case of (b). or contain exactly one item. This method of creating a parser will fail if there are any conﬂicts in the above rules. Example. or (b) B → β • aγ. where sj means shif t j. rj means reduce using production rule j (where rule 1 is S → (S) and rule 2 is S → a). if a state contains an item of the form A → α• (signifying the fact that α should be reduced to A) then it cannot also contain another item of the form: (a) B → β•.Rule (c) says that. for a grammar to be LR(0). A grammar is LR(0) if. all states contain either no items of the form A → α•. the parser cannot decide whether to reduce using rule A → α or to shift (by the item B → β • aγ). action ( ) s3 goto S 1
State 0 1 2 3 4 5
a s2
$
ok r2 r2 r2 r2 s2 s3 s5 r1 r1 r1 r1
4
A trace of the parse of the string ((a)) is as follows: Stack 0 0(3 0(3(3 0(3(3a2 0(3(3S4 0(3(3S4)5 0(3S4 0(3S4)5 0S1 Input ((a))$ (a))$ a))$ ))$ ))$ )$ )$ $ $ Action shift 3 shift 3 shift 2 reduce S → a shift 5 reduce S → (S) shift 5 reduce S → (S) accept
Notice that from the rules for constructing an LR(0) parsing table. and ok means ‘accept’. this is known as a shiftreduce conﬂict. 71
. if we have successfully parsed the start symbol then we have ﬁnished. this is known as a reducereduce conﬂict.
T ) = closure({E → T •. E → E • +T }) = {E ′ → E•. (1) E → E + T (2) E → T (3) T → T ∗ F (4) T → F (5) F → (E) (6) F → a augmented with the production E ′ → E. The construction of an SLR(1) table begins with the LR(0) state set construction as with LR(0) parsers. (. F → •(E). Next. The state set construction proceeds as follows. T → •F. T → •T ∗ F. X) for X = a. We then show that the grammar is not LR(0) due to shiftreduce conﬂicts. E → E • +T } = I3 (say) goto(I0. Example. T → •F.5
SLR(1) Parsing
SLR(1) parsing is more powerful than LR(0) as it consults (by the way an SLR(1) parsing table is constructed) the next input token to direct its actions. E) = closure({E ′ → E•. E → •T. a) = closure({F → a•}) = {F → a•} = I1 (say) goto(I0. We then show how to construct general SLR(1) parsing tables and apply this to our example. E → •E + T. T → T • ∗F }) 72
. E.3. We ﬁrst construct the LR(0) state set. E → •E + T. we see that we need to compute goto(I0 . We begin by constructing the initial state: I0 = closure({E ′ → •E}) = {E ′ → •E. Let’s consider an example. F : goto(I0 .4. F → •a} = I2 (say) goto(I0 . E → •T. T → •T ∗ F. Consider the following productions for a grammar for arithmetic expressions. Thus C = {I0 } after the initial step of the algorithm. F → •a}. T . F → •(E). () = closure({F → (•E)}) = {F → (•E).
I7 . T. For I6 . I5 . E → E • +T } = I6 (say) goto(I2 . we see that we need to compute goto(I6 . T → T • ∗F }) = I4 goto(I2. )): goto(I6 . I1 . In the next iteration we need to consider I6 . +): goto(I3 . +) = closure({E → E + •T }) = I7 goto(I6. +) = closure({E → E + •T }) = {E → E + •T. F → •a}) = I8 (say). We also see that goto(I5 . I7 . I6 . )) = closure({F → (E)•}) 73
. () = closure({F → (•E)}) = I2 goto(I2 . +) and goto(I6 . I2 . a) = closure({F → a•}) = I1 goto(I2 . I4 . F → •a} = I7 (say). T ) = closure({E → T •. E. I2 . I8 . E) = closure({F → (E•). E → E • +T }) = {F → (E•). F → •(E). I1 . I4 . I5 . X) = ∅ for all X. I8 }. I4 . For I4 we need only compute goto(I4. T → •T ∗ F. I3 . X) = ∅ for each possible symbol X. T → T • ∗F } = I4 (say) goto(I0 . On the next iteration of the loop we need to consider I1 . Thus now C = {I0 . I3 . For I3 we only need to compute goto(I3. ∗): goto(I4. after the second iteration of the loop we have C = {I0 . ∗) = closure({T → T ∗ •F }) = {T → T ∗ •F. notice that there are no items in I1 that have any symbols following the marker. We now compute goto(I2 . I3 .= {E → T •. I5 }. F ) = closure({T → F •}) = I5 . F as follows: goto(I2 . Thus. X) for X = a. as in the case of I1 . (. and hence goto(I1. F → •(E). T → •F. F ) = closure({T → F •}) = {T → F •} = I5 (say). I2 . First.
Thus. Ii ) whose (only) item is of the form A → α• regardless of the next input symbol. I11 }. For I9 and I11 . I11 .
Thus. In SLR(1) parsing. a) = = goto(I7 . I8 . This state contains the items E → T • and T → T • ∗F and goto(I4 . I10 .e. X) for X = a. Constructing SLR(1) Parsing Tables. consider the state I4 (we could also have chosen I10 ). T → T • ∗F )}) {E → E + T •. I9 . For I10 we only need to compute goto(I10 . we see that goto(I9 . This is a shiftreduce conﬂict (we would attempt to place both s8 and r2 at location (4. consider an attempt at constructing an LR(0) parsing table from this state set. Specifically. F ) = = = closure({F → a•}) I1 closure({F → (•E)}) I2 closure({T → T ∗ F •}) {T → T ∗ F •} I11 (say). ∗): goto(I10 . we see that we need to compute goto(I7 . ∗) = closure({T → T ∗ •F }) = I8 . X) = ∅ for all X. (. ∗) = I8 . T ) = = = goto(I7. we need to consider I9 . () = = goto(I7. For I7 . X) for X = a. F : goto(I8 . Now. I1 . I7 . I3 . (. () = = goto(I8 . T → T • ∗F }) I10 (say) closure({T → F •}) I5 . I6 . if a state contains an item of this form then the reduction of α to A is only performed if the next token is in F ollow(A) (the set of all terminal symbols that can follow A in a sentential form). we see that we need to compute goto(I8 . C did not change during the fourth iteration of the loop and we have ﬁnished(!). a) = = goto(I8 . F : goto(I7 . after the third iteration of the loop we have C = {I0 . I4 . F ) = = closure({F → a•}) I1 closure({F → (•E)}) I2 closure({E → E + T •. I10 . Next.
For I8 . 74
. I5 . T. X) = goto(I11 . ∗) in the table).= {F → (E)•} = I9 (say). I2 . In LR(0) parsing a reduce action is performed whenever the parser enters a state i (i.
except that we replace rule 2(b): If A → α• ∈ Ii then action(i. The complete SLR(1) parsing table for this grammar is as shown below. )} F ollow(T ) = {$. a) = reduce A → α. we do not choose to / reduce in this state (i. +. )}.else it performs a diﬀerent action (e. The rules for constructing SLR(1) parsing tables are the same as for LR(0) tables giving in the previous section. for each a ∈ T ∪ {$} by If A → α• ∈ Ii then action(i.g. a shift. reduce. or error) based on the other items of state Ii . ∗)). we place only s8 in the parsing table at position (4. +. for each a ∈ F ollow(A). ∗. ∗. action + ∗ ( ) $ s2 r6 r6 r6 r6 s2 s7 ok r2 s8 r2 r2 r4 r4 r4 r4 s7 s9 s2 s2 r5 r5 r5 r5 r1 s8 r1 r1 r3 r3 r3 r3 goto E T F 3 4 5 6 4 5
State 0 1 2 3 4 5 6 7 8 9 10 11
a s1 s1
s1 s1
10
5 11
A parse of the string a + a ∗ a is:
75
. ∗) = I8 . Notice ﬁrst that the shiftreduce conﬂict mentioned above has been resolved. a) = reduce A → α. As ∗ ∈ F ollow(E). +.e. Lets consider the example grammar. )} F ollow(F ) = {$. The F ollow sets are F ollow(E) = {$. Recall the items E → T • and T → T • ∗F of I4 where goto(I4 .
Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Stack Input Action 0 a + a ∗ a shift 1 0a1 +a ∗ a reduce by F → a 0F 5 +a ∗ a reduce by T → F 0T 4 +a ∗ a reduce by E → T 0E3 +a ∗ a shift 7 0E3 + 7 a ∗ a shift 1 0E3 + 7a1 ∗a reduce by F → a 0E3 + 7F 5 ∗a reduce by T → F 0E3 + 7T 10 ∗a shift 8 0E3 + 7T 10 ∗ 8 a shift 1 0E3 + 7T 10 ∗ 8a1 $ reduce by F → a 0E3 + 7T 10 ∗ 8F 11 $ reduce by T → T ∗ F 0E3 + 7T 10 $ reduce by E → E + T 0E3 $ accept
It can be seen from the SLR(1) table construction rules that a grammar is SLR(1) if.
76
. (b) For any two items A → α• and B → β•. we need to look at LR(1) and LALR(1) parsers. and only if. the parser cannot decide whether to reduce using rule A → α or rule B → β. For nonSLR(1) grammars that are still LR(1). the parser cannot decide whether to shift the a (and goto(i. the following conditions are satisﬁed for each state i: (a) For any item A → α • aβ. F ollow(A) ∩ F ollow(B) = ∅. a)) onto the stack or to reduce the top of the stack using the rule B → γ. Violation of (a) will lead to a shiftreduce conﬂict: if the parser is in state i and a is the next token. there is no other item B → γ• with a ∈ F ollow(B). Violation of (b) will lead to a reducereduce conﬂict: if the parser is in state i and the next token a is in F ollow(A) and F ollow(B).
The process is shown in ﬁgure 1
myFile. meaning that we have to remove any leftrecursion from the grammar before writing the parser. is deﬁned here: program → decList begin statementList  statementList decList → dec decList  dec dec → type id type → ﬂoat  int statementList → statement . It comes with a useful tool called Jjtree which in turn produces a Javacc ﬁle with extra code added to construct a tree from a parse.5
Javacc
Javacc is a compiler compiler which takes a grammar speciﬁcation and writes a parser for that language in Java.1
Grammar and Coursework
The grammar we will be compiling.
5.java
javac
parser
Figure 1: The JavaTree Compilation Process
Because the parser is topdown we can only use Javacc on LL(k) grammars.jjt
Language Specification
jjtree
Javacc file
myFile. By running the output from Jjtree through Javacc and then compiling the resulting Java ﬁles we get a full topdown parser for our grammar. both in these notes and in your coursework. statementList  statement statement → if Statement  repeatStatement  assignStatement  readStatement  writeStatement if Statement → if P art end  if P art elseP art end 77
.jj
javacc
Parser in Java
myFiles.
ensure that you understand fully what is going on. Although there will not be formal lab classes you are strongly recommended to use Friday afternoons as such. There are no boolean variables but two boolean operators are available for conditional and repeat statements. The ﬁrst ﬁve courseworks are worth 10% each. You are recommended to spend each Friday afternoon working on this assignment. Your coursework comes in seven parts with a deadline at midnight on the Friday of weeks 2 6. If you have a reasonable excuse for not submitting on time you will be exempted from that coursework. Because I am giving model answers each week there will be no extensions. so you should spend about 3 hours each week. There are 5 statements. 8 and 11.if P art → if exp then statementList elseP art → else statementList repeatStatement → repeat statementList until exp assignStatement → id := exp readStatement → read id writeStatement → write exp exp → simpleExp boolop simpleExp  simpleExp boolOp → <  = simpleExp → term addOp simpleExp  term addOp → +  term → f actor mulOp f actor  f actor mulOp → *  / f actor → id  ﬂoat  int  ( exp ) So we have an optional list of variable declarations. I hope to put a model answer on Blackboard by Monday evening each week. and equals. If this takes less than three hours you should start work on the next one. All variables must be declared before use and each variable can be a ﬂoating point number or an integer. The terminals in this grammar are precisely those written in bold in the above list. and complete the coursework for that week. an iterational. an assignment. The early parts shouldn’t take that long so you are encouraged to work ahead. together with your grade. We also have various expressions. The coursework is designed to take 30 hours to complete in total. Each week the coursework will build upon the previous work. a conditional. Note how the grammar forces all arithmetic expressions to have the correct operator precedence. try out the examples. The coursework in total is worth 30% of the marks for the module. the last two are worth 25% each. Other ways of achieving this will be discussed in the lectures. You plan each week should be to read the relevant section of this document. less than. so if you achieved a disappointing grade for one assignment you can use the model answer for the next. followed by a series of statements. and a read and write statement.
78
.
out. try{ lexan.5.start().}  <DIV> {System.}  <MINUS> {System.println("Integer"). }//end catch System.out.out. Lets look at a minimal javacc ﬁle: options{ STATIC = true.} <INT> {System.println("Plus").out.println("Minus").println("Finished Lexical Analysis"). }//end main }//end class PARSER_END(LexicalAnalyser) //Ignore all whitespace SKIP:{" ""\t""\n""\r"} //Declare the tokens TOKEN:{<INT: (["0""9"])+>} //Now the operators TOKEN:{<PLUS: "+">} TOKEN:{<MINUS: "">} TOKEN:{<MULT: "*">} TOKEN:{<DIV: "/">} void start(): {} { <PLUS> {System.out.println(e.println("Divide").} } 79
.out.}  <MULT> {System.out.getMessage()). } PARSER_BEGIN(LexicalAnalyser) class LexicalAnalyser{ public static void main(String[] args){ LexicalAnalyser lexan = new LexicalAnalyser(System. }//end try catch(ParseException e){ System.in).println("Multiply").2
Recognising Tokens
Our ﬁrst job is to build a recogniser for the terminal symbols in the grammar.
In this case we are simply telling the ﬁnal parser to ignore all whitespace. newlines and carriage returns. We then have to declare the tokens or terminal symbols of the grammar. which constructs a new parser and calls its start() method.
80
. • A colon. Therefore we set this to true because it defaults to false. There are languages in which whitespace has meaning.9. • An open angle bracket. Within that there is a class declaration. separated by a hyphen. Most of the options available default to acceptable values for this example but we want this program to be static because we are going to write a main method here rather than in another ﬁle. Next we have a block of code delimited by PARSER BEGIN(LexicalAnalyser) and PARSER END(LexicalAnalyser). For such a simple example this is not necessary since we could use literals instead. Then wrapping the range expression in parentheses and adding a “+” symbol we say that an INT (our name for integers) is one or more instances of a digit. in this case 0 . tabs. Besides.First we have a section for options. We then have some SKIPped items. Now we write the main method. The ﬁrst declaration is simply a regular expression for integers. Finally we print a message to the screen announcing that we have ﬁnished the parse successfully. Note that literals are always included in quote marks. A TOKEN declaration consists of: • The keyword TOKEN. e. The argument is the name which Javacc will give the ﬁnished parser. This is normal as we usually want to encourage users of our language to use whitespace to indent their programs to make them easier to read. it sometimes makes the code diﬃcult to read if we mix literals and references freely. This means “anything between these two values inclusive”. spaces. This has to be placed inside a trycatch block because Javacc automatically generates its own internally deﬁned ParseExceptions and throws them if something goes wrong with the parse. • A colon. but we will be making this more complex later so it’s as well to start oﬀ the right way. • The name by which we will be referring to this token. We then include deﬁnitions for the arithmetic operators. Haskell. but these tend to be rare. We can put in square brackets a range of values. • An open brace. All code within the PARSER BEGIN and PARSER END delimiters will be output verbatim by Javacc.g.
jj. N. We then compile the resulting java ﬁles: javac *. Each statement inside the start() method follows a format whereby a previously deﬁned token is followed by a set on one or more (one in this example) statements inside braces. This states that when the respective token is identiﬁed by the parser the code inside the braces (which must be legal Java code) is executed. (The ﬁletype .class ﬁles which can be run from the command line with Java LexicalAnalyser. • The close angle bracket.java. The second one contains the grammar’s production rules. In the above example we are saying that a legal program consists of a single instance of one of the grammar terminals. The ﬁrst improvement is to continue to read tokens until either we reach the end of the input or we read illegal input. the name of the method (in this case start) and any arguments required. A method declaration in Javacc consists of the return type (in this case void). We ﬁrst run the code through Javacc :Javacc lexan. This gives us the . At the moment we can only read a single token. We can have multiple token (or skip) declarations in a single line by including the  (or) symbol. if the parser recognises an int it will print integer to the terminal.B. You are strongly recommended to type in the above code and compile it. • The close brace. You should do this with all the examples given in this chapter. We can do this by simply creating another regular expression inside the start() method. The ﬁrst one contains variables local to this method (in this case none).jj is compulsory). This gives us a series of messages. followed by a colon and two (!) sets of braces. Thus. void start(): {} { (<PLUS>  <MINUS>  <DIV>  <MULT>  <INT>)+ } Also we can redirect the input from the console to an input ﬁle: 81
.• A regular expression deﬁning the pattern for recognising this token. We ﬁnally deﬁne the start() method in which all the work is done. as in the skip declarations in the example above. Most of these relate to the ﬁles being produced by Javacc. Test your parser with both legal and illegal input to see what happens.
java LexicalAnalyser < input. The process is shown in ﬁgure 1. Then when each token is recognised we insert a piece of java code to execute.e.jj) and add a few lines of code: options{ STATIC = true. I will continue to do this until it reaches the end of input or ﬁnds an error. System.dump("").jjt (from Lexan.println("Finished Lexical Analysis"). SimpleNode root = lexan. The above program demonstrates how each token is speciﬁed and given a name. We will now build a complete parser using this tool – Jjtree. it takes our input ﬁle and outputs Javacc code with tree building code built in. Luckily Javacc comes with a tree builder which inserts all the code necessary for the construction of such a tree. The input ﬁle to Jjtree needs a ﬁletype of jjt so let’s just change the ﬁle we have been using to Lexan.3
JjTree
So far all we have done is create a lexical analyser which merely decides whether each input is a token (terminal symbol) or not. when compiled. You must write a Javacc ﬁle which recognises all legal terminal symbols in the grammar deﬁned on page 77.txt This ﬁle will now. First of all it should be noted that Jjtree is a preprocessor for Javacc. It is time to do assignment 1. } PARSER_BEGIN(LexicalAnalyser) class LexicalAnalyser{ public static void main(String[] args)throws ParseException{ LexicalAnalyser lexan = new LexicalAnalyser(System. i. produce a parser which recognises any number of our operators and integers. The deadline for this is Friday midnight of week two. } } PARSER_END(LexicalAnalyser) //Ignore all whitespace 82
.start(). Having done that we can traverse the tree generating machinecode for our target machine. The ﬁrst time it ﬁnds an error it exits. root. printing to the screen what it has found.in).
5. For a complete compiler we need to construct a tree.out.
} <MULT> {System.jjt) through jjtree produces a ﬁle called Lexan. The start() method now has a return value of type simpleNode and the ﬁnal line of the method returns jjtThis.} <DIV> {System. Running the compiled code with the following input ﬁle. Finally invoking javac *.* / .start() will return an object of type SimpleNode and assigned this to a variable called root. jjtThis is the node currently under construction when it is executed..println("Multiply"). Running this code (Lexan.. The ﬁnal line of the main method now calls root.} <INT>{System.dump(‘‘’’)..out.println("Divide")..} <MINUS> {System.SKIP:{" ""\t""\n""\r"} //Declare the tokens TOKEN:{<INT: (["0""9"])+>} //Now the operators TOKEN:{<PLUS: "+">} TOKEN:{<MINUS: "">} TOKEN:{<MULT: "*">} TOKEN:{<DIV: "/">} SimpleNode start(): {} { (<PLUS> {System.out.jj. this time with tree building capabilities built in. At the moment it prints to the output a text description of the tree which has been built by the parse if successful. SimpleNode is the default type of node returned by JavaTree methods.produces the following output: 83
. We will be seeing later how we can change this behaviour. )+ {return jjtThis. 123 + . Run this through Javacc and we get the usual series of .java ﬁles. passing it an empty string.println("Plus").out.println("Integer").println("Minus"). This is a method deﬁned within the ﬁles produced automatically by JavaTree and its behaviour can be changed if we wish.} }
   
We have stated that calling lexan.out. This will be the root of our tree.out.java produces our parser.
It produces the single line start. while all terminal nodes are tokens which we have deﬁned.Integer Plus Minus Multiply Divide Finished Lexical Analysis start It is the same as before except that we now see the result of printing out the tree (dump(””)). The we have a series of methods. We have constructed a tree with a single node called start. An intExp is simply an INT terminal. The code is the same until the deﬁnition of the start() method. This is in fact the name of the root of the tree. A multExp is an intExp followed by a MULT terminal followed by another intExp. each of which contains a production rule for that nonterminal symbol. The start() rule expects 0 or more multExp’s.} } void multExp(): {} { intExp() <MULT> intExp() } void intExp(): {} { <INT> } What we are doing here is stating that all internal nodes in our ﬁnished tree take the form of a method. The grammar in BNF form is as follows: start → multExp  multExp start  ǫ multExp → intExp MULT intExp intExp → INT 84
. Let’s add a few production rules to get a clearer idea of what is happening. SimpleNode start(): {} { (multExp())* {return jjtThis.
All you have to do is change the ﬁle you produced for assignment 1 into a jjtree ﬁle which contains the complete speciﬁcation for the module grammar given on page 77 Notice how the nodes take the same name as the method name in which they are deﬁned? Let’s change this behaviour: SimpleNode start() #Root:{}{..} 85
. The output from this after being run on the input: 1*12 3 *14 45 * 6 is more interesting: start multExp intExp intExp multExp intExp intExp multExp intExp intExp Pictorially the tree we have built from the parse of the input is shown in ﬁgure 2
Start
MultExp
MultExp
MultExp
IntExp
IntExp
IntExp
IntExp
IntExp
IntExp
Figure 2: The tree built from the parse
We have a start node containing three multExp’s.. It is time for assignment 2. each of which contains two children.. both of which are intExp’s.} void multExp() #MultiplicationExpression:{}{.where INT and MULT are terminal symbols.} void intExp() #IntegerExpression:{}{....
86
. It is time for assignment 3. This gives us the following.4
Information within nodes
With the addition of the above method of changing node names we have a great deal of information about interior nodes.All we have done is precede the method body with a local name (indicated by the # sign). more easily understandable output: Root MultiplicationExpression IntegerExpression IntegerExpression MultiplicationExpression IntegerExpression IntegerExpression MultiplicationExpression IntegerExpression IntegerExpression To see the full eﬀect of this change we can add two options to the ﬁrst section of the ﬁle: MULTI = true. The ﬁrst tells the compiler builder to create diﬀerent types of node. Change all the methods you have written so far so that they produce node names and types which are other than the names of the methods. Looking at the list of ﬁles that Javacc generates for us we can see lots of nodes.java */ public class SimpleNode implements Node { protected Node parent. or the actual value of the ﬂoat? We will need to know that information in order to be able to use it when.
5. NODE_PREFIX = "". If we don’t add the second line the ﬁles thus generated will all have a default preﬁx. What about terminal nodes? We know that we have. for example. We will be using these diﬀerent names later. Examining the code we can see that all of them extend SimpleNode. Let’s look at the superclass: /* Generated By:JJTree: Do not edit this line. for example. SimpleNode. generating code. rather than just SimpleNode. an identiﬁer or perhaps a ﬂoating point number but what is the name of that identiﬁer.
System. int i) { this(i). } public SimpleNode(LexicalAnalyser p. children.arraycopy(children. } public Node jjtGetParent() { return parent.protected Node[] children. parser = p. children = c.length). protected LexicalAnalyser parser. } public void jjtOpen() { } public void jjtClose() { } public void jjtSetParent(Node n) { parent = n. } /* You can override these two methods in subclasses of SimpleNode to customize the way the node appears when the tree is dumped. 0. c. protected int id. } public Node jjtGetChild(int i) { return children[i]. public SimpleNode(int i) { id = i.length) { Node c[] = new Node[i + 1]. int i) { if (children == null) { children = new Node[i + 1]. If 87
. } children[i] = n. } public int jjtGetNumChildren() { return (children == null) ? 0 : children. } public void jjtAddChild(Node n. } else if (i >= children. 0.length.
println(toString(prefix)). } } } } } The child nodes for this node are kept in an Array. public void setType(String type){this. otherwise overriding toString() is probably all you need to do.out.your output uses more than one line you should override toString(String).} public void setInt(int value){intValue = value.length.jjtNodeName[id]. Notice that dump() in particular recursively invokes dump() on all its children after printing itself. */ public void dump(String prefix) { System. giving a preorder printout of the tree.} public String getType(){return type. protected int intValue.dump(prefix + " "). since all other nodes inherit from this. ++i) { SimpleNode n = (SimpleNode)children[i]. Now change the code in the intExp method as follows: 88
. We can add functionality to the node classes by adding it to SimpleNode.type = type.} public int getInt(){return intValue. if (children != null) { for (int i = 0. jjtgetChild(i). } /* Override this method if you want to customize how the node dumps out its children. This can easily be changed to inorder or postorder. if (n != null) { n. jjtGetNumChildren() returns the number of children this node has and dump() alters how the nodes are printed to the screen when we call the Root note. Add the following to the SimpleNode ﬁle: protected String type = "Interior node". i < children. } public String toString(String prefix) { return prefix + toString().} By adding these members to the SimpleNode class all subclasses of SimpleNode will inherit them. */ public String toString() { return LexicalAnalyserTreeConstants. returns the ith child node.
i < children.out. if it is an integer we will know its value as well.equals("int")){ System.image)).} { <ID>  <FLOAT>  t = <INT> {jjtThis. } } } } Now when we call the dump() method of our root node we will get all the information we had previously. a ﬂoat or a parenthesised expression. if (n != null) { n. and int. If it is an int we add to the factor node the type and the value of this particular int. ++i) { SimpleNode n = (SimpleNode)children[i]. jjtThis. if it is an integer.length. 89
. The class Token contains some useful methods of which two are seen here.setInt(Integer. What we are saying here is that if the method factor() is called then it will be an ID. Time for assignment 4.println(toString(prefix)). Look at the dump() method: public void dump(String prefix) { System. integers and ﬂoats will all print out what they are. Alter your code (in the factor method and the SimpleNode class) so that variables. We can also.}  <OPENPAR> exp() <CLOSEPAR> } Each time javacc recognises a predeﬁned token it produces an object of type Token.setType("int"). Now. what value it has. This is exactly the information we need to add to our nodes.println("I am an integer and my value is " + getInt()).void factor() #Factor: {Token t. plus their value or name. plus.} } if (children != null) { for (int i = 0. when we have a factor node we know. for each terminal factor node.dump(prefix + " ").parseInt(t. by adding to this code. if(toString().equals("Factor")){ if(getType().out. know exactly which type of factor we are dealing with. kind and image.
6
Symbol Tables
Because our terminal nodes are treated diﬀerently from interior nodes we need to maintain a symbol table to keep track of them. This very much simpliﬁes the output of dump(). Now when we visit an expression or statement we know exactly what sort of node we are dealing with.5
Conditional statements in jjtree
Many of the methods we have deﬁned so far for the internal nodes have an optional part. and forbidding. This means that. so that we can access our variables in constant time. 90
. If the expression consists of just a single intExp the method returns that. and also makes our job much easier when walking the tree to. The (>1) expression says to return a MultiplicationExpression node only if there is more than one child. Unfortunately it means that when the tree is constructed it will not be possible to tell what exactly we have when we are examining a statement list node without counting to see how many child nodes there are.
5. multiple declarations. or generate machine code. This use of recursion allows us to have as many statements in our grammar as we want. Our Variables have a name. a statement list is a single statement followed by an optional semicolon and statement list. Of course we will implement a symbol table as a hash table. such as the requirement to declare variables before using them. rather than a MultiplicationExpression. a type and a value.5. or at least detecting. type check assignment statements. This unnecessarily complicates our code so let’s see how to alter it: void multExp() #MultiplicationExpression(>1): {} { intExp() (<MULT> intExp())? } The production rule now states that a multExp is a single intExp followed by 0 or 1 intExps (indicated by the question mark). This would also be a convenient time to add some demands to our language. We will use this during all phases of our ﬁnished compiler so we should look at it now. private String name. for example. for example. This is a trivial class to implement: public class Variable{ private String type.
} } This quite simply interrogates the symbol table to see if the Variable we have just read exists or not.put(v.0. type = "integer".getName() + " multiply declared"). Token tokName.getName(). private double floatValue = 0.} String getString(){return name.getName())){ System. Once we have done that we add to our LexicalAnalyser class a method for adding variables to our symbol table: public static void addToSymbolTable(Variable v){ if(symbolTable.name = name. } public public public public } Now that we have Variables we need to declare a HashMap in which to put them.out.} int getInt(){return intValue. } else{ symbolTable.private int intValue = 0. Variable v. If it does we print out an error message.}
. String name){ this. floatValue = value. String name){ this.containsKey(v.println("Variable " + v. All that is left to do is invoke our function when we read a variable declaration: void dec() #Declaration: {String tokType. otherwise we add the new variable to the table.} 91 String getType(){return type. } public Variable(double value. v). public Variable(int value. type = "float".name = name. intValue = value.} double getDouble(){return floatValue.
}
Read the above carefully.) Then add code which detects if the writer is trying to use a variable which has not been declared. tokName. } } String {Token { t = t = } type() #Type: t. tokName.equals("integer"))v = new Variable(0. you always knew that module would come in handy. STATIC = false. Then start this week’s assignment. 92
. else v = new Variable(0. If we add a VISITOR switch to the options section: options{ MULTI = true. addToSymbolTable(v).Programming with Objects and Threads will recognise immediately the Visitor pattern.
5. Those of you who took CS211 .7
Visiting the Tree
In altering the dump() method in the previous example we merely altered the toString() method. especially with the abilities you have all demonstrated by progressing thus far in your studies. But let’s think about what we are trying to do. VISITOR = true.}  <INTEGER> {return "integer". (See.) Fortunately the writers of Javacc also took that module so they have already thought of this. type it in and make sure you understand exactly what is going on.} <FPN> {return "float". Assignment 5. (Ideally you should declare the main method in a class of its own to avoid having all the extra methods and ﬁelds declared as static.image). printing an error message if (s)he does so.image). NODE_PREFIX = "". Alter the code you have written so far so that methods with alternative numbers of children return the appropriate type of node.0. operating upon each node (for the time being we will merely print out details of the node).{ tokType = type() tokName = <ID>{ if(tokType. Then add the symbol table code given above. We have a data structure (a tree) which we want to traverse. Let us consider what we have here.
. Object data). Object data).java we may get errors if we have not checked our implementation of the Visitor ﬁrst. public Object visit(Term node. Therefore from now on if we alter the . public Object visit(DeclarationList node. We will be writing an implementation of the Visitor interface. public Object visit(MultiplicationOperator node. Object data). Object data). Object data). public Object visit(Root node. Warning 2. public Object visit(WriteStatement node.} . Object data).jjtree writes a Visitor interface for us to implement. public Object visit(RepeatStatement node. Jjtree does not consistently rewrite all of its classes. public Object visit(BooleanOperator node. public Object visit(Declaration node. Object data). Object data). Object data). Object data). Object data). Object data). public Object visit(Factor node.jjt ﬁle and then compile byte code with javac *. public Object visit(AssignmentStatement node. public Object visit(ElsePart node. Object data). public Object visit(IfPart node. add the new options and then recompile with Jjtree. Object data). Object data). public Object visit(Type node.. Object data). Object data). It deﬁnes an accept method for each of the nodes allowing us to pass our implementation of the Visitor to each node. public Object visit(SimpleExpression node. Object data). 93
. public Object visit(AdditionOperator node./LexicalAnalyserVisitor. public Object visit(Statement node. public Object visit(IfStatement node. Object data). public Object visit(ReadStatement node. This means that if you have been writing these classes and experimenting with them as you have read this (highly recommended) you should delete all ﬁles except the .java */ public interface LexicalAnalyserVisitor { public Object visit(SimpleNode node. } Warning 1.jjt ﬁle. Object data). public Object visit(Expression node. public Object visit(StatementList node. Object data). This is our own ﬁle and is not therefore rewritten by Jtree. . This is the Visitor interface for our current example: /* Generated By:JJTree: Do not edit this line.
You have three weeks to do this assignment so there will be plenty of help given during lectures and you have plenty of time to come and see me. a ﬂoating point number to an integer variable.jjtGetNumChildren() == 0){ System. i++){ node. though if you are enjoying this module I hope you might want to do this out of interest.Now instead of dump()ing our tree we visit it with a visitor. For your ﬁnal assignment you are going to be doing some semantic analysis. } else{ for(int i = 0.jjtGetNumChildren(). if(node.out. public class MyVisitor implements ParserVisitor{ public Object visit(Root node. In a real compiler we would want eventually to generate machine code for our target machine.
5.out. without altering the construct of the nodes themselves.jjtAccept(this. e. In the last assignment all we did was print out details of the type of node. It should be clear thet we only need to write code for assignment statements. implementing LexicalAnalyserVisitor.g. which visits each node in turn. or ask Program Advisory for help. Let’s look at this: void assignStatement() #AssignmentStatement: {} { 94
. We will not be doing that as an assignment. I am a Root node").println("Hello. exactly what the Visitor pattern was designed to do. Object data){ System. to ensure that you do not attempt to assign incorrect values to variables.8
Semantic Analyses
Now that we have our symbol table and a Visitor interface it is easy to implement a Visitor object which can do anything we wish with the data held in the tree which our parser builds. printing out the type of node it is and how many children it has. Write a Visitor class. data). By implementing the visit() method of each particular type of node in a diﬀerent way we can do anything we want. } } } Simple isn’t it! Assignment 6.children[i]. i < node.println("I have no children").
Start from the bottom nodes of your tree and work your way up to the expression.
95
. All we need to do is decide what the type of the exp is.<ID> <ASSIGN> exp() } When we read the ID token we can check its type by looking it up in the symbol table as we have done before. This is your assignment.
.. The following C declarations will be used in the descriptions which follow: #define IADDR_SIZE.1
Basic Architecture
tiny consists of a read–only instruction memory.. as described below. This allows memory to easily be added to the machine since programs can ﬁnd out during execution how much 96
. These all use nonnegative integer addresses. . This section describes the tiny language.1) into dMem[0].. int dMem[DADDR_SIZE]. similar to the one you met in Theory of Programming Languages. then loads the value of the highest legal address (namely DADDR SIZE . while (!(halterror)).. a data memory. At start–up the tiny machine sets all registers and data memory to 0...
6. Register 7 is the program counter and is the only special register.6
Tiny
For your coursework you will be writing a compiler which compiles a while language. beginning at 0. A tiny machine simulator can be downloaded from the course website to test your compiler. /* size of data memory */ #define NO_REGS 8 /* number of registers */ #define PC_REG 7 Instruction iMem[IADDR_SIZE].. int reg[NO_REGS]. /* size of instruction memory */ #define DADDR_SIZE. /* execute current instruction */ . and a set of eight general– purpose registers. tiny performs a conventional fetch–execute cycle: do /* fetch */ currentInstruction = iMem[reg[pcRegNo]++]. into an assembly language called tiny.
or RO instructions. There are two basic instruction formats: register–only. s. Thus.s. and the two conditions DMEM ERR and ZERO DIV. s. and register– memory. which occurs if reg[PC REG] < 0 or reg[PC REG] ≥ IADDR SIZE in the fetch step. or RM instructions. All arithmetic instructions are 97
. t are legal registers (checked at load time).t where the operands r.RO Instructions Format : opcode r. any reference to dMem[a] generates DMEM ERR if a < 0 or a ≥ DADDR SIZE Eﬀect reg[r] = dMem[a] (load r with memory value at a) reg[r] = a (load address a directly into r) reg[r] = d (load constant d directly into r. similarly for the following) if(reg[r] ≤ 0) reg[PC REG] = a if(reg[r] ≥ 0) reg[PC REG] = a if(reg[r] > 0) reg[PC REG] = a if(reg[r] == 0) reg[PC REG] = a if(reg[r] != 0) reg[PC REG] = a Table 1: The tiny instruction set
memory is available. together with a brief description of the eﬀect of each instruction. (s and t ignored) reg[r] → the standard output (s and t ignored) reg[r] = reg[s] + reg[t] reg[r] = reg[s] . t Opcode HALT IN OUT ADD SUB MUL DIV RM Instructions Format : opcode r. such instructions are three–address. The machine then starts to execute instructions beginning at iMem[0]. d(s) Opcode LD LDA LDC ST JLT JLE JGE JGT JEQ JNE
Eﬀect stop execution (operands ignored) reg[r] ← integer value read from the standard input.reg[t] reg[r] = reg[s] * reg[t] reg[r] = reg[s] / reg[t] (may generate ZERO DIV) a = d + reg[s]. The instruction set of the machine is given in table 1. which occur during instruction execution execution as described below. and all three addresses must be registers. The possible error conditions include IMEM ERR. A register–only instruction has the format opcode r. s is ignored) dMem[a] = reg[r] (store value in r to memory location a) if (reg[r] < 0) reg[PC REG] = a (jump to instruction a if r is negative. The machine stops when a HALT instruction is executed.
In this the machine resembles RISC machines such as the Sun SparcStation. and load operations comes ﬁrst. where the ﬁrst address is always a register and the second address is a memory address a given by a = d + reg[r]. similar to the 80x85 and unlike the Sun SparcStation. In addition there is one store instruction and six conditional jump instructions. (Indeed. IN. On the other hand the machine has only 8 registers. and “load memory” (LD). If a is out of the legal range. A compiler for the machine must therefore maintain any runtime environment organisation entirely manually. (But the language you are compiling has no ﬂoating point variables either). Table 1 and the discussion of the machine up to this point represent the complete tiny architecture. in particular the source and target registers may be the same. then DMEM ERR is generated during execution. No register except the PC is special in any way. No operations (except load and store operations) act directly on memory. all three operands must be present. i. RM instructions include three diﬀerent load instructions corresponding to the three addressing modes “load constant” (LDC). even though some of them may be ignored. iii. and the source register(s) come second.limited to this format. This instruction is a two–address instruction. There are no ﬂoating point operations or ﬂoating point registers.d(s) In this code r and s must be legal registers (checked at load time). The target register in the arithmetic. the machine is adequate. In particular there is no hardware stack or other facilities of any kind. there is no sp or fp. In both RO and RM instructions. This is due to the simple nature of the loader. ii. as are the two primitive input/output instructions. “load address” (LDA). which only distinguishes between the two classes of instruction (RO and RM) and does not allow diﬀerent formats within each class.
98
. while most RISC processors have many more. where a must be a legal address (0 ≤ a < DADDR SIZE). if not comfortable. Although this may be unrealistic it has the advantage that all operations must be generated explicitly as they are needed. (This actually makes code generation easier since only two diﬀerent routines will be needed). A register–memory instruction has the format opcode r. Since the instruction set is minimal we should give some idea of how they can be used to achieve some more complicated programming language operations. All arithmetic operations are restricted to registers. There is no restriction on the use of registers for sources and targets. for even very sophisticated languages). and d is a positive or negative integer representing an oﬀset.
d(s) which has the eﬀect of jumping to the procedure whose entry address is dMem[d + reg[s]]. For example JEQ 0. since there is no unconditional jump.2
The machine simulator
The machine accepts text ﬁles containing TM instructions as described above with the following conventions: • An entirely blank line is ignored • A line beginning with an asterisk is considered to be a comment and is ignored. vii. There is also no indirect jump instruction.1(7) which places the current PC value plus one into reg[0].
6. vi.4(7) performs an unconditional jump three instructions backwards.0(1) jumps to the instruction whose address is stored in memory at the location pointed to by register 1. There is no restriction on the use of the PC in any of the instructions. An unconditional jump can also be made relative to the PC by using the PC twice in an LDA instruction. Indeed. v. 99
.). Thus LDA 7. • Any other line must contain an integer instruction location followed by a colon followed by a legal instruction. but it too can be imitated if necessary.4(7) causes the machine to jump ﬁve instructions forward in the code if register 0 is 0. Any text after the instruction is considered to be a comment and is ignored. LD 7. Instead we must write LD 7. The conditional jump instructions (JLT etc. There are no procedure calls or JSUB instruction.iv. For example. by using an LD instruction. it must be simulated by using the PC as the target register in an LDA instruction: LDA 7. Of course we need to remember to save the return address ﬁrst by executing something like LDA 0. d(s) This instruction has the eﬀect of jumping to location d + reg[s]. can be made relative to the current position in the program by using the PC as the second register.
0.3(7) 7: OUT 1.. and also as a target for jumps which wish to end the program. however it is useful to put it in as a reminder. The g command stands for “go”. The complete list of simulator commands can be obtained by using the command h. until a HALT instruction is read. computes * its factorial if it is positive.0 JLE 0.tm if one is not given).2 6: JNE 0.1.The machine contains no other features–in particular there are no symbolic labels and no macro facilities.0 LDC 2.6(7) LDC 1. If this program were saved in the ﬁle fact.0 Halted Enter command: q Simulation done.0 * end of program
Strictly speaking there is no need for the HALT instruction at the end of the code since the machine sets all instruction locations to HALT before loading the program..0 5: SUB 0.1. meaning that the program is executed starting at the current contents of the PC (which is 0 after loading).0.0.
100
.tm then this ﬁle can be loaded and executed as in the following sample session.r2 until r0 = 0 write r1 halt
4: MUL 1. * This program inputs an integer.0.0 r0 = read if 0 < r0 then r1 = 1 r2 = 1 * repeat r1 = r1 * r0 r0 = r0 . tm fact TM simulation (enter h for help) .1. We now have a look at a TM program which computes the factorial of a number input by the user.0. Enter command: g Enter value for IN instruction: 7 OUT instruction prints: 5040 HALT: 0.0 8: HALT 0. * and prints the result 0: 1: 2: 3: IN 0. (The simulator assumes the ﬁle extension .