Principles, Techniques,
^^
and Tools
t<^^ ,'^\>
«^?^
^)j^
mm^
^^^
'"^ /^toij^
•^^>:^
3/^
<°^
..*^
Alfred V Aho
Ravi Sethi ^ i 1
Jeffrey D. Ullman
hO^COfy <^, \2>vr.<N=>
Digitized by the Internet Archive
in 2010
http://www.archive.org/details/compilersprincipOOahoa
Compilers
ALFRED V. AHO
AT&T Bell Laboratories
Murray Hill, New Jersey
RAVI SETHI
AT&T Bell Laboratories
Murray Hill, New Jersey
JEFFREY D. ULLMAN
Stanford University
Stanford, California
A
TV
ADDISONWESLEY PUBLISHING COMPANY
Reading. Massachusetts • Menlo Park, California
Don Mills, Ontario • Wokingham, England • Amsterdam • Sydney
Singapore • Tokyo • Mexico City • Bogota • Santiago • San Juan
Mark S. Dalton/Publisher
James T. DeWoIf/Sponsoring Editor
Aho, Alfred V.
Compilers, principles, techniques, and tools.
Bibliography: p.
Includes index.
1. Compiling (Electronic computers) I. Sethi,
Ravi. II. Ullman, Jeffrey D. 1942 III. Title,, .
UNIX is a trademark of AT&T Bell Laboratories. DEC, PDP, and VAX are trade
marks of Digital Equipment Corporation. Ada is a trademark of the Ada Joint Pro
gram Office, Department of Defense, United States Government.
25 MA 9695
Preface
and Jeffrey D. Ullman. Like its ancestor, it is intended as a text for a first
course in compiler design. The emphasis is on solving problems universally
encountered in designing a language translator, regardless of the source or tar
get machine.
Although few people are likely to build or even maintain a compiler for a
major programming language, the reader can profitably apply the ideas and
techniques discussed in this book to general software design. For example,
the string matching techniques for building lexical analyzers have also been
used in text editors, information retrieval systems, and pattern recognition
programs. Contextfree grammars and syntaxdirected definitions have been
used to build many little languages such as the typesetting and figure drawing
systems that produced book. The techniques of code optimization have
this
been used in program verifiers and in programs that produce "structured"
programs from unstructured ones.
The major topics in compiler design are covered in depth. The first chapter
introduces the basic structure of a compiler and is essential to the rest of the
book.
Chapter 2 presents a translator from infix to postfix expressions, built using
some of the basic techniques described in this book. Many of the remaining
chapters amplify the material in Chapter 2.
menting translations.
Chapter 6 presents the main ideas for performing static semantic checking.
Type checking and unification are discussed in detail.
iv PREFACE
introduction
PREFACE
Exercises
As before, we rate exercises with stars. Exercises without stars test under
standing of definitions, singly starred exercises are intended for more
advanced courses, and doubly starred exercises are food for thought.
Acknowledgments
1.1 Compilers 1
2.1 Overview 25
2.2 Syntax definition 26
2.3 Syntaxdirected translation 33
2.4 Parsing 40
2.5 A translator for simple expressions 48
2.6 Lexical analysis 54
2.7 Incorporating a symbol table 60
2.8 Abstract stack machines 62
2.9 Putting the techniques together 69
Exercises 78
Bibliographic notes 81
CONTENTS
Bibliography 752
Index 780
CHAPTER 1
Introduction
to Compiling
The principles and techniques of compiler writing are so pervasive that the
ideas found in this book will be used many times in the career of a computer
scientist. Compiler writing spans programming languages, machine architec
ture, language theory, algorithms, and software engineering. Fortunately, a
few basic compilerwriting techniques can be used to construct translators for
a wide variety of languages and machines. In this chapter, we introduce the
subject of compiling by describing the components of a compiler, the environ
ment in which compilers do their job, and some software tools that make it
easier to build compilers.
1.1 COMPILERS
Simply stated, a compiler is a program that reads a program written in one
source target
compiler
program program
error
messages
cult to give an exact date for the first compiler because initially a great deal of
experimentation and implementation was done independently by several
groups. Much of the early work on compiling deaU with the translation of
arithmetic formulas into machine code.
Throughout the 1950's, compilers were considered notoriously difficult pro
grams to write. The first Fortran compiler, for example, took 18 staffyears
to implement (Backus et al. [1957]). We have since discovered systematic
techniques for handlingmany of the important tasks that occur during compi
lation. Good implementation languages, programming environments, and
software tools have also been developed. With these advances, a substantial
compiler can be implemented even as a student project in a onesemester
compilerdesign course.
position +
initial
rate 60
Many software tools that manipulate source programs first perform some
kind of analysis. Some examples of such tools include:
sively to compute the value of the expression rate #60. It would then
add that value to the value of the variable initial.
Interpreters are frequently used to execute command languages, since
each operator executed in a command language is usually an invocation of
a complex routine such as an editor or compiler. Similarly, some "very
highlevel" languages, like APL, are normally interpreted because there
are many things about the data, such as the size and shape of arrays, that
4 INTRODUCTION TO COMPILING SEC. 1.1
sk
INTRODUCTION TO COMPILING SEC. 1.2
Syntax Analysis
the tokens of the source program into grammatical phrases that are used by
the compiler to synthesize output. Usually, the grammatical phrases of the
source program are represented by a parse tree such as the one shown in Fig.
1.4.
SEC. 1.2 ANALYSIS OF THE SOURCE PROGRAM 7
1. If identifier ]
is an identifier, and expressioni is an expression, then
identifier 
: = expression 2
is a statement.
2. U expression X
is an expression and statementi is a statement, then
are statements.
position + position +
initial * initial
(a) (b) 60
The parse tree in Fig. 1.4 describes the syntactic structure of the input. A
more common internal representation of this syntactic structure is given by the
syntax tree 1.5(a). A syntax tree is a compressed representation of the
in Fig.
parse tree which the operators appear as the interior nodes, and the
in
operands of an operator are the children of the node for that operator. The
construction of trees such as the one in Fig. 1.5(a) is discussed in Section 5.2.
8 INTRODUCTION TO COMPILING SEC. 1.2
\
Semantic Analysis
The semantic analysis phase checks the source program for semantic errors
and gathers type information for the subsequent codegeneration phase. It
uses the hierarchical structure determined by the syntaxanalysis phase to
identify the operators and operands of expressions and statements.
An important component of semantic analysis is type checking. Here the
compiler checks that each operator has operands that are permitted by the
source language specification. For example, many programming language
definitions require a compiler to report an error every time a real number is
used to index an array. However, the language specification may permit some
operand coercions, for example, when a binary arithmetic operator is applied
to an integer and real. In this case, the compiler may need to convert the
integer to a real. Type checking and semantic analysis are discussed in
Chapter 6.
Example 1.1. Inside a machine, the bit pattern representing an integer is gen
erally differentfrom the bit pattern for a real, even if the integer and the real
number happen to have the same value. Suppose, for example, that all iden
tifiers in Fig. 1.5 have been declared to be reals and that 60 by itself is
I
10 INTRODUCTION TO COMPILING SEC. 1.2
a sub {i sup 2}
results in c/,:. Grouping the operators sub and sup into tokens is part of the
lexical analysis of EQN text. However, the syntactic structure of the text is
needed to determine the size and placement of a box.
source program
1
lexical
analyzer
syntax
analyzer
semantic
analyzer
symboltable
manager
intermediate code
generator
code
optimizer
code
generator
target program
first three phases, forming the bulk of the analysis portion of a com
The
piler, were introduced in the last section. Two other activities, symboltable
management and error handling, are shown interacting with the six phases o/
lexical analysis, syntax analysis, semantic analysis, intermediate code genefa
tion, code optimization, and code generation. Informally, we shall also call
SymbolTable Management
the case of procedure names, such things as the number and types of its argu
ments, the method of passing each argument (e.g., by reference), and the type
returned, if any.
A symbol table is a data structure containing a record for each identifier,
with fields for the attributes of the identifier. The data structure allows us to
find the record for each identifier quickly and to store or retrieve data from
that record quickly. Symbol tables are discussed in Chapters 2 and 7.
the type real is not known when position, initial, and rate are seen by
the lexical analyzer.
The remaining phases enter information about identifiers into the symbol
table and then use this information in various ways. For example, when
doing semantic analysis and intermediate code generation, we need to know
what the types of identifiers are, so we can check that the source program
uses them in valid ways, and so that we can generate the proper operations on
them. The code generator typically enters and uses detailed information about
the storage assigned to identifiers.
Each phase can encounter errors. However, after detecting an error, a phase
must somehow deal with that error, so that compilation can proceed, allowing
further errors in the source program to be detected. A compiler that stops
when it finds the first error is not as helpful as it could be.
The syntax and semantic analysis phases usually handle a large fraction of
the errors detectable by the compiler. The lexical phase can detect errors
where the characters remaining in the input do not form any token of the
language. Errors where the token stream violates the structure rules (syntax)
of the language are determined by the syntax analysis phase. During semantic
analysis the compiler tries to detect constructs that have the right syntactic
structure but no meaning to the operation involved, e.g., if we try to add two
identifiers, one of which is the name of an array, and the other the name of a
procedure. We discuss the handling of errors by each phase in the part of the
book devoted to that phase.
12 INTRODUCTION TO COMPILING SEC. 1.3
Figure 1.10 shows the representation of this statement after each phase.
The lexical analysis phase reads the characters in the source program and
groups them into a stream of tokens in which each token represents a logically
cohesive sequence of characters, such as an identifier, a keyword (if, while,
etc.), a punctuation character, or a multicharacter operator like :=. The
character sequence forming a token is called the lexeme for the token.
Certain tokens will be augmented by a "lexical value." For example, when
an identifier like rate is found, the lexical analyzer not only generates a
token, say id, but also enters the lexeme rate into the symbol table, if it is
not already there. The lexical value associated with this occurrence of id
points to the symboltable entry for rate.
In this section, we shall use id], id2, and id^ for position, initial, and
rate, respectively, to emphasize that the internal representation of an identif
ier is different from the character sequence forming the identifier. The
representation of ( 1 .
1 ) after lexical analysis is therefore suggested by:
We should also make up tokens for the multicharacter operator := and the
number 60 to reflect their internal representation, but we defer that until
Chapter 2. Lexical analysis is covered in detail in Chapter 3.
The second and third phases, syntax and semantic analysis, have also been
introduced in Section 1.2. Syntax analysis imposes a hierarchical structure on
the token stream, which we shall portray by syntax trees as in Fig. I.I 1(a). A
typical data structure for the tree is shown in Fig. 1.1 Kb) in which an interior
node is a record with a fleld for the operator and two flelds containing
pointers to the records for the left and right children. A leaf is a record with
two or more flelds, one to identify the token at the leaf, and the others to
After syntax and semantic analysis, some compilers generate an explicit inter
mediate representation of the source program. We can think of this inter
mediate representation as a program for an abstract machine. This intermedi
ate representation should have two important properties; it should be easy to
produce, and easy to translate into the target program.
The intermediate representation can have a variety of forms. In Chapter 8,
SEC. 1.3 THE PHASES OF A COMPILER 13
1
lexical analyzer
T
id,
id.
id, '60
semantic analyzer
T
id,
Symbol Table
position
14 INTRODUCTION TO COMPILING SEC. 1.3
id,
id.
SEC. 1.3 THE PHASES OF A COMPILER 15
There is nothing wrong with this simple algorithm, since the problem can be
fixed during the codeoptimization phase. That is, the compiler can deduce
that the conversion of 60 from integer to real representation can be done once
and for all at compile time, so the inttoreal operation can be eliminated.
Besides, temp3 is used only once, to transmit its value to idl. it then
becomes safe to substitute idl for temp3, whereupon the last statement of
(1.3) is not needed and the code of (1.4) results.
There is great variation in the amount of code optimization different com
pilers perform. In those that do the most, called "optimizing compilers," a
significant fraction of the time of the compiler is spent on this phase. How
ever, there are simple optimizations that significantly improve the running
time of the target program without slowing down compilation too much.
Many in Chapter 9, while Chapter 10 gives the technol
of these are discussed
ogy used by the most powerful optimizing compilers.
Code Generation
The final phase of the compiler is the generation of target code, consisting
normally of relocatable machine code or assembly code. Memory locations
are selected for each of the variables used by the program. Then, intermedi
ate instructions are each translated into a sequence of machine instructions
that perform the same task. A crucial aspect is the assignment of variables to
registers.
For example, using registers 1 and 2, the translation of the code of (1.4)
might become
MOVF ids, R2
MULF #60.0, R2
MOVF id2, R1 (1.5)
ADDF R2, R1
MOVF R1 idl
,
The and second operands of each instruction specify a source and destina
first
'
We have sidestepped the important issue of storage allocation for the identifiers in the source
program. As we shall sec in Chapter 7, the organization of storage at runtime depends on the
language being compiled. Storageallocation decisions arc made cither during intermediate code
generation or during code generation.
16 INTRODUCTION TO COMPILING SEC. 1.4
Preprocessors
2. File inclusion. A preprocessor may include header files into the program
text. For example, the C preprocessor causes the contents of the file
Macro processors deal with two kinds of statement: macro definition and
macro use. Definitions are normally indicated by some unique character or
keyword, like define or macro. They consist of a name for the macro
being defined and a body, forming its definition. Often, macro processors
permit formal parameters in their definition, that is, symbols to be replaced by
values (a "value" is The use of a
a string of characters, in this context).
macro consists of naming
macro and supplying actual parameters, that is,
the
values for its formal parameters. The macro processor substitutes the actual
parameters for the formal parameters in the body of the macro; the
transformed body then replaces the macro use itself.
\JACM 17;4;715728.
and expect to see
The portion of the body {\sl J. ACM} calls for an italicized ("slanted") '7.
ACM'\ Expression {\bf #1} says that the first actual parameter is to be
made boldface; this parameter is intended to be the volume number.
T^ allows any punctuation or string of text to separate the volume, issue,
and page numbers in the definition of the \JACM macro. We could even have
used no punctuation at all. in which case T^X would take each actual parame
ter to be a single character or a string surrounded by { }
Assemblers
MOV a, R1
ADD #2, R1 (1.6)
MOV R b 1 ,
This code moves the contents of the address a into register 1, then adds the
constant 2 to it, treating the contents of register 1 as a fixedpoint number.
 Well, almost arbitrary strings, since a simple Icfttoright scan of the macro use is made, and as
soon as a symbol matching the text following a #/ symbol in the template is found, the preceding
string is deemed to match #/. Thus, if we tried to sunstitute ab;cd for #1. we would find that
only ab matched #1 and cd was matched to #2.
18 INTRODUCTION TO COMPILING SEC. 1.4
and finally stores the result in the location named by b. Thus, it computes
b : = a + 2.
It is customary for assembly languages to have macro facilities that are simi
TwoPass Assembly
The simplest form of assembler makes two passes over the input, where a pass
consists of reading an input file once. In the first pass, all the identifiers that
denote storage locations are found and stored in a symbol table (separate from
that of the compiler). Identifiers are assigned storage locations as they are
encountered for the first time, so after reading (1.6), for example, the symbol
table might contain the entries shown in Fig. 1.12. In that figure, we have
assumed that a word, consisting of four bytes, is set aside for each identifier,
and that addresses are assigned starting from byte 0.
Identifier Address
a U
b 4
In the second pass, the assembler scans the input again. This time, it
translates each operation code into the sequence of bits representing that
operation in machine language, and it translates each identifier representing a
location into the address given for that identifier in the symbol table.
The output of the second pass is usually relocatable machine code, meaning
that it can be loaded starting at any location L in memory; i.e., if L is added
to all addresses in the code, then all references will be correct. Thus, the out
put of the assembler must distinguish those portions of instructions that refer
to addresses that can be relocated.
Example 1.3. The following is a hypothetical machine code into which the
assembly instructions (1.6) might be translated.
0001 01 00 00000000
0011 01 10 00000010 (1.7)
0010 01 00 00000100 *
We envision a tiny instruction word, in which the first four bits are the
instruction code, with 0001. 0010. and 0011 standing for load, store, and
add, respectively. By load and store we mean moves from memory into a
register and vice versa. The next two bits designate a register, and 01 refers
to register in each of the three above instructions.
1 The two bits after that
represent a "tag," with 00 standing for the ordinary address mode, where the
SEC. 1.4 COUSINS OF THE COMPILER 19
last eight bits refer to a memory address. The tag 10 stands for the 'immedi
ate'" mode, where the last eight bits are taken literally as the operand. This
mode appears in the second instruction of (1.7).
We also see in (1.7) a * associated with the first and third instructions.
This * represents the relocation bit that is associated with each operand in
relocatable machine code. Suppose that the address space containing the data
is to be loaded starting at location L. The presence of the means that L
must be added to the address of the instruction. Thus, if L = 00001111.
i.e., 15, then a and b would be at locations 15 and 19, respectively, and the
0001 01 00 00001 1 1 1
0011 01 10 00000010 (1.8)
0010 01 00 00010011
Usually, a program called a loader performs the two functions of loading and
linkediting. The process of loading consists of taking relocatable machine
code, altering the relocatable addresses as discussed in Example 1.3. and plac
ing the altered instructions and data in memory at the proper locations.
The linkeditor allows us to make a single program from several files of
relocatable machine code. These files may have been the result of several dif
ferent compilations, and one or more may be library files of routines provided
by the system and available to any program that needs them.
If the files are to be used together in a useful way, there may be some
another file. This reference may be to a data location defined in one file and
used in another, or it may be to the entry point of a procedure that appears in
the code for one file from another file. The relocatable machine
and is called
code file must retain the information in the symbol table for each data loca
tion or instruction label that is referred to externally. If we do not know in
advance what might be referred to, we in effect must include the entire assem
bler symbol table as part of the relocatable machine code.
For example, the code of 1.7) would be preceded by(
a
b 4
If a file loaded with (1.7) referred to b, then that reference would be replaced
by 4 plus the offset by which the data locations in file ( 1.7) were relocated.
20 INTRODUCTION TO COMPILING StC. 1.5
Often, the phases are collected into a front end and a hack end. The front end
consists of those phases, or parts of phases, that depend primarily on the
source language and are largely independent of the target machine. These
normally include lexical and syntactic analysis, the creation of the symbol
table, semantic analysis, and the generation of intermediate code. A certain
amount of code optimization can be done by the front end as well. The front
end also includes the error handling that goes along with each of these phases.
The back end includes those portions of the compiler that depend on the
target machine, and generally, these portions do not depend on the source
language, just the intermediate language. In the back end, we find aspects of
the code optimization phase, and we find code generation, along with the
necessary error handling and symboltable operations.
It has become fairly routine to take the front end of a compiler and redo
its
associated back end to produce a compiler for the same source language on a
different machine. If the back end is designed carefully, it may not even be
necessary to redesign too much of the back end; this matter is discussed in
Chapter 9. It is also tempting to compile several different languages into the
same intermediate language and use a common back end for the different
front ends, thereby obtaining several compilers for one machine. However,
because of subtle differences in the viewpoints of different languages, there
has been only limited success in this direction. ^
Passes
great variation in the way the phases of a compiler are grouped into passes, so
we prefer to organize our discussion of compiling around phases rather than
passes. Chapter 12 discusses some representative compilers and mentions the
way they have structured the phases into passes.
As we have mentioned, is commonit for several phases to be grouped into
one pass, and for the activity of these phases to be interleaved during the
pass. For example, lexical analysis, syntax analysis, semantic analysis, and
intermediate code generation might be grouped into one pass. If so, the token
stream after lexical analysis may be translated directly into intermediate code.
In more detail, think of the syntax analyzer as being "in charge." It
we may
attempts to discover the grammatical structure on the tokens it sees; it obtains
tokens as it needs them, by calling the lexical analyzer to find the next token.
As the grammatical structure is discovered, the parser calls the intermediate
SEC. 1.5 THE GROUPING OF PHASES 21
It is desirable to have relatively few passes, since it takes time to read and
write intermediate files. On the other hand, if we group several phases into
one pass, we may be forced to keep the entire program in memory, because
one phase may need information in a different order than a previous phase
produces it. The internal form of the program may be considerably larger
than either the source program or the target program, so this space may not
be a trivial matter.
For some phases, grouping into one pass presents few problems. For exam
ple, as we mentioned above, the interface between the lexical and syntactic
analyzers can often be limited to a single token. On the other hand, it is
often very hard to perform code generation until the intermediate representa
tion has been completely generated. For example, languages like PL/I and
Algol 68 permit variables to be used before they are declared. We cannot
generate the target code for a construct if we do not know the types of vari
ables involved in that construct. Similarly, most languages allow goto's that
jump forward in the code. We cannot determine the target address of such a
jump until we have seen the intervening source code and generated target
code for it.
the identifiers that represent memory locations and deduced their addresses as
they were discovered. Then a second pass substituted addresses for identif
iers.
GOTO target
we generate a skeletal instruction, with the machine operation code for GOTO
and blanks for the address. All instructions with blanks for the address of
target are kept in a list associated with the symboltable entry for target.
The blanks are filled in when we finally encounter an instruction such as
shall see how some of these tools can be used to implement a compiler, in
addition to these softwaredevelopment tools, other more specialized tools
have been developed for helping implement various phases of a compiler. We
mention them briefly in this section; they are covered in detail in the appropri
ate chapters.
Shortly after the first compilers were written, systems to help with the
compilerwriting process appeared. These systems have often been referred to
as compilercompilers, compilergenerators, or translatorwriting systems.
Largely, they are oriented around a particular model of languages, and they
are most suitable for generating compilers of languages similar to the model.
For example, it is tempting to assume that lexical analyzers for all
languages are essentially the same, except for the particular keywords and
signs recognized. Many compilercompilers do
produce fixed lexical
in fact
analysis routines for use in the generated compiler. These routines differ only
in the list of keywords recognized, and this list is all that needs to be supplied
the "little languages" used to typeset this book, such as PIC (Kernighan
[1982]) and EQN, were implemented in a few days using the parser gen
erator described in Section 4.7. Many parser generators utilize powerful
parsing algorithms that are too complex to be carried out by hand.
Chapter 5.
BIBLIOGRAPHIC NOTES
Writing in 1962 on the history of compiler writing, Knuth [1962] observed
been an unusual amount of parallel discovery of
that, "In this field there has
the same technique by people working independently." He continued by
observing that several individuals had in fact discovered "various aspects of a
24 INTRODUCTION TO COMPILING CHAPTER 1
technique, and it has been polished up through the years into a very pretty
algorithm, which none of the originators fully realized." Ascribing credit for
techniques remains a perilous task; the bibliographic notes in this book are
intended merely as an aid for further study of the literature.
Historical notes on the development of programming languages and com
pilers until the arrival of Fortran may be found in Knuth and Trabb Pardo
11977]. Wexelblat [I981 contains historical recollections about several pro
gramming languages by participants in their development.
Some fundamental early papers on compiling have been collected in Rosen
119671 and Pollack 1972. The January 1961 issue of the Communications of
the ACM provides a snapshot of the state of compiler writing at the time. A
detailed account of an early Algol 60 compiler is given by Randell and
Russell 1964.
Beginning in the early 1960's with the study of syntax, theoretical studies
have had profound influence on the development of compiler technology,
a
perhaps, at least as much influence as in any other area of computer science.
The fascination with syntax has long since waned, but compiling as a whole
continues to be the subject of lively research. The fruits of this research will
become evident when we examine compiling in more detail in the following
chapters.
CHAPTER 2
A Simple
OnePass
Compiler
2.1 OVERVIEW
A programming language can be defined by describing what its programs look
like (the syntax of the language) and what its programs mean (the semantics of
the language). For specifying the syntax of a language, we present a widely
used notation, called contextfree grammars or BNF (for BackusNaur Form).
With the notations currently available, the semantics of a language is much
more difficult to describe than the syntax. Consequently, for specifying the
semantics of a language we shall use informal descriptions and suggestive
examples.
Besides specifying the syntax of a language, a contextfree grammar can be
used to help guide the translation of programs. A grammaroriented compil
ing technique, known as syntaxdirected translation, is very helpful for organiz
ing a compiler front end and will be used extensively throughout this chapter.
In the course of discussing syntaxdirected translation, we shall construct a
compiler that translates infix expressions into postfix form, a notation in
which the operators appear after their operands. For example, the postfix
form of the expression 95 + 2 is 952 + Postfix notation can be converted
.
directly into code for a computer that performs all its computations using a
stack. We begin by constructing a simple program to translate expressions
consisting of digits separated by plus and minus signs into postfix form. As
the basic ideas become clear, we extend the program to handle more general
programming language constructs. Each of our translators is formed by sys
tematically extending the previous one.
26 A SIMPLE COMPILER SEC. 2.2
In our compiler, the lexical analyzer converts the stream of input characters
into a stream of tokens that becomes the input to the following phase, as
shown in Fig. 2.1. The "syntaxdirected translator" in the figure is a combi
nation of a syntax analyzer and an intermediatecode generator. One reason
for starting with expressions consisting of digits and operators is to make lexi
cal analysis initially very easy; each input character forms a single token.
Later, we extend the language to include lexical constructs such as numbers,
identifiers, and keywords. For this extended language we shall construct a
lexical analyzer that collects consecutive input characters into the appropriate
tokens. The construction of lexical analyzers will be discussed in detail in
Chapter 3.
syntax
character token intermediate
directed
stream stream representation
translator
That is, the statement is the concatenation of the keyword if, an opening
parenthesis, an expression, a closing parenthesis, a statement, the keyword
else, and another statement. (In C, there is no keyword then.) Using the
variable expr to denote an expression and the variable stmt to denote a state
ment, this structuring rule can be expressed as
in which the arrow may be read as "can have the form." Such a rule is called
a production. In a production lexical elements like the keyword if and the
parentheses are called tokens. Variables like expr and stmt represent
sequences of tokens and are called nonterminals.
2. A set ot nonterminals.
The right sides of the three productions with nonterminal list on the left
According to our conventions, the tokens of the grammar are the symbols
+ 0123456789
The nonterminals are the italicized names list and digit, with list being the
starting nonterminal because its productions are given first.
'
Individual italic letters will be used for additional purposes when grammars are studied in detail
in Chapter 4. For example, we shall use X, Y. and Z to talk about a symbol that is either a token
or a nonterminal. However, any italicized name containing two i)r mi)re characters will continue
to represent a nonterminal.
28 A SIMPLE COMPILER SEC. 2.2
nonterminal. The token strings that can be derived trom the start symbol
form the language defined by the grammar.
Example 2.2. The language defined by the grammar of Example 2.1 consists
The ten productions for the nonterminal digit allow it to stand for any of
the tokens 0,1 9. From production (2.4), a single digit by itself is a
list. Productions (2.2) and (2.3) express the fact that if we take any list and
follow it by a plus or minus sign and then another digit we have a new list.
It turns out that productions (2.2) to (2.5) are all we need to define the
language we are interested in. For example, we can deduce that 95 + 2 is a
list as follows.
This reasoning is illustrated by the tree in Fig. 2.2. Each node in the tree is
li.st
li
SEC. 2.2 SYNTAX DEFINITION 29
can be replaced by the empty string, so a block can consist of the twotoken
string begin end. Notice that the productions for strntJist are analogous to
those for list in Example 2.1, with semicolon in place of the arithmetic opera
tor and stmt in place of dii^it. We have not shown the productions for stmt.
Shortly, we shall discuss the appropriate productions for the various kinds of
statements, such as ifstatements, assignment statements, and so on.
Parse Trees
A parse tree pictorially shows how the start symbol of a grammar derives a
string in the language. If nonterminal A has a production A — XKZ, then a
parse tree may have an interior node labeled A with three children labeled X,
Y, and Z, from left to right:
X Y Z
Formally, given a contextfree grammar, a parse tree is a tree with the fol
lowing properties:
are the labels of the children of that node from left to right, then
A XjXt • •
A",, is a production. Here, X\, Xt, X„ stand for a
. . . ,
Example 2.4. In Fig. 2.2, the root is labeled //.s7, the start symbol of the
grammar in Example 2.1. The children of the root are labeled, from left to
right, //.s7, +, and digit. Note that
repeated at the left child of the root, and the three nodes labeled digit each
have one child that is labeled by a digit.
The leaves of a parse tree read from form the yield of the tree,
left to right
which is the string generated or derived from the nonterminal at the root of
the parse tree. In Fig. 2.2, the generated string is 95 + 2. In that figure, all
the leaves are shown at the bottom level. Henceforth, we shall not necessarily
30 A SIMPLE COMPILER SEC. 2.2
line up the leaves in this way. Any tree imparts a natural lefttoright order
to leaves, based on the idea that if a and b are two children with the same
its
parent, and a is to the left of b, then all descendants of a are to the left of
descendants of b.
Ambiguity
Example 2.5. Suppose we did not distinguish between digits and lists as in
Example 2.1. We could have written the grammar
However, Fig. 2.3 shows that an expression like 95 + 2 now has more than
one parse tree. The two trees for 95 + 2 correspond to the two ways of
parenthesizing the expression: (95) +2 and 9 (5 + 2). This second
parenthesization gives the expression the value 2 rather than the customary
value 6. The grammar of Example 2.1 did not permit this interpretation.
Associativity of Operators
string string
string
y\\  string 2
I I
9 string
/l\ + string
9 5 5 2
letter alb] • • •
 z
The contrast between a parse tree for a leftassociative operator like  and a
parse tree for a rightassociative operator like = is shown by Fig. 2.4. Note
that the parse tree for952 grows down towards the left, whereas the parse
tree for a=b=c grows down towards the right.
list right
y\\ /i\
II
list
digit
I
 digit
5
I
2 a
I
II
letter
b
= ri^ht
letter
9 c
Precedence of Operators
Consider the expression 9 + 5*2. There are two possible interpretations of this
expression: (9 + 5)*2 or 9+ (5*2).
The associativity of + and * do not
resolve this ambiguity. For this reason, we need to know the relative pre
cedence of operators when more than one kind of operator is present.
We say that * has higher precedence than + if takes its operands before +
does. In ordinary arithmetic, multiplication and division have higher pre
cedence than addition and subtraction. Therefore, 5 is taken by in both
9 + 5*2 and 9*5 + 2; i.e., the expressions are equivalent to 9+ (5*2) and
(9*5) +2, respectively.
left associative: + 
left associative: * /
We create two nonterminals expr and term for the two levels of precedence,
and an extra nonterminal factor for generating basic units in expressions. The
basic units in expressions are presently digits and parenthesized expressions.
factor  digit 
( expr )
Now consider the binary operators, * and /, that have the highest pre
cedence. Since these operators associate to the left, the productions are simi
lar to those for lists that associate to the left.
I
term / factor
I
factor
stmt » id : = expr
I
if expr then stmt
I
if expr then stmt else stmt
I
while expr do stmt
I
begin optstmts end
contained in Chapter 5.
Postfix Notation
No parentheses are needed in postfix notation because the position and arity
(number of arguments) of the operators permits only one decoding of a post
fix expression. For example, the postfix notation for (95) +2 is 952+ and
SyntaxDirected Definitions
specified in the following manner. First, construct a parse tree for x. Suppose
34 A SIMPLE COMPILER SEC. 2.3
a node n in the parse tree is labeled by the grammar symbol X. We write X.a
todenote the value of attribute a of X at that node. The value of X.a at n is
computed using the semantic rule for attribute a associated with the X
production used at node n. A parse tree showing the attribute values at each
node is called an annotated parse tree.
Synthesized Attributes
butes have the desirable property that they can be evaluated during a single
bottomup traversal of the parse tree. In this chapter, only synthesized attri
Production
SEC. 2.3 SYNTAXDIRECTED TRANSLATION 35
expr.t = 952 +
expr.t = 9 term.t = 5
term.t
95
I
I
= 9
parse tree.
2
Example 2.7. Suppose a robot can be instructed to move one step east, north,
west, or south from its current position. A sequence of such instructions is
(2,1)
north
(1,0) west begin
(0,0)
south north
(  I .
 1 ) east east east (2,1)
position. (If .V is negative, then the robot is to the west of the starting posi
tion; similarly, if v is negative, then the robot is to the south of the starting
position.)
Let us eonstruet a syntaxdirected definition to translate an instruction
sequence into a robot position. We
and seq.v, shall use two attributes, seq.x
to keep track of the position resulting from an instruction sequence generated
by the nonterminal seq. Initially, scq generates begin, and seq.x and scq.y are
both initialized to 0, as shown at the leftmost interior node of the parse tree
for begin west south shown in Fig. 2.8.
seq.x = — I
seq.y = — I
,v('(/.A =  I in.str.dx=
seq.y ~ instr.dy = — 1
seq.x = in.'ilr.dx = — I
south
.sr^.v = Inslr.Jy —
begin west
given by attributes instr. dx and instr.dy. For example, if instr derives west,
then instr. dx = — I and instr.dy — 0. Suppose a sequence seq is formed by
following a sequence seq ^
by a new instruction instr. The new position of the
robot is then given by the rules
DepthFirst Traversals
A syntaxdirected definition does not impose any specific order for the evalua
tion of attributes on a parse tree; any evaluation order that computes an attri
bute a after all the other attributes that a depends on is acceptable. In gen
eral, we may have to evaluate some attributes when node is first reached
a
during a walk of the parse tree, others after all its children have been visited,
or at some point in between visits to the children of the node. Suitable
evaluation orders are discussed Chapter 5. in more detail in
The translations in this chapter can all be implemented by evaluating the
semantic rules for the attributes in a parse tree in a predetermined order. A
traversal of a tree starts at the root and visits each node of the tree in some
SEC. 2.3 SYNTAXDIRBCTED TRANSLATION 37
Production
\
rest
Emitting a Translation
in this chapter, the semantic actions in translations schemes will write the out
of the nonterminals on the right, in the same order as in the production, with
some additional strings (perhaps none) interleaved. A syntaxdirected defini
tion with this property termed simple. For example, consider the
is first pro
duction and semantic rule from the syntaxdirected definition of Fig. 2.5:
Here the translation expr.t is the concatenation of the translations of expr^ and
term, followed by the symbol +. Notice that expr^ appears before term on the
right side of the production.
An additional string appears between term.t and rest^.t in
but, again, the nonterminal term appears before rest on the right side.
\
Example 2.8. Figure 2.5 contained a simple definition for translating expres
sions into postfix form. A translation scheme derived from this definition is
given in Fig. 2.13 and a parse tree with actions for 95 + 2 is shown in Fig.
2.14. Note that although Figures 2.6 and 2.14 represent the same input
output mapping, the translation in the two cases is constructed differently;
Fig. 2.6 attaches the output to the root of the parse tree, while Fig. 2.14 prints
the output incrementally.
expr
40 A SIMPLE COMPILER SEC. 2.3
subtree for the right operand term and, finally, the semantic action
print (' + ') the extra node.
{ } at
Since the productions for term have only a digit on the right side, that digit
is printed by the actions for the productions. No output is necessary for the
production expr » term, and only the operator needs to be printed in the
action for the first two productions. When executed during a depthfirst
traversal of the parse tree, the actions in Fig. 2.14 print 952 + .
term
/
{print ('')} 2 {print ('2')}
e.xpr term
I / ^ .
term 5 {print {'5')}
/ ^ .
9 {print ('9'))
As a general rule, most parsing methods process their input from left to
right in a "greedy" fashion; that is, they construct as much of a parse tree as
possible before reading the next input token. In a simple translation scheme
(one derived from a simple syntaxdirected definition), actions are also done
in a lefttoright order. Therefore, to implement a simple translation scheme
we can execute the semantic actions while we parse; it is not necessary to con
struct the parse tree at all.
2.4 PARSING
Parsing is the process of determining if a string of tokens can be generated by
a grammar. In discussing this problem, it is helpful to think of a parse tree
being constructed, even though a compiler may not actually construct such a
tree. However, must be capable of constructing the
a parser tree, or else the
translation cannot be guaranteed correct.
This section introduces a parsing method that can be applied to construct
syntaxdirected translators. A complete C program, implementing the transla
tion scheme of Fig. 2.13, appears in the next section. A viable alternative is
however, have a special form. For any contextfree grammar there is a parser
that takes at most 0{n^) time to parse a string of n tokens. But cubic time is
SEC. 2.4 PARSING 41
Most parsing methods fall into one of two classes, called the topdown and
bottomup methods. These terms refer to the order in which nodes in the
parse tree are constructed, in the former, construction starts at the root and
proceeds towards the leaves, while, in the latter, construction starts at the
leaves and proceeds towards the root. The popularity of topdown parsers is
due to the fact that efficient parsers can be constructed more easily by hand
using topdown methods. Bottomup parsing, however, can handle a larger
class of grammars and translation schemes, so software tools for generating
parsers directly from grammars have tended to use bottomup methods.
TopDown Parsing
We introduce topdown parsing by considering a grammar that is wellsuited
for this class of methods. Later in we consider the construction
this section,
of topdown parsers in general. The following grammar generates a subset of
the types of Pascal. We use the token dotdot for ". ." to emphasize that the
character sequence is treated as a unit.
tvpe * simple
I
tid
I
array [ simple ] of type
simple  integer
I
char
I
num dotdot num
The topdown construction of a parse tree is done by starting with the root,
labeled with the starting nonterminal, and repeatedly performing the following
two steps (see Fig. 2.15 for an example).
For some grammars, the above steps can be implemented during a single left
The current token being scanned in the input
toright scan of the input string.
is frequently referred to as the lookahead symbol. Initially, the lookahead
symbol is the first, i.e., leftmost, token of the input string. Figure 2.16 illus
Initially, the token array is the lookahead symbol and the known part of the
42 A SIMPLE COMPILER SEC. 2.4
(a)
(b)
type
(c) type
type
array
(c)
parse tree consists of the root, labeled with the starting nonterminal type in
Parse type
Tree
(a)
type
Parse
Tree array [ simple ] of type
t
(b)
type
Parse
Tree array simple of type
(c)
Fig. 2.16. Topdown parsing while scanning the input from left to right.
terminal matches the lookahead symbol, then we advance in both the parse
treeand the input. The next token in the input becomes the new lookahead
symbol and the next child in the parse tree is considered. In Fig. 2.16(c), the
arrow in the parse tree has advanced to the next child of the root and the
arrow in the input has advanced to the next token [. After the next advance,
the arrow in the parse tree will point to the child labeled with nonterminal
simple. When a node labeled with a nonterminal is considered, we repeat the
process of selecting a production for the nonterminal.
In general, the selection of a production for a nonterminal may involve
trialanderror; that is, we may have to try a production and backtrack to try
another production if the found to be unsuitable. A production is
first is
unsuitable if, after using the production, we cannot complete the tree to match
the input string. There is an important special case, however, called predic
tive parsing, in which backtracking does not occur.
(
Predictive Parsing
end;
procedure type ;
begin
if lookahead is in { integer, char, num } then
simple
else if lookahead = '
t ' then begin
match '
T '); match (id)
end
else if lookahead = array then begin
match {array); match ('['); simple ; match (']'); match (of); type
end
else error
end;
procedure simple ;
begin
if lookahead = integer then
nk//<7/( integer)
end;
The predictive parser in Fig. 2.17 consists of procedures for the nontermi
nals type and simple of grammar (2.8) and an additional procedure match. We
SEC. 2.4 PARSING 45
use match to simplify the code for type and simple; it advances to the next
input token if its argument
matches the lookahead symbol. Thus match
/
changes the variable lookahead, u'hich is the currently scanned input token.
Parsing begins with a call of the procedure for the starting nonterminal type
in our grammar. With the same input as in Fig. 2.16, lookahead is initially
the first token array. Procedure type executes the code
Note that each terminal in the right side is matched with the lookahead sym
bol and that each nonterminal in the right side leads to a call of its procedure.
With the input of Fig. 2.16, after the tokens array and [ are matched, the
lookahead symbol is num. At this point procedure simple is called and the
code
the right side of a production starts with a token, then the production can be
used when the lookahead symbol matches the token. Now consider a right
side starting with a nonterminal, as in
This production is used if the lookahead symbol can be generated from simple.
For example, during the execution of the code fragment (2.9), suppose the
lookahead symbol is integer when control reaches the procedure call type.
There is no production for type that starts with token integer. However, a
production for simple does, so production (2.10) is used by having type call
FIRST(t id) = { t
}
In practice, many production right sides start with tokens, simplifying the
"
Productions with e on the right side complicate the determination of the first symbols generated
by a nonterminal. For example, if nonterminal B can derive the empty string and there is a pro
duction A > BC, then the symbol generated by C can also be the
first first symbol generated by A.
If C can also generate e, then both F'iRST(4) and F!RST(«C') contain e.
46 A SIMPLE COMPILER SEC. 2.4
Section 4.4.
The FIRST sets must be considered if there are two productions A ^ a and
y4 ^ p. Recursivedescent parsing without backtracking requires FIRST(a)
and FIRST(P) to be disjoint. The lookahead symbol can then be used to
decide which production to use; if the lookahead symbol FIRST(a), then
is in
Productions with e on the right side require special treatment. The recursive
descent parser will use an eproduction as a default when no other production
can be used. For example, consider:
FIRST(a). If there is a conflict between two right sides for any look
ahead symbol, then we cannot use this parsing method on this grammar.
A production with e on the right side is used if the lookahead symbol is
not in the FIRST set for any other right hand side.
2. Copy the actions from the translation scheme into the parser. If an action
appears after grammar symbol X inproduction p, then it is copied after
the code implementing X. Otherwise, if it appears at the beginning of the
production, then it is copied just before the code implementing the pro
duction.
Left Recursion
in which the leftmost symbol on the right sideis the same as the nonterminal
p
)
)
'^ ^^ ,
(2.11)
Rightrecursive productions lead to trees that grow down towards the right, as
in Fig. 2.18(b). Trees growing down to the right make it harder to translate
expressions containing leftassociative operators, such as minus. In the next
section, however, we shall see that the proper translation of expressions into
postfix notation can still be attained by a careful design of the translation
scheme based on a rightrecursive grammar.
In Chapter 4, we consider more general forms of left recursion and show
how all left recursion can be eliminated from a grammar.
expr * term
term — { print (
'
'
}
term — 1 { print (
'
1
' }
A useful starting point for thinking about the translation of an input string is
an abstract syntax tree in which each node represents an operator and the chil
dren of the node represent the operands. By contrast, a parse tree is called a
concrete syntax tree, and the underlying grammar is called a concrete syntax
for the language. Abstract syntax trees, or simply syntax trees, differ from
parse trees because superficial distinctions of form, unimportant for transla
tion, do not appear in syntax trees.
9 5
For example, the syntax tree for 95 + 2is shown in Fig. 2.20. Since + and
 have the same precedence and operators at the same precedence level
level,
are evaluated left to right, the tree shows 95 grouped as a subexpression.
Comparing Fig. 2.20 with the corresponding parse tree of Fig. 2.2, we note
that the syntax tree associates an operator with an interior node, rather than
making the operator be one of the children.
It is desirable for a translation scheme to be based on a grammar whose
parse trees are as close to syntax trees as possible. The grouping of subex
pressions by the grammar in Fig. 2.19 is similar to their grouping in syntax
trees. Unfortunately, the grammar of and hence
Fig. 2.19 is leftrecursive,
not suitable for predictive parsing. It on the one
appears there is a conflict;
hand we need a grammar that facilitates parsing, on the other hand we need a
radically different grammar for easy translation. The obvious solution is to
eliminate the leftrecursion. However, this must be done carefully as the fol
lowing example shows.
This grammar has the problem that the operands of the operators generated
by rest * + expr and rest   expr are not obvious from the productions.
Neither of the following choices for forming the translation rest.t from that of
expr.t is acceptable:
(We have only shown the production and semantic action for the minus opera
tor.) The translation of 95 is 95. However, if we use the action in (2.12),
then the minus sign appears before expr.t and 95 incorrectly remains 95 in
translation.
On the other hand, if we use (2.13) and the analogous rule for plus, the
operators consistently move to the right end and 95 + 2 is translated
incorrectly into 952 +  (the correct translation is 952 + ).
A ^ yR
R ^ aR \
^R \
e
When semantic actions are embedded in the productions, we carry them along
in the transformation. {printC + ') },
Here, if we let A  expr, a = + term
=  term {print{'') }, and 7 = term, the transformation above produces
P
the translation scheme (2.14). The expr productions in Fig. 2.19 have been
transformed into the productions for expr and the new nonterminal rest in
(2.14). The productions for term are repeated from Fig. 2.19. Notice that the
underlying grammar is different from the one in Example 2.9 and the differ
ence makes the desired translation possible.
Figure 2.21 shows how 95 + 2 is translated using the above grammar.
; ;
expr
term rest
{print{'9') 
9 } term { print (
'
' ) } rest
5
/\ {primes')} + term { prim (
'
+
' ) } rest
2
/\ { print{'2') }
I
expr( )
term( ) ; rest ( ) ;
rest (
if ( lookahead == '+') {
+
match('+'); term( ) ; putchar (
' ' ) ; rest();
}
else ;
terin( )
if ( isdigit( lookahead) ) {
putchar lookahead
( ) ; match lookahead ( )
else error ( )
Fig. 2.22. Functions for the nonterminals expr, rest, and term.
Fig. 2.17 to match a token with the lookahead symbol and advance through
the input. Since each token is a single character in our language, match can
be implemented by comparing and reading characters.
For those unfamiliar with the programming language C, we mention the
salient differences between C and other Algol derivatives such as Pascal, as
we find uses for those features of C. A program in C consists of a sequence
of function definitions, with execution starting at a distinguished function
called main. Function definitions cannot be nested. Parentheses enclosing
function parameter lists are needed even if there are no parameters; hence we
write expr( ). and rest( ). Functions communicate either by pass
term( ).
ing parameters "by value" or by accessing data global to all functions. For
example, the functions term{ and rest() examine the lookahead symbol
)
Operation
) ) ;
because control flows to the end of the function body after each of these calls.
We can speed up a program by replacing tail recursion by iteration. For a
procedure without parameters, a tailrecursive call can be simply replaced by a
jump to the beginning of the procedure. The code for rest can be rewritten
as:
rest( )
L: if ( lookahead == '+') {
+
match('+'); term( ) ; putchar (
' ' ) ; goto L;
}
else ;
2.22, the only remaining call of rest is from expr (see line 3). The two
functions can therefore be integrated into one, as shown in Fig. 2.23. In C, a
statement stmt can be repeatedly executed by writing
while ( 1 ) stmt
because the condition 1 is always true. We can exit from a loop by executing
a breakstatement. The stylized form of the code in Fig. 2.23 allows other
operators to be added conveniently.
expr (
term( ) ;
while( 1
if (lookahead == '+') {
else break;
Fig. 2.23. Replacement for functions expr and rest of Fig. 2.22.
54 A SIMPLE COMPILER SEC. 2.6
global to any functions that are defined after line 2 of Fig. 2.24.
The function match checks tokens; it reads the next input token if the look
ahead symbol is matched and calls the error routine otherwise.
The function error uses the standard library function printf to print the
Constants
main(
{
lookahead = getchar();
expr ( )
putchar( '\n' /
adds trailing newline character */
) ;
expr(
{
term( )
while ( 1)
if (lookahead == '+') {
else break;
}
terin( )
if ( isdigit lookahead ( ) ) {
putchar lookahead (
)
match lookahead
( )
else error ( )
match (t
int t;
{
if (lookahead == t)
lookahead = getchar();
else error ( )
error ( )
appears in the input stream, the lexical analyzer will pass num to the parser.
The value of the integer will be passed along as an attribute of the token num.
Logically, the lexical analyzer passes both the token and the attribute to the
parser. If we write a token and its attribute as a tuple enclosed between <>,
the input
31 + 28 + 59
<nuni, 31> <+ , > <num, 28> < + , > <num, 59>
The token + has no attribute. The second components of the tuples, the attri
butes, play no role during parsing, but are needed during translation.
id = id + id ; (2.16)
Many languages use fixed character strings such as begin, end. if, and so
on, as punctuation marks or to identify certain constructs. These character
forming identifiers, so a
strings, called keywords, generally satisfy the rules for
mechanism is needed for deciding when a lexeme forms a keyword and when
it forms an identifier. The problem is easier to resolve if keywords are
reserved, i.e., if they cannot be used as identifiers. Then a character string
forms an identifier only if it is not a keyword.
The problem of isolating tokens also arises if the same characters appear in
the lexemes of more than one token, as in <, < = and <> in Pascal. Tech ,
When a lexical analyzer is inserted between the parser and the input stream, it
interacts with the two in the manner shown in Fig. 2.25. It reads characters
from the input, groups them into lexemes, and passes the tokens formed by
the lexemes, together with their attribute values, to the later stages of the
compiler. In some situations, the lexical analyzer has to read some characters
ahead before it can decide on the token to be returned to the parser. For
example, a lexical analyzer for Pascal must read ahead after it sees the charac
ter >. If the next character is =, then the character sequence >= is the lexeme
forming the token for the "greater than or equal to" operator. Otherwise > is
the lexeme forming the "greater than" operator, and the lexical analyzer has
read one character too many. The extra character has to be pushed back onto
the input, because it can be the beginning of the next lexeme in the input.
read pass
push back
character
Fig. 2.25. inserting a lexical analyzer between the input and the parser.
The lexical analyzer and parser form a producerconsumer pair. The lexical
analyzer produces tokens and the parser consumes them. Produced tokens can
be held in a token buffer until they are consumed. The interaction between
the two is constrained only by the size of the buffer, because the lexical
analyzer cannot proceed when the buffer is full and the parser cannot proceed
when the buffer is empty. Commonly, the buffer holds just one token. In
this case, making the lexical
the interaction can be implemented simply by
analyzer be a procedure called by the parser, returning tokens on demand.
The implementation of reading and pushing back characters is usually done
by setting up an input buffer. A block of characters is read into the buffer at
a time; a keeps track of the portion of the input that has been
pointer
analyzed. Pushing back a character is implemented by moving back the
pointer. Input characters may also need to be saved for error reporting, since
some indication has to be given of where in the input text the error occurred.
The buffering of input characters can be justified on efficiency grounds alone.
Fetching a block of characters is usually more efficient than fetching one char
acter at a time. Techniques for input buffering are discussed in Section 3.2.
)
A Lexical Analyzer
uses getchar { )
to read character
factor » ( expr )
I
num { print {num.value) }
The C code for factor in Fig. 2.27 is a direct implementation of the produc
tions above. When lookahead equals NUM, the value of attribute num.value
is given by the global variable tokenval. The action of printing this value is
done by the standard library function printf. The first argument of
printf is a string between double quotes specifying the format to be used for
printing the remaining arguments. Where %d appears in the string, the
decimal representation of the next argument is printed. Thus, the printf
statement in Fig. 2.27 prints a blank followed by the decimal representation of
tokenval followed by another blank.
f actor {
if (lookahead == '
(
' ) {
match expr( (
'
(
' ) ; ) ; match {
'
)
' )
else error ( )
incremented, thereby keeping track of line numbers in the input, but again no
token is returned. Supplying a line number with an error message helps pin
point errors.
The code for reading a sequence of digits is on lines 1423. The predicate
isdigit(t) from the includefile <ctype.h> is used on lines 14 and 17 to
determine if an incoming character t is a digit, if it is, then its integer value
is given by the expression t'O' in both ASCII and EBCDIC. With other
character sets, the conversion may need to be done differently. In Section
2.9, we incorporate this lexical analyzer into our expression translator.
60 A SIMPLE COMPILER SEC. 2.6
(1)
SEC. 2.7 INCORPORATING A SYMBOL TABLE
illustrate how the lexical analyzer of the previous section might interact with a
symbol table.
The symboltable routines are concerned primarily with saving and retrieving
lexemes. When a lexeme is saved, we also save the token associated with the
lexeme. The following operations will be performed on the symbol table.
The symboltable routmes above can handle any collection of reserved key
words. For example, consider tokens div and mod
with lexemes div and
mod, respectively. We can initialize the symbol table using the calls
A SymbolTable Implementation
sketched in Fig. 2.29. We do not wish to set aside a fixed amount of space to
hold lexemes forming identifiers; a fixed amount of space may not be large
enough to hold a very long identifier and may be wastefully large for a short
identifier, such as i. In Fig. 2.29, a separate array lexemes holds the char
acter string forming an identifier. The string is terminated by an endofstring
character, denoted by EOS, that may not appear in identifiers. Each entry in
the symboltable array symtable is a record consisting of two fields,
lexptr, pointing to the beginning of a lexeme, and token. Additional fields
can hold attribute values, although we shall not do so here.
is left empty, because lookup returns
In Fig. 2.29, the 0th entry to indi
cate that there no entry for a string. The 1st and 2nd entries are for the
is
keywords div and mod. The 3rd and 4th entries are for identifiers count
and i.
62 A SIMPLE COMPILER SEC. 2.7
Array symtable
lexptr token
Array lexemes
initialized with entries for the keywords div and mod, as shown in Fig. 2.29,
the lookup operation will find these entries if lexbuf contains either div or
mod. If no entry for the string in lexbuf, i.e., lookup returns 0,
there is
then lexbuf contains a lexeme for a new identifier. An entry for the new
identifier is created using insert. After the insertion is made, p is the index
of the symboltable entry for the string in lexbuf. This index is communi
cated to the parser by setting tokenval to p, and the token in the token
field of the entry is returned.
The default action is to return the integer encoding of the character as a
token. Since the single character tokens here have no attributes, tokenval is
set to NONE.
c : char;
begin
loop begin
read a character into c\
do nothing
else if r is a newline then
lineno :— lineno + I
p :— lookup (lexbuf);
if p = then
p := insert (lexbuf , ID);
tokenval := p\
return the token field of table entry p
end
else begin / token is a single character */
set tokenval to NONE; /* there is no attribute /
return integer encoding of character r
end
end
end
can be generated for it. The machine has separate instruction and data
memories and all arithmetic operations are performed on values on a stack.
The instructions are quite limited and fall into three classes; integer arith
metic, stack manipulation, and control flow. Figure 2.31 illustrates the
machine. The pointer pc indicates the instruction we are about to execute.
The meanings of the instructions shown will be discussed shortly.
Arithmetic Instructions
pc
Fig. 2.31. Snapshot of the stack machine after the first four instructions are executed.
1. Stack 1.
2. Stack 3.
3. Add the two topmost elements, pop them, and stack the result 4.
4. Stack 5.
5. Multiply the two topmost elements, pop them, and stack the result 20.
The value on top of the stack at the end (here 20) is the value of the entire
expression.
In the intermediate language, all values will be integers, with correspond
ing to false and nonzero integers corresponding to true. The boolean
operators and and or require both their arguments to be evaluated.
There is a distinction between the meaning of identifiers on the left and right
sides of an assignment. In each of the assignments
= 5;
= i 1;
the right side specifies an integer value, while the left side specifies where the
value is to be stored. Similarly, if p and q are pointers to characters, and
pt := qt;
SEC. 2.8 ABSTRACT MACHINES 65
the right side qt specifies a character, while pt specifies where the character
is to be stored. The terms lvalue and rvalue refer to values that are
appropriate on the left and right sides of an assignment, respectively. That is,
rvalues are what we usually think of as "values," while /values are locations.
Stack Manipulation
Besides the obvious instruction for pushing an integer constant onto the stack
and popping a value from the top of the stack, there are instructions to access
data memory:
Translation of Expressions
rvalue a
rvalue b
+
In words: push the contents of the data locations for a and b onto the stack;
then pop the top two values on the stack, add them, and push the result onto
the stack.
The translation of assignments into stackmachine code is done as follows:
the /value of the identifier assigned to is pushed onto the stack, the expres
sion is evaluated, and its rvalue is assigned to the identifier. For example,
the assignment
lvalue day
SEC. 2.8 ABSTRACT MACHINES 67
Translation of Statements
The layout in Fig. 2.33 sketches the abstractmachine code for conditional and
while statements. The following discussion concentrates on creating labels.
Consider the code layout for ifstatements in Fig. 2.33. There can only be
one label out instruction in the translation of a source program; otherwise,
there will be confusion about where control flows to from a goto out state
ment. We therefore need some mechanism for consistently replacing out in
the code layout by a unique label every time an ifstatement is translated.
Suppose newlahel is a procedure that returns a fresh label every time it is
called. In the following semantic action, the label returned by a call of newla
hel is recorded using a local variable out:
stmt.t expr.t II
'label' out }
If While
label test
Emitting a Translation
Using the procedure emit, we can write the following instead of (2.18):
stmt if
then
stmt^ { emit {' lahel' , out); }
on the right side of the production in a lefttoright order. For the above pro
duction, the order of actions is as follows: actions during the parsing of expr
are done, out is set to the label returned by newlahel and the gofalse
instruction is emitted, actions during the parsing of sfmt ^ are done, and,
finally, the label instruction is emitted. Assuming the actions during the
parsing of c.xpr and stmt i
emit the code for these nonterminals, the above pro
duction implements the code layout of Fig. 2.33.
procedure .stmi\
end
else if lookahead = 'if' then begin
match 'if'); {
expr ;
out :
— newlahel ;
match 'then'); (
stmt;
end
requires some thought. Suppose that the labels in the translation are of the
form LI, L2, .... The pseudocode manipulates such labels using the
integer following L. Thus, out is declared to be an integer, newlahel returns
an integer that becomes the value of out, and emit must be written to print a
label given an integer.
The code layout for while statements in Fig. 2.33 can be converted into
code in a similar fashion. The translation of a sequence of statements is sim
ply the concatenation of the statements in the sequence, and is left to the
reader.
The translation of most singleentry singleexit constructs is similar to that
of while statements. We illustrate by considering control flow in expressions.
the form:
If ris a blank, then clearly it is not necessary to test if / is a tab, because the
first equality implies that the condition is true. The expression
expr ]
or f.v/^ri
The reader can verify that the following code implements the or operator:
separate file. Execution begins in the module main.c that consists of a call
70 A SIMPLE COMPILER SEC. 2.9
start —
SEC. 2.9 PUTTING THE TECHNIQUES TOGETHER 7 1
found. The value of the attribute associated with the token is assigned to a
global variable tokenval.
The following tokens are expected by the parser:
Lexeme
72 A SIMPLE COMPILER SEC. 2.9
I
e
I
€
I
/ factor { print {'/') ) morefactors
I
div factor { print (' HIV') } morefactors
I
mod factor { print {'MOD') } morefactors
I
e
factor  ( <;^r/>r )
I
id { print (id. lexeme) }
I
num { pmi/(nuni.\Y//M£') }
Fig. 2.29 of Section 2.7. The entries in the array S3nntable are pairs consist
ing of a pointer to the lexemes array and an integer denoting the token
stored there. The operation insert (s,t) returns the symtable index for
the lexeme s forming the token t. The function lookup(s) returns the
index of the entry in symtable for the lexeme s or if s is not there.
The module init.c is used to preload symtable with keywords. The
lexeme and token representations for all the keywords are stored in the array
keywords, which has the same type as the symtable array. The function
init( goes sequentially through the keyword array, using the function
)
insert to put the keywords in the symbol table. This arrangement allows us
to change the representation of the tokens for keywords in a convenient way.
The error module manages the error reporting, which is extremely primitive.
On encountering a syntax error, the compiler prints a message saying that an
error has occurred on the current input line and then halts. A better error
recovery technique might skip to the next semicolon and continue parsing; the
SEC. 2.9 PUTTING THE TECHNIQUES TOGETHER 73
The code for the modules appears in seven files: lexer, c, parser, c,
emitter, c, symbol, c, init.c, error, and main.c. The file main.c
c,
contains the main routine in the C program that calls init(), then
parse ( ), and upon successful completion exit(O).
Under the UNIX operating system, the compiler can be created by execut
ing the command
cc lexer. c parser. c emitter. c symbol. c init.c error. c main.c
cc c filename .c
The cc command creates a file a. out that contains the translator. The trans
lator can then be exercised by typing a. out followed by the expressions to be
translated; e.g.,
2+3*5;
12 div 5 mod 2;
The Listing
int lineno;
struct entry { /* form of symbol table entry */
char *lexptr;
int token;
};
int lineno = 1
int tokenval = NONE;
int t;
while(l) {
t = getchar ( )
if (t == ' '
! ! t == '\t'
; / strip out white space */
else if t == \n' (
'
lineno = lineno + 1
else if (isdigit(t)) { /* t is a digit */
ungetc t stdin ( , )
scanf("%d", itokenval )
return NUM;
}
b = b + 1;
if (b >= BSIZE)
error "compiler error");
(
p = lookup lexbuf ( )
if (p == 0)
p = insert lexbuf ( , ID);
tokenval = p;
return symtable[p] token; .
else if (t == EOF)
return DONE;
)) ; : ;;; ; ;
else {
tokenval = NONE;
return t
}
lookahead = lexan( )
expr( match ) ;
(
'
;
' )
expr(
{
int t;
terin( ) ;
while(l)
switch (lookahead) {
case + case  ' '
:
' '
:
t = lookahead;
match lookahead) ( ; term( ) ; emit(t, NONE);
continue
default
return;
}
term(
{
int t;
f actor ( )
while ( 1)
switch (lookahead) {
case '*': case '/': case DIV: case MOD:
t = lookahead;
match lookahead factor(); emit(t, NONE);
( ) ;
continue
default:
return;
}
)) : ;
factor (
switch! lookahead { )
(
case ' '
match ( t
int t;
{
if lookahead == t)
(
lookahead = lexan( )
switch(t) {
case '+': case '': case '*': case '/':
printf "%c\n" t); break; (
,
case DIV:
printf "DIV\n" break; (
) ;
case MOD:
printf "MOD\n" break; (
) ;
case NUM:
printf "%d\n" tval); break; (
,
case ID:
printf "%s\n" symtable[ tval lexptr
(
break;, ] . ) ;
default:
printf "token %d tokenval %d\n" t, tval);
(
, ,
int lastchar =  1 ; /
last used position in lexemes »/
struct entry symtable[ SYMMAX]
int lastentry =0; /
last used position in symtable /
int lookup(s) / returns position of entry for s */
char s [ ]
int p;
for (p = lastentry; p> 0; p=p 1)
if ( strcmp( symtable [p] lexptr . , s) == 0)
return p;
return ;
int tok;
{
int len;
len = strlen(s); /* strlen computes length of s */
if (lastentry + 1 >= SYMMAX)
error "symbol table full");
(
lastentry = lastentry + 1
symtable lastentry] token = tok;
[ .
return lastentry;
"div", DIV,
"mod", MOD,
0,
};
}
;
init();
parse ( )
/*»#*»»«»»«»*»*»*««#**»*/
EXERCISES
5^55+ 55* a
a) Show how the string aa+a* can be generated by this grammar.
b) Construct a parse tree for this string.
c) What language is generated by this grammar? Justify your
answer.
b)5 + S S \
 S S \
a
c) S * S S
( ) S \
€
6)S ^ aS hS b5a5€
e)5a5+5555* { S )
*2.5 a) Show that all binary strings generated by the following grammar
have values divisible by 3. Hint. Use induction on the number of
nodes in a parse tree.
num  1 1 I
1001 I
nam  num nam
b) Does the grammar generate all binary strings with values divisible
by 3?
2.13 The following rules define the translation of an English word into pig
Latin:
a) If the word begins with a nonempty string of consonants, move the
initial consonant string to the back of the word and add the suffix
AY; e.g.,pig becomes igpay.
b) If word begins with a vowel, add the
the suffix YAY; e.g., owl
becomes owlyay.
c) U following a Q is a consonant.
d) Y at the beginning of a word is a vowel if it is not followed by a
vowel.
80 A SIMPLE COMPILER CHAPTER 2
for ( expr I
; expr2 ', expr^ ) stmt
The first expression is executed before the loop; it is typically used for
initializing the loop index. The second expression is a test made
before each iteration of the loop; the loop is exited if the expression
becomes 0. The loop itself consists of the statement {stmt exprt, ;}.
for / :
= 1 step 1 — y until 1 * 7 do 7 :
= 7 + 1
Three semantic definitions can be given for this statement. One pos
sible meaning is that the limit 10 * j and increment 10 — j are to be
evaluated once before the loop, as in PL/I. For example, if 7 = 5
before the loop, we would run through the loop ten times and exit. A
second, completely different, meaning would ensue if we are required
to evaluate the limit and increment every time through the loop. For
example, if 7 = 5 before the loop, the loop would never terminate. A
third meaning is given by languages such as Algol. When the incre
ment is negative, the test made for termination of the loop is
/ < 10*7, rather than / > 10*7. For each of these three semantic
definitions construct a syntaxdirected translation scheme to translate
these forloops into stackmachine code.
2.16 Consider the following grammar fragment for ifthen and ifthen
elsestatements:
PROGRAMMING EXERCISES
P2.1 Implement a translator from integers to roman numerals based on the
syntaxdirected translation scheme developed in Exercise 2.9.
P2.2 Modify the translator in Section 2.9 to produce as output code for the
abstract stack machine of Section 2.8.
P2.3 Modify the error recovery module of the translator in Section 2.9 to
skip to the next input expression on encountering an error.
P2.4 Extend the translator in Section 2.9 to handle all Pascal expressions.
stmt
82 A SIMPLE COMPILER CHAPTER 2
Lexical
Analysis
This chapter deals with techniques for specifying and implementing lexical
analyzers. A simple way to build a lexical analyzer is to construct a diagram
that illustrates the structure of the tokens of the source language, and then to
handtranslate the diagram into a program for finding tokens. Efficient lexi
cal analyzers can be produced in this manner.
The techniques used to implement lexical analyzers can also be applied to
other areas such as query languages and information retrieval systems. In
each application, the underlying problem is the specification and design of
programs that execute actions triggered by patterns in strings. Since pattern
directed programming is widely useful, we introduce a patternaction language
called Lex for specifying lexical analyzers. In this language, patterns are
specified by regular expressions, and a compiler for Lex can generate an effi
cient finiteautomaton recognizer for the regular expressions.
Several other languages use regular expressions to describe patterns. For
example, the patternscanning language AWK uses regular expressions to
select input lines for processing and the UNIX system shell allows a user to
refer to a set of file names by writing a regular expression. The UNIX com
mand rm *.o, for instance, removes all files with names ending in ".o".'
A software tool that automates the construction of lexical analyzers allows
people with different backgrounds to use pattern matching in their own appli
cation areas. For example, Jarvis [19761 used a lexicalanalyzer generator to
create a program that recognizes imperfections in printed circuit boards. The
circuits are digitally scanned and converted into "strings" of line segments at
different angles. The "lexical analyzer" looked for patterns corresponding to
imperfections in the string of line segments. A major advantage of a lexical
analyzer generator is that it can utilize the bestknown patternmatching algo
rithms and thereby create efficient lexical analyzers for people who are not
experts in patternmatching techniques.
'
The expression .o is a variant of the usual notation for regular expressions. Excreises .^.10
and 3.14 mention some commonly used variants of regular expression notations.
84 LEXICAL ANALYSIS SEC. 3.1
the input characters and produce as output a sequence of tokens that the
parser uses for syntax analysis. This interaction, summarized schematically in
Fig. 3.1, is commonly implemented by making the lexical analyzer be a sub
routine or a coroutine of the parser. Upon receiving a "get next token" com
mand from the parser, the lexical analyzer reads input characters until it can
identify the next token.
SEC. 3.1 THE ROLE OF THE LEXICAL ANALYZER 85
one or the other of these phases. For example, a parser embodying the
conventions for comments and white space is significantly more complex
than one that can assume comments and white space have already been
removed by a lexical analyzer. If we are designing a new language,
separating the lexical and syntactic conventions can lead to a cleaner
overall language design.
When talking about lexical analysis, we use the terms "token," "pattern," and
"lexeme" with specific meanings. Examples of their use are shown in Fig.
3.2. In general, there is a set of strings in the input for which the same token
is produced as output. This set of strings is described by a rule called a pat
tern associated with the token. The
said to match each string in the
pattern is
set. A
lexeme is a sequence of characters in the source program that is
matched by the pattern for a token. For example, in the Pascal statement
const pi = 3.1416;
the substring pi is a lexeme for the token "identifier."
Token
86 LEXICAL ANALYSIS SEC. 3.1
just the single string const that spells out the keyword. The pattern for the
token relation is the set of all six Pascal relational operators. To describe pre
cisely the patterns for more complex tokens like id (for identifier) and num
(for number) we shall use the regularexpression notation developed in Section
3.3.
Certain language conventions impact the difficulty of lexical analysis.
Languages such as Fortran require certain constructs in fixed positions on the
input line. Thus the alignment of a lexeme may be important in determining
the correctness of a source program. The trend in modern language design is
toward freeformat input, allowing constructs to be positioned anywhere on
the input line, so this aspect of lexical analysis is becoming less important.
The treatment of blanks varies greatly from language to language. In some
languages, such as Fortran or Algol 68, blanks are not significant except in
literal strings. They can be added at will to improve the readability of a pro
gram. The conventions regarding blanks can greatly complicate the task of
identifying tokens.
A popular example that illustrates the potential difficulty of recognizing
tokens is the DO statement of Fortran. In the statement
DO 5 I = 1.25
we cannot tell until we have seen the decimal point that DO is not a keyword,
but rather part of the identifier D05I. On the other hand, in the statement
DO 5 I = 1,25
we have seven tokens, corresponding to the keyword DO, the statement label
5, the identifier I, the operator =, the constant 1, the comma, and the con
stant 25. Here, we cannot be sure until we have seen the comma that DO is a
keyword. To alleviate this uncertainty, Fortran 77 allows an optional comma
between the label and index of the DO statement. The use of this comma is
encouraged because it helps make the DO statement clearer and more read
able.
In many languages, certain strings are reserved; i.e., their meaning is
SEC. 3.1 THE ROLE OF THE LEXICAL ANALYZER 87
predefined and cannot be changed by the user. If keywords are not reserved,
then the lexical analyzer must distinguish between a keyword and a user
defined identifier. In PL/I, keywords are not reserved; thus, the rules for dis
tinguishing keywords from identifiers are quite complicated as the following
PL/I statement illustrates:
When more than one pattern matches a lexeme, the lexical analyzer must pro
vide additional information about the particular lexeme that matched to the
subsequent phases of the compiler. For example, the pattern num matches
both the strings and 1, but it is essential for the code generator to know
what string was actually matched.
The lexical analyzer collects information about tokens into their associated
attributes. The tokens influence parsing decisions; the attributes influence the
translation of tokens. As a practical matter, a token has usually only a single
attribute — a pointer to the symboltable entry in which the information about
the token is kept; the pointer becomes the attribute for the token. For diag
nostic purposes, we may be interested in both the lexeme for an identifier and
the line number on which it was first seen. Both these items of information
can be stored in the symboltable entry for the identifier.
Example 3.1. The tokens and associated attributevalues for the Fortran
statement
E = M C 2
Lexical Errors
Few errors are discernible at the lexical level alone, because a lexical analyzer
has a very localized view of a source program, if the string f i is encountered
in a C program for the first time in the context
fi ( a == f (x) ) •
•
•
3. Write the lexical analyzer in assembly language and explicitly manage the
reading of input.
The three choices are listed in order of increasing difficulty for the imple
mentor. Unfortunately, the hardertoimplement approaches often yield faster
lexical analyzers. Since the lexical analyzer is the only phase of the compiler
that reads the source program characterbycharacter, it is possible to spend a
considerable amount of time in the lexical analysis phase, even though the
later phases are conceptually more complex.
Thus, the speed of lexical
analysis is a concern in compiler design. While the bulk of the chapter is
devoted to the first approach, the design and use of an automatic generator,
we also consider techniques that are helpful in manual design. Section 3.4
discusses transition diagrams, which are a useful concept for the organization
of a handdesigned lexical analyzer.
Buffer Pairs
For many source languages, there are times when the lexical analyzer needs to
look ahead several characters beyond the lexeme for a pattern before a match
can be announced. The lexical analyzers in Chapter 2 used a function
ungetc to push lookahead characters back into the input stream. Because a
large amount of time can be consumed moving characters, specialized buffer
ing techniques have been developed to reduce the amount of overhead
required to process an input character. Many buffering schemes can be used,
but since the techniques are somewhat dependent on system parameters, we
shall only outline the principles behind one class of schemes here.
We use a buffer divided into two A'character halves, as shown in Fig. 3.3.
Typically, A' is the number of characters on one disk block, e.g., 1024 or
4096.
M C « 2 eof
forwcird
lexeme _he}^inninii
We read A' input characters into each half of the buffer with one system
read command, rather than invoking a read command for each input charac
ter. If fewer than A^ characters remain in the input, then a special character
eof is read into the buffer after the input characters, as in Fig. 3.3. That is,
eof marks the end of the source file and is different from any input character.
Two pointers to the input buffer are maintained. The string of characters
between the two pointers is the current lexeme. Initially, both pointers point
to the first character of the next lexeme to be found. One, called the forward
pointer, scans ahead until a match for a pattern is found. Once the next lex
eme is determined, the forward pointer is set to the character at its right end.
After the lexeme is processed, both pointers are set to the character immedi
ately past the lexeme. With this scheme, comments and white space can be
treated as patterns that yield no token.
If the forward pointer is about to move past the halfway mark, the right
half is filled with A' new input characters. If the forward pointer is about to
move past the right end of the buffer, the left half is filled with N new charac
tersand the forward pointer wraps around to the beginning of the buffer.
This buffering scheme works quite well most of the time, but with it the
amount of lookahead is limited, and this limited lookahead may make it
impossible to recognize tokens in situations where the distance that the for
ward pointer must travel is more than the length of the buffer. For example,
if we see
end
else if forward at end of second half then begin
reload first half;
end
else forward — forward +: 1
Sentinels
foni'ord
lexemeJbeginning
With the arrangement of Fig. 3.5, we can use the code shown in Fig. 3.6 to
advance the forward pointer (and test for the end of the source file). Most of
the time the code performs only one test to see whether forward points to an
eof. Only when we reach the end of a buffer half or the end of the file do we
perform more tests. Since N input characters are encountered between eof's,
the average number of tests per input character is very close to 1.
forward := foryi'ard + 1;
end
else if fom'ard at end of second half then begin
reload first half;
end
else /» eof within a buffer signifying end of input */
terminate lexical analysis
end
We also need to decide how to process the character scanned by the forward
pointer; does it mark the end of a token, does it represent progress in finding
a particular keyword, or what? One way to structure these tests is to use a
case statement, if the implementation language has one. The test
if forwarcn = eof
The term alphabet or character class denotes any finite set of symbols. Typi
cal examples of symbols are letters and characters. The set {0,1} is the binary
alphabet. ASCII and EBCDIC are two examples of computer alphabets.
A string over some alphabet is a finite sequence of symbols drawn from that
alphabet. In language theory, the terms sentence and word are often used as
synonyms for the term "string." The length of a string s, usually written .s,
The term language denotes any set of strings over some fixed alphabet.
This definition is very broad. Abstract languages like 0, the empty set, or
{e}, the set containing only the empty string, are languages under this defini
tion. So too are the set of all syntactically wellformed Pascal programs and
the set of all grammatically correct English sentences, although the latter two
sets are much more difficult to specify. Also note that this definition does not
ascribe any meaning to the strings in a language. Methods for ascribing
meanings to strings are discussed in Chapter 5.
If ,v and y are strings, then the concatenation of v and y, written .vv, is the
Since e.v is ,v itself, .v' = .v. Then, .v" = .v.v, .v'' = .v.v.v, and so on.
SEC. 3.3 SPECIFICATION OF TOKENS 93
Term
94 LEXICAL ANALYSIS SEC. 3.3
Operation
SEC. 3.3 SPECIFICATION OF TOKENS 95
3. Suppose r and s are regular expressions denoting the languages L{r) and
L(s). Then,
1. the unary operator * has the highest precedence and is left associative,
2. concatenation has the second highest precedence and is left associative,
3. I
has the lowest precedence and is left associative.
2. The regular expression {a\b){a\b) denotes {aa, ab, ba, bb), the set of all
strings of a's and /?'s of length two. Another regular expression for this
same set is aa \
ab \
ba \
bb.
3. The regular expression a* denotes the set of all strings of zero or more
a's, i.e., {e, a, aa, aaa, •
}.
4. The regular expression ia\b)* denotes the set of all strings containing
zero or more instances of an a or b, that is, the set of all strings of «'s
and /7's. Another regular expression for this set is {a*b*)*.
and t.

This rule says that extra pairs of parentheses may be placed around regular expressions if we
desire.
96 LEXICAL ANALYSIS SEC. 3.3
Axiom
SEC. 3.3 SPECIFICATION OF TOKENS 97
digit ^ I
1
I
• • •
I
9
digits — digit digit*
optionalfraction — . digits 
e
optionalexponent — ( E ( + 
 
e ) digits ) 
e
num » digits optionalfraction optionalexponent
Notational Shorthands
1. One or more instances. The unary postfix operator means "one or "^
2. Zero or one instance. The unary postfix operator ? means "zero or one
instance of." The notation r? is a shorthand for re. If r is a regular
expression, then (r)? is a regular expression that denotes the language
L{r) U {e}. For example, using the ^ and ? operators, we can rewrite
the regular definition for num in Example 3.5 as
digit  I
1 I
• • •
I
9
digits — digit^
optionalfraction  ( . digits )'?
optionalexponent — ( E ( + 
 )? digits )?
num » digits optionalfraction optionalexponent
AZazAZazO9*
98 LEXICAL ANALYSIS SEC. 3.3
Nonregular Sets
the limits of the descriptive power of regular expressions, here we give exam
ples of programming language constructs that cannot be described by regular
expressions. Proofs of these assertions can be found in the references.
Regular expressions cannot be used to describe balanced or nested con
structs. For example, the set of all strings of balanced parentheses cannot be
described by a regular expression. On the other hand, this set can be speci
fied by a contextfree grammar.
Repeating strings cannot be described by regular expressions. The set
I
if expr then stmt else stmt
I
e
I
num
where the terminals if, then, else, relop, id, and num generate sets of strings
given by the following regular definitions:
if  if
then  then
else  else
relop <<= = <>>> =
id  letter ( letter 
digit )*
num  digit^ ( . digit^ )? ( E( + )? digit^ )"!
SEC. 3.4 RECOGNITION OF TOKENS 99
delim » blank 
tab 
newline
^
ws — delim
If a match for ws is found, the lexical analyzer does not return a token to the
parser. Rather, it proceeds to find a token following the white space and
returns that to the parser.
Our goal is to construct a lexical analyzer that will isolate the lexeme for
the next token in the input buffer and produce as output a pair consisting of
the appropriate token and attributevalue, using the translation table given in
Fig. 3.10. The attributevalues for the relational operators are given by the
symbolic constants LT, LE, EQ, NE, GT, GE.
Regular
Expression
100 LEXICAL ANALYSIS SEC. 3.4
depict the actions that take place when a lexical analyzer is called by the
parser to get the next token, as suggested by Fig. 3.1. Suppose the input
buffer is as in Fig. 3.3 and the lexemebeginning pointer points to the charac
ter following the last lexeme found. We use a transition diagram to keep
track of information about characters that are seen as the forward pointer
scans the input. We do so by moving from position to position in the diagram
as characters are read.
Positions in a transition diagram are drawn as circlesand are called states.
The states are connected by arrows, called edges. Edges leaving state .v have
labels indicating the input characters that can next appear after the transition
diagram has reached state .v. The label other refers to any character that is
We assume the transition diagrams of this section are deterministic; that is,
no symbol can match the labels of two edges leaving one state. Starting in
Section 3.5, we shall relax this condition, making life much simpler for the
designer of the lexical analyzer and, with proper tools, no harder for the
implementor.
One state is labeled the start state; it is the initial state of the transition
diagram where control resides when we begin to recognize a token. Certain
states may have actions that are executed when the flow of control reaches
that state. On entering a state we read the next input character. If there is
an edge from the current state whose label matches this input character, we
then go to the state pointed to by the edge. Otherwise, we indicate failure.
shows a transition diagram for the patterns >= and >. The
Figure 3.11
transitiondiagram works as follows. Its start state is state 0. In state 0, we
read the next input character. The edge labeled > from state is to be fol
start
On reaching state 6 we read the next input character. The edge labeled =
from state 6 is to be followed to state 7 if this input character is an =. Other
wise, the edge labeled other indicates that we are to go to state 8. The double
circleon state 7 indicates that it is an accepting state, a state in which the
token >= has been found.
Notice that the character > and another extra character are read as we fol
low the sequence of edges from the start state to the accepting state 8. Since
the extra character is not a part of the relational operator >, we must retract
SEC. 3.4 RECOGNITION OF TOKENS 101
the forward pointer one character. We use a * to indicate states on which this
Example 3.7. A transition diagram for the token relop is shown in Fig. 3.12.
Notice that Fig. 3. 1 1 is a part of this more complex transition diagram.
start
2j) return( relop. LE)
other ^*
4J) return( relop, LT)
other ^^s:*
Sj) return( relop, GT)
Example 3.8. Since keywords are sequences of letters, they are exceptions to
the rule that a sequence of letters and digits starting with a letter
is an identi
fier. Rather than encode the exceptions into a transition diagram, a useful
trick is to treat keywords as special identifiers, as in Section 2.7. When the
accepting state in Fig. 3.13 is reached, we execute some code to determine if
letter or digit
start
0
letter V^y^ other
»( 10)
.;?=?x*
3.13 uses gettokenO and instaUid{) to obtain the token and attributevalue,
respectively, to be returned. The procedure installidO has access to the
buffer, where the identifier lexeme has been located. The symbol table is
the lexical analyzer is coded by hand. Without doing so, the number of states
in a lexical analyzer for a typical programming language is several hundred,
while using the trick, fewer than a hundred states will probably suffice.
*\\^) —^^J5
^x digit
digit
digit digit
digit
Note that the definition is of the form digits fraction? exponent? in which
fraction and exponent are optional.
The lexeme must be the longest possible. For example,
for a given token
the lexical analyzer must not stop after seeing 12 or even 12.3 when the
input is 12.3E4. Starting at states 25, 20, and 12 in Fig. 3.14, accepting
states will be reached after 12, 12.3, and 12,3E4 are seen, respectively,
provided 12.3E4 is followed by a nondigit in the input. The transition
diagrams with start states 25, 20, and 12 are for digits, digits fraction, and
digits fraction? exponent, respectively, so the start states must be tried in the
reverse order 12, 20, 25.
The action when any of the accepting states 19, 24, or 27 is reached is to
call a procedure installnum that enters the lexeme into a table of numbers and
returns a pointer to the created entry. The lexical analyzer returns the token
num with this pointer as the lexical value. n
Information about the language that is not in the regular definitions of the
tokens can be used to pinpoint errors the input. For example, on input
in
1. <x, we fail in states 14 and 22 in Fig. 3.14 with next input character <.
Rather than returning the number 1, we may wish to report an error and con
tinue as if the input were 1 .0 <x. Such knowledge can also be used to sim
plify the transition diagrams, because errorhandling may be used to recover
from some situations that would otherwise lead to failure.
There are several ways in which the redundant matching in the transition
diagrams of Fig. 3.14 can be avoided. One approach is to rewrite the transi
tion diagrams by combining them into one, a nontrivial task in general.
Another is to change the response to failure during the process of following a
diagram. An approach explored later in this chapter allows us to pass through
several accepting states; we revert back to the last accepting state that we
passed through when failure occurs.
delim
containing the character read, then control is transferred to the code for the
state pointed to by that edge. If there is no such edge, and the current state is
not one that indicates a token has been found, then a routine fail{) is
invoked to retract the forward pointer to the position of the beginning pointer
and to initiate a search for a token specified by the next transition diagram.
If there are no other transition diagrams to try, fail() calls an error
recovery routine.
To return tokens we use a global variable lexical_value, which is
assigned the pointers returned by functions install_id( and )
We use a case statement to find the start state of the next transition
diagram. In the C implementation in Fig. 3.15, two variables state and
start keep track of the present state and the starting state of the current
transition diagram. The state numbers in the code are for the transition
diagrams of Figures 3.123. 14.
Edges in transition diagrams are traced by repeatedly selecting the code
fragment for a state and executing the code fragment to determine the next
state as shown in Fig. 3.16. We show the code for state 0, as modified in
Example 3.10 to handle white space, and the code for two of the transition
diagrams from Fig. 3.13 and 3.14. Note that the C construct
while ( 1 ) stmt
*
A more efficient implementation would u.sc an inline macro in place of the function
nextchar ( )
) ; ;
forward = token_beginning;
switch (start) {
case 0: start = 9; break;
case 9: start = 12 break;
case 12 start = 20 break;
case 20 start = 25 break;
case 25 recover break; ( )
return start;
question.
If the implementation language does not have a case statement, we can
create an array for each state, indexed by characters. If state 1 is such an
array, then state \\c\ is a pointer to a piece of code that must be executed
whenever the lookahead character is c. This code would normally end with a
goto to code for the next state. The array for state .v is referred to as the
indirect transfer table for s.
token nexttoken(
{ while(l) {
switch (state) {
case 0: c = nextchar();
/* c is lookahead character /
if (c==blank c==tab c==newline)
! ! ! ! {
state = 0;
lexeme_beginning++
/* advance beginning of lexeme /
}
case 9: c = nextchar();
if (isletter(c) state = 10; )
return gettoken(
( ) )
return NUM ( )
working program using the transition diagram techniques of the previous sec
tion.
Lex is generally used in the manner depicted in Fig. 3.17. First, a specifi
cation of a lexical analyzer is prepared by creating a program lex.l in the
Lex language. Then, lex.l is run through the Lex compiler to produce a C
program lex.yy.c. The program lex.yy.c consists of a tabular represen
tation of a transition diagram constructed from the regular expressions of
lex.l, together with a standard routine that uses the table to recognize lex
emes. The actions associated with regular expressions in lex.l are pieces of
C code and are carried over directly to lex.yy.c. Finally, lex.jry.c is run
through the C compiler to produce an object program a. out, which is the
lexical analyzer that transforms an input stream into a sequence of tokens.
Lex
source Lex
lex.yy .
program compiler
lex.l
C
lex.yy . a out
.
compiler
sequence
mput
a out
. of
stream
tokens
Lex Specifications
declarations
%%
translation rules
%%
auxiliary procedures
p I
{ action  }
Pj { action 2 }
p„ { action,, }
where each /?, is a regular expression and each action, is a program fragment
describing what action the lexical analyzer should take when pattern p,
matches a lexeme. In Lex, the actions are written in C; in general, however,
they can be any implementation language.
in
The third section holds whatever auxiliary procedures are needed by the
actions. Alternatively, these procedures can be compiled separately and
loaded with the lexical analyzer.
A lexical analyzer created by Lex behaves in concert with a parser in the
following manner. When activated by the parser, the lexical analyzer begins
reading its remaining input, one character at has found the
a time, until it
longest prefix of the input that is matched by one of the regular expressions
/?,. Then, it executes action,. Typically, action, will return control to the
parser. However, if it does not, then the lexical analyzer proceeds to find
more lexemes, until an action causes control to return to the parser. The
repeated search for lexemes until an explicit return allows the lexical analyzer
to process white space and comments conveniently.
The lexical analyzer returns a single quantity, the token, to the parser. To
pass an attribute value with information about the lexeme, we can set a global
Example 3.11. Figure 3.18 is a Lex program that recognizes the tokens of
Fig. 3.10 and returns the token found. A few observations about the code
will introduce us to many of the important features of Lex.
In the declarations section, we see (a place for) the declaration of certain
manifest constants used by the translation rules. These declarations are sur
"^
*
It is common for the program lex.yy.c to be used as a subroutine of a parser generated by
Yacc, a parser generator to be discussed in Chapter 4. In this case, the declaration of the manifest
constants would be provided by the parser, when it is compiled with the program lex.yy.c.
} } , ? }
%{
/ definitions of manifest constants
LT, LE, EQ, NE, GT GE ,
/* regular definitions */
delim \t\n] [
ws {deliin} +
letter [AZaz]
digit [09]
id {letter} {letter} {digit} )*
( I
%%
{ws} {/* no action and no return */}
if {return(IF) ;
install_num( { )
%%. The first rule says that if we see ws, that is, any maximal sequence of
blanks, tabs, and newlines, we take no action. In particular, we do not return
to the parser. Recall that the structure of the lexical analyzer is such that it
keeps trying to recognize tokens, until the action associated with one found
causes a return.
The second rule says that if the letters if are seen, return the token IF,
which is a manifest constant representing some integer understood by the
parser to be the token if. The next two rules handle keywords then and
else similarly.
In the rule for id, we see two statements in the associated action. First, the
variable yylval is set to the value returned by procedure install_id; the
definition of that procedure is in the third section, yylval is a variable
^ Actually, Lex handles the character class [+] correctly without the backslash, because the
minus sign appearing at the end cannot represent a range.
*"
We did so because < and > are Lex metasymbols; they surround the names of "states," enabling
Lex to change state when encountering certain tokens, such as comments or quoted strings, that
must be treated dit'tercntly from the usual text. There is no need to surround the equal sign by
quotes, but neither is it forbidden.
SEC. 3.5 A LANGUAGE FOR SPECIFYING LEXICAL ANALYZERS HI
whose definition appears in the Lex output lex.yy.c, and which is also
available to the parser. The purpose of yylval is to hold the lexical value
returned, since the second statement of the action, return (ID), can only
return a code for the token class.
We do not show the code for install_id. However, we
details of the
may suppose that symbol table for the lexeme matched by the
it looks in the
pattern id. Lex makes the lexeme available to routines appearing in the third
section through two variables yytext and yyleng. The variable yytext
corresponds to the variable that we have been calling lexemebeginning, that
is, a pointer to the first character of the lexeme; yyleng is an integer telling
how long the lexeme is. For example, if install_id fails to find the identi
fier in the symbol table, it might create a new entry for it. The yyleng char
yytext, might be copied into a character array
acters of the input, starting at
and delimited by an endofstring marker as in Section 2.7. The new symbol
table entry would point to the beginning of this copy.
Numbers are treated similarly by the next rule, and for the last six rules,
yylval is used to return a code for the particular relational operator found,
while the actual return value is the code for token relop in each case.
Suppose the lexical analyzer resulting from the program of Fig. 3.18 is
given an input consisting of two tabs, the letters if, and a blank. The two
matched by a pattern, namely
tabs are the longest initial prefix of the input
the pattern ws. The action for ws
do nothing, so the lexical analyzer is to
moves the lexemebeginning pointer, yytext, to the i and begins to search
for another token.
The next lexeme to be matched is if. Note that the patterns if and {id}
both match this lexeme, and no pattern matches a longer string. Since the
pattern for keyword if precedes the pattern for identifiers in the list of Fig.
3.18, the conflict is resolved in favor of the keyword. In general, this
ambiguityresolving strategy makes it easy to reserve keywords by listing them
ahead of the pattern for identifiers.
For another example, suppose <= are the first two characters read. While
pattern < matches the first character, it is not the longest pattern matching a
prefix of the Thus Lex's strategy of selecting the longest prefix
input.
matched by makes it easy to resolve the conflict between < and < =
a pattern
in the expected manner  by choosing <= as the next token.
DO 5 I = 1.25
DO 5 I = 1,25
Strings, so suppose that all removable blanks are stripped before lexical
analysis begins. The above statements then appear to the lexical analyzer as
D05I=1 .25
D05I=1 ,25
In the first statement, we cannot tell until we see the decimal point that the
string DO is part of the identifier D05I. In the second statement, DO is a key
word by itself.
In Lex, we can write a pattern of the form ri/ri, where r\ and r^ are reg
ular expressions, meaning match a string in r, but only if followed by a
string in r2. The regular expression r2 after the lookahead operator / indi
cates the right context for a match; it is used only to restrict a match, not to
be part of the match. For example, a Lex specification that recognizes the
keyword DO in the context above is
Example 3.12. The lookahead operator can be used to cope with another dif
ficult lexical analysis problem in Fortran: distinguishing keywords from identi
fiers. For example, the input
IF{I, J) = 3
IF ( condition ) statement
IF ( condition ) THEN
thenhlock
ELSE
elschlock
END IF
We note that every unlabeled Fortran statement begins with a letter and that
every right parenthesis used for subscripting or operand grouping must be fol
IF / \( .* \) {letter}
The dot stands for "any character but newline" and the backslashes in front of
the parentheses tell Lex to treat them literally, not as metasymbols for group
ing in regular expressions (see Exercise 3.10). n
Another way to attack the problem posed by ifstatements in Fortran is,
after seeing IF(, to determine whether IF has been declared an array. We
scan for the full pattern indicated above only if it has been so declared. Such
tests make the automatic implementation of a lexical analyzer from a
Lex
and they may even cost time in the long run, since fre
specification harder,
quent checks must be made by the program simulating a transition diagram to
determine whether any such tests must be made. It should be noted that tok
enizing Fortran is such an irregular task that it is frequently easier to write an
ad hoc lexical analyzer for Fortran in a conventional programming language
than it is to use an automatic lexical analyzer generator.
any sequence of characters ending in */, with the added requirement that no
proper prefix ends in */.
1 a set of states S
2. a set of input symbols S (the input symbol alphabet)
3. a transition function move that maps statesymbol pairs to sets of states
4. a state ^o that is distinguished as the start (or initial) state
edges represent the transition function. This graph looks like a transition
diagram, but the same character can label two or more transitions out of one
state, and edges can be labeled by the special symbol e as well as by input
symbols.
The transition graph for an NFA that recognizes the language (a \b)*abb is
shown in Fig. 3.19. The set of states of the NFA is {0, 1, 2, 3} and the input
symbol alphabet is {a, b}. State in Fig. 3.19 is distinguished as the start
state, and the accepting state 3 is indicated by a double circle.
start Q
}<, „
a ,^x h z^^. h ,^
set of states (or more likely in practice, a pointer to the set of states) that can
be reached by a transition from state / on input a. The transition table for the
NFA of Fig. 3.19 is shown in Fig. 3.20.
The transition table representation has the advantage that it provides fast
access to the transitions of a given state on a given character; its disadvantage
is that it can take up a lot of space when the input alphabet is large and most
transitions are to the empty set. Adjacency list representations of the
SEC. 3.6 FINITE AUTOMATA 115
116 LEXICAL ANALYSIS SEC. 3.6
2. for each state .v and input symbol a, there is at most one edge labeled a
leaving .v.
A deterministic finite automaton has at most one transition from each state
maton accepts an input string, since there is at most one path from the start
state labeled by that string. The following algorithm shows how to simulate
Method. Apply the algorithm in Fig. 3.22 to the input string .v. The function
movc(.s, c) gives the state to which there is a transition from state s on input
character c. The function ne.xtchar returns the next character of the input
string .V.
i^
s := .v,,;
( :— ne.xtchar;
while e ^ eof do
,v := moveis, r);
(• := ne.xtchar
end;
if ,v is in / then
return "yes"
else return "no":
Example 3.14. In Fig. 3.23, we see the transition graph of a deterministic fin
iteautomaton accepting the same language (a \b)*ahh as that accepted by the
NFA of Fig. 3.19. With this DFA and the input string alxibb. Algorithm 3.1
follows the sequence of states 0, 1,2, 1, 2, 3 and returns "yes".
Input. An NFA N.
Operation
SEC. 3.6 FINITE AUTOMATA 119
States that N could be in after reading some sequence of input symbols includ
ing all possible etransitions before or after symbols are read. The start state
of D is eclosure(sQ). States and transitions are added to D using the algo
rithm of Fig. 3.25. A state of D is an accepting state if it is a set of NFA
states containing at least one accepting state of N.
initialize eclosure(T) to T;
Example 3.15. Figure 3.27 shows another NFA A' accepting the language
(a\b)*abb. (It happens to be the one in the next section, which will be
mechanically constructed from the regular expression.) Let us apply Algo
rithm 3.2 to A'. The start state of the equivalent DFA is eclosure(O) , which is
A — {0, 1, 2,4, 7}, since these are exactly the states reachable from state via
a path in which every edge is labeled e. Note that a path can have no edges,
so is reached from itself by such a path.
The input symbol alphabet here is {a, b}. The algorithm of Fig. 3.25 tells
us to mark A and then to compute
^closure{move{A, a)).
C = echsure{{5}) = {1,2, 4, 5, 6, 7}
Thus, Dtran[A, b] = C.
If we continue this process with the now unmarked sets B and C, we even
tually reach the point where all sets that are states of the DFA are marked.
This is certain since there are "only" 2" different subsets of a set of eleven
states, and a set, once marked, is marked forever. The five different sets of
states we actually construct are:
A = {0, 1, 2, 4, 7} D = {1, 2, 4, 5, 6, 7, 9}
B = {1, 2, 3, 4, 6, 7, 8} £ = {1, 2, 4, 5, 6, 7, 10}
C = {1, 2, 4, 5, 6, 7}
State A is the start state, and state E is the only accepting state. The complete
transition table Dtran is shown in Fig. 3.28.
State
SEC. 3.7 FROM A REGULAR EXPRESSION TO AN NFA 121
Here / is a new start state and /a new accepting state. Clearly, this NFA
recognizes {e}.
L
Again / is a new start state and / a new accepting state. This machine
recognizes {«}.
3. Suppose N{s) and NKt) are NFA's for regular expressions s and /.
start
Here / is a new start state and /a new accepting state. In the com
posite NFA, we can go from / to /directly, along an edge labeled e,
representing the fact that € is in (L(.v))*, or we can go from / to
f
passing through A'(^) one or more times. Clearly, the composite
NFA recognizes iLis))*.
d) For the parenthesized regular expression (.v), use Nis) itself as the
NFA.
124 LEXICAL ANALYSIS SEC. 3.7
Every time we construct a new state, we give it a distinct name. In this way,
no two states of any component NFA can have the same name. Even if the
same symbol appears several times in r, we create for each instance of that
symbol a separate NFA with its own states.
We can verify that each step of the construction of Algorithm 3.3 produces
an NFA that recognizes the correct language. In addition, the construction
produces an NFA N{r) with the following properties.
1. N(r) has at most twice as many states as the number of symbols and
operators in This follows from the fact each step of the construction
r.
2. N(r) has exactly one start state and one accepting state. The accepting
state has no outgoing transitions. This property holds for each of the
constituent automata as well.
' 11
\ Tk
I
h
I
r^ h
I
a
SEC. 3.7 FROM A REGULAR EXPRESSION TO AN NFA 125
The NFA for (r^) is the same as that for Kt,. The NFA for (r^)* is then:
start /^~N a
To obtain the automaton for r^rf,, we merge states 7 and 7', calling the
resulting state 7, to obtain
Continuing in this fashion we obtain the NFA for rn = {a\h)*ahb that was
first exhibited in Fig. 3.27. Q
126 LEXICAL ANALYSIS SEC. 3.7
can be implemented to run in time proportional to JA^jx jf, where A is the
number of states in A^ and jc is the length of a.
Method. Apply the algorithm sketched in Fig. 3.31 to the input string x. The
algorithm in effect performs the subset construction computes at run time. It
a transition from the current set of states 5 to the next set of states in two
stages. First, it determines moveiS, a), all states that can be reached from a
state in 5 by a transition on a, the current input character. Then, it computes
the echsure of move{S, a), that is, all states that can be reached from
moveiS, a) by zero or more etransitions. The algorithm uses the function
nextchar to read the characters of .v, one at a time. When all characters of x
have been seen, the algorithm returns "yes" if an accepting state is in the set
S of current states; "no", otherwise.
5 := ^closure({s^^})\
a :— nextchar;
while a + eof do begin
S :— i.c\osure{move{S , «));
a := nextchar
end
if 5 n F ^ then
return "yes";
else return "no";
Algorithm 3.4 can be efficiently implemented using two stacks and a bit
vector indexed by NFA states. We use one stack to keep track of the current
set of nondeterministic states and the other stack to compute the next set of
nondeterministic states. We can use the algorithm in Fig. 3.26 to compute the
eclosure. The bit vector can be used to determine in constant time whether a
SEC. 3.7 FROM A REGULAR EXPRESSION TO AN NFA 127
a transition. Let us write \N\ for the number of states of A^. Since there can
be at most [A^l states on a stack, the computation of the next set of states
from the current set of states can be done in time proportional to \N\. Thus,
the total time needed to simulate the behavior of A^ on input x is proportional
to A' I
X .v.
I
Example 3.17. Let A^ be the NFA of Fig. 3.27 and let x be the string consist
ing of the single character a. The start state is €closurei{0}) = {0, 1, 2, 4, 7}.
On input symbol a there is a transition to 8. Thus, T
from 2 to 3 and from 7
is Taking the eclosure of T gives us the next state {1, 2, 3, 4, 6, 7, 8}.
{3, 8}.
Since none of these nondeterministic states is accepting, the algorithm returns
"no."
Notice that Algorithm 3.4 does the subset construction at runtime. For
example, compare the above transitions with the states of the DFA in Fig.
3.29 constructed from the NFA of Fig. 3.27. The start and next state sets on
input a correspond to states A and B of the DFA.
TimeSpace Tradeoffs
to construct an NFA A' from r. This construction can be done in 0{ \r\) time,
where r is the length of r. A' has at most twice as many states as r, and at
most two transitions from each state, so a transition table for A' can be stored
in C>(r) space. We can then use Algorithm 3.4 to determine whether A^
accepts j: in 0( r X ;c) time. Thus, using this approach, we can determine
whether x is in L{r) in total time proportional to the length of r times the
length of X. This approach has been used in a number of text editors to
search for regular expression patterns when the target string x is generally not
very long.
A second approach is to construct a DFA from the regular expression r by
applying Thompson's construction to r and then the subset construction, Algo
rithm 3.2, to the resulting NFA. (An implementation that avoids constructing
the intermediate NFA explicitly is given in Section 3.9.) Implementing the
transition function with a transition table, we can use Algorithm 3.1 to simu
late the DFA on input .v in time proportional to the length of x, independent
of the number of states in the DFA. This approach has often been used in
patternmatching programs that search text files for regular expression pat
terns. Once the finite automaton has been constructed, the searching can
proceed very rapidly, so this approach is advantageous when the target string
X is very long.
There are, however, certain regular expressions whose smallest DFA has a
128 LEXICAL ANALYSIS SEC. 3.7
are n — (a/?)'s at the end, has no DFA with fewer than 2" states. This reg
1
ular expression denotes any string of «'s and b's in which the nth character
from the end is an a. It is not hard to prove that any DFA for this
right
expression must keep track of the last n characters it sees on the input; other
wise, it may give an erroneous answer. Clearly, at least 2" states are required
to keep track of all possible sequences of n a's and ^'s. Fortunately, expres
sions such as this do not occur frequently in lexical analysis applications, but
there are applications where similar expressions do arise.
A third approach is to use a DFA, but avoid constructing all of the transi
tion table by using a technique called "lazy transition evaluation." Here,
transitions are computed at run time but a transition from a given state on a
given character is not determined until it is actually needed. The computed
transitions are stored in a cache. Each time a transition is about to be made,
the cache is consulted. If the transition is not there, it is computed and stored
in the cache. If the cache becomes full, we can erase some previously com
Automaton
SEC. 3.8 DESIGN OF A LEXICAL ANALYZER GENERATOR 129
P\ { action i }
Pi { action 2 }
Pn { action,, }
where, as in Section 3.5, each pattern /?, is a regular expression and each
action actioni is a program fragment whenever a lexeme
that is to be executed
matched by /?, is found in the input.
Our problem is to construct a recognizer that looks for lexemes in the input
buffer. If more than one pattern matches, the recognizer is to choose the
longest lexeme matched. If there are two or more patterns that match the
longest lexeme, the firstlisted matching pattern is chosen.
A finite automaton is a natural model around which to build a lexical
analyzer, and the one constructed by our Lex compiler has the form shown in
Fig. 3.33(b). There is an input buffer with two pointers to it, a lexeme
beginning and a forward pointer, as discussed in Section 3.2. The Lex com
piler constructs a transition table for a finite automaton from the regular
expression patterns in the Lex specification. The lexical analyzer itself con
sists of a finite automaton simulator that uses this transition table to look for
Lex
specification
FA
simulator
transition
table
record the current input position and the pattern /?, corresponding to this
accepting state. If the current set of states already contains an accepting state,
then only the pattern that appears first in the Lex specification is recorded.
Second, we continue making transitions until we reach termination. Upon ter
mination, we retract the forward pointer to the position at which the last
match occurred. The pattern making this match identifies the token found,
and the lexeme matched is the string between the lexemebeginning and for
ward pointers.
Usually, the Lex specification is such that some pattern, possibly an error
pattern, will always match. If no pattern matches, however, we have an error
condition for which no provision was made, and the lexical analyzer should
transfer control to some default error recovery routine.
a*b^ {}
The three tokens above are recognized by the automata of Fig. 3.35(a). We
have simplified the third automaton somewhat from what would be produced
by Algorithm 3.3. As indicated above, we can convert the NFA's of Fig.
3.35(a) into one combined NFA A' shown in 3.35(b).
Let us now consider the behavior of on the input string aaba using our
A'
modification of Algorithm 3.4. Figure 3.36 shows the sets of states and pat
terns that match as each character of the input aaba is processed. This figure
shows that the initial set of states is {0, 1, 3, 7}. States 1, 3, and 7 each have
a transition on a, to states 2, 4, and 7, respectively. Since state 2 is the
accepting state for the first pattern, we record the fact that the first pattern
matches after reading the first a.
state 7 to state 8 on the input character b. State 8 is the accepting state for
the third pattern.Once we reach state 8, there are no transitions possible on
the next input character a so we have reached termination. Since the last
match occurred after we read the third input character, we report that the
third pattern has matched the lexeme aab.
The role of actionj associated with the pattern /;, in the Lex specification is
P\ P^
SEC. 3. DESIGN OF A LEXICAL ANALYZER GENERATOR 133
in the Lex specification has priority. As in the NFA simulation, the only
other modification we need perform is to continue making state transitions
to
until we reach a state withno next state (i.e., the state 0) for the current
input symbol. To find the lexeme matched, we return to the last input posi
tion at which the DFA entered an accepting state.
State
134 LEXICAL ANALYSIS SEC. 3.8
Recall from Section 3.4 that the lookahead operator / is necessary in some
situations, since the pattern that denotes a particular token may need to
describe some trailing context for the actual lexeme. When converting a pat
tern with / to an NFA, we were e, so that we do not
can treat the / as if it
Example 3.20. The NFA recognizing the pattern for IF given in Example
3.12 is shown in Fig. 3.38. State 6 indicates the presence of keyword IF;
however, we find the token IF by scanning backwards to the last occurrence
of state 2. n
any
tant. During the construction, two subsets can be identified if they have the
same important states, and either both or neither include accepting states of
the NFA.
SEC. 3.9 OPTIMIZATION OF DFABASED PATTERN MATCHERS 135
states of the NFA. Nonimportant states are named by upper case letters in
Fig. 3.39(c).
The DFA in Fig. 3.39(b) can be obtained from the NFA in Fig. 3.39(c) if
we apply the subset construction and identify subsets containing the same
important states. The identification results in one fewer state being con
structed, as a comparison with Fig. 3.29 shows.
/ \
• #
/ \ 6 (a) Syntax tree for {a\h)*ahh#:.
• h
/ \ 5
• h
/ \ 4
a
/ \
a h
1 2
start
Remembering the equivalence between the important NFA states and the
positions of the leaves in the syntax tree of the regular expression, we can
shortcircuit the construction of the NFA by building the DFA whose states
correspond to sets of positions in the tree. The etransitions of the NFA
represent some fairly complicated structure of the positions; in particular, they
encode the information regarding when one position can follow another. That
is, each symbol in an input string to a DFA can be matched by certain posi
pos are given in Fig. 3.40. The rules for lastpos (n) are the same as those for
firstpos(n), but with C] and Ct reversed, and are not shown.
The first rule for nullable states that \f n \s a leaf labeled e, then nullable (n)
is surely true. The second rule states that if n is a leaf labeled by an alphabet
symbol, then nullable (n) is false. In this case, each leaf corresponds to a sin
gle input symbol, and therefore cannot generate e. The last rule for nullable
states that if ai is a starnode with child c\, then nullable (n) is true, because
the closure of an expression generates a language that includes e.
As another example, the fourth rule for firstpos states that if n is a catnode
with left child c^ and right child ct, and if nullable (c^) is true, then
Node n
SEC. 3.9 OPTIMIZATION OF DFABASED PATTERN MATCHERS 139
{1,2,3} . {6}
{1.2} I
{1,2}
Fig. 3.41. firstpos and lastpos for nodes in syntax tree for {a\h)*abb#.
The node labeled * is the only nullable node. Thus, by the ifcondition of
the fourth rule, firstpos for the parent of this node (the one representing
expression {a\b)*a) is the union of {!, 2} and {3}, which are the firstpos's of
its left and right children. On the other hand, the elsecondition applies for
lastpos of this node, since the leaf at position 3 is not nullable. Thus, the
parent of the starnode has lastpos containing only 3.
now compute foUowpos bottom up for each node of the syntax tree
Let us
of Fig. 3.41. At the starnode, we add both and 2 to followpos{\) and to
1
followpos(2) using rule (2). At the parent of the starnode, we add 3 to fol
lowpos(\) and followpos(2) using rule (1). At the next catnode, we add 4 to
followpos O) using rule (1). At the next two catnodes we add 5 to fol
lowpos{4) and 6 to followpos (5) using the same rule. This completes the con
struction of followpos. Figure 3.42 summarizes /<9//ow/?o5.
Node
140 LEXICAL ANALYSIS SEC. 3.9
3.42.
Method.
1. Construct a syntax tree for the augmented regular expression (r)#, where
# is a unique endmarker appended to (r).
3. Construct Dstates, the set of states of D, and Dtran, the transition table
for D by the procedure in The states in Dstates are sets of
Fig. 3.44.
positions; initially, each "unmarked," and a state becomes
state is
Example 3.23. Let us construct a DFA for the regular expression (a\b)*abb.
The syntax tree for {ia\b)*abb)# is shown in Fig. 3.39(a). nuUable is true
only for the node labeled *. The functions firstpos and lastpos are shown in
Fig. 3.41, and foUowpos is shown in Fig. 3.42.
From Fig. 3.41, firstpos of the root is {1, 2, 3}. Let this set be A and
SEC. 3.9 OPTIMIZATION OF DFABASED PATTERN MATCHERS 141
DtranlT, a \
:= U
end
end
DFA M
in state and feeding it input h', we end up in an accepting state, but
,y
group of the current partition. Suppose, for example, that .V] and S2 go to
states / and tj on input a, and t\ and tj are in different groups of the parti
tion. Then we must split A into at least two subsets so that one subset con
tains .V] and the other .s^ Note that /] and t2 are distinguished by some
string w, so .V] and .V2 are distinguished by string aw.
We repeat this process of splitting groups in the current partition until no
more groups need to be split. While we have justified why states that have
been split into different groups really can be distinguished, we have not indi
cated why states that are not split into different groups are certain not to be
distinguishable by any input string. Such is the case, however, and we leave a
proof of that fact to the reader interested in the theory (see, for example,
Hopcroft and Ullman [1979)). Also left to the interested reader is a proof
that theDFA constructed by taking one state for each group of the final parti
tion and then throwing away the dead state and states not reachable from the
start state has as few states as any DFA accepting the same language.
Output. A DFA M' accepting the same language as M and having as few
states as possible.
Method.
1. Construct an initial partition FI of the set of states with two groups: the
accepting states F and the nonaccepting states 5 F.
4. Choose one state in each group of the partition U{,na\ as the representative
SEC. 3.9 OPTIMIZATION OF DFABASED PATTERN MATCHERS 143
for that group. The representatives will be the states of the reduced DFA
M' . Let 5 be a representative state, and suppose on input a there is a
transition of M from s to /. Let r be the representative of /'s group (r
may be t). Then M' has a transition from 5 to r on a. Let the start state
of M' be the representative of the group containing the start state .Vq of
M, and let the accepting states of M' be the representatives that are in F.
Note that each group of n,inai either consists only of states in F or has no
states in F.
If M' has a dead state, that is, a state d that is not accepting and that has
transitions to itself on all input symbols, then remove d from M' . Also
remove any states not reachable from the start state. Any transitions to d
from other states become undefined.
Example 3.24. Let us reconsider the DFA represented in Fig. 3.29. The ini
tial partition 11 consists of two groups: (E), the accepting state, and (ABCD),
the nonaccepting states. To construct Onevv the algorithm of Fig. 3.45 first
considers (£). Since this group consists of a single state, it cannot be split
further, so (E) is placed in flne^. The algorithm then considers the group
(ABCD). On input «, each of these states has a transition to B, so they could
all remain in one group as far as input a is concerned. On input b, however,
A, B, and C go to members of the group (ABCD) of 11, while D goes to E, a
member of another group. Thus, in flpew the group (ABCD) must be split into
State
SEC. 3.9 OPTIMIZATION OF DFABASED PATTERN MATCHERS 145
States and characters, provides the fastest access, but it can take up too much
space (say several hundred states by 128 characters). A more compact but
slower scheme is to use a linked list to store the transitions out of each state,
with a "default" transition at the end of the list. The most frequently occur
ing transition one obvious choice for the default.
is
There is a more subtle implementation that combines the fast access of the
array representation with the compactness of the list structures. Here we use
a data structure consisting of four arrays indexed by state numbers as depicted
in Fig. 3.47.^ The base array is used to determine the base location of the
entries for each state stored in the next and check arrays. The default array is
used to determine an alternative base location in case the current base location
is invalid.
example, state q, the default for state s, might be the state that says we are
"working on an identifier," such as state 10 in Fig. 3.13. Perhaps s is entered
after seeing th a prefix of the keyword then as well as a prefix of an identif
ier. On input character e we must go to a special state that remembers we
have seen the, but otherwise state behaves as state q does. Thus, we set
.v
EXERCISES
3.2 What are the conventions regarding the use of blanks in each of the
languages of Exercise 3.1?
3.3 Identify the lexemes that make up the tokens in the following pro
grams. Give reasonable attribute values for the tokens.
a) Pascal
b) C
int max ( i , j ) int i , j
/* return maximum of integers i and j */
{
c) Fortran 77
FUNCTION MAX I, J ( )
3.9 Specify the lexical form of identifiers and keywords in the languages
of Exercise 3.1.
148 LEXICAL ANALYSIS CHAPTER 3
3.10 The regular expression constructs permitted by Lex are listed in Fig.
3.48 in decreasing order of precedence. In this table, c stands for any
single character, r for a regular expression, and s for a string.
Expression
^^^^^^^ EXERCISES 149
3.11 Write a Lex program that copies a file, replacing each nonnull
sequence of white space by a single blank.
3.15 Modify Algorithm 3.1 to find the longest prefix of the input
that is
accepted by the DFA.
3.16 Construct nondeterministic finite automata for the following
regular
expressions using Algorithm 3.3. Show the sequence of
moves made
by each in processing the input string ababbah
a) (a\b)*
b) ia* \b*)*
150 LEXICAL ANALYSIS CHAPTER 3
Expression
CHAPTER 3 EXERCISES 151
3.24 Construct the representation of Fig. 3.47 for the transition table of
Exercise 3.19. Pick default states and try the following two methods
of constructing the next array and compare the amounts of space used:
a) Starting with the densest states (those with the largest number of
entries differing from their default states) first, place the entries
for the states in the next array.
b) Place the entries for the states in the next array in a random order.
transition from state 5 — to state s on symbol b^. The start and final
1
~N
1
h
y—M^
,^x
2
a ,^\
)—M^zy—M.^Ay—*{^
h /—^ a ^
1
.
3.27 Algorithm KMP in Fig. 3.51 uses the failure function /constructed as
in Exercise 3.26 to determine whether keyword ^i h,„ is a sub
/* does </ 1
• •
a„ contain h^ •
h,„ as a substring «/
s :
= 0;
for i : 1 to « do begin
while .V > and </, 9^ /?^ , ,
do ,v := /(.v);
if a, = h, then = + , I
.v : .v 1
is a substring of « 
• •
a,,.
expressed as (uv)^u, for some k > 0, where wv' = p and v is not the
empty string. For example, 2 and 4 are periods of the string abababa.
a) Show that /?is a period of a string if and only if ,vr = us for some .v
and u of length p.
strings /
S\ = b
St = a
Sk = ^Ai^A:^ for k > 2.
*d) Using induction, show that the failure function for s„ can be
expressed by f (j) = j  \s^<^\, where k is such that
\sk\ ^7 + < 1 ki + il for 1 < 7 < \s„\.
3.31 We can extend the trie and failure function concepts of Exercise 3.26
from a single keyword keywords as follows. Each state in
to a set of
the trie corresponds to a prefix of one or more keywords. The start
state corresponds to the empty string, and a state that corresponds to
a complete keyword is a final state. may be made
Additional states
final during the computation of the failure function. The transition
diagram for the set of keywords {he, she, his, hers} is shown in
Fig. 3.52.
all input symbols a that are not the initial symbol of any keyword.
We then set g(s, a) = fail for any transition not defined. Note that
there are no fail transitions for the start state.
154 LEXICAL ANALYSIS CHAPTER 3
tH^^iy^^
/(.v)
CHAPTER 3 EXERCISES 155
*b) Show that the algorithm in Fig. 3.53 correctly computes the failure
function.
*c) Show that the failure function can be computed in time propor
tional to the sum of the lengths of the keywords.
3.32 Let g be the transition function and /the failure function of Exercise
3.31 for a set of keywords A' = {vi, .V2 Vil Algorithm AC in
Fig. 3.54 uses g and /to determine whether a target string a^ a,, • • •
get string.
3.33 Use the algorithm in Exercise 3.32 to construct a lexical analyzer for
thekeywords in Pascal.
3.34 Define lcs(x, y), a longest common subsequence of two strings x and
>',to be a string that is a subsequence of both .v and y and is as long
as any such For example, tie is a longest common
subsequence.
subsequence of striped and tiger. Define d{x, y), the distance
between v and y, to be the minimum number of insertions and dele
tions required to transform x into y. For example, J( striped,
tiger) = 6.
156 LEXICAL ANALYSIS CHAPTER 3
a) Show that for any two strings v and y, the distance between a and
y and the length of their longest common subsequence are related
by J(.v, _v) = .v + _v  (2* /c,v(.v, _v)).
*b) Write an algorithm that takes two strings .v and y as input and pro
duces a longest common subsequence of a and y as output.
3.35 Define ei.x, y). the allt distance between two strings a and y, to be
the minimum number of character insertions, deletions, and replace
ments that are required to transform a into y. Let a = c/ 1
• • •
a,„ and
y =
/?! •
h„. c(.\, y) can be computed by a dynamic programming
algorithm using a distance array d\i)..m, 0..// in which c/(/, /) is the
between a\
edit distance a, and h^ •
/;^. The algorithm in Fig.
3.55 can be used to compute the cJ matrix. The function repl is just
the cost of a character replacement: repKa,. /?,) = if a, = /?,, 1 oth
erwise.
for / := to tn do d\i. 0 := /;
for 7 := I to /; do </(), / 1 := j\
for / := I to /// do
for 7 ;
= 1 to /; do
D/, j\ := min(J/l. /1 + repUa,, /?,),
d\i\, j\ + 1.
d\i. 7H + I)
3.36 Give an algorithm that takes as input a string a and a regular expres
sion r, and produces as output a string y in L(r) such that d{x, y) is
as small as possible, where d is the distance function in Exercise 3.34.
PROGRAMMING EXERCISES
P3.1 Write a lexical analyzer in Pascal or C for the tokens shown in Fig.
3.10.
P3.2 Write a specification for the tokens of Pascal and from this specifica
tion construct transition diagrams. Use the transition diagrams to
implement a lexical analyzer for Pascal in a language like C or Pascal.
CHAPTERS BIBLIOGRAPHIC NOTES 157
P3.3 Complete the Lex program in Fig. 3.18. Compare the size and speed
of the resulting lexical analyzer produced by Lex with the program
written in Exercise P3.1.
P3.4 Write a Lex specification for the tokens of Pascal and use the Lex
compiler to construct a lexical analyzer for Pascal.
P3.5 Write a program that takes as input a regular expression and the
name of a file, and produces as output all lines of the file that contain
a substring denoted by the regular expression.
P3.6 Add an error recovery scheme to the Lex program in Fig. 3.18 to
enable it to continue to look for tokens in the presence of errors.
P3.7 Program a lexical analyzer from the DFA constructed in Exercise 3.18
and compare this lexical analyzer with that constructed in Exercises
P3.1 and P3.3.
BIBLIOGRAPHIC NOTES
The restrictions imposed on the lexical aspects of a language are often deter
mined by the environment in which the language was created. When Fortran
was designed in 1954, punched cards were a common input medium. Blanks
were ignored in Fortran partially because keypunchers, who prepared cards
from handwritten notes, tended to miscount blanks (Backus 1981). Algol
58's separation of the hardware representation from the reference language
was a compromise reached after a member of the design committee insisted,
"No! I will never use a period for a decimal point." (Wegstein 1981)).
Knuth 11973a] presents additional techniques for buffering input. Feldman
1979bj discusses the practical difficulties of token recognition in Fortran 77.
Regular expressions were first studied by Kleene 1956, who was interested
in describing the events that could be represented by McCulloch and Pitts
[1943] finite automaton model of nervous activity. The minimization of finite
automata was first studied by Huffman 11954] and Moore [1956]. The
equivalence of deterministic and nondeterministic automata as far as their
ability to recognize languages was shown by Rabin and Scott 1959.
McNaughton and Yamada (I960] describe an algorithm to construct a DFA
directly from a regular expression. More of the theory of regular expressions
can be found in Hopcroft and UUman [1979].
It was quickly appreciated that tools to build lexical analyzers from regular
Johnson, who first used it in the implementation of the Yacc parser generator
(Johnson 1975). Other tablecompression schemes are discussed and
evaluated in Dencker, DCirre, and Heuft l984j.
The problem of compact implementation of transition tables has been
theoretically studied in a general setting by Tarjan and Yao 1979 and by
Fredman, Komlos, and Szemeredi I984. Cormack, Horspool, and
Kaiserswerth 1985j present a perfect hashing algorithm based on this work.
Regular expressions and finite automata have been used in many applica
tions other than compiling. Many text editors use regular expressions for con
egrep are similar to those in Lex, except for iteration and lookahead.
egrep uses a DFA with lazy state construction to look for its regular expres
sion patterns, as outlined in Section 3.7. fgrep looks for patterns consisting
of sets of keywords using the algorithm in Aho and Corasick [1975], which is
discussed in Exercises 3.31 and 3.32. Aho 1980 discusses the relative per
formance of these programs.
Regular expressions have been widely used in text retrieval systems, in
database query languages, and in file processing languages like (Aho, AWK
Kernighan, and Weinberger 1979). Jarvis [1976] used regular expressions to
describe imperfections in printed circuits. Cherry 1982j used the keyword
matching algorithm in Exercise 3.32 to look for poor diction in manuscripts.
The string pattern matching algorithm in Exercises 3.26 and 3.27 is from
Knuth, Morris, and Pratt 1977. This paper also contains a good discussion
of periods in strings. Another efficient algorithm for string matching was
invented by Boyer and Moore I977 who showed that a substring match can
usually be determined without having to examine all characters of the target
string. Hashing has also been proven as an effective technique for string pat
tern matching (Harrison [1971]).
The notion of a longest common subsequence discussed in Exercise 3.34 has
been used in the design of the UNIX system file comparison program diff
(Hunt and Mcllroy 1976). An efficient practical algorithm for computing
longest common subsequences is described in Hunt and Szymanski 1977.
The algorithm for computing minimum edit distances in Exercise 3.35 is from
Wagner and Fischer 1974. Wagner 1974 contains a solution to Exercise
3.36. Sankoff and Kruskal 1983 contains a fascinating discussion of the
broad range of applications of minimum distance recognition algorithms from
the study of patterns in genetic sequences to problems in speech processing.
CHAPTER 4
Syntax
Analysis
Every programming language has rules that prescribe the syntactic structure of
wellformed programs. In Pascal, for example, a program is made out of
blocks, a block out of statements, a statement out of expressions, an expres
sion out of tokens, and so on. The syntax of programming language con
structs can be described by contextfree grammars or BNF (BackusNaur
Form) notation, introduced in Section 2.2. Grammars offer significant advan
tages to both language designers and compiler writers.
The bulk of this chapter is devoted to parsing methods that are typically
used in compilers. We first present the basic concepts, then techniques that
are suitable for hand implementation, and finally algorithms that have been
used automated tools. Since programs may contain syntactic errors, we
in
extend the parsing methods so they recover from commonly occurring errors.
160 SYNTAX ANALYSIS SEC. 4.1
regarding approach.
• It should recover from each error quickly enough to be able to detect sub
sequent errors.
(6) begin
(7) if i > j then max := i
(8) else max := j
(9) end;
(10) begin
( 1 1) readln (x,y ) ;
How should an error handler report the presence of an error? At the very
least, it should report the place in the source program where an error is
detected because there is a good chance that the actual error occurred within
the previous few tokens. A common strategy employed by many compilers is
to print the offending line with a pointer to the position at which an error is
Once an error is detected, how should the parser recover? As we shall see,
there are a number of general strategies, but no one method clearly dom
inates. In most cases, it is not adequate for the parser to quit after detecting
the first error, because subsequent processing of the input may reveal addi
tional errors. Usually, there is some form of error recovery in which the
parser attempts to restore itself to a state where processing of the input can
continue with a reasonable hope that correct input will be parsed and other
wise handled correctly by the compiler.
An inadequate job of recovery may introduce an annoying avalanche of
"spurious" errors, those that were not made by the programmer, but were
introduced by the changes made to the parser state during error recovery. In
a similar vein, syntactic error recovery may introduce spurious semantic errors
that will later be detected by the semantic analysis or code generation phases.
For example, in recovering from an error, the parser may skip a declaration
of some variable, say zap. When zap is later encountered in expressions,
there is nothing syntactically wrong, but since there is no symboltable entry
for zap, a message "zap undefined" is generated.
A is to inhibit error messages that stem
conservative strategy for a compiler
from errors uncovered too close together in the input stream. After discover
ing one syntax error, the compiler should require several tokens to be parsed
successfully before permitting another error message. In some cases, there
may many errors for the compiler to continue sensible processing. (For
be too
example, how should a Pascal compiler respond to a Fortran program as
164 SYNTAX ANALYSIS SEC. 4.1
fact, with the increasing emphasis on interactive computing and good pro
gramming environments, the trend seems to be toward simple errorrecovery
mechanisms.
ErrorRecovery Strategies
There are many different general strategies that a parser can employ to
recover from a syntactic error. Although no one strategy has proven itself to
be universally acceptable, a few methods have broad applicability. Here we
introduce the following strategies:
• panic mode
• phrase level
• error productions
• global correction
correction on the remaining input; that is, it may replace a prefix of the
remaining input by some string that allows the parser to continue. A typical
local correction would be to replace a comma by a semicolon, delete an
extraneous semicolon, or insert a missing semicolon. The choice of the local
correction is left to the compiler designer. Of course, we must be careful to
choose replacements that do not lead to infinite loops, as would be the case,
for example, if we always inserted something on the input ahead of the
current input symbol.
This type of replacement can correct any input string and has been used in
several errorrepairing compilers. The method was first used with topdown
parsing. Its major drawback is the difficulty it has in coping with situations in
SEC. 4.2 CONTEXTFREE GRAMMARS 165
which the actual error has occurred before the point of detection.
Error productions. If we have a good idea of the common errors that might
be encountered, we can augment the grammar for the language at hand with
productions that generate the erroneous constructs. We then use the grammar
augmented by these error productions to construct a parser. If an error pro
duction is used by the parser, we can generate appropriate error diagnostics to
indicate the erroneous construct that has been recognized in the input.
Global correction. Ideally, we would like a compiler to make as few
changes as possible in processing an incorrect input string. There are algo
rithms for choosing a minimal sequence of changes to obtain a globally least
cost correction. Given an incorrect input string x and grammar G, these algo
rithms will find a parse tree for a related string y, such that the number of
insertions, deletions, and changes of tokens required to transform x into y is
as small as possible. Unfortunately, these methods are in general too costly to
implement in terms of time and space, so these techniques are currently only
of theoretical interest.
We should point out that a closest correct program may not be what the
programmer had in mind. Nevertheless, the notion of least cost correction
does provide a yardstick for evaluating errorrecovery techniques, and it has
been used for finding optimal replacement strings for phraselevel recovery.
1. Terminals are the basic symbols from which strings are formed. The
word "token" is a synonym for "terminal" when we arc talking about
grammars for programming languages. In (4.2), each of the keywords if,
terminals.
Example 4.2. The grammar with the following productions defines simple
arithmetic expressions.
expr » — expr
expr * id
op  +
op ^ 
op — ^
op ^/
op * T
id +  */ t ( )
The nonterminal symbols are expr and op, and expr is the start symbol.
Notational Conventions
To avoid always having to state that "these are the terminals," "these are the
nonterminals," and so on, we shall employ the following notational conven
tions with regard to grammars throughout the remainder of this book.
ii) The letter 5, which, when it appears, is usually the start symbol,
iii) Lowercase italic names such as expr or stmt.
7. Unless otherwise stated, the left side of the first production is the start
symbol.
Example 4.3. Using these shorthands, we could write the grammar of Exam
ple 4.2 concisely as
E ^ EA E I
{ E ) I
 E I
id
/\ ^ +  * / t
I I I I
Our notational conventions tell us that E and A are nonterminals, with E the
start symbol. The remaining symbols are terminals.
Derivations
There are several ways to view the process by which a grammar defines a
language. In Section 2.2, we viewed this process as one of building parse
trees, but there is also a related derivational view that we frequently find use
ful. In fact, this derivational view gives a precise description of the topdown
construction of a parse tree. The central idea here is that a production is
E =>  E
which is read "£ derives — E." The production E  (E) tells us that we could
also replace one instance of an E in any string of grammar symbols by (£);
e.g., E*E => {E)*E or E*£ => EME).
We can take a single £ and repeatedly apply productions in any order to
obtain a sequence of replacements. For example,
tt] => a2 ^^^ "^^ «/M we say ai derives a„. The symbol =t> means
"derives in one step." Often we wish to say "derives in zero or more steps."
For this purpose we can use the symbol =^. Thus,
Example 4.4. The string (id + id) is a sentence of grammar (4.3) because
there is the derivation
E => E => {E) => {E + E) => (id + E) => (id+id) (4.4)
sentential forms of this grammar. We write E => (id + id) to indicate that
— (id + id) can be derived from E.
We can show by induction on the length of a derivation that every sentence
in the language of grammar (4.3) is an arithmetic expression involving the
binary operators + and *, the unary operator — parentheses, and the ,
To see the relationship between derivations and parse trees, consider any
derivation ai => ai => • • •
=>a„, where a is a single nonterminal A. For
each sentential form a, in the derivation, we construct a parse tree whose
yield is a,. The process is an induction on /. For the basis, the tree for
tt] — /I is a single node labeled A. To do the induction, suppose we have
already constructed a parse tree whose yield is a,_ = X^Xi X/^. (Recal
ling our conventions, each X, is either a nonterminal or a terminal.) Suppose
a, derived from a,_ by replacing Xj, a nonterminal, by p = KiKt
is Y^. • • •
That is, at the /th step of the derivation, production Xj  p is applied to a,_i
to derive a, = XiA'2 Xy_iPX^^ X^. • •
1
• • •
To model this step of the derivation, we find the 7th leaf from the left in
the current parse tree. This leaf is labeled Xj. We give this leaf r children,
labeled ^1,^2, . . . , Y^, from the left. As a special case, if r = 0, i.e..
170 SYNTAX ANALYSIS SEC. 4.2
E
/ \
/ \ I
/ \£ +
I
£
I I
id id
Example 4.5. Consider derivation (4.4). The sequence of parse trees con
structed from this derivation shown in Fig. 4.3. In the first step of the
is
derivation. E => —E. To model this step, we add two children, labeled —
and £, to the root E of the initial tree to create the second tree.
E
SEC. 4.2 CONTEXTFREE GRAMMARS 1
" 1
not hard to see that e\er> parse tree ha< associated uith it a unique leftmost
and a unique rightmost deri\ation. In what tollows. we parseshall t'requentK
by producmg a leftmost or rightmost derivation, understanding that instead of
this derivation we could produce the parse tree itself.Ho\ve\er. we should
not assume that e\er\ sentence necessanl) has onl> one parse tree or onlv one
leftmost or rightmost deri\ation.
Example 4.6. Let us agam ci^nsider the arithmetic expression grammar (4.3).
The sentence idid^id has the two distinct leftmost dernations;
£
I
id E £ £ £
I I I
id id id id
(a) (b)
Note that the parse tree of Fig. 4.4(a) reflects the commonly assumed pre
cedence of ~ and *. while the tree of Fig. 4.4(b) does not. That is. it is cus
tomar) to treat operator >• as haMng higher precedence than ~. corresponding
to the fact that we would normalK evaluate an expression like ah^c as
a^i.b*c), rather than as (atb)^c.
.\mbiguity
.^ grammar produces more than one parse tree for some sentence is said
that
to be amhii;u(>us.Put another wa\. an ambiguous grammar is one that pro
duces more than one leftmost or more than one rightmost derivation for the
same sentence. For certain types of parsers, is desirable that the grammar
it
A I
^ bA.
A. ^ bA^
A^  e
describe the same language, the set of strings of «'s and b\ ending in abb.
We can mechanically convert a nondeterministic automaton (NFA) finite
into a grammar that generates the same language as recognized by the NFA.
The grammar above was constructed from the NFA of Fig. 3.23 using the fol
lowing construction: For each state of the NFA, create a nonterminal sym
/
bol A,. If state / has a transition to state j on symbol a, introduce the produc
tion A, — ciAj. If state / goes to state j on input e, introduce the production
Aj ^ Aj. If / is an accepting state, introduce A,  e. If / is the start state,
make A, be the start symbol of the grammar.
Since every regular set is a contextfree language, we may reasonably ask,
"Why use regular expressions to define the lexical syntax of a language?"
There are several reasons.
There are no firm guidelines as to what to put into the lexical rules, as
opposed to the syntactic rules. Regular expressions are most useful for
describing the structure of lexical constructs such as identifiers, constants,
keywords, and so forth. Grammars, on the other hand, are most useful in
describing nested structures such as balanced parentheses, matching begin
end's, corresponding ifthenelse's, and so on. As we have noted, these
nested structures cannot be described by regular expressions.
5 * (S)S e (4.6)
I
It may not be initially apparent, but this simple grammar generates all strings
of balanced parentheses, and only such strings. To see this, we shall show
first that every sentence derivable from S is balanced, and then that every bal
anced string is derivable from S. To show that every sentence derivable from
S is balanced, we use an inductive proof on the number of steps in a deriva
tion. For the basis step, we note that the only string of terminals derivable
from S in one step is the empty string, which surely is balanced.
Now assume that all derivations of fewer than // steps produce balanced sen
tences, and consider a leftmost derivation of exactly n steps. Such a deriva
tion must be of the form
The derivations of x and y from S take fewer than n steps so, by the inductive
hypothesis, x and y are balanced. Therefore, the string (.v)y must be bal
anced.
We have thus shown that any string derivable from S is balanced. We must
174 SYNTAX ANALYSIS SEC. 4.3
next show that every balanced string is derivable from 5. To do this we use
induction on the length of a string. For the basis step, the empty string is
derivable from S.
Now assume that every balanced string of length less than 2n is derivable
from S, and consider a balanced string w of length 2n, n > \. Surely w
begins with a left parenthesis. Let (x) be the shortest prefix of w having an
equal number of left and right parentheses. Then h' can be written as {x)y
where both x and y are balanced. Since x and y are of length less than 2n,
they are derivable from 5 by the inductive hypothesis. Thus, we can find a
derivation of the form
Eliminating Ambiguity
Here "other" stands for any other statement. According to this grammar, the
compound conditional statement
has the parse tree shown in Fig. 4.5. Grammar (4.7) is ambiguous since the
string
stmt
E2 S2
stmt
E.
E\
X / \ \ ^'
ZA ZA
Fig. 4.6. Two parse trees for an ambiguous sentence.
stmt  matched_stmt
I
unmatched_stmt
matched_stmt * if expr then matched_stmt else matched_stmt
I
other (4.9)
unmatched_stmt » if expr then stmt
I
if expr then matched_stmt else unmatched_stmt
This grammar generates the same set of strings as (4.7), but it allows only one
parsing for string (4.8), namely the one that associates each else with the
closest previous unmatched then.
176 SYNTAX ANALYSIS SEC. 4.3
A * p/\'
A'  a/4' I
e
without changing the set of strings derivable from A. This rule by itself suf
E * E + T \T
T ^ T * E E \
(4.10)
E ^ E ( ) I
id
E * TE'
E' +TE'
*
I
€
T  FT' (4.11)
T' * *ET' I
e
E  (£) I
id n
A ^ Aa^ \Aa2\ 
Aa,  P,  ^2 I
"
' "
I P«
A  p,A' I
P2A' I
•
• •
I
P„A'
A' * a, A' I
ajA' 
• • •

a„A' \
t
The nonterminal A generates is no longer left the same strings as before but
recursive. This procedure eliminates all immediate left recursion from the A
and A' productions (provided no a, is e), but it does not eliminate left recur
sion involving derivations of two or more steps. For example, consider the
grammar
S ^ Aa \h
(4,2)
A ^ Ac Sd \t \
Method. Apply the algorithm in Fig. 4.7 to G. Note that the resulting non
leftrecursive grammar may have eproductions.
2. for / := I to /; do begin
for y := 1 to /
 ! do begin
replace each production of the form A, — ^4^7
by the productions /4, • 5,7 
ft.^ I
'
' '
I 5a7
where /4, — 5,  5^ 
•
 5^ are all the current 4, productions;
end
eliminate the immediate left recursion among the ,4, productions
end
The reason the procedure in Fig. 4.7 works is that after the /  T' iteration
of the outer for loop any production of the form A^
in step (2). Aia, where —
k < i, must have / > k. As a result, on the next iteration, the inner loop (on j)
progressively raises the lower limit on /// in any production A, — A,„a, until we
must have m>i. Then, eliminating immediate left recursion for the A,
productions forces m to be greater than /.
A ^ Ac \
Acid I
hd \
e
Eliminating the immediate left recursion among the /\productions yields the
following grammar.
178 SYNTAX ANALYSIS SEC. 4.3
S Aa h  \
A  bdA' I
A'
A' * cA' I
a JA' I
€
Left Factoring
on seeing the input token we cannot immediately tell which production to if,
and the input begins with a nonempty string derived from a, we do not know
whether to expand A to ap, or to 0^2 However, we may defer the decision
by expanding A to aA' Then, after seeing the input derived from a, we .
A — a/4'
A' ^ (3, I
p,
Input. Grammar G.
Method. For each nonterminal A find the longest prefix a common to two or
more of its alternatives. If a 9^ e, i.e., there is a nontrivial common prefix,
replace all the' A productions A — aPi  aP2 I
' ' '
I
^(3,, 
7 where 7
represents all alternatives that do not begin with a by
A ^ aA' I 7
>^' ^ Pi I P2 I
•
• •
I P„
S  iEtS I
iEtSeS \
a , . x,x
Here /, /, and e stand for if, then and else, E and S for "expression" and
"statement." Leftfactored, this grammar becomes:
SEC. 4.3 WRITING A GRAMMAR 179
S * iEtSS' I
a
S'^ eS \
e (4.14)
E ^ b
Thus, we may expand S to iEtSS' on input /, and wait until iEtS has been seen
to decide whether to expand S' to eS or to e. Of course, grammars (4.13) and
(4.14) are both ambiguous, and on input e, it will not be clear which alterna
tive for S' should be chosen. Example 4.19 discusses a way out of this
dilemma. n
Example 4.12. The language Lj = {a"b"'c"d"'\ n>\ and w>l} is not con
text free. That
Lt consists of strings in the language generated by the reg
is,
ular expression a*b*c*d* such that the number of «'s and c's are equal and
the number of b's and Ws are equal. (Recall a" means a written n times.) Lt
abstracts the problem of checking that the number of formal parameters in the
declaration of a procedure agrees with the number of actual parameters in a
use of the procedure. That is, a" and b'" could represent the formal parame
ter lists in two procedures declared to have n and m arguments, respectively.
Then c" and d'" represent the actual parameter lists in calls to these two pro
cedures.
Again note that the typical syntax of procedure definitions and uses does
not concern itself with counting the number of parameters. For example, the
CALL statement in a Fortranlike language might be described
.
with suitable productions for expr. Checking that the number of actual
parameters in the call is correct is usually done during the semantic analysis
phase.
S ^ aSd I
a Ad
A * bAc I
be
Since D has only k different states, at least two states in the sequence
Sq, Si, ... , .v^ must be the same, say Sj and .v^. From state 5, a sequence of
/ ^'s takes D to an accepting state /, since a'b' is in L'y But then there is also
a path from the initial state sq to v, to /labeled a^b', as shown in Fig. 4.8.
Thus, D also accepts a^b', which is not in L\, contradicting the assumption
that L'3 is the language accepted by D.
Colloquially, we say that "a finite automaton cannot keep count," meaning
that a finite automaton cannot accept
a language like L'3 which would require
it keep count of the number of «'s before it sees the b\. Similarly, we say
to
"a grammar can keep count of two items but not three," since with a gram
mar we can define L'3 but not L3.
RecursiveDescent Parsing
a parse tree for the input starting from the root and creating the nodes of the
parse tree in preorder. In Section 2.4, we discussed the special case of
recursivedescent parsing, called predictive parsing, where no backtracking is
S  cAd
(4.15)
A ^ ab a \
and the input string h = cad. To construct a parse tree for this string top
down, we initially create a tree consisting of a single node labeled S. An
input pointer points to c, the first symbol of w. We then use the first produc
tion for S to expand the tree and obtain the tree of Fig. 4.9(a).
S S S
c
/w A d c
/w A d c
/w A d
a
/ \ b
I
a
(a) (b) (c)
Predictive Parsers
then the keywords if, while, and begin tell us which alternative is the only one
that could possibly succeed if we are to find a statement.
2. For each production A ^XxXj X„, create a path from the " "
initial to
the final state, with edges labeled X) X2, ,X„. , . . .
The predictive parser working off the transition diagrams behaves as fol
lows. It begins in the start state for the start symbol. If after some actions it
symbol is a, then the parser moves the input cursor one position right and
goes to state t. If, on the other hand, the edge is labeled by a nonterminal A,
the parser instead goes to the start state for A, without moving the input cur
sor. If it ever reaches the final state for A, it immediately goes to state /, in
effect having "read" A from the input during the time it moved from state s
to t. Finally, if there is an edge from s to t labeled e, then from state s the
parser immediately goes to state r, without advancing the input.
.
there is a transition on a nonterminal out of s, and popping the stack when the
final state for a nonterminal is reached. We shall discuss the implementation
of transition diagrams in more detail shortly.
The above approach works if the given transition diagram does not have
nondeterminism, in the sense that there is more than one transition from a
state on the same input, if ambiguity occurs, we may be able to resolve it in
an adhoc way, as in the next example. If the nondeterminism cannot be
eliminated, we cannot build a predictive parser, but we could build a
recursivedescent parser using backtracking to systematically try all possibili
ties, if that were the best parsing strategy we could find.
removed, and we can write a predictive parsing program for grammar (4.1 l).n
r
.:(7^^^0.
^(I^^0
(b)
(c) (d)
Figure 4.11(b) shows an equivalent transition diagram for E' . We may then
substitute the diagram of Fig. 4.11(b) for the transition on E' in the diagram
for E in Fig. 4.10, yielding the diagram of Fig. 4.11(c). Lastly, we observe
that the first and third nodes in Fig. 4.11(c) are equivalent and we merge
them. The result. Fig. 4.11(d), is repeated as the first diagram in Fig. 4.12.
The same techniques apply to the diagrams for T and T The complete set of
.
^^ (5K^{5^^0
Input
Stack
SEC. 4.4 TOPDOWN PARSING 187
assume that the parser just prints the production used; any other code
could be executed here. If M[X, a\ = error, the parser calls an error
recovery routine.
if X is a terminal or $ then
if X = « then
Nonter
.
that begin the strings derived from a. If a =J> €, then e is also in FIRST(a).
Define FOLLOW (A), for nonterminal A, to he the set of terminals a that
can appear immediately to the right of A in some sentential form, that is, the
set of terminals a such that there exists a derivation of the form S =^ aAa^
for some a and p. Note that there may, at some time during the derivation,
have been symbols between A and a, but if so, they derived € and disap
peared. If i4 can be the rightmost symbol in some sentential form, then $ is in
FOLLOW(A).
To compute FIRST(A^ for all grammar symbols X, apply the following rules
until no more terminals or € can be added to any FIRST set.
Now, we can compute FIRST for any string X\X2 X„ as follows. Add •
to FIRST(X,X2 X„) all the non€ symbols of FIRST(X,). Also add the
•
FOLLOW(fl).
Example 4.17. Consider again grammar (4.1 1), repeated below:
E * TE'
E' * +TE' I
e
T * FT'
r * *Fr I
€
F  ( £ ) I
id
Then:
190 SYNTAX ANALYSIS SEC. 4.4
FIRST(E') = {+, €}
FIRST(r) = {*, e}
FOLLOW(F) = {+,*, ), $}
For example, id and left parenthesis are added to FIRST(F) by rule (3) in
the definition of FIRST with / = 1 in each case, since FIRST(id) = {id} and
FIRSTCC) = { ( } by rule (1). Then by rule (3) with / = 1 , the production
T FT implies that id and left parenthesis
«• are in F1RST(7) as well. As
another example, € is in F1RST(£') by rule (2).
To compute FOLLOW sets, we put $ in FOLLOW(E) by rule (1) for FOL
LOW. By rule (2) applied to production F » (E), the right parenthesis is also
in FOLLOW(£:). By rule (3) applied to production E ^ TE' , $ and right
parenthesis are in FOLLOW(E'). Since E'=>e, they are also in
FOLLOW(r). For a last example of how the FOLLOW rules are applied, the
production E — TE' implies, by rule (2), that everything other than e in
FOLLOW(D. n
The following algorithm can be used to construct a predictive parsing table for
a grammar G. The idea behind the algorithm is the following. Suppose
A  a is a production with a in FIRST(a). Then, the parser will expand A by
a when the current input symbol is a. The only complication occurs when
a = € or a =>e. In this case, we should again expand A by a if the current
input symbol is in FOLLOW(/4), or if the $ on the input has been reached and
$ is in FOLLOW(/l).
Input. Grammar G.
Method.
The parsing table produced by Algorithm 4.4 for grammar (4.11) was
shown in Fig. 4. 15. n
LL(1) Grammars
Algorithm 4.4 can be applied to any grammar G to produce a parsing table M.
For some grammars, however, M
may have some entries that are multiply
defined. For example, if G is left recursive or ambiguous, then will have at M
least one multiplydefined entry.
Example 4.19. Let us consider grammar (4.13) from Example 4.10 again; it
5 * iEtSS'
S' ^ eS \
e
E  b
Nonter
192 SYNTAX ANALYSIS SEC. 4.4
decisions. It can be shown that Algorithm 4.4 produces for every LL(1)
grammar G a parsing table that parses all and only the sentences of G.
LL(1) grammars have several distinctive properties. No ambiguous or left
recursive grammar can be LL(1). It can also be shown that a grammar G is
LL( 1) if and only if whenever A » a 3 are two distinct productions of G the

3. If P =^€, then a does not derive any string beginning with a terminal in
FOLLOW(A).
Clearly, grammar (4.11) for arithmetic expressions is LL(1). Grammar
(4.13), modeling ifthenelse statements, is not.
There remains the question of what should be done when a parsing table
has multiplydefined entries. One recourse is to transform the grammar by
eliminating all left recursion and then left factoring whenever possible, hoping
to produce a grammar for which the parsing table has no multiplydefined
entries. Unfortunately, there are some grammars for which no amount of
alteration will yield an LL(I) grammar. Grammar (4.13) is one such exam
ple; its language has no LL(I) grammar at all. As we saw, we can still parse
(4.13) with a predictive parser by arbitrarily making A/ 15", e = {5' 
^ eS}. In
general, there are no universal rules by which multiplydefined entries can be
made singlevalued without affecting the language recognized by the parser.
The main difficulty in using predictive parsing is in writing a grammar for
the source language such that a predictive parser can be constructed from the
grammar. Although leftrecursion elimination and left factoring are easy to
do, they make the resultinggrammar hard to read and difficult to use for
translation purposes. To alleviate some of this difficulty, a common organiza
tion for a parser in a compiler is to use a predictive parser for control con
structs and to use operator precedence (discussed in Section 4.6) for expres
sions. However, if an LR parser generator, as discussed in Section 4.9, is
available, one can get all the benefits of predictive parsing and operator pre
cedence automatically.
The stack of a nonrecursive predictive parser makes explicit the terminals and
nonterminals that the parser hopes to match with the remainder of the input.
We shall therefore refer to symbols on the parser stack in the following dis
cussion. An error is detected during predictive parsing when the terminal on
top of the stack does not match the next input symbol or when nonterminal A
is on top of the stack, a is the next input symbol, and the parsing table entry
M[A, a\ is empty.
4. If a nonterminal can generate the empty string, then the production deriv
ing e can be used as a default. Doing so may postpone some error detec
tion, but cannot cause an error to be missed. This approach reduces the
number of nonterminals that have to be considered during error recovery.
attempt to resume parsing. If a token on top of the stack does not match the
input symbol, then we pop the token from the stack, as mentioned above.
On the erroneous input ) id * + id the parser and error recovery mechanism
of Fig. 4.18 behave as in Fig. 4. 19. D
194 SYNTAX ANALYSIS SEC. 4.4
Nonter
SEC. 4.5 BOTTOMUP PARSING 195
S * aABe
A ^ Abe \b
B ^ d
The sentence abbede can be reduced to S by the following steps:
abbede
aAbcde
oAde
aABe
S
We scan abbede looking for a substring that matches the right side of some
production. The substrings b and d qualify. Let us choose the leftmost b and
replace it by y4, the left side of the production A » ft; we thus obtain the string
aAbcde. Now the substrings Abe, b, and d match the right side of some pro
duction. Although b is the leftmost substring that matches the right side of
some production, we choose to replace the substring Abe by A, the left side of
the production A * We now obtain oAde. Then replacing d by B,
Abe. the
left side of the production B — d, we obtain aABe. We can now replace this
entire string by 5. Thus, by a sequence of four reductions we are able to
reduce abbede to S. These reductions, in fact, trace out the following right
most derivation in reverse:
S =i>
rm
aABe =^ aAde =^
rm
aAbcde =^
nn
abbede O
rm
196 SYNTAX ANALYSIS SEC. 4.5
Handles
a production, and whose reduction to the nonterminal on the left side of the
production represents one step along the reverse of a rightmost derivation. In
many cases the leftmost substring (3 that matches the right side of some pro
duction A  P is not a handle, because a reduction by the production A * (3
yields a string that cannot be reduced to the start symbol. In Example 4.21, if
we replaced b by A in the second string aAhcde we would obtain the string
aAAcde that cannot be subsequently reduced to 5. For this reason, we must
give a more precise definition of a handle.
Formally, a handle of a rightsentential form 7 is a production A ^ ^ and a
position of 7 where the string P may be found and replaced by A to produce
the previous rightsentential form in a rightmost derivation of 7. That is, if
a3w. The string w to the right of the handle contains only terminal symbols.
Note we say "a handle" rather than "the handle" because the grammar could
be ambiguous, with more than one rightmost derivation of a3w. If a gram
mar is unambiguous, then every rightsentential form of the grammar has
exactly one handle.
In example above, cibbcde is a rightsentential form whose handle is
the
A ^ b 2ii Likewise, aAbcde is a rightsentential form whose handle
position 2.
is A — Abe at position 2. Sometimes we say "the substring P is a handle of
apvv" if the position of P and the production /4 » p we have in mind are
clear.
Figure 4.20 portrays the handle /\ — p in the parse tree of a rightsentential
form aPw. The handle represents the leftmost complete subtree consisting of
a node and all its children. In Fig. 4.20, A is the bottommost leftmost interior
node with all its children in the tree. Reducing p to A in a^w can be thought
of as "pruning the handle," that is, removing the children of A from the parse
tree.
(1) E ^ E + E
(2) E ^ E * E
(4.16)
(3) E  (E)
(4) E  id
and the rightmost derivation
E =>
rm
E + E
=>
rm
E + E ^ E
=>
rm
E + £ * id^—
=>
rm
£ + ido
£
* id^
=> id I
+ idi * idi
"
We have subscripted the Id's for notational convenience and underlined a han
dle of each rightsentential form. For example, idj is a handle of the right
sentential form id + id2 * id3 because id is the right side of the production
I
E =>
rm
E * E
=> £ * id,
=>
rm
E + E * id,
=>
rm
£ + idi
^
* id,
=>
rm
id.
L
+ ido * id,
most derivations in Example 4.6. The first derivation gives * a higher pre
cedence than +, whereas the second gives the higher precedence. I
Handle Pruning
Example 4.23. Consider the grammar (4.16) of Example 4.22 and the input
string + idi * id^.
id The sequence of reductions shown in Fig. 4.21
reduces id + id^ * id^ to the start symbol E. The reader should observe that
the sequence of rightsentential forms in this example is just the reverse of the
sequence in the first rightmost derivation in Example 4.22. Q
RightSentential Form
SEC. 4.5 BOTTOMUP PARSING 199
side of the appropriate production. The parser repeats this cycle until it has
detected an error or until the stack contains the start symbol and the input is
empty:
Stack Input
$S
After entering this configuration, the parser halts and announces successful
completion of parsing.
Example 4.24. Let us step through the actions a shiftreduce parser might
make in parsing the input string id idi * id3 according to grammar (4.16),
i
using the first derivation of Example 4.22. The sequence is shown in Fig.
4.22. Note that because grammar (4.16) has two rightmost derivations for
this input there is another sequence of steps a shiftreduce parser might take.
Stack
200 SYNTAX ANALYSIS SEC. 4.5
that right side is replaced by y. In case (2), A is again replaced first, but this
time the right side is a string y of terminals only. The next rightmost nonter
minal B will be somewhere to the left of y.
Let us consider case (1) in reverse, where a shiftreduce parser has just
reached the configuration
Stack Input
$aP7 yz$
Stack Input
$apfl yz%
Since B is the rightmost nonterminal in a^Byz, the right end of the handle of
a^Byz cannot occur inside the stack. The parser can therefore shift the string
Stack Input
Sa^By z$
Stack Input
$ay xyz%
the handle 7 is on top of the stack. After reducing the handle 7 to B, the
parser can shift the string xy to get the next handle y on top of the stack:
Stack Input
$afijcy z$
Viable Prefixes
The set of prefixes of right sentential forms that can appear on the stack of a
shiftreduce parser are called viable prefixes. An equivalent definition of a
viable prefix is that it is a prefix of a rightsentential form that does not con
tinue past the right end of the rightmost handle of that sentential form. By
this definition, always possible to add terminal symbols to the end of a
it is
Example 4.25. An ambiguous grammar can never be LR. For example, con
sider the danglingelse grammar (4.7) of Section 4.3:
Stack Input
•if expr then stmt else • •
$
we cannot tell whether if expr then stmt is the handle, no matter what appears
for another stmt to complete the alternative if expr then stmt else stmt. Thus,
we cannot tell whether to shift or reduce in this case, so the grammar is not
LR(I). More generally, no ambiguous grammar, as this one certainly is, can
be LR(A:) for any k.
of shifting, the parser will behave naturally. We discuss parsers for such
ambiguous grammars in Section 4.8.
Example 4.26. Suppose we have a lexical analyzer that returns token id for
all identifiers, Suppose also that our language invokes
regardless of usage.
procedures by giving their names, with parameters surrounded by parentheses,
and that arrays are referenced by the same syntax. Since the translation of
indices in array references and parameters in procedure calls are different, we
want to use different productions to generate lists of actual parameters and
indices. Our grammar might therefore have (among others) productions such
as:
Stack Input
id ( id , id )
It is evident that the id on top of the stack must be reduced, but by which pro
duction? The correct choice is production (5) if A is a procedure and produc
tion (7) if A is an array. The stack does not tell which; information in the
symbol table obtained from the declaration of A has to be used.
One solution is to change the token id in production (I) to procid and to
use a more sophisticated lexical analyzer that returns token procid when it
Stack Input
• • •
procid (id , id ) •
•
SEC. 4.6 OPERATORPRECEDENCE PARSING 203
E * EAE I
(£) I
£ I
id
/I ^ + I
 I
* / t
I
is not an operator grammar, because the right side EAE has two (in fact three)
consecutive nonterminals. However, if we substitute for A each of its alterna
tives, we obtain the following operator grammar:
E * E+E I
EE I
E*£ 
E/E 
E t £ 
(£) 
£ 
id (4.17)
Relation
SEC. 4.6 OPERATORPRECEDENCE PARSING 205
pair of terminals and between the endmost terminals and the $'s maricing the
ends of the For example, suppose we initially have the rightsentential
string.
form id + id * id and the precedence relations are those given in Fig. 4.23.
These relations are some of those that we would choose to parse according to
grammar (4.17).
206 SYNTAX ANALYSIS SEC. 4.6
must be scanned at each step to find the handle. Such is not the case if we
use a stack to store the input symbols already seen and if the precedence rela
tions are used to guide the actions of a shiftreduce parser. If the precedence
relation <• or = holds between the topmost terminal symbol on the stack and
the next input symbol, the parser shifts; it has not yet found the right end of
the handle. If the relation > holds, a reduction is called for. At this point
the parser has found the right end of the handle, and the precedence relations
can be used to find the end of the handle in the stack.
left
Method. Initially, the stack contains $ and the input buffer the string w$. To
parse, we execute the program of Fig. 4.24.
end
We are always free to create operatorprecedence relations any way we see fit
and hope that the operatorprecedence parsing algorithm will work correctly
when guided by them. For a language of arithmetic expressions such as that
generated by grammar (4.17) we can use the following heuristic to produce a
proper set of precedence relations. Note that grammar (4.17) is ambiguous,
and rightsentential forms could have many handles. Our rules are designed
to select the "proper" handles to reflect a given set of associativity and pre
cedence rules for binary operators.
1. Ifoperator 6i has higher precedence than operator 62, make 61 > 62 and
62 <• 61. For example, if * has higher precedence than +, make
* •> + and I < *. These relations ensure that, in an expression of the
form E +E*E +E, the central £*£ is the handle that will be reduced
first.
(
= )
208 SYNTAX ANALYSIS SEC. 4.6
SEC. 4.6 OPERATORPRECEDENCE PARSING 209
numerical comparison between f (a) and g{b). Note, however, that error
entries in the precedence matrix are obscured, since one of (I), (2), or (3)
holds no matter what f{a) and gib) are. The loss of error detection capabil
ity is generally not considered serious enough to prevent the using of pre
cedence functions where possible; errors can still be caught when a reduction
is called for and no handle can be found.
Example 4.29. The precedence table of Fig. 4.25 has the following pair of
precedence functions.
210 SYNTAX ANALYSIS SEC. 4.6
beginning at the group of /„; let g(a) be the length of the longest path
from the group of ^„. n
Example 4.30. Consider the matrix of Fig. 4.23. There are no = relation
ships, so each symbol is in a group by itself. Figure 4.26 shows the graph
constructed using Algorithm 4.6.
are treated anonymously, they still have places held for them on the parsing
immediately above it on the stack. Note also that we never allow adjacent
symbols on the stack in Fig. 4.24 unless they are related by <• or =. Thus
steps (1012) must succeed in making a reduction.
Just because we find a sequence of symbols a < b\ = bj = = bi^ on
the stack, however, does not mean that bibj bf^ is the string of terminal
'
symbols on the right side of some production. We did not check for this con
dition in Fig. 4.24, but we clearly can do so, and in fact we must do so if we
wish to associate semantic rules with reductions. Thus we have an opportun
ity to detect errors in Fig. 4.24, modified at steps (1012) to determine what
production is the handle in a reduction.
We may divide the error detection and recovery routine into several pieces.
One piece handles errors of type (2). For example, this routine might pop
symbols off the stack just as in steps (1012) of Fig. 4.24. However, as there
is no production to reduce by, no semantic actions are taken; a diagnostic mes
sage is printed instead. To determine what the diagnostic should say, the rou
tine handling case (2) must decide what production the right side being
popped "looks like." For example, suppose abc is popped, and there is no
production right side consisting of a, b and c together with zero or more non
terminals. Then we might consider if deletion of one of a, b, and c yields a
legal right side (nonterminals omitted). For example, if there were a right
side aEcE, we might issue the diagnostic
We may also find that there is a right side with the proper sequence of ter
minals, but the wrong pattern of nonterminals. For example, if abc is popped
off the stack with no intervening or surrounding nonterminals, and abc is not
a right side but aEbc is, we might issue a diagnostic
steps (1012) of Fig. 4.24. These are evident in the directed graph whose
nodes represent the terminals, with an edge from a to b if and only if a = b.
Then the possible strings are the labels of the nodes along paths in this graph.
Paths consisting of a single node are possible. However, in order for a path
^1^2 ' ' '
^k to be "poppable" on some input, there must be a symbol a (pos
sibly $) such that a<bi. Call such a bi initial. Also, there must be a sym
bol ( (possibly $) such that b,,>c. Call b^^ final. Only then could a reduction
O © O
(lK0
Fig. 4.27. Graph for precedence matrix of Fig. 4.25.
E  E+E I
EE I
£*E 
E/E \
E \ E \
{E) \
E \
id
The precedence matrix for this grammar was shown in Fig. 4.25, and its
graph is given in Fig. 4.27. There is only one edge, because the only pair
related by = is the left and right parenthesis. All but the right parenthesis
are initial, and all but the Thus the only paths from
left parenthesis are final.
an node are the paths +,,*,/, id, and of length one, and
initial to a final t
the path from ( to ) of length two. There are but a finite number, and each
corresponds to the terminals of some production's right side in the grammar.
Thus the error checker for reductions need only check that the proper set of
SEC. 4.6 OPERATORPRECEDENCE PARSING 213
missing operand
2. If id is reduced, it checks that there is no nonterminal to the right or left.
If there is, it can warn
missing operator
3. If ( ) is reduced, it checks that there is a nonterminal between the
parentheses. If not, it can say
If there are an infinity of strings that may be popped, error messages cannot
be tabulated on a casebycase basis. We might use a general routine to deter
mine whether some production right side is close (say distance or 2, where 1
deleted, or changed) to the popped string and if so, issue a specific diagnostic
on the assumption that that production was intended. If no production is close
to the popped string, we can issue a general diagnostic to the effect that
"something is wrong in the current line."
We must now discuss the other way in which the operatorprecedence parser
detects errors. When consulting the precedence matrix to decide whether to
shift or reduce (lines (6) and (9) of Fig. 4.24), we may find that no relation
holds between the top stack symbol and the first input symbol. For example,
suppose a and b are the two top stack symbols {b is at the top), c and d are
the next two input symbols, and there is no precedence relation between b
and c. To recover, we must modify the stack, input or both. We may change
symbols, insert symbols onto the input or stack, or delete symbols from the
input or stack. If we insert or change, we must be careful that we do not get
into an infinite loop, where, for example, we perpetually insert symbols at the
beginning of the input without being able to reduce or to shift any of the
inserted symbols.
One approach that will assure us no infinite loops is to guarantee that after
recovery the current input symbol can be shifted (if the current input is $,
guarantee that no symbol is placed on the input, and the stack is eventually
shortened). For example, given ab on the stack and cd on the input, if «<c^
we might pop b from the stack. Another choice is to delete c from the input
if b^d. A third choice is to find a symbol e such that b<e^c and insert e
in front of c on the input. More generally, we might insert a string of sym
bols such that
e x^e2 — e„
if a single symbol for insertion could not be found. The exact action chosen
should reflect the compiler designer's intuition regarding what error is likely
in each case.
For each blank entry in the precedence matrix we must specify an error
recovery routine; the same routine could be used in several places. Then
when the parser consults the entry for a and b in step (6) of Fig. 4.24, and no
precedence relation holds between a and b, it finds a pointer to the error
recovery routine for this error.
Example 4.32. Consider the precedence matrix of Fig. 4.25 again. In Fig.
4.28, we show the rows and columns of this matrix that have one or more
blank entries, and we have filled in these blanks with the names of error han
dling routines.
SEC. 4.7 LR PARSERS 215
erroneous input id + ). The first actions taken by the parser are to shift id,
reduce it to E (we again use E for anonymous nonterminals on the stack), and
then to shift the + . We now have configuration
Stack Input
%E+ )$
Since + •> ) a reduction is called for, and the handle is +. The error
checker for reductions is required to inspect for £"s to left and right. Finding
one missing, it issues the diagnostic
missing operand
and does the reduction anyway.
Our configuration is now
%E )$
There is no precedence relation between $ and ), and the entry in Fig. 4.28
for this pair of symbols is e2. Routine e2 causes diagnostic
%E % u
4.7 LR PARSERS
This section presents an efficient, bottomup syntax analysis technique that
can be used to parse a large class of contextfree grammars. The technique is
called LR(^) parsing; the "L" is for lefttoright scanning of the input, the
"R" for constructing a rightmost derivation in reverse, and the k for the
number of input symbols of lookahead that are used in making parsing deci
sions. When {k) is omitted, k is assumed to be 1. LR parsing is attractive for
a variety of reasons.
The principal drawback of the method is that it is too much work to con
struct an LR parser by hand for a typical programminglanguage grammar.
One needs a specialized tool  an LR parser generator. Fortunately, many
such generators are available, and we shall discuss the design and use of one,
Yacc, in Section 4.9. With such a generator, one can write a contextfree
grammar and have the generator automatically produce a parser for that
grammar. If the grammar contains ambiguities or other constructs that are
difficult to parse in a lefttoright scan of the input, then the parser generator
can locate these constructs and inform the compiler designer of their presence.
After discussing the operation of an LR parser, we present three techniques
for constructing an LR parsing table for a grammar. The first method, called
simple LR (SLR for short), is the easiest to implement, but the least powerful
of the three. It may fail to produce a parsing table for certain grammars on
which the other methods succeed. The second method, called canonical LR,
is the most powerful, and the most expensive. The third method, called look
ahead LR (LALR for short), is intermediate in power and cost between the
other two. The LALR method will work on most programminglanguage
grammars and, with some effort, can be implemented efficiently. Some tech
niques for compressing the size of an LR parsing table are considered later in
this section.
and the combination of the state symbol on top of the stack and the current
input symbol are used to index the parsing table and determine the shift
reduce parsing decision. In an implementation, the grammar symbols need
not appear on the stack; however, we shall always include them in our discus
sions to help explain the behavior of an LR parser.
The parsing two parts, a parsing action function action and
table consists of
a goto function goto. The program driving the LR parser behaves as follows.
It determines .v,„, the state currently on top of the stack, and «,, the current
input symbol. It then consults action[s,„, a,], the parsing action table entry for
state s,„ and input a,, which can have one of four values:
Input
LR
Stack »• Output
Parsing Program
X,
action goto
The function goto takes a state and grammar symbol as arguments and pro
duces a state. We shall see that the goto function of a parsing table con
structed from a grammar G using the SLR, canonical LR, or LALR method is
X X I
2
•
X^ a, a, + 1
• • •
a„
in essentially the same way as a shiftreduce parser would; only the presence
of states on the stack is new.
The next move of the parser is determined by reading a,, the current input
symbol, and s,„, the state on top of the stack, and then consulting the parsing
action table entry action[s,„, a,]. The configurations resulting after each of
the four types of move are as follows:
Here the parser has shifted both the current input symbol a, and the next
state 5, which is given in action[s^, a,], onto the stack; «, + i
becomes the
current input symbol.
218 SYNTAX ANALYSIS SEC. 4.7
where s = goto[s,„r, A ] and r is the length of (3, the right side of the
production. Here the parser first popped 2r symbols off the stack (r state
symbols and r grammar symbols), exposing state .v,„_^. The parser then
pushed both A, the left side
,y, of the production, and
the entry for
goto[s,„r. A], onto the stack. The current input symbol is not changed
in a reduce move. For the LR parsers we shall construct,
X,„r + X,„, the sequence
\
•
of grammar symbols popped off the stack,
will always match p, the right side of the reducing production.
4. if action\s,„, tf, = error, the parser has discovered an error and calls an
error recovery routine.
Input. An input string w and an LR parsing table with functions action and
goto for a grammar G.
Method. Initially, the parser has sq on its stack, where a'o is the initial state,
and w$ in the input buffer. The parser then executes the program in Fig.
4.30 until an accept or error action is encountered. Q
Example 4.33. Figure 4.31 shows the parsing action and goto functions of an
LR parsing table for the following grammar for arithmetic expressions with
binary operators + and *:
(1)
)
end
State
220 SYNTAX ANALYSIS SEC. 4.7
Note that the value of f>oto\s, a\ for terminal a is found in the action field
connected with the shift action on input a for state .v. The goto field gives
goto[s, A\ for nonterminals A. Also, bear in mind that we have not yet
explained how the entries for Fig. 4.31 were selected; we shall deal with this
issue shortly.
On input id * id + id, the sequence of stack and input contents is shown in
Fig. 4.32. For example, at line (I) the LR parser is in state with id the first
input symbol. The action in row and column id of the action field of Fig.
4.31 is meaning shift and cover the stack with state 5. That is what has
s5,
happened at line (2): the first token id and the state symbol 5 have both been
pushed onto the stack, and id has been removed from the input.
Then, * becomes the current input symbol, and the action of state 5 on
input * is to reduce by F * id. Two symbols are popped off the stack (one
state symbol and one grammar symbol). State is then exposed. Since the
goto of state on F is 3, F and 3 are pushed onto the stack. We now have
the configuration in line (3). Each of the remaining moves is determined
similarly. D
Stack
SEC. 4.7 LR PARSERS 221
An LR parser does not have to scan the entire stack to know when the han
dle appears on top. Rather, the state symbol on top of the stack contains all
of practical interest, and we shall only consider LR parsers with k<\ here.
For example, the action table in Fig. 4.31 uses one symbol of lookahead. A
grammar that can be parsed by an LR parser examining up to k input symbols
on each move is called an LR(k) grammar.
There is a significant difference between LL and LR grammars. For a
grammar to be LR(/:), we must be able to recognize the occurrence of the
right side of a production,having seen all of what is derived from that right
side with k input symbols of lookahead. This requirement is far less stringent
than that for LL(A:) grammars where we must be able to recognize the use of
a production seeing only the first k symbols of what its right side derives.
Thus, LR grammars can describe more languages than LL grammars.
A  XYZ
A * XYZ
A * XY Z
A * XYZ
222 SYNTAX ANALYSIS SEC. 4.7
represented by a pair of integers, the first giving the number of the production
and the second the position of the dot. Intuitively, an item indicates how
G, is G with a new start symbol 5" and production 5" — S. The purpose of
this new starting production is to indicate to the parser when it should stop
parsing and announce acceptance of the input. That is, acceptance occurs
when and only when the parser is about to reduce by 5' ^ S.
If / is a set of items for a grammar G, then closured) is the set of items con
E ^ E + T T \
(4.19)
T * T ^ F F \
F ^ (E) id \
If/ is the set of one item {[E' ^ E]}, then closured) contains the items
SEC. 4.7 LR PARSERS 223
E'
224 SYNTAX ANALYSIS SEC. 4.7
kernel items, of course. Thus, we can represent the sets of items we are
really interested in with very little storage if we throw away all nonkernel
The second useful function is goto (I, X) where / is a set of items and X is a
grammar symbol. goto{l, X) is defined to be the closure of the set of all items
[A — aXpj such that [A  aX^\ is in /. Intuitively, if / is the set of items
that are valid for some viable prefix y, then goto{I, X) is the set of items that
are valid for the viable prefix yX.
E ^ E + T
T ^ T * F
T ^ F
F * (E)
F  id
We computed gotoil, +) by examining / for items with I immediately to the
right of the dot. E' ^E is not such an item, but E » £ lT is. We moved
the dot over the + to get {E ^ E+T} and then took the closure of this set.
We are now ready to give the algorithm to construct C, the canonical collec
tion of sets of LR(0) items for an augmented grammar C; the algorithm is
repeat
for each set of items / in C and each grammar symbol X
such that gotoih X) is not empty and not in C do
add goto (I, X)ioC
until no more sets of items can be added to C
end
E' * E h
E * E+T
E ^ T
T  T*F
T ^ F
F •(£)
F  id
E' ^ E
E ^ E +T
E ^T
T * T^F
(•£)
E * E+T
E * T
T * T*F
T  F
F * (E)
F * id
226 SYNTAX ANALYSIS SEC. 4.7
(ty^<ly^<^y^(i^f)
J
^ ^
SEC. 4.7 LR PARSERS 227
and goto function are exhibited in Fig. 4.35 and 4.36. Clearly, the string
£ + r * is a viable prefix of (4.19). The automaton of Fig. 4.36 will be in
state I J after having read E + T *. State /? contains the items
T ^ T * F
F * (E)
F * id
which are precisely the items valid for £" + 7 *. To see this, consider the fol
lowing three rightmost derivations
action, the parsing action function, and goto, the goto function, from C using
the following algorithm. It requires us to know FOLLOW(/4) for each nonter
minal A of a grammar (see Section 4.4).
Output. The SLR parsing table functions action and goto for G'.
Method.
2. State / is constructed from /,. The parsing actions for state ; are deter
mined as follows:
Ifany conflicting actions are generated by the above rules, we say the gram
mar is not SLR(l). The algorithm fails to produce a parser in this case.
3. The goto transitions for state / are constructed for all nonterminals A
using the rule: If goto {I i
A) = Ij, then goto\i, /4 I
= y.
4. All entries not defined by rules (2) and (3) are made "error."
5. The initial state of the parser is the one constructed from the set of items
containing \S' * S). °
The parsing table consisting of the parsing action and goto functions deter
mined by Algorithm 4.8 is called the SLR{1) table for G. An LR parser using
the SLR(l) table for G is called the SLR(l) parser for G, and a grammar hav
ing an SLR(l) parsing table is said to be SLR(l). We usually omit the "(I)"
after the "SLR," since we shall not deal here with parsers having more than
one symbol of lookahead.
Example 4.38. I^et us construct the SLR table for grammar (4.19). The
canonical collection of sets of LR(0) items for (4.19) was shown in Fig. 4.35.
E'* E
E * E+T
E ^ T
T * T*F
T ^ F
F * •(£)
F * id
The item F  (E) gives rise to the entry action[0, (] = shift 4, the item
E' * E
E * E +T
The first item yields action\\, $] = accept, the second yields action[\, +] =
shift 6. Next consider l2'
E ^ T
T * T *F
Since FOLLOW(E) = {$, +, )}, the first item makes action[2, $] =
action\2, +\ = action\2, ) = reduce E^T. The second item makes
action[2, *] = shift 7. Continuing in this fashion we obtain the parsing action
and goto tables that were shown in Fig. 4.31. In that figure, the numbers of
productions in reduce actions are the same as the order in which they appear
SEC. 4.7 LR PARSERS 229
Example 4.39. Every SLR(l) grammar is unambiguous, but there are many
unambiguous grammars that are not SLR(l). Consider the grammar with pro
ductions
S ^ L = R
S ^ R
L ^ ^ R (4.20)
L  id
R ^ L
We may think of L and R as standing for /value and rvalue, respectively, and
* as an operator indicating "contents of."^ The canonical collection of sets of
LR(0) items for grammar (4.20) is shown in Fig. 4.37.
0 S'
.
input symbol =
Grammar (4.20) is not ambiguous. This shift/reduce conflict arises from
the fact that the SLR parser construction method is not powerful enough to
remember enough left context to decide what action the parser should take on
input = having seen a string reducible to L. The canonical and LALR
methods, to be discussed next, will succeed on a larger collection of gram
mars, including grammar (4.20). It should be pointed out, however, that
there are unambiguous grammars for which every LR parser construction
method will produce a parsing action table with parsing action conflicts. For
tunately, such grammars can generally be avoided in programming language
applications. ^
We shall now present the most general technique for constructing an LR pars
ing table from a grammar. Recall that in the SLR method, state / calls for
Example 4.40. Let us reconsider Example 4.39, where in state 2 we had item
R — L, which could correspond to /4 ^a above, and a could be the = sign,
the state corresponding to viable prefix L only, should not really call for
reduction of that L to R. ^
is possible to carry more information in the state that will allow us to rule
It
to the length of the second component, called the lookahead of the item. The
lookahead has no effect in an item of the form y4  ap, u\, where p is not
e, but an item of the form \A > a, a\ calls for a reduction by .4 ^a only if
"
Lookahcads that arc strings of length greater than one are possible, of course, but we shall not
on top of the stack. The set of such a's will always be a subset of
FOLLOW(y4), but it could be a proper subset, as in Example 4.40.
Formally, we say LR(1) item [A — ap, a] is valid for a viable prefix 7 if
1. y = 8a, and
2. either a is the first symbol of w, or w is e and a is $.
S ^ BB
B ^ aB \b
There is a rightmost
^ derivation S =t>
rm
aaBah =t>
rm
aaaBab. We see that item
[B * aB, a] is valid for a viable prefix y = aaa by letting 8 = aa, A = B,
w = ab, a = a, and p = fl in the above definition.
There is also a rightmost
^ derivation S =5>
rm
BaB =t>
rm
BaaB. From this deriva
tion we see that item [B * aB, $] is valid for viable prefix Baa.
The method for constructing the collection of sets of valid LR(1) items is
essentially the same as the way we built the canonical collection of sets of
LR(0) items. We only need to modify the two procedures closure and goto.
To appreciate the new definition of the closure operation, consider an item
of the form \A » aflp, a] in the set of items valid for some viable prefix 7.
Then there is ° S =t>
a rightmost
rm
8Aajr =^haB2>ax,
derivation
rm '^ where 7 = 8a. '
Suppose Pojc derives terminal string by. Then for each production of the form
B * for
T] some t), we have derivation S =t> "^Bhy ^^> ^^by. Thus,
[fl » T), ^] is valid for 7. Note that b can be the first terminal derived from
P, or it is possible that P derives e in the derivation ^ax =^ by, and b can
therefore be a. To summarize both possibilities we say that b can be any ter
minal in FIRST(Par), where FIRST is the function from Section 4.4. Note
that X cannot contain the first terminal of by, so FIRST(PaAr) = FIRST(Pa).
We now give the LR(1) sets of items construction.
Algorithm 4.9. Construction of the sets of LR( 1) items.
Output. The sets of LR(I) items that are the set of items valid for one or
more viable prefixes of G'.
Method. The procedures closure and goto and the main routine items for con
structing the sets of items are shown in Fig. 4.38.
S'^ S
S ^ CC (4.21)
C ^ cC d \
;
begin
repeat
for each item [A  afip, a] in /,
return I
end;
procedure items (
C )
'
begin
C := {closure ({\S'  S, $]})};
repeat
for each set of items / in C and each grammar symbol X
such that gotod, X) is not empty and not in C do
add goto (I, X) to C
until no more sets of items can be added to C
end
S' * S, $
/o:
S * CC, $
C  cC, eld
C * d, eld
The brackets have been omitted for notational convenience, and we use the
notation [C » eC, eld] as a shorthand for the two items [C * eC, e] and
[C * eC, d].
Now wecompute goto (I q, X) for the various values of X. For X = S we
must close the item [5' *S, $]. No additional closure is possible, since the
dot is at the right end. Thus we have the next set of items:
I2: S * C C, $
C ^ eC, $
C ^ d, $
Next, let X = e. We
must close {(C » eC, eld]}. We add the Cproductions
with second component eld, yielding:
73: C * e C, eld
C * eC, eld
C * d, eld
U: C * d, eld
We have finished considering goto on /q. We get no new sets from /j, but
I2 has goto's on C, e, and d. On C we get:
/j: S  CC , $
U: C  eC, $
C * eC, $
C ^ d, $
Note that /^ differs from 73 only in second components. We shall see that it
is common for several sets of LR(1) items for a grammar to have the same
first components and differ in their second components. When we construct
the collection of sets of LR(0) items
same grammar, each set of LR(0)
for the
items will coincide with the set of first components of one or more sets of
LR(1) items. We shall have more to say about this phenomenon when we dis
cuss LALR parsing.
Continuing with the goto function for 1 2, goto (1 2, d) is seen to be:
.
77: C  J , $
74 and 75 have no goto's. The goto's of 7^ on c and d are 7^, and Ij, respec
tively, and goto {If,, C) is:
79: C * cC, $
The remaining sets of items yield no goto's, so we are done. Figure 4.39
shows the ten sets of items with their goto's. D
We now whereby the LR(1) parsing action and goto functions
give the rules
are constructed from the sets of LR(1) items. The action and goto functions
are represented by a table as before. The only difference is in the values of
the entries.
Output. The canonical LR parsing table functions action and goto for G'
Method.
2. State / of the parser is constructed from 7,. The parsing actions for state /
If a conflict results from the above rules, the grammar is said not to be
LR(1), and the algorithm is said to fail.
4. All entries not defined by rules (2) and (3) are made "error."
5. The initial state of the parser is the one constructed from the set contain
ing item 15'  5, $]. D
The table formed from the parsing action and goto functions produced by
Algorithm 4.10 is called the canonical LR(1) parsing table. An LR parser
using this table is called a canonical LR(1) parser. If the parsing action
SEC. 4.7 LR PARSERS 235
Example 4.43. The canonical parsing table for the grammar (4.21) is shown
in Fig. 4.40. Productions 1, 2, and 3 are 5  CC, C * cC, and C ^ d.
State
SEC. 4.7 LR PARSERS 237
requirement that c or d follow makes sense, since these are the symbols that
could begin strings in c*d. If $ follows the first d, we have an input like ccd,
which is not in the language, and state 4 correctly declares an error if $ is the
next input.
The parser enters state 7 after reading the second d. Then, the parser must
see $ on the input, or it started with a string not of the form c*dc*d. It thus
makes sense that state 7 should reduce by C — c/ on input $ and declare error
on inputs c or d.
Let us now replace 14 and l^ by I^j, the union of 74 and Ij, consisting of
the set of three items represented by [C » d\ c/d/$]. The goto's on d to I4
or /y from /q, Ij^ h. ^^d !(, now enter 747. The action of state 47 is to
reduce on any input. The revised parser behaves essentially like the original,
although it might reduce ^ to C in circumstances where the original would
declare error, for example, on input like ccd or cdcdc. The error will eventu
ally be caught; in fact, it will be caught before any more input symbols are
shifted.
More generally, we can look for sets of LR( 1) items having the same core,
that is, set of first components, and we may merge these sets with common
cores into one set of items. For example, in Fig. 4.39, 74 and Ij form such a
pair, with core {C — d}. Similarly, 73 and
form another pair, with core
7(,
{C * cC}. Note that, in general, a core is a set of LR(0) items for the gram
mar at hand, and that an LR(I) grammar may produce more than two sets of
items with the same core.
Since the core of goto (I, X) depends only on the core of 7, the goto's of
merged sets can themselves be merged. Thus, there no problem revising
is
the goto function as we merge sets of items. The action functions are modi
fied to reflect the nonerror actions of all sets of items in the merger.
Suppose we have an LR( 1) grammar, that is, one whose sets of LR(1) items
produce no parsing action conflicts. If we replace all states having the same
core with their union, it is possible that the resulting union will have a con
flict, but it is unlikely for the following reason: Suppose in the union there is a
conflict on lookahead a because there is an item [A a, a] calling for a
reduction by A a, and there is another item [B » ^ay, b] calling for a
shift. Then some set of items from which the union was formed has item
[A > a, a], and since the cores of all these states are the same, it must have
an item [B 
(Ba^, c] for some c. But then this state has the same
shift/reduce conflict on a, and the grammar was not LR(1) as we assumed.
Thus, the merging of states with common cores can never produce a
shift/reduce conflict that was not present in one of the original states, because
shift actions depend only on the core, not the lookahead.
It is possible, however, that a merger will produce a reduce/reduce conflict,
S'^ s
^ aAd
S \
bBd \
ciBe \
bAe
A ^ c
B ^ c
which generates the four strings acd, ace, bed, and bee. The reader can
check that the grammar is LR(1) by constructing the sets of items. Upon
doing so, we find the set of items {\A ^ e , d\, [B ^ e, e\] valid for viable
prefix ae and {\A ^e, e\, \B ^e, d\) valid for be. Neither of these sets
generates a conflict, and their cores are the same. However, their union,
which is
A ^ C, die
B * C, die
We are now prepared to give the first of two LALR table construction algo
rithms. The general idea is to construct the sets of LR(1) items, and if no
conflicts arise, merge sets with common cores. We then construct the parsing
table from the collection of merged sets of items. The method we are about
to describe serves primarily as a definition of LALR( 1) grammars. Construct
ing the entire collection of LR(1) sets of items requires too much space and
time to be useful in practice.
Output. The LALR parsing table functions aetion and goto for G'
Method.
2. For each core present among the set of LR(1) items, find all sets having
that core, and replace these sets by their union.
3. Let C
[Jq, Ji,= . . . , J,„} be the resulting sets of LR(1) items. The
parsing actions for state / are constructed from 7, in the same manner as
in Algorithm 4.10. If there is a parsing action conflict, the algorithm
fails to produce a parser, and the grammar is said not to be LALR(l).
/] , /2,
/;i
all
. have
. the
. same
, core. Let K be the union of all sets of
The table produced by Algorithm 4.11 is called the LALR parsing table for
G. If there are no parsing action conflicts, then the given grammar is said to
SEC. 4.7 LR PARSERS 239
Example Again consider the grammar (4.21) whose goto graph was
4.45.
shown in Fig. 4.39. As we mentioned, there are three pairs of sets of items
that can be merged. /3 and If, are replaced by their union:
The LALR action and goto functions for the condensed sets of items are
shown in Fig. 4.41.
State
240 SYNTAX ANALYSIS SEC. 4.7
c 3 c 3 ^ 4
on the stack, and in state 4 will discover an error, because $ is the next input
symbol and state 4 has action error on $. In contrast, the LALR parser of
Fig. 4.41 will make the corresponding moves, putting
(• 36 c 36 d 47
on the stack. But state 47 on input $ has action reduce C ^ d. The LALR
parser will thus change its stack to
c 36 c 36 C 89
Now the action of state 89 on input $ is reduce C » cC. The stack becomes
( 36 C 89
OC 2
Finally, state 2 has action error on input $, so the error is now discovered.
There are several modifications we can make to Algorithm 4.11 to avoid con
structing the full collection of sets of LR( 1) items in the process of creating an
LALR( 1) parsing table. The first observation is that we can represent a set of
items / by its kernel, that by those items that are either the initial item
is,
\S' ^ 5, $, or that have the dot somewhere other than at the beginning of
the right side.
Second, we can compute the parsing actions generated by / from the kernel
alone. Any item calling for a reduction by y4 —a will be in the kernel unless
a = e. Reduction by i4 » e is called for on input a if and only if there is a
kernel item [B *
yCb, b\ such that C =^ Ax] for some and a is in
t],
C => «x in a derivation in which the last step does not use an eproduction.
The set of such «'s can also be precomputed for each C.
Here is how the goto transitions for / can be computed from the kernel, if
SEC. 4.7 LR PARSERS 241
'o
. .
then [A *Xp, b] will also be in goto (I, X). We say, in this case, that look
aheads propagate from B — 7C8 to A » Xp. A simple method to determine
when an LR(1) item in / generates a lookahead in goto{I, X) spontaneously,
and when lookaheads propagate, is contained in the next algorithm.
A  aXp in gotoH, X)
end
possible. There are many different approaches, all of which in some sense
keep track of "new" lookaheads that have propagated to an item but which
have not yet propagated out. The next algorithm describes one technique to
propagate lookaheads to all items.
Output. The kernels of the LALR( 1) collection of sets of items for G'
Method.
1. Using the method outlined above, construct the kernels of the sets of
LR(0) items for G.
SEC. 4.7 LR PARSERS 243
2. Apply Algorithm 4.12 to the kernel of each set of LR(0) items and gram
mar symbol X to determine which lookaheads are spontaneously gen
erated for kernel items in goto {I, X), and from which items in / look
aheads are propagated to kernel items in goto {I, X).
3. Initialize a table that gives, for each kernel item in each set of items, the
associated lookaheads. Initially, each item has associated with it only
those lookaheads that we determined in (2) were generated spontane
ously.
4. Make repeated passes over the kernel items in all sets. When we visit an
item /, we look up the kernel items to which / propagates its lookaheads,
using information tabulated in (2). The current set of lookaheads for / is
added to those already associated with each of the items to which / pro
pagates its lookaheads. We continue making passes over the kernel items
until no more new lookaheads are propagated.
Example 4.47. Let us construct the kernels of the LALR(l) items for the
grammar in the previous example. The kernels of the LR(0) items were
shown in Fig. 4.42. When we apply Algorithm 4.12 to the kernel of set of
items /q, we compute closure {{\S' » S, # }), which is
S' * S, #
5  L^R, #
5  •/?, #
L •*/?, #/=
L * id, #/=
/?  L, #
Two items in this closure cause lookaheads to be generated spontaneously.
Item \L »•*/?, =] causes lookahead = to be spontaneously generated for
kernel item L  *•/? in 1 4 and item [L — id, =J causes = to be spontane
ously generated for kernel item L » id in 75.
/i through /s.
In Fig. 4.45,we show steps (3) and (4) of Algorithm 4.13. The column
labeled INIT shows the spontaneously generated lookaheads for each kernel
item. On the first pass, the lookahead $ propagates from 5' — 5 in /q to the
six items listed in Fig. 4.44. The lookahead = propagates from L * *7? in 74
L id in I^, but these lookaheads are already present. In the second and
third passes, the only new lookahead propagated is $, discovered for the suc
cessors of I2 and /4 on pass 2 and for the successor of If, on pass 3. No new
lookaheads are propagated on pass 4, so the final set of lookaheads is shown
244 SYNTAX ANALYSIS SEC. 4.7
From
SEC. 4.7 LR PARSERS 245
246 SYNTAX ANALYSIS SEC. 4.7
* s7
any r2
State 3 has only error and r4 entries. We can replace the former by the latter,
so the list for state 3 consists of only the pair (any, r4). States 5, 10, and 11
1
SEC. 4.8 USING AMBIGUOUS GRAMMARS 247
4 8
any 1
If the reader totals up the number of entries in the lists created in this
example and the previous one, and then adds the pointers from states to
action lists and from nonterminals to nextstate lists, he will not be impressed
with the space savings over the matrix implementation of Fig. 4.31. We
should not be misled by this small example, however. For practical gram
mars, the space needed for the list representation is typically less than ten per
cent of that needed for the matrix representation.
We should also point out that the tablecompression methods for finite auto
mata that were discussed in Section 3.9 can also be used to represent LR pars
ing tables. Application of these methods is discussed in the exercises.
all cases we specify disambiguating rules that allow only one parse tree for
each sentence. In this way, the overall language specification still remains
unambiguous. We also stress that ambiguous constructs should be used spar
ingly and in a strictly controlled fashion; otherwise, there can be no guarantee
as to what language is recognized by a parser.