You are on page 1of 824

GNTipilers

Principles, Techniques,
^^
and Tools
t<^^ ,'^\>
«^?^
^)j^

mm^

-^^^
'"^ /^toij^
•^^>:^
3/^

<°^
..*^

Alfred V Aho
Ravi Sethi ^ i 1
Jeffrey D. Ullman
hO^COfy <^, \2>vr.<N-=>
Digitized by the Internet Archive
in 2010

http://www.archive.org/details/compilersprincipOOahoa
Compilers

Principles, Techniques, and Tools


Compilers
Principles, Techniques, and Tools

ALFRED V. AHO
AT&T Bell Laboratories
Murray Hill, New Jersey

RAVI SETHI
AT&T Bell Laboratories
Murray Hill, New Jersey

JEFFREY D. ULLMAN
Stanford University
Stanford, California

A
TV
ADDISON-WESLEY PUBLISHING COMPANY
Reading. Massachusetts • Menlo Park, California
Don Mills, Ontario • Wokingham, England • Amsterdam • Sydney
Singapore • Tokyo • Mexico City • Bogota • Santiago • San Juan
Mark S. Dalton/Publisher
James T. DeWoIf/Sponsoring Editor

Bette J. Aaronson/Production Supervisor


Hugh Crawford/Manufacturing Supervisor
Jean Depoian/Cover Design and Illustration
Karen Guardino/Managing Editor

This book is in the Addison-Wesley series in Computer Science


Michael A. Harrison/Consulting Editor

Library of Congress Cataloging in Publication Data

Aho, Alfred V.
Compilers, principles, techniques, and tools.

Bibliography: p.
Includes index.
1. Compiling (Electronic computers) I. Sethi,
Ravi. II. Ullman, Jeffrey D. 1942- III. Title,, .

QA76.76.C65A37 1985 005.4'53 85-15647


ISBN 0-201-10088-6

Reprinted with corrections March, 1988

Reproduced by Addison-Wesley from camera-ready copy supplied by the authors.

Copyright & 1986 by Bell Telephone Laboratories, Incorporated.

All rights reserved. No part of this publicationmay be reproduced, stored in a


retrieval system, or transmitted, in any form or by any means, electronic, mechanical,
photocopying, recording, or otherwise, without the prior written permission of the pub-
lisher. Printed in the United States of America. Published simultaneously in Canada.

UNIX is a trademark of AT&T Bell Laboratories. DEC, PDP, and VAX are trade-
marks of Digital Equipment Corporation. Ada is a trademark of the Ada Joint Pro-
gram Office, Department of Defense, United States Government.

25 MA 9695
Preface

This book is a descendant of Principles of Compiler Design by Alfred V. Aho

and Jeffrey D. Ullman. Like its ancestor, it is intended as a text for a first
course in compiler design. The emphasis is on solving problems universally
encountered in designing a language translator, regardless of the source or tar-
get machine.
Although few people are likely to build or even maintain a compiler for a
major programming language, the reader can profitably apply the ideas and
techniques discussed in this book to general software design. For example,
the string matching techniques for building lexical analyzers have also been
used in text editors, information retrieval systems, and pattern recognition
programs. Context-free grammars and syntax-directed definitions have been
used to build many little languages such as the typesetting and figure drawing
systems that produced book. The techniques of code optimization have
this
been used in program verifiers and in programs that produce "structured"
programs from unstructured ones.

Use of the Book

The major topics in compiler design are covered in depth. The first chapter
introduces the basic structure of a compiler and is essential to the rest of the
book.
Chapter 2 presents a translator from infix to postfix expressions, built using
some of the basic techniques described in this book. Many of the remaining
chapters amplify the material in Chapter 2.

Chapter 3 covers lexical analysis, regular expressions, finite-state machines,


and scanner-generator tools. The material in this chapter is broadly applicable
to text-processing.
Chapter 4 covers the major parsing techniques in depth, ranging from the
recursive-descent methods that are suitable for hand implementation to the
computationally more intensive LR techniques that have been used in parser
generators.
Chapter 5 introduces the principal ideas in syntax-directed translation. This
chapter is used in the remainder of the book for both specifying and imple-

menting translations.
Chapter 6 presents the main ideas for performing static semantic checking.
Type checking and unification are discussed in detail.
iv PREFACE

Chapter 7 discusses storage organizations used to support the run-time


environment of a program.
Chapter 8 begins with a discussion of intermediate languages and then
shows how common programming language constructs can be translated into
intermediate code.
Chapter 9 covers target code generation. Included are the basic "on-the-
fly" code generation methods, as well as optimal methods for generating code
for expressions. Peephole optimization and code-generator generators are also
covered.
Chapter 10 is a comprehensive treatment of code optimization. Data-flow
analysis methods are covered in detail, as well as the principal methods for
global optimization.
Chapter 11 discusses some pragmatic issues that arise in implementing a
compiler. Software engineering and testing are particularly important in com-
piler construction.
Chapter 12 presents case studies of compilers that have been constructed
using some of the techniques presented in this book.
Appendix A describes a simple language, a "subset" of Pascal, that can be
used as the basis of an implementation project.
The authors have taught both introductory and advanced courses, at the
undergraduate and graduate from the material in this book at AT&T
levels,

Bell Laboratories, Columbia, Princeton, and Stanford.


An introductory compiler course might cover material from the following
sections of this book:

introduction
PREFACE

on type equivalence, overloading, polymorphism, and unification in Chapter


6, the material on run-time storage organization in Chapter 7, the pattern-
directed code generation methods discussed in Chapter 9, and material on
code optimization from Chapter 10.

Exercises

As before, we rate exercises with stars. Exercises without stars test under-
standing of definitions, singly starred exercises are intended for more
advanced courses, and doubly starred exercises are food for thought.

Acknowledgments

At various stages in number of people have given


the writing of this book, a
us invaluable comments on the manuscript.regard we owe a debt of
In this
gratitude to Bill Appelbe, Nelson Beebe, Jon Bentley, Lois Bogess, Rodney
Farrow, Stu Feldman, Charles Fischer, Chris Fraser, Art Gittelman, Eric
Grosse, Dave Hanson, Fritz Henglein, Robert Henry, Gerard Holzmann,
Steve Johnson, Brian Kernighan, Ken Kubota, Daniel Lehmann, Dave Mac-
Queen, Dianne Maki, Alan Martm, Doug Mcllroy, Charles McLaughlin, John
Mitchell, Elliott Organick, Robert Paige, Phil Pfeiffer, Rob Pike, Kari-Jouko
Raiha, Dennis Ritchie, Sriram Sankar, Paul Stoecker, Bjarne Stroustrup, Tom
Szymanski, Kim Tracy, Peter Weinberger, Jennifer Widom, and Reinhard
Wilhelm.
This book was phototypeset by the authors using the excellent software
available on the UNIX system. The typesetting command read

pic files I tbl I eqn I trof f -ms


pic is we owe Brian a
Brian Kernighan's language for typesetting figures;
special debt of gratitude foraccommodating our special and extensive figure-
drawing needs so cheerfully, tbl is Mike Lesk's language for laying out
tables, eqn is Brian Kernighan and Lorinda Cherry's language for typesetting
mathematics, trof f is Joe Ossana's program for formatting text for a photo-
typesetter, which in our case was a Mergenthaler Linotron 202/N. The ms
package of troff macros was written by Mike Lesk. In addition, we
managed the text using make due to Stu Feldman. Cross references within
the text were maintained using awk created by Al Aho, Brian Kernighan, and
Peter Weinberger, and sed created by Lee McMahon.
The authors would particularly like to acknowledge Patricia Solomon for
helping prepare the manuscript for photocomposition. Her cheerfulness and
expert typing were greatly appreciated. J. D. Ullman was supported by an
Einstein Fellowship of the Israeli Academy of Arts and Sciences during part of
the time in which this book was written. Finally, the authors would like to
thank AT&T Bell Laboratories for its support during the preparation of the
"^^""^'^''P^-
A.V.A.,R.S.,J.D. U.
Contents

Chapter 1 Introduction to Compiling 1

1.1 Compilers 1

1.2 Analysis of the source program 4


1.3 The phases of a compiler 10
1.4 Cousins of the compiler 16
1.5 The grouping of phases 20
1 .6 Compiler-construction tools 22
Bibliographic notes 23

Chapter 2 A Simple One-Pass Compiler 25

2.1 Overview 25
2.2 Syntax definition 26
2.3 Syntax-directed translation 33
2.4 Parsing 40
2.5 A translator for simple expressions 48
2.6 Lexical analysis 54
2.7 Incorporating a symbol table 60
2.8 Abstract stack machines 62
2.9 Putting the techniques together 69
Exercises 78
Bibliographic notes 81

Chapter 3 Lexical Analysis 83

3. 1 The role of the lexical analyzer 84


3.2 Input buffering 88
3.3 Specification of tokens 92
3.4 Recognition of tokens 98
3.5 A language for specifying lexical analyzers 105
3.6 Finite automata 113
3.7 From a regular expression to an NFA 121
3.8 Design of a lexical analyzer generator 128
3.9 Optimization of DFA-based pattern matchers 134
Exercises 1 46
Bibliographic notes 157
11

CONTENTS

Chapter 4 Syntax Analysis 159

4. The role of the parser 160


4.2 Context-free grammars 165
4.3 Writing a grammar 172
4.4 Top-down parsing 181
4.5 Bottom-up parsing 195
4.6 Operator-precedence parsing 203
4.7 LR parsers 215
4.8 Using ambiguous grammars 247
4.9 Parser generators 257
Exercises 267
Bibliographic notes 277

Chapter 5 Syntax-Directed Translation 279

5.1 Syntax-directed definitions 280


5.2 Construction of syntax trees 287
5.3 Bottom-up evaluation of S-attributed definitions 293
5.4 L-attributed definitions 296
5.5 Top-down translation 302
5.6 Bottom-up evaluation of inherited attributes 308
5.7 Recursive evaluators 316
5.8 Space for attribute values at compile time 320
5.9 Assigning space at compiler-construction time 323
5.10 Analysis of syntax-directed definitions 329
Exercises 336
Bibliographic notes 340

Chapter 6 Type Checking 343

6. Type systems 344


6.2 Specification of a simple type checker 348
6.3 Equivalence of type expressions 352
6.4 Type conversions 359
6.5 Overloading of functions and operators 361
6.6 Polymorphic functions 364
6.7 An algorithm for unification 376
Exercises 381
Bibliographic notes 386

Chapter 7 Run-Time Environments 389

7.1 Source language issues 389


7.2 Storage organization 396
7.3 Storage-allocation strategies 401
7.4 Access to nonlocal names 411
CONTENTS ix

7.5 Parameter passing 424


7.6 Symbol tables 429
7.7 Language facilities for dynamic storage allocation 440
7.8 Dynamic storage allocation techniques 442
7.9 Storage allocation in Fortran 446
Exercises 455
Bibliographic notes 461

Chapter 8 Intermediate Code Generation 463

8.1 Intermediate languages 464


8.2 Declarations 473
8.3 Assignment statements 478
8.4 Boolean expressions 488
8.5 Case statements 497
8.6 Backpatching 500
8.7 Procedure calls 506
Exercises 508
Bibliographic notes 511

Chapter 9 Code Generation 513

9.1 Issues in the design of a code generator 514


9.2 The target machine 519
9.3 Run-time storage management 522
9.4 Basic blocks and flow graphs 528
9.5 Next-use information 534
9.6 A simple code generator 535
9.7 Register allocation and assignment 541
9.8 The dag representation of basic blocks 546
9.9 Peephole optimization 554
9.10 Generating code from dags 557
9.11 Dynamic programming code-generation algorithm 567
9.12 Code-generator generators 572
Exercises 580
Bibliographic notes 583

Chapter 10 Code Optimization 585

10.1 Introduction 586


10.2 The principal sources of optimization 592
10.3 Optimization of basic blocks 598
10.4 Loops in flow graphs 602
10.5 Introduction to global data-flow analysis 608
10.6 Iterative solution of data-flow equations 624
10.7 Code-improving transformations 633
10.8 Dealing with aliases 648
CONTENTS

10.9 Data-flow analysis of Structured flow graphs 660


10.10 Efficient data-flow algorithms 671
10.11 A tool for data-flow analysis 680
10.12 Estimation of types 694
10.13 Symbolic debugging of optimized code 703
Exercises 711
Bibliographic notes 718

Chapter 11 Want to Write a Compiler? 723

11.1 Planning a compiler 723


11.2 Approaches development
to compiler 725
11.3 The compiler-development environment 729
11.4 Testing and maintenance 731

Chapter 12 A Look at Some Compilers 733

12.1 EQN, a preprocessor for typesetting mathematics 733


12.2 Compilers for Pascal 734
12.3 The C compilers 735
12.4 The Fortran H compilers 737
12.5 The Bliss/11 compiler 740
12.6 Modula-2 optimizing compiler 742

Appendix A Compiler Project 745

A.l Introduction 745


A. 2 A Pascal subset 745
A. 3 Program structure 745
A. 4 Lexical conventions 748
A. 5 Suggested exercises 749
A .6 Evolution of the interpreter 750
A. 7 Extensions 751

Bibliography 752

Index 780
CHAPTER 1

Introduction
to Compiling

The principles and techniques of compiler writing are so pervasive that the
ideas found in this book will be used many times in the career of a computer
scientist. Compiler writing spans programming languages, machine architec-
ture, language theory, algorithms, and software engineering. Fortunately, a
few basic compiler-writing techniques can be used to construct translators for
a wide variety of languages and machines. In this chapter, we introduce the
subject of compiling by describing the components of a compiler, the environ-
ment in which compilers do their job, and some software tools that make it
easier to build compilers.

1.1 COMPILERS
Simply stated, a compiler is a program that reads a program written in one

language - the source language - and translates it into an equivalent program


in another language - the target language (see Fig. 1.1). As an important part
of this translation process, the compiler reports to its user the presence of
errors in the source program.

source target
compiler
program program

error
messages

Fig. 1.1. A compiler.

At first glance, the variety of compilers may appear overwhelming. There


are thousands of source languages, ranging from traditional programming
languages such as Fortran and Pascal to specialized languages that have arisen
in virtually every area of computer application. Target languages are equally
as varied; a target language may be another programming language, or the
machine language of any computer between a microprocessor and a
2 INTRODUCTION TO COMPILING SEC. 1.1

supercomputer. Compilers are sometimes classified as single-pass, multi-pass,


load-and-go, debugging, or optimizing, depending on how they have been con-
structed or on what function they are supposed to perform. Despite this
apparent complexity, the basic tasks that any compiler must perform are
essentially the same.By understanding these tasks, we can construct com-
pilers for a wide variety of source languages and target machines using the
same basic techniques.
Our knowledge about how to organize and write compilers has increased
vastly since the first compilers started to appear in the early 1950's. It is diffi-

cult to give an exact date for the first compiler because initially a great deal of
experimentation and implementation was done independently by several
groups. Much of the early work on compiling deaU with the translation of
arithmetic formulas into machine code.
Throughout the 1950's, compilers were considered notoriously difficult pro-
grams to write. The first Fortran compiler, for example, took 18 staff-years
to implement (Backus et al. [1957]). We have since discovered systematic
techniques for handlingmany of the important tasks that occur during compi-
lation. Good implementation languages, programming environments, and
software tools have also been developed. With these advances, a substantial
compiler can be implemented even as a student project in a one-semester
compiler-design course.

The Analysis-Synthesis Model of Compilation


There are two parts to compilation: analysis and synthesis. The analysis part
breaks up the source program into constituent pieces and creates an intermedi-
ate representation of the source program. The synthesis part constructs the
desired target program from the intermediate representation. Of the two
parts, synthesis requires the most specialized techniques. We shall consider
analysis informally in Section 1.2 and outline the way target code is syn-
thesized in a standard compiler in Section 1.3.
During analysis, the operations implied by the source program are deter-
mined and recorded in a hierarchical structure called a tree. Often, a special
kind of tree called a syntax tree is used, in which each node represents an
operation and the children of a node represent the arguments of the operation.
For example, a syntax tree for an assignment statement is shown in Fig. 1.2.

position +

initial
rate 60

Fig. 1.2. Syntax tree for position : = initial + rate 60.


SEC. 1.1 COMPILERS 3

Many software tools that manipulate source programs first perform some
kind of analysis. Some examples of such tools include:

1. Structure editors. A structure editor takes as input a sequence of com-


mands program. The structure editor not only performs
to build a source
the text-creation and modification functions of an ordinary text editor,
but it also analyzes the program text, putting an appropriate hierarchical
structure on the source program. Thus, the structure editor can perform
additional tasks that are useful in the preparation of programs. For
example, it can check that the input is correctly formed, can supply key-
words automatically (e.g., when the user types while, the editor supplies
the matching do and reminds the user that a conditional must come
between them), and can jump from a begin or left parenthesis to its
matching end or right parenthesis. Further, the output of such an editor
is often similar to the output of the analysis phase of a compiler.

2. Pretty printers. A program and prints it in such


pretty printer analyzes a
a way that program becomes clearly visible. For
the structure of the
example, comments may appear in a special font, and statements may
appear with an amount of indentation proportional to the depth of their
nesting in the hierarchical organization of the statements.

3. Static checkers. A static checker reads a program, analyzes it, and


attempts to discover potential bugs without running the program. The
analysis portion is often similar to that found in optimizing compilers of
the type discussed in Chapter 10. For example, a static checker may
detect that parts of the source program can never be executed, or that a
certain variable might be used before being defined. In addition, it can
catch logical errors such as trying to use a real variable as a pointer,
employing the type-checking techniques discussed in Chapter 6.

4. Interpreters. Instead of producing a target program as a translation, an


interpreter performs the operations implied by the source program. For
an assignment statement, for example, an interpreter might build a tree
like Fig. and then carry out the operations at the nodes as it "walks"
1.2,
the tree. At the root it would discover it had an assignment to perform,
so it would call a routine to evaluate the expression on the right, and then
store the resulting value in the location associated with the identifier
position. At would discover it
the right child of the root, the routine
had to compute the sum of two expressions. would call itself recur-
It

sively to compute the value of the expression rate #60. It would then
add that value to the value of the variable initial.
Interpreters are frequently used to execute command languages, since
each operator executed in a command language is usually an invocation of
a complex routine such as an editor or compiler. Similarly, some "very
high-level" languages, like APL, are normally interpreted because there
are many things about the data, such as the size and shape of arrays, that
4 INTRODUCTION TO COMPILING SEC. 1.1

cannot be deduced at compile time.

Traditionally, we think of a compiler as a program that translates a source


language like Fortran into the assembly or machine language of some com-
puter. However, there are seemingly unrelated places where compiler technol-
ogy is regularly used. The analysis portion in each of the following examples
is similar to that of a conventional compiler.

1. Text formatters. A text formatter takes input that is a stream of charac-


ters, most of which is text to be typeset, but some of which includes com-
mands to indicate paragraphs, figures, or mathematical structures like
subscripts and superscripts. We mention some of the analysis done by
text formatters in the next section.

2. Silicon compilers. A silicon compiler has a source language that is similar


or identical to a conventionalprogramming language. However, the vari-
ables of the language represent, not locations in memory, but, logical sig-
nals (0 or 1) or groups of signals in a switching circuit. The output is a
circuit design in an appropriate language. See Johnson |1983|, Ullman
[19841, or Trickey |1985| for a discussion of silicon compilation.

3. Query A query interpreter translates a predicate containing


interpreters.
relationaland boolean operators into commands to search a database for
records satisfying that predicate. (See Ullman [19821 or Date [1986|.)

The Context of a Compiler

programs may be required to create an


In addition to a compiler, several other
executable target program. A source program may be divided into modules
stored in separate files. The task of collecting the source program is some-
times entrusted to a distinct program, called a preprocessor. The preprocessor
may also expand shorthands, called macros, into source language statements.
Figure 1.3 shows a typical ''compilation." The target program created by
the compiler may require further processing before it can be run. The com-
piler in Fig. 1.3 creates assembly code that is translated by an assembler into
machine code and then linked together with some library routines into the
code that actually runs on the machine.
We shall consider the components of a compiler in the next two sections;
the remaining programs in Fig. 1.3 are discussed in Section 1.4.

1.2 ANALYSIS OF THE SOURCE PROGRAM


In this section, we introduce analysis and illustrate its use in some text-

formatting languages. The subject is treated in more detail in Chapters 2-4


and 6. In compiling, analysis consists of three phases:

1. Linear analysis, which the stream of characters making up the source


in

program is readleft-to-right and grouped into tokens that are


from
sequences of characters having a collective meaning.
SEC. 1.2 ANALYSIS OF THE SOURCE PROGRAM 5

sk
INTRODUCTION TO COMPILING SEC. 1.2

Syntax Analysis

Hierarchical analysis is called parsini> or syntax analysis.


It involves grouping

the tokens of the source program into grammatical phrases that are used by
the compiler to synthesize output. Usually, the grammatical phrases of the
source program are represented by a parse tree such as the one shown in Fig.
1.4.
SEC. 1.2 ANALYSIS OF THE SOURCE PROGRAM 7

1. If identifier ]
is an identifier, and expressioni is an expression, then

identifier |
: = expression 2

is a statement.

2. U expression X
is an expression and statementi is a statement, then

while ( expression | ) do statement 2


if ( expression 1 ) then statement 1

are statements.

The division between lexical and syntactic analysis is somewhat arbitrary.


We usually choose a division that simplifies the overall task of analysis. One
factor in determining the division is whether a source language construct is

inherently recursive or not. Lexical constructs do not require recursion, while


syntactic constructs often do. Context-free grammars are a formalization of
recursive rules that can be used to guide syntactic analysis. They are intro-
duced in Chapter 2 and studied extensively in Chapter 4.
For example, recursion is not required to recognize identifiers, which are
typically strings of letters and digits beginning with a letter. We would nor-
mally recognize identifiers by a simple scan of the input stream, waiting until
a character that was neither a letter nor a digit was found, and then grouping
all the letters and digits found up to that point into an identifier token. The
characters so grouped are recorded in a table, called a symbol table, and
removed from the input so that processing of the next token can begin.
On the other hand, this kind of linear scan is not powerful enough to
analyze expressions or statements. For example, we cannot properly match
parentheses in expressions, or begin and end in statements, without putting
some kind of hierarchical or nesting structure on the input.

position + position +

initial * initial

rate 60 rate inttoreal


I

(a) (b) 60

Fig. 1.5. Semantic analysis inserts a eonvcrsion from integer to real.

The parse tree in Fig. 1.4 describes the syntactic structure of the input. A
more common internal representation of this syntactic structure is given by the
syntax tree 1.5(a). A syntax tree is a compressed representation of the
in Fig.

parse tree which the operators appear as the interior nodes, and the
in

operands of an operator are the children of the node for that operator. The
construction of trees such as the one in Fig. 1.5(a) is discussed in Section 5.2.
8 INTRODUCTION TO COMPILING SEC. 1.2

We shall take up in Chapter 2, and in more detail in Chapter 5, the subject of


syntax-directed translation, in which the compiler uses the hierarchical struc-
ture on the input to help generate the output.

\
Semantic Analysis

The semantic analysis phase checks the source program for semantic errors
and gathers type information for the subsequent code-generation phase. It
uses the hierarchical structure determined by the syntax-analysis phase to
identify the operators and operands of expressions and statements.
An important component of semantic analysis is type checking. Here the
compiler checks that each operator has operands that are permitted by the
source language specification. For example, many programming language
definitions require a compiler to report an error every time a real number is

used to index an array. However, the language specification may permit some
operand coercions, for example, when a binary arithmetic operator is applied
to an integer and real. In this case, the compiler may need to convert the
integer to a real. Type checking and semantic analysis are discussed in
Chapter 6.

Example 1.1. Inside a machine, the bit pattern representing an integer is gen-
erally differentfrom the bit pattern for a real, even if the integer and the real
number happen to have the same value. Suppose, for example, that all iden-
tifiers in Fig. 1.5 have been declared to be reals and that 60 by itself is

assumed to be an integer. Type checking of Fig. 1.5(a) reveals that * is


applied to a real, rate, and an integer, 60. The general approach is to con-
vert the integer into a real. This has been achieved in Fig. 1.5(b) by creating
an extra node for the operator inttoreal that explicitly converts an integer into
a real. Alternatively, since the operand of inttoreal is a constant, the com-
piler may instead replace the integer constant by an equivalent real constant.

Analysis in Text Formatters

It is useful to regard the input to a text formatter as specifying a hierarchy of


boxes that are rectangular regions to be filled by some bit pattern, represent-
ing light and dark pixels to be printed by the output device.
For example, the TfeX system (Knuth 11984a]) views its input this way.
Each character that is not part of a command represents a box containing the
bit pattern for that character in the appropriate font and size. Consecutive
characters not separated by "white space" (blanks or newline characters) are
grouped into words, consisting of a sequence of horizontally arranged boxes,
shown schematically in Fig. 1.6. The grouping of characters into words (or
commands) is the linear or lexical aspect of analysis in a text formatter.
Boxes in T^ may be built from smaller boxes by arbitrary horizontal and
vertical combinations. For example,

\hbox{ <list of boxes> }


SEC. 1.2 ANALYSIS OF THE SOURCE PROGRAM 9

I
10 INTRODUCTION TO COMPILING SEC. 1.2

a sub {i sup 2}

results in c/,:. Grouping the operators sub and sup into tokens is part of the
lexical analysis of EQN text. However, the syntactic structure of the text is
needed to determine the size and placement of a box.

1.3 THE PHASES OF A COMPILER


Conceptually, a compiler operates in phases, each of which transforms the
source program from one representation to another. A typical decomposition

of a compiler is shown in Fig. 1.9. In practice, some of the phases may be


grouped together, as mentioned in Section 1.5, and the intermediate represen-
tations between the grouped phases need not be explicitly constructed.

source program

1
lexical

analyzer

syntax
analyzer

semantic
analyzer
symbol-table
manager
intermediate code
generator

code
optimizer

code
generator

target program

Fig. 1.9. Phases of a compiler.

first three phases, forming the bulk of the analysis portion of a com-
The
piler, were introduced in the last section. Two other activities, symbol-table
management and error handling, are shown interacting with the six phases o/
lexical analysis, syntax analysis, semantic analysis, intermediate code genefa-
tion, code optimization, and code generation. Informally, we shall also call

the symbol-table manager and the error handler "phases." j


SEC. 1.3 THE PHASES OF A COMPILER II

Symbol-Table Management

An essential function of a compiler is to record the identifiers used in the


source program and collect information about various attributes of each iden-
tifier. These attributes may provide information about the storage allocated
for an identifier,its type, its scope (where in the program it is valid), and, in

the case of procedure names, such things as the number and types of its argu-
ments, the method of passing each argument (e.g., by reference), and the type
returned, if any.
A symbol table is a data structure containing a record for each identifier,
with fields for the attributes of the identifier. The data structure allows us to
find the record for each identifier quickly and to store or retrieve data from
that record quickly. Symbol tables are discussed in Chapters 2 and 7.

When an identifier in the source program is detected by the lexical


analyzer, the identifier is entered into the symbol table. However, the attri-

butes of an identifier cannot normally be determined during lexical analysis.


For example, in a Pascal declaration like

var position, initial, rate : real ;

the type real is not known when position, initial, and rate are seen by
the lexical analyzer.
The remaining phases enter information about identifiers into the symbol
table and then use this information in various ways. For example, when
doing semantic analysis and intermediate code generation, we need to know
what the types of identifiers are, so we can check that the source program
uses them in valid ways, and so that we can generate the proper operations on
them. The code generator typically enters and uses detailed information about
the storage assigned to identifiers.

Error Detection and Reporting

Each phase can encounter errors. However, after detecting an error, a phase
must somehow deal with that error, so that compilation can proceed, allowing
further errors in the source program to be detected. A compiler that stops
when it finds the first error is not as helpful as it could be.
The syntax and semantic analysis phases usually handle a large fraction of
the errors detectable by the compiler. The lexical phase can detect errors
where the characters remaining in the input do not form any token of the
language. Errors where the token stream violates the structure rules (syntax)
of the language are determined by the syntax analysis phase. During semantic
analysis the compiler tries to detect constructs that have the right syntactic
structure but no meaning to the operation involved, e.g., if we try to add two
identifiers, one of which is the name of an array, and the other the name of a
procedure. We discuss the handling of errors by each phase in the part of the
book devoted to that phase.
12 INTRODUCTION TO COMPILING SEC. 1.3

The Analysis Phases


As translation progresses, the compiler's internal representation of the source
program changes. We
illustrate these representations by considering the

translation of the statement

position := initial + rate +60 (1-1)

Figure 1.10 shows the representation of this statement after each phase.
The lexical analysis phase reads the characters in the source program and
groups them into a stream of tokens in which each token represents a logically
cohesive sequence of characters, such as an identifier, a keyword (if, while,
etc.), a punctuation character, or a multi-character operator like :=. The
character sequence forming a token is called the lexeme for the token.
Certain tokens will be augmented by a "lexical value." For example, when
an identifier like rate is found, the lexical analyzer not only generates a
token, say id, but also enters the lexeme rate into the symbol table, if it is

not already there. The lexical value associated with this occurrence of id
points to the symbol-table entry for rate.
In this section, we shall use id], id2, and id^ for position, initial, and
rate, respectively, to emphasize that the internal representation of an identif-
ier is different from the character sequence forming the identifier. The
representation of ( 1 .
1 ) after lexical analysis is therefore suggested by:

id, := id: + '^3 * 60 (1.2)

We should also make up tokens for the multi-character operator := and the
number 60 to reflect their internal representation, but we defer that until
Chapter 2. Lexical analysis is covered in detail in Chapter 3.
The second and third phases, syntax and semantic analysis, have also been
introduced in Section 1.2. Syntax analysis imposes a hierarchical structure on
the token stream, which we shall portray by syntax trees as in Fig. I.I 1(a). A
typical data structure for the tree is shown in Fig. 1.1 Kb) in which an interior
node is a record with a fleld for the operator and two flelds containing
pointers to the records for the left and right children. A leaf is a record with
two or more flelds, one to identify the token at the leaf, and the others to

record information about the token. Additional information about language


constructs can be kept by adding more flelds to the records for nodes. We
discuss syntax and semantic analysis in Chapters 4 and 6, respectively.

Intermediate Code Generation

After syntax and semantic analysis, some compilers generate an explicit inter-
mediate representation of the source program. We can think of this inter-
mediate representation as a program for an abstract machine. This intermedi-
ate representation should have two important properties; it should be easy to
produce, and easy to translate into the target program.
The intermediate representation can have a variety of forms. In Chapter 8,
SEC. 1.3 THE PHASES OF A COMPILER 13

position := initial + rate * 60

1
lexical analyzer

id, : = id. + id, 60


'i
syntax analyzer

T
id,

id.

id, '60

semantic analyzer

T
id,
Symbol Table
position
14 INTRODUCTION TO COMPILING SEC. 1.3

id,

id.
SEC. 1.3 THE PHASES OF A COMPILER 15

tempi := id3 » 60.0 ,, ^,


( 1 .4)'
idl := id2 + tempi ^

There is nothing wrong with this simple algorithm, since the problem can be
fixed during the code-optimization phase. That is, the compiler can deduce
that the conversion of 60 from integer to real representation can be done once
and for all at compile time, so the inttoreal operation can be eliminated.
Besides, temp3 is used only once, to transmit its value to idl. it then
becomes safe to substitute idl for temp3, whereupon the last statement of
(1.3) is not needed and the code of (1.4) results.
There is great variation in the amount of code optimization different com-
pilers perform. In those that do the most, called "optimizing compilers," a
significant fraction of the time of the compiler is spent on this phase. How-
ever, there are simple optimizations that significantly improve the running
time of the target program without slowing down compilation too much.
Many in Chapter 9, while Chapter 10 gives the technol-
of these are discussed
ogy used by the most powerful optimizing compilers.

Code Generation
The final phase of the compiler is the generation of target code, consisting
normally of relocatable machine code or assembly code. Memory locations
are selected for each of the variables used by the program. Then, intermedi-
ate instructions are each translated into a sequence of machine instructions
that perform the same task. A crucial aspect is the assignment of variables to
registers.
For example, using registers 1 and 2, the translation of the code of (1.4)
might become

MOVF ids, R2
MULF #60.0, R2
MOVF id2, R1 (1.5)
ADDF R2, R1
MOVF R1 idl
,

The and second operands of each instruction specify a source and destina-
first

tion, respectively.The F in each instruction tells us that instructions deal with


floating-point numbers. This code moves the contents of the address' id3
into register 2, then multiplies it with the real-constant 60.0. The # signifies
that 60.0 is to be treated as a constant. The third instruction moves id2 into
register 1 and adds to it the value previously computed in register 2. Finally,
the value in register 1 is moved into the address of idl, so the code imple-
ments the assignment in Fig. 1.10. Chapter 9 covers code generation.

'
We have side-stepped the important issue of storage allocation for the identifiers in the source
program. As we shall sec in Chapter 7, the organization of storage at run-time depends on the
language being compiled. Storage-allocation decisions arc made cither during intermediate code
generation or during code generation.
16 INTRODUCTION TO COMPILING SEC. 1.4

1.4 COUSINS OF THE COMPILER


As we saw in Fig. 1.3, the input to a compiler may be produced by one or
more preprocessors, and further processing of the compiler's output may be
needed before running machine code is obtained. In this section, we discuss
the context in which a compiler typically operates.

Preprocessors

Preprocessors produce input to compilers. They may perform the following


functions:

1. Macro processing. A preprocessor may allow a user to define macros that


are shorthands for longer constructs.

2. File inclusion. A preprocessor may include header files into the program
text. For example, the C preprocessor causes the contents of the file

<global.h> to replace the statement #include <global.h> when it

processes a file containing this statement.

3. "Rational" preprocessors. These processors augment older languages


with more modern flow-of-control and data-structuring facilities. For
example, such a preprocessor might provide the user with built-in macros
for constructs like while-statements or if-statements, where none exist in
the programming language itself.

4. Language extensions. These processors attempt to add capabilities to the


language by what amounts to built-in macros. For example, the language
Equel (Stonebraker et al. |1976|) is a database query language embedded
in C. Statements beginning with ## are taken by the preprocessor to be
database-access statements, unrelated to C, and are translated into pro-
cedure calls on routines that perform the database access.

Macro processors deal with two kinds of statement: macro definition and
macro use. Definitions are normally indicated by some unique character or
keyword, like define or macro. They consist of a name for the macro
being defined and a body, forming its definition. Often, macro processors
permit formal parameters in their definition, that is, symbols to be replaced by
values (a "value" is The use of a
a string of characters, in this context).
macro consists of naming
macro and supplying actual parameters, that is,
the
values for its formal parameters. The macro processor substitutes the actual
parameters for the formal parameters in the body of the macro; the
transformed body then replaces the macro use itself.

Example 1.2. The T^ typesetting system mentioned in Section 1.2 contains a


general macro facility. Macro definitions take the form

\define < macro name> <template> {<body>}


A macro name is any string of letters preceded by a backslash. The template
SEC. 1.4 COUSINS OF THE COMPILER 17

is any string of characters, strings of the form


with #1, #2 #9
regarded as formal parameters. These symbols may also appear in the body,
any number of times. For example, the following macro defines a citation for
the Journal of the ACM.
\define\JACM #1;#2;#3.
{{\sl J. ACM} {\bf #1}:#2, pp. #3.}

The macro name is \JACM, and the template is "#1;#2;#3."; semicolons


separate the parameters and the last parameter is followed by a period. A use
of this macro must take the form of the template, except that arbitrary strings
may be substituted for the formal parameters." Thus, we may write

\JACM 17;4;715-728.
and expect to see

J. ACM 17:4, pp. 715-728.

The portion of the body {\sl J. ACM} calls for an italicized ("slanted") '7.
ACM'\ Expression {\bf #1} says that the first actual parameter is to be
made boldface; this parameter is intended to be the volume number.
T^ allows any punctuation or string of text to separate the volume, issue,
and page numbers in the definition of the \JACM macro. We could even have
used no punctuation at all. in which case T^X would take each actual parame-
ter to be a single character or a string surrounded by { }

Assemblers

Some compilers produce assembly code, as in (1.5), that is passed to an


assembler for further processing. Other compilers perform the job of the
assembler, producing relocatable machine code that can be passed directly to
the loader/link-editor. We assume the reader has some familiarity with what
an assembly language looks like and what an assembler does; here we shall
review the relationship between assembly and machine code.
Assembly code is a mnemonic version of machine code, in which names are
used instead of binary codes for operations, and names are also given to
memory addresses. A typical sequence of assembly instructions might be

MOV a, R1
ADD #2, R1 (1.6)
MOV R b 1 ,

This code moves the contents of the address a into register 1, then adds the
constant 2 to it, treating the contents of register 1 as a fixed-point number.

- Well, almost arbitrary strings, since a simple Icft-to-right scan of the macro use is made, and as

soon as a symbol matching the text following a #/ symbol in the template is found, the preceding
string is deemed to match #/. Thus, if we tried to sunstitute ab;cd for #1. we would find that
only ab matched #1 and cd was matched to #2.
18 INTRODUCTION TO COMPILING SEC. 1.4

and finally stores the result in the location named by b. Thus, it computes
b : = a + 2.
It is customary for assembly languages to have macro facilities that are simi-

lar to those in the macro preprocessors discussed above.

Two-Pass Assembly

The simplest form of assembler makes two passes over the input, where a pass
consists of reading an input file once. In the first pass, all the identifiers that
denote storage locations are found and stored in a symbol table (separate from
that of the compiler). Identifiers are assigned storage locations as they are
encountered for the first time, so after reading (1.6), for example, the symbol
table might contain the entries shown in Fig. 1.12. In that figure, we have
assumed that a word, consisting of four bytes, is set aside for each identifier,
and that addresses are assigned starting from byte 0.

Identifier Address
a U
b 4

Fig. 1.12. An assembler's symbol table with identifiers of (1.6).

In the second pass, the assembler scans the input again. This time, it

translates each operation code into the sequence of bits representing that
operation in machine language, and it translates each identifier representing a
location into the address given for that identifier in the symbol table.
The output of the second pass is usually relocatable machine code, meaning
that it can be loaded starting at any location L in memory; i.e., if L is added

to all addresses in the code, then all references will be correct. Thus, the out-
put of the assembler must distinguish those portions of instructions that refer
to addresses that can be relocated.

Example 1.3. The following is a hypothetical machine code into which the
assembly instructions (1.6) might be translated.

0001 01 00 00000000
0011 01 10 00000010 (1.7)
0010 01 00 00000100 *

We envision a tiny instruction word, in which the first four bits are the
instruction code, with 0001. 0010. and 0011 standing for load, store, and
add, respectively. By load and store we mean moves from memory into a
register and vice versa. The next two bits designate a register, and 01 refers
to register in each of the three above instructions.
1 The two bits after that
represent a "tag," with 00 standing for the ordinary address mode, where the
SEC. 1.4 COUSINS OF THE COMPILER 19

last eight bits refer to a memory address. The tag 10 stands for the 'immedi-
ate'" mode, where the last eight bits are taken literally as the operand. This
mode appears in the second instruction of (1.7).
We also see in (1.7) a * associated with the first and third instructions.
This * represents the relocation bit that is associated with each operand in

relocatable machine code. Suppose that the address space containing the data
is to be loaded starting at location L. The presence of the means that L
must be added to the address of the instruction. Thus, if L = 00001111.
i.e., 15, then a and b would be at locations 15 and 19, respectively, and the

instructions of 1.7) would appear as


(

0001 01 00 00001 1 1 1
0011 01 10 00000010 (1.8)
0010 01 00 00010011

in absolute, or unrelocatable, machine code. Note that there is no associ-


ated with the second instruction in (1.7), so L has not been added to its
address in (1.8). which is exactly right because the bits represents the constant
2, not the location 2. Q

Loaders and Link-Editors

Usually, a program called a loader performs the two functions of loading and
link-editing. The process of loading consists of taking relocatable machine
code, altering the relocatable addresses as discussed in Example 1.3. and plac-
ing the altered instructions and data in memory at the proper locations.
The link-editor allows us to make a single program from several files of
relocatable machine code. These files may have been the result of several dif-
ferent compilations, and one or more may be library files of routines provided
by the system and available to any program that needs them.
If the files are to be used together in a useful way, there may be some

external references, in which the code of one file refers to a location in

another file. This reference may be to a data location defined in one file and
used in another, or it may be to the entry point of a procedure that appears in

the code for one file from another file. The relocatable machine
and is called
code file must retain the information in the symbol table for each data loca-
tion or instruction label that is referred to externally. If we do not know in
advance what might be referred to, we in effect must include the entire assem-
bler symbol table as part of the relocatable machine code.
For example, the code of 1.7) would be preceded by(

a
b 4

If a file loaded with (1.7) referred to b, then that reference would be replaced
by 4 plus the offset by which the data locations in file ( 1.7) were relocated.
20 INTRODUCTION TO COMPILING StC. 1.5

1.5 THE GROUPING OF PHASES


The discussion of phases in Section 1.3 deals with the logical organization of a
compiler. In an implementation, activities from more than one phase are
often grouped together.

Front and Back Ends

Often, the phases are collected into a front end and a hack end. The front end
consists of those phases, or parts of phases, that depend primarily on the
source language and are largely independent of the target machine. These
normally include lexical and syntactic analysis, the creation of the symbol
table, semantic analysis, and the generation of intermediate code. A certain

amount of code optimization can be done by the front end as well. The front

end also includes the error handling that goes along with each of these phases.
The back end includes those portions of the compiler that depend on the
target machine, and generally, these portions do not depend on the source
language, just the intermediate language. In the back end, we find aspects of
the code optimization phase, and we find code generation, along with the
necessary error handling and symbol-table operations.
It has become fairly routine to take the front end of a compiler and redo
its

associated back end to produce a compiler for the same source language on a
different machine. If the back end is designed carefully, it may not even be
necessary to redesign too much of the back end; this matter is discussed in
Chapter 9. It is also tempting to compile several different languages into the
same intermediate language and use a common back end for the different
front ends, thereby obtaining several compilers for one machine. However,
because of subtle differences in the viewpoints of different languages, there
has been only limited success in this direction. ^

Passes

Several phases of compilation are usually implemented in a single pass consist-


ing of reading an input and writing an output file. In practice, there is
file

great variation in the way the phases of a compiler are grouped into passes, so
we prefer to organize our discussion of compiling around phases rather than
passes. Chapter 12 discusses some representative compilers and mentions the
way they have structured the phases into passes.
As we have mentioned, is commonit for several phases to be grouped into
one pass, and for the activity of these phases to be interleaved during the
pass. For example, lexical analysis, syntax analysis, semantic analysis, and
intermediate code generation might be grouped into one pass. If so, the token
stream after lexical analysis may be translated directly into intermediate code.
In more detail, think of the syntax analyzer as being "in charge." It
we may
attempts to discover the grammatical structure on the tokens it sees; it obtains
tokens as it needs them, by calling the lexical analyzer to find the next token.
As the grammatical structure is discovered, the parser calls the intermediate
SEC. 1.5 THE GROUPING OF PHASES 21

code generator to perform semantic analysis and generate a portion of the


code. A compiler organized this way is presented in Chapter 2.

Reducing the Number of Passes

It is desirable to have relatively few passes, since it takes time to read and
write intermediate files. On the other hand, if we group several phases into
one pass, we may be forced to keep the entire program in memory, because
one phase may need information in a different order than a previous phase
produces it. The internal form of the program may be considerably larger
than either the source program or the target program, so this space may not
be a trivial matter.
For some phases, grouping into one pass presents few problems. For exam-
ple, as we mentioned above, the interface between the lexical and syntactic
analyzers can often be limited to a single token. On the other hand, it is

often very hard to perform code generation until the intermediate representa-
tion has been completely generated. For example, languages like PL/I and
Algol 68 permit variables to be used before they are declared. We cannot
generate the target code for a construct if we do not know the types of vari-
ables involved in that construct. Similarly, most languages allow goto's that
jump forward in the code. We cannot determine the target address of such a
jump until we have seen the intervening source code and generated target
code for it.

In some cases, it is possible to leave a blank slot for missing information,


and fill in the slot when the information becomes available. In particular,
intermediate and target code generation can often be merged into one pass
using a technique called "backpatching." While we cannot explain all the
details until we have seen intermediate-code generation in Chapter 8, we can
illustrate backpatching in terms of an assembler. Recall that in the previous
section we discussed a two-pass assembler, where the first pass discovered all

the identifiers that represent memory locations and deduced their addresses as
they were discovered. Then a second pass substituted addresses for identif-
iers.

We can combine the action of the passes as follows. On encountering an


assembly statement that is a forward reference, say

GOTO target
we generate a skeletal instruction, with the machine operation code for GOTO
and blanks for the address. All instructions with blanks for the address of
target are kept in a list associated with the symbol-table entry for target.
The blanks are filled in when we finally encounter an instruction such as

target: MOV foobar, R1


and determine the value of target; it is the address of the current instruc-
tion. We then "backpatch." by going down the list for target of all the
instructions that need its address, substituting the address of target for the
22 INTRODUCTION TO COMPILING SEC. 1.5

blanks in the address fields of those instructions. This approach is easy to


implement if the instructions can be kept in memory until all target addresses
can be determined.
This approach is a reasonable one for an assembler that can keep all its out-
put in memory. Since the intermediate and final representations of code for
an assembler are roughly the same, and surely of approximately the same
length, backpatching over the length of the entire assembly program is not
infeasible. However, in space-consuming intermediate
a compiler, with a
code, we may need to be careful about the distance over which backpatching
occurs.

1.6 COMPILER-CONSTRUCTION TOOLS


The compiler any programmer, can profitably use software tools
writer, like
such as debuggers, version managers, profilers, and so on.In Chapter 11, we

shall see how some of these tools can be used to implement a compiler, in
addition to these software-development tools, other more specialized tools
have been developed for helping implement various phases of a compiler. We
mention them briefly in this section; they are covered in detail in the appropri-
ate chapters.
Shortly after the first compilers were written, systems to help with the
compiler-writing process appeared. These systems have often been referred to
as compiler-compilers, compiler-generators, or translator-writing systems.
Largely, they are oriented around a particular model of languages, and they
are most suitable for generating compilers of languages similar to the model.
For example, it is tempting to assume that lexical analyzers for all

languages are essentially the same, except for the particular keywords and
signs recognized. Many compiler-compilers do
produce fixed lexical
in fact

analysis routines for use in the generated compiler. These routines differ only
in the list of keywords recognized, and this list is all that needs to be supplied

by the user. The approach is valid, but may be unworkable if it is required to


recognize nonstandard tokens, such as identifiers that may include certain
characters other than letters and digits.
Some general tools have been created for the automatic design of specific
compiler components. These tools use specialized languages for specifying
and implementing the component, and many use algorithms that are quite
sophisticated. The most successful tools are those that hide the details of the
generation algorithm and produce components that can be easily integrated
into the remainder of a compiler. The following is a list of some useful
compiler-construction tools:

1. Parser generators. These produce syntax analyzers, normally from input


that is based on a context-free grammar. In early compilers, syntax
analysis consumed not only a large fraction of the running time of a com-
piler, but a large fraction of the intellectual effort of writing a compiler.
This phase is now considered one of the easiest to implement. Many of
CHAPTER 1 BIBLIOGRAPHIC NOTES 23

the "little languages" used to typeset this book, such as PIC (Kernighan
[1982]) and EQN, were implemented in a few days using the parser gen-
erator described in Section 4.7. Many parser generators utilize powerful
parsing algorithms that are too complex to be carried out by hand.

2. Scanner generators. These automatically generate lexical analyzers, nor-


mally from a specification based on regular expressions, discussed in

Chapter 3. The basic organization of the resulting lexical analyzer is in

effect a finite automaton. A typical scanner generator and its implemen-


tation are discussed in Sections 3.5 and 3.8.

3. Syntax-directed translation engines. These produce collections of routines


that walk the parse tree, such as Fig. 1.4, generating intermediate code.
The basic idea one or more "translations" are associated with each
is that
node of the parse tree, and each translation is defined in terms of transla-
tions at its neighbor nodes in the tree. Such engines are discussed in

Chapter 5.

4. Automatic code generators. Such a tool takes a collection of rules that


define the translation of each operation of the intermediate language into
the machine language for the target machine. The rules must include suf-
ficient detail that we can handle the different possible access methods for
data; e.g., variables may be in registers, in a fixed (static) location in
memory, or may be allocated a position on a stack. The basic technique
is "template matching." The intermediate code statements are replaced
by "templates" that represent sequences of machine instructions, in such
a way that the assumptions about storage of variables match from tem-
plate to template. Since there are usually many options regarding where
variables are to be placed (e.g., in one of several registers or in memory),
there are many possible ways to "tile" intermediate code with a given set
of templates, and it is necessary to select a good tiling without a combina-
torial explosion in running time of the compiler. Tools of this nature are
covered in Chapter 9.

5. Data-flow engines. Much of the information needed to perform good code


optimization involves "data-flow analysis," the gathering of information
about how values are transmitted from one part of a program to each
other part. Different tasks of this nature can be performed by essentially
the same routine, with the user supplying details of the relationship
between intermediate code statements and the information being gath-
ered. A tool of this nature is described in Section 10.1 1.

BIBLIOGRAPHIC NOTES
Writing in 1962 on the history of compiler writing, Knuth [1962] observed
been an unusual amount of parallel discovery of
that, "In this field there has
the same technique by people working independently." He continued by
observing that several individuals had in fact discovered "various aspects of a
24 INTRODUCTION TO COMPILING CHAPTER 1

technique, and it has been polished up through the years into a very pretty
algorithm, which none of the originators fully realized." Ascribing credit for
techniques remains a perilous task; the bibliographic notes in this book are
intended merely as an aid for further study of the literature.
Historical notes on the development of programming languages and com-
pilers until the arrival of Fortran may be found in Knuth and Trabb Pardo
11977]. Wexelblat [I981| contains historical recollections about several pro-
gramming languages by participants in their development.
Some fundamental early papers on compiling have been collected in Rosen
119671 and Pollack |1972|. The January 1961 issue of the Communications of
the ACM provides a snapshot of the state of compiler writing at the time. A
detailed account of an early Algol 60 compiler is given by Randell and
Russell |1964|.
Beginning in the early 1960's with the study of syntax, theoretical studies
have had profound influence on the development of compiler technology,
a
perhaps, at least as much influence as in any other area of computer science.
The fascination with syntax has long since waned, but compiling as a whole
continues to be the subject of lively research. The fruits of this research will
become evident when we examine compiling in more detail in the following
chapters.
CHAPTER 2

A Simple
One-Pass
Compiler

This chapter is an introduction to the material in Chapters 3 through 8 of this


book. It presents a number of basic compiling techniques that are illustrated
by developing a working C program that translates infix expressions into post-
fix form. Here, the emphasis is on the front end of a compiler, that is, on
lexical analysis, parsing, and intermediate code generation. Chapters 9 and 10
cover code generation and optimization.

2.1 OVERVIEW
A programming language can be defined by describing what its programs look

like (the syntax of the language) and what its programs mean (the semantics of
the language). For specifying the syntax of a language, we present a widely
used notation, called context-free grammars or BNF (for Backus-Naur Form).
With the notations currently available, the semantics of a language is much
more difficult to describe than the syntax. Consequently, for specifying the
semantics of a language we shall use informal descriptions and suggestive
examples.
Besides specifying the syntax of a language, a context-free grammar can be
used to help guide the translation of programs. A grammar-oriented compil-
ing technique, known as syntax-directed translation, is very helpful for organiz-
ing a compiler front end and will be used extensively throughout this chapter.
In the course of discussing syntax-directed translation, we shall construct a
compiler that translates infix expressions into postfix form, a notation in

which the operators appear after their operands. For example, the postfix
form of the expression 9-5 + 2 is 95-2 + Postfix notation can be converted
.

directly into code for a computer that performs all its computations using a
stack. We begin by constructing a simple program to translate expressions
consisting of digits separated by plus and minus signs into postfix form. As
the basic ideas become clear, we extend the program to handle more general
programming language constructs. Each of our translators is formed by sys-
tematically extending the previous one.
26 A SIMPLE COMPILER SEC. 2.2

In our compiler, the lexical analyzer converts the stream of input characters
into a stream of tokens that becomes the input to the following phase, as
shown in Fig. 2.1. The "syntax-directed translator" in the figure is a combi-
nation of a syntax analyzer and an intermediate-code generator. One reason
for starting with expressions consisting of digits and operators is to make lexi-
cal analysis initially very easy; each input character forms a single token.
Later, we extend the language to include lexical constructs such as numbers,
identifiers, and keywords. For this extended language we shall construct a
lexical analyzer that collects consecutive input characters into the appropriate
tokens. The construction of lexical analyzers will be discussed in detail in

Chapter 3.

syntax-
character token intermediate
directed
stream stream representation
translator

Fig. 2.1. Structure of our compiler front end.

2.2 SYNTAX DEFINITION


In this section, we introduce a notation, called a context-free grammar (gram-
mar, for short), for specifying the syntax of a language. It will be used
throughout this book as part of the specification of the front end of a com-
piler.

A grammar naturally describes the hierarchical structure of many program-


ming language constructs. For example, an if-else statement in C has the
form

if ( expression ) statement else statement

That is, the statement is the concatenation of the keyword if, an opening
parenthesis, an expression, a closing parenthesis, a statement, the keyword
else, and another statement. (In C, there is no keyword then.) Using the
variable expr to denote an expression and the variable stmt to denote a state-
ment, this structuring rule can be expressed as

stmt - if ( expr ) stmt else stmt (2.1)

in which the arrow may be read as "can have the form." Such a rule is called
a production. In a production lexical elements like the keyword if and the
parentheses are called tokens. Variables like expr and stmt represent
sequences of tokens and are called nonterminals.

A context-free grammar has four components:

I. A set of tokens, known as terminal symbols.


SEC. 2.2 SYNTAX DEFINITION 27

2. A set ot nonterminals.

3. A set of productions where each production consists of a nonterminal,


called the left side of the production, an arrow, and a sequence of tokens
and/or nonterminals, called the right side of the production.

4. A designation of one of the nonterminals as the start symbol.

We follow the convention of specifying grammars by listing their produc-


tions, with the productions for the start symbol listed first. We assume that
digits, signs and boldface strings such as while are terminals. An
such as < = ,

italicized name is a nonterminal and any nonitalicized name or symbol may be


assumed to be a token.' For notational convenience, productions with the
same nonterminal on the left can have their right sides grouped, with the
alternative right sides separated by the symbol |, which we read as "or."

Example 2.1. Several examples in this chapter use expressions consisting of


digits and plus and minus signs, e.g., 9-5 + 2, 3-1, and 7. Since a plus or
minus sign must appear between two digits, we refer to such expressions as
"lists of digits separated by plus or minus signs." The following grammar
describes the syntax of these expressions. The productions are:

list -* list + digit (2.2)


list - list - digit (2.3)
list -* digit (2.4)
digit ^0|l|2|3|4|5|6|7|8|9 (2.5)

The right sides of the three productions with nonterminal list on the left

side can equivalently be grouped:

list - list + digit \


list - digit \ digit

According to our conventions, the tokens of the grammar are the symbols

+ -0123456789
The nonterminals are the italicized names list and digit, with list being the
starting nonterminal because its productions are given first.

We say a production is /or a nonterminal if the nonterminal appears on the


left side of the production. A string of tokens is a sequence of zero or more
tokens. The string containing zero tokens, written as e, is called the empty
string.
A grammar derives strings by beginning with the start symbol and repeat-
edly replacing a nonterminal by the right side of a production for that

'
Individual italic letters will be used for additional purposes when grammars are studied in detail

in Chapter 4. For example, we shall use X, Y. and Z to talk about a symbol that is either a token
or a nonterminal. However, any italicized name containing two i)r mi)re characters will continue
to represent a nonterminal.
28 A SIMPLE COMPILER SEC. 2.2

nonterminal. The token strings that can be derived trom the start symbol
form the language defined by the grammar.

Example 2.2. The language defined by the grammar of Example 2.1 consists

of of digits separated by plus and minus signs.


lists

The ten productions for the nonterminal digit allow it to stand for any of
the tokens 0,1 9. From production (2.4), a single digit by itself is a

list. Productions (2.2) and (2.3) express the fact that if we take any list and
follow it by a plus or minus sign and then another digit we have a new list.

It turns out that productions (2.2) to (2.5) are all we need to define the

language we are interested in. For example, we can deduce that 9-5 + 2 is a
list as follows.

a) 9 is a list by production (2.4), since 9 is a digit.

b) 9-5 is a list by production (2.3), since 9 is a list and 5 is a digit.

c) 9-5 + 2 is a list by production (2.2), since 9-5 is a list and 2 is a digit.

This reasoning is illustrated by the tree in Fig. 2.2. Each node in the tree is

labeled by a grammar symbol. An interior node and its children correspond


to a production; the interior node corresponds to the left side of the produc-
tion, the children to the right side. Such trees are called parse trees and are
discussed below. ^

li.st

li
SEC. 2.2 SYNTAX DEFINITION 29

Note that the second


possible right side for optstmts ("optional statement
list") which stands for the empty string of symbols. That is, optstmts
is e,

can be replaced by the empty string, so a block can consist of the two-token
string begin end. Notice that the productions for strntJist are analogous to
those for list in Example 2.1, with semicolon in place of the arithmetic opera-
tor and stmt in place of dii^it. We have not shown the productions for stmt.
Shortly, we shall discuss the appropriate productions for the various kinds of
statements, such as if-statements, assignment statements, and so on.

Parse Trees

A parse tree pictorially shows how the start symbol of a grammar derives a
string in the language. If nonterminal A has a production A — XKZ, then a
parse tree may have an interior node labeled A with three children labeled X,
Y, and Z, from left to right:

X Y Z

Formally, given a context-free grammar, a parse tree is a tree with the fol-
lowing properties:

1. The root is labeled by the start symbol.

2. Each leaf is labeled by a token or by e.

3. Each interior node is labeled by a nonterminal.

4. UA is the nonterminal labeling some node and A"], X^,


interior X„ • ,

are the labels of the children of that node from left to right, then
A -XjXt • •
A",, is a production. Here, X\, Xt, X„ stand for a
. . . ,

symbol that is either a terminal or a nonterminal. As a special case, if

/\ - 6 then a node labeled A may have a single child labeled e.

Example 2.4. In Fig. 2.2, the root is labeled //.s7, the start symbol of the
grammar in Example 2.1. The children of the root are labeled, from left to
right, //.s7, +, and digit. Note that

list - list + digit

is a production in the grammar of Example 2.1. The same pattern with - is

repeated at the left child of the root, and the three nodes labeled digit each
have one child that is labeled by a digit.

The leaves of a parse tree read from form the yield of the tree,
left to right

which is the string generated or derived from the nonterminal at the root of
the parse tree. In Fig. 2.2, the generated string is 9-5 + 2. In that figure, all

the leaves are shown at the bottom level. Henceforth, we shall not necessarily
30 A SIMPLE COMPILER SEC. 2.2

line up the leaves in this way. Any tree imparts a natural left-to-right order
to leaves, based on the idea that if a and b are two children with the same
its

parent, and a is to the left of b, then all descendants of a are to the left of
descendants of b.

Another definition of the language generated by a grammar is as the set of


strings that can be generated by some parse tree. The process of finding a
parse tree for a given string of tokens is called parsing that string.

Ambiguity

We have to be careful in talking about the structure of a string according to a


grammar. While it is clear that each parse tree derives exactly the string read
off its grammar can have more than one parse tree generating a
leaves, a
given string of tokens. Such a grammar is said to be ambiguous. To show
that a grammar is ambiguous, all we need to do is find a token string that has
more than one parse tree. Since a string with more than one parse tree usu-
ally has more than one meaning, for compiling applications we need to design
unambiguous grammars, or to use ambiguous grammars with additional rules
to resolve the ambiguities.

Example 2.5. Suppose we did not distinguish between digits and lists as in
Example 2.1. We could have written the grammar

string -» string + string |


string - string |o|l|2|3|4|5|6|7|8|9
Merging the notion of digit and list into the nonterminal string makes superfi-
cial sense, because a single digit is a special case of a list.

However, Fig. 2.3 shows that an expression like 9-5 + 2 now has more than
one parse tree. The two trees for 9-5 + 2 correspond to the two ways of
parenthesizing the expression: (9-5) +2 and 9- (5 + 2). This second
parenthesization gives the expression the value 2 rather than the customary
value 6. The grammar of Example 2.1 did not permit this interpretation.

Associativity of Operators

By convention, 9 + 5 + 2 is equivalent to (9 + 5) +2 and 9-5-2 is equivalent to


(9-5) -2. When an operand like 5 has operators to its left and right, con-
ventions are needed for deciding which operator takes that operand. We say
that the operator + associates to the left because an operand with plus signs on
both sides of it is taken by the operator to its left. In most programming
languages the four arithmetic operators, addition, subtraction, multiplication,
and division are left associative.
Some common operators such as exponentiation are right associative. As
another example, the assignment operator = in C is right associative; in C, the
expression a=b=c is treated in the same way as the expression a=(b=c).
Strings like a=b=c with a right-associative operator are generated by the
following grammar:
SEC. 2.2 SYNTAX DEFINITION 31

string string

string + string string - string

string
y\\ - string 2
I I

9 string
/l\ + string

9 5 5 2

Fig. 2.3. Two parse trees for 9-5 + 2.

right — letter = right \


letter

letter -alb] • • •
| z

The contrast between a parse tree for a left-associative operator like - and a
parse tree for a right-associative operator like = is shown by Fig. 2.4. Note
that the parse tree for9-5-2 grows down towards the left, whereas the parse
tree for a=b=c grows down towards the right.

list right

list - digit letter = right

y\\ /i\
II
list

digit

I
- digit

5
I

2 a
I

II
letter

b
= ri^ht

letter

9 c

Fig. 2.4. Parse trees for left- and right-associative operators.

Precedence of Operators

Consider the expression 9 + 5*2. There are two possible interpretations of this
expression: (9 + 5)*2 or 9+ (5*2).
The associativity of + and * do not
resolve this ambiguity. For this reason, we need to know the relative pre-
cedence of operators when more than one kind of operator is present.
We say that * has higher precedence than + if takes its operands before +
does. In ordinary arithmetic, multiplication and division have higher pre-
cedence than addition and subtraction. Therefore, 5 is taken- by in both

9 + 5*2 and 9*5 + 2; i.e., the expressions are equivalent to 9+ (5*2) and
(9*5) +2, respectively.

Syntax of expressions. A grammar for arithmetic expressions can be


32 A SIMPLE COMPILER SEC. 2.2

constructed from a table showing the associativity and precedence of opera-


tors. We start with the four common arithmetic operators and a precedence
table, showing the operators in order of increasing precedence with operators
at the same precedence level on the same line:

left associative: + -
left associative: * /

We create two nonterminals expr and term for the two levels of precedence,
and an extra nonterminal factor for generating basic units in expressions. The
basic units in expressions are presently digits and parenthesized expressions.

factor - digit |
( expr )

Now consider the binary operators, * and /, that have the highest pre-
cedence. Since these operators associate to the left, the productions are simi-
lar to those for lists that associate to the left.

term -* term * factor

I
term / factor
I
factor

Similarly, expr generates lists of terms separated by the additive operators.

expr - expr + term


I
expr - term
I
term

The resulting grammar is therefore

expr term \ expr - term \


term
factor I
term / factor \ factor
{ expr )

This grammar treats an expression as a list of terms separated by either + or -


signs, and a term as a list of factors separated by or / signs. Notice that
any parenthesized expression is a factor, so with parentheses we can develop
expressions that have arbitrarily deep nesting (and also arbitrarily deep trees).

Syntax of statements. Keywords allow us to recognize statements in most


languages. All Pascal statements begin with a keyword except assignments
and procedure calls. Some Pascal statements are defined by the following
(ambiguous) grammar in which the token id represents an identifier.

stmt -» id : = expr
I
if expr then stmt

I
if expr then stmt else stmt

I
while expr do stmt
I
begin optstmts end

The nonterminal optstmts generates a possibly empty list of statements


separated by semicolons using the productions in Example 2.3.
SEC. 2.3 SYNTAX-DIRECTED TRANSLATION 33

2.3 SYNTAX-DIRECTED TRANSLATION


To translate a programming language construct, a compiler may need to keep
track of many quantities besides the code generated for the construct. For
example, the compiler may need to know the type of the construct, or the
location of the first instruction in the target code, or the number of instruc-
tions generated. We therefore talk abstractly about attributes associated with
constructs. An attribute may represent any quantity, e.g., a type, a string, a
memory location, or whatever.
In this section, we present a formalism called a syntax-directed definition
for specifying translations for programming language constructs. A syntax-
directed definition specifies the translation of a construct in terms of attributes
associated with its syntactic components. In later chapters, syntax-directed
defmitions are used to specify many of the translations that take place in the
front end of a compiler.
We also introduce a more procedural notation, called a translation scheme,
for specifying translations. Throughout this chapter, we use translation
schemes for translating infix expressions into postfix notation. A more
detailed discussion of syntax-directed definitions and their implementation is

contained in Chapter 5.

Postfix Notation

The postfix notation for an expression E can be defined inductively as follows:

1. If £ is a variable or constant, then the postfix notation for £ is £" itself.

2. If E is an expression of the form fj op Ej, where op is any binary opera-


tor, then the postfix notation for £ is £1' £2' op, where £/ and £2' are
the postfix notations for £1 and £2, respectively.

3. If £ is an expression of the form ( £1 ), then the postfix notation for £|


is also the postfix notation for £.

No parentheses are needed in postfix notation because the position and arity
(number of arguments) of the operators permits only one decoding of a post-
fix expression. For example, the postfix notation for (9-5) +2 is 95-2+ and

the postfix notation for 9- (5+2) is 952 + -.

Syntax-Directed Definitions

A syntax -directed definition uses a context-free grammar to specify the syntac-


tic structure of the input. With each grammar symbol, it associates a set of
attributes, and with each production, a set of semantic rules for computing
values of the attributes associated with the symbols appearing in that produc-
tion. The grammar and the set of semantic rules constitute the syntax-
directed definition.
A translation is an input-output mapping. The output for each input x is

specified in the following manner. First, construct a parse tree for x. Suppose
34 A SIMPLE COMPILER SEC. 2.3

a node n in the parse tree is labeled by the grammar symbol X. We write X.a
todenote the value of attribute a of X at that node. The value of X.a at n is
computed using the semantic rule for attribute a associated with the X-
production used at node n. A parse tree showing the attribute values at each
node is called an annotated parse tree.

Synthesized Attributes

An attribute is said to be synthesized if its value at a parse-tree node is deter-


mined from attribute values at the children of the node. Synthesized attri-

butes have the desirable property that they can be evaluated during a single
bottom-up traversal of the parse tree. In this chapter, only synthesized attri-

butes are used; "inherited" attributes are considered in Chapter 5.

Example 2.6. A s^yntax-directed definition for translating expressions consist-


ing of digits separated by plus or minus signs into postfix notation is shown in

Fig. 2.5. Associated with each nonterminal is a string-valued attribute / that


represents the postfix notation for the expression generated by that nontermi-
nal in a parse tree.

Production
SEC. 2.3 SYNTAX-DIRECTED TRANSLATION 35

rules represents string concatenation.


Figure 2.6 contains the annotated parse tree corresponding to the tree of
Fig. 2.2. The value of the r-attribute at each node has been computed using
the semantic rule associated with the production used at that node. The value
of the attribute at the root is the postfix notation for the string generated by
the parse tree. d

expr.t = 95-2 +

expr.t = 95- term.t = 2

expr.t = 9 term.t = 5

term.t

9-5
I

I
= 9

Fig. 2.6. Attribute values at nodes in a


+

parse tree.
2

Example 2.7. Suppose a robot can be instructed to move one step east, north,
west, or south from its current position. A sequence of such instructions is

generated by the following grammar:

seq - seq instr \ begin


it\str — east I north I west south

Changes in the position of the robot on input

begin west south east east east north north

are shown in Fig. 2.7.

(2,1)

north
(-1,0) west begin
(0,0)
south north

( - I .
- 1 ) east east east (2,-1)

Fig. 2.7. Keeping track of a robot's position.

In the figure, a position is marked by a pair (.v,v), where x and y represent


the number of steps to the east and north, respectively, from the starting
36 A SIMPLE COMPILER SEC. 2.3

position. (If .V is negative, then the robot is to the west of the starting posi-
tion; similarly, if v is negative, then the robot is to the south of the starting
position.)
Let us eonstruet a syntax-directed definition to translate an instruction
sequence into a robot position. We
and seq.v, shall use two attributes, seq.x
to keep track of the position resulting from an instruction sequence generated
by the nonterminal seq. Initially, scq generates begin, and seq.x and scq.y are
both initialized to 0, as shown at the leftmost interior node of the parse tree
for begin west south shown in Fig. 2.8.

seq.x = — I

seq.y = — I

,v('(/.A = - I in.str.dx=
seq.y ~ instr.dy = — 1

seq.x = in.'ilr.dx = — I

south
.sr^.v = Inslr.Jy —

begin west

Fig. 2.8. Annotated parse tree for begin west south.

The change in position due to an individual instruction derived from instr is

given by attributes instr. dx and instr.dy. For example, if instr derives west,
then instr. dx = — I and instr.dy — 0. Suppose a sequence seq is formed by
following a sequence seq ^
by a new instruction instr. The new position of the
robot is then given by the rules

seq.x := seq^.x + instr. dx


seq.y := seq\.y + instr.dy

A syntax-directed definition for translating an instruction sequence into a


robot position is shown in Fig. 2.9.

Depth-First Traversals

A syntax-directed definition does not impose any specific order for the evalua-
tion of attributes on a parse tree; any evaluation order that computes an attri-
bute a after all the other attributes that a depends on is acceptable. In gen-
eral, we may have to evaluate some attributes when node is first reached
a
during a walk of the parse tree, others after all its children have been visited,
or at some point in between visits to the children of the node. Suitable
evaluation orders are discussed Chapter 5. in more detail in
The translations in this chapter can all be implemented by evaluating the
semantic rules for the attributes in a parse tree in a predetermined order. A
traversal of a tree starts at the root and visits each node of the tree in some
SEC. 2.3 SYNTAX-DIRBCTED TRANSLATION 37

Production
\

38 A SIMPLE COMPILER SEC. 2.3

Fig. 2.11. Example of a depth-first traversal of a tree.

position at which an action is to be executed is shown by enclosing it between


braces and writing it within the right side of a production, as in

rest - + term { print (' + ') } rest

A translation scheme generates an output for each sentence x generated by


the underlying grammar by executing the actions in the order they appear dur-
ing a depth-first traversal of a parse tree for .v. For example, consider a parse
tree with a node labeled rest representing this production. The action
{/;/•////(' + ') } will be performed after the subtree for term is traversed but
before the child for rest \ is visited.

rest

-r term { print (' + ')} rest \

Fig. 2.12. An extra leaf is constructed for a semantic action.

When drawing scheme, we indicate an action


a parse tree for a translation
by constructing for it line to the node
an extra child, connected by a dashed
for its production. For example, the portion of the parse tree for the above
production and action is drawn as in Fig. 2.12. The node for a semantic
action has no children, so the action is performed when that node is first seen.

Emitting a Translation

in this chapter, the semantic actions in translations schemes will write the out-

put of a translation into a file, a string or character at a time. For example,


we translate 9-5 + 2 into 95-2+ by printing each character in 9-5 + 2 exactly
once, without using any storage for the translation of subexpressions. When
the output is created incrementally in this fashion, the order in which the
characters are printed is important.
Notice that the syntax-directed definitions mentioned so far have the follow-
ing important property: the string representing the translation of the nontermi-
nal on the left side of each production is the concatenation of the translations
SEC. 2.3 SYNTAX-DIRECTED TRANSLATION 39

of the nonterminals on the right, in the same order as in the production, with
some additional strings (perhaps none) interleaved. A syntax-directed defini-
tion with this property termed simple. For example, consider the
is first pro-
duction and semantic rule from the syntax-directed definition of Fig. 2.5:

Production Semantic Rule


expr — expr i
+ term expr.t :~ expr^.t ||
term.t (2.6)

Here the translation expr.t is the concatenation of the translations of expr^ and
term, followed by the symbol +. Notice that expr^ appears before term on the
right side of the production.
An additional string appears between term.t and rest^.t in

Production Semantic Rule


rest — + term rest y
rest, t = term.t \\
'
+ '
||
rest ] .t (2.7)

but, again, the nonterminal term appears before rest on the right side.
\

Simple syntax-directed definitions can be implemented with translation


schemes in which actions print the additional strings in the order they appear
in the definition. The actions in the following productions print the additional
strings in (2.6) and (2.7), respectively:

expr -» expr + term print ( +


| { '
' ) }

rest -< + term print ( + rest


{ '
' ) } i

Example 2.8. Figure 2.5 contained a simple definition for translating expres-
sions into postfix form. A translation scheme derived from this definition is

given in Fig. 2.13 and a parse tree with actions for 9-5 + 2 is shown in Fig.
2.14. Note that although Figures 2.6 and 2.14 represent the same input-
output mapping, the translation in the two cases is constructed differently;
Fig. 2.6 attaches the output to the root of the parse tree, while Fig. 2.14 prints
the output incrementally.

expr
40 A SIMPLE COMPILER SEC. 2.3

subtree for the right operand term and, finally, the semantic action
print (' + ') the extra node.
{ } at

Since the productions for term have only a digit on the right side, that digit
is printed by the actions for the productions. No output is necessary for the
production expr -» term, and only the operator needs to be printed in the
action for the first two productions. When executed during a depth-first
traversal of the parse tree, the actions in Fig. 2.14 print 95-2 + .

{print {' + ')}

term
/
{print ('-')} 2 {print ('2')}

e.xpr term
I / ^ .
term 5 {print {'5')}
/ ^ .
9 {print ('9'))

Fig. 2.14. Actions translating 9-5 + 2 into 95-2 + .

As a general rule, most parsing methods process their input from left to
right in a "greedy" fashion; that is, they construct as much of a parse tree as
possible before reading the next input token. In a simple translation scheme
(one derived from a simple syntax-directed definition), actions are also done
in a left-to-right order. Therefore, to implement a simple translation scheme
we can execute the semantic actions while we parse; it is not necessary to con-
struct the parse tree at all.

2.4 PARSING
Parsing is the process of determining if a string of tokens can be generated by
a grammar. In discussing this problem, it is helpful to think of a parse tree
being constructed, even though a compiler may not actually construct such a
tree. However, must be capable of constructing the
a parser tree, or else the
translation cannot be guaranteed correct.
This section introduces a parsing method that can be applied to construct
syntax-directed translators. A complete C program, implementing the transla-
tion scheme of Fig. 2.13, appears in the next section. A viable alternative is

to use a software tool to generate a translator directly from a translation


scheme. See Section 4.9 for the description of such a tool; it can implement
the translation scheme of Fig. 2.13 without modification.
A parser can be constructed for any grammar. Grammars used in practice,

however, have a special form. For any context-free grammar there is a parser
that takes at most 0{n^) time to parse a string of n tokens. But cubic time is
SEC. 2.4 PARSING 41

too expensive. Given a programming language, we can generally construct a


grammar that can be parsed quickly. Linear algorithms suffice to parse essen-
tially all languages that arise in practice. Programming language parsers
almost always make a single left-to-right scan over the input, looking ahead
one token at a time.

Most parsing methods fall into one of two classes, called the top-down and
bottom-up methods. These terms refer to the order in which nodes in the
parse tree are constructed, in the former, construction starts at the root and
proceeds towards the leaves, while, in the latter, construction starts at the
leaves and proceeds towards the root. The popularity of top-down parsers is
due to the fact that efficient parsers can be constructed more easily by hand
using top-down methods. Bottom-up parsing, however, can handle a larger
class of grammars and translation schemes, so software tools for generating
parsers directly from grammars have tended to use bottom-up methods.

Top-Down Parsing
We introduce top-down parsing by considering a grammar that is well-suited
for this class of methods. Later in we consider the construction
this section,
of top-down parsers in general. The following grammar generates a subset of
the types of Pascal. We use the token dotdot for ". ." to emphasize that the
character sequence is treated as a unit.

tvpe -* simple
I
tid
I
array [ simple ] of type
simple - integer

I
char
I
num dotdot num
The top-down construction of a parse tree is done by starting with the root,
labeled with the starting nonterminal, and repeatedly performing the following
two steps (see Fig. 2.15 for an example).

1. At node n, labeled with nonterminal A, select one of the productions for


A and construct children at n for the symbols on the right side of the pro-
duction.

2. Find the next node at which a subtree is to be constructed.

For some grammars, the above steps can be implemented during a single left-
The current token being scanned in the input
to-right scan of the input string.
is frequently referred to as the lookahead symbol. Initially, the lookahead
symbol is the first, i.e., leftmost, token of the input string. Figure 2.16 illus-

trates the parsing of the string

array [ num dotdot num ] of integer

Initially, the token array is the lookahead symbol and the known part of the
42 A SIMPLE COMPILER SEC. 2.4

(a)

(b)

type

(c) type

type

(d) array [ simple ] of type

num dotdot num simple

array
(c)

num dotdot num

Fig. 2.15. Steps in the top-down construction of a parse tree.

parse tree consists of the root, labeled with the starting nonterminal type in

Fig. 2.16(a). The objective is remainder of the parse tree in


to construct the
such a way that the string generated by the parse tree matches the input
string.
For a match to occur, nonterminal type in Fig. 2.16(a) must derive a string
that starts with the lookahead symbol array. In grammar (2.8), there is just
one production for type that can derive such a string, so we select it, and con-
struct the children of the root labeled with the symbols on the right side of the
production.
Each of the three snapshots in Fig. 2.16 has arrows marking the lookahead
symbol in the input and the node in the parse tree that is being considered.
When children are constructed at a node, we next consider the leftmost child.
In Fig. 2.16(b), children have just been constructed at the root, and the left-

most child labeled with array is being considered.


When the node being considered in the parse tree is for a terminal and the
SEC. 2.4 PARSING 43

Parse type
Tree
(a)

array [ num dotdot num of integer


Input

type

Parse
Tree array [ simple ] of type

t
(b)

array [ num dotdot num ] of integer


Input
t

type

Parse
Tree array simple of type

(c)

array [ num dotdot num ] of integer


Input
t

Fig. 2.16. Top-down parsing while scanning the input from left to right.

terminal matches the lookahead symbol, then we advance in both the parse
treeand the input. The next token in the input becomes the new lookahead
symbol and the next child in the parse tree is considered. In Fig. 2.16(c), the
arrow in the parse tree has advanced to the next child of the root and the
arrow in the input has advanced to the next token [. After the next advance,
the arrow in the parse tree will point to the child labeled with nonterminal
simple. When a node labeled with a nonterminal is considered, we repeat the
process of selecting a production for the nonterminal.
In general, the selection of a production for a nonterminal may involve
trial-and-error; that is, we may have to try a production and backtrack to try
another production if the found to be unsuitable. A production is
first is

unsuitable if, after using the production, we cannot complete the tree to match
the input string. There is an important special case, however, called predic-
tive parsing, in which backtracking does not occur.
(

44 A SIMPLE COMPILER SEC. 2.4

Predictive Parsing

Recursive-descent parsing is a top-down method of syntax analysis in which we


execute a set of recursive procedures to process the input. A procedure is

associated with each nonterminal of a grammar. Here, we consider a special


form of recursive-descent parsing, called predictive parsing, in which the look-
ahead symbol unambiguously determines the procedure selected for each non-
terminal. The sequence of procedures called in processing the input implicitly
defines a parse tree for the input.

procedure match (t: token);


begin
if lookaheiul = / then
lookahead :— nexttokcn
else error

end;

procedure type ;

begin
if lookahead is in { integer, char, num } then
simple
else if lookahead = '
t ' then begin
match '
T '); match (id)

end
else if lookahead = array then begin
match {array); match ('['); simple ; match (']'); match (of); type

end
else error

end;

procedure simple ;

begin
if lookahead = integer then
nk//<7/( integer)

else if lookahead = char then


Am//c77(char)

else if lookahead = num then begin


match {num); match (dotdot): match {num)
end
else error

end;

Fig. 2.17. Pscudo-ci)dc for a predictive parser.

The predictive parser in Fig. 2.17 consists of procedures for the nontermi-
nals type and simple of grammar (2.8) and an additional procedure match. We
SEC. 2.4 PARSING 45

use match to simplify the code for type and simple; it advances to the next
input token if its argument
matches the lookahead symbol. Thus match
/

changes the variable lookahead, u'hich is the currently scanned input token.
Parsing begins with a call of the procedure for the starting nonterminal type
in our grammar. With the same input as in Fig. 2.16, lookahead is initially
the first token array. Procedure type executes the code

match ( array) match ;


('['); simple ; match (']'); match ( of) ; type (2.9)

corresponding to the right side of the production

type - array [ simple ] of type

Note that each terminal in the right side is matched with the lookahead sym-
bol and that each nonterminal in the right side leads to a call of its procedure.
With the input of Fig. 2.16, after the tokens array and [ are matched, the
lookahead symbol is num. At this point procedure simple is called and the
code

matchinum): match (dotdot); matchinum)

in its body is executed.


The lookahead symbol guides the selection of the production to be used. If

the right side of a production starts with a token, then the production can be
used when the lookahead symbol matches the token. Now consider a right
side starting with a nonterminal, as in

type - simple (2.10)

This production is used if the lookahead symbol can be generated from simple.
For example, during the execution of the code fragment (2.9), suppose the
lookahead symbol is integer when control reaches the procedure call type.
There is no production for type that starts with token integer. However, a
production for simple does, so production (2.10) is used by having type call

procedure simple on lookahead integer.


Predictive parsing relies on information about what first symbols can be
generated by the right side of a production. More a be the right
precisely, let

side of a production for nonterminal A. We FlRST(a) to be the set of


define
tokens that appear as the first symbols of one or more strings generated from
a. If a is e or can generate e, then e is also in FIRST(a).'" For example,

FlRSTisimple) = { integer, char, num }

FIRST(t id) = { t
}

FIRST( array [ simple ] of type) = { array }

In practice, many production right sides start with tokens, simplifying the

"
Productions with e on the right side complicate the determination of the first symbols generated
by a nonterminal. For example, if nonterminal B can derive the empty string and there is a pro-

duction A -> BC, then the symbol generated by C can also be the
first first symbol generated by A.
If C can also generate e, then both F'iRST(-4) and F!RST(«C') contain e.
46 A SIMPLE COMPILER SEC. 2.4

construction of FIRST sets. An algorithm for computing FIRST'S is given in

Section 4.4.
The FIRST sets must be considered if there are two productions A ^ a and
y4 -^ p. Recursive-descent parsing without backtracking requires FIRST(a)
and FIRST(P) to be disjoint. The lookahead symbol can then be used to
decide which production to use; if the lookahead symbol FIRST(a), then
is in

a is used. Otherwise, if the lookahead symbol is in FIRST(P), then (3 is


used.

When to Use e-Productions

Productions with e on the right side require special treatment. The recursive-
descent parser will use an e-production as a default when no other production
can be used. For example, consider:

stmt -* begin optstmts end


optstmts -> stmt-Ust \ e

While parsing optstmts, if the lookahead symbol is not in FIRST(5/m/_//5'/),


then the e-production is used. This choice is exactly right if the lookahead
symbol is end. Any lookahead symbol other than end will result in an error,
detected during the parsing of stmt.

Designing a Predictive Parser

A predictive parser is a program consisting of a procedure for every nontermi-


nal. Each procedure does two things.

1. It decides which production to use by looking at the lookahead symbol.


The production with right side a is used if the lookahead symbol is in

FIRST(a). If there is a conflict between two right sides for any look-
ahead symbol, then we cannot use this parsing method on this grammar.
A production with e on the right side is used if the lookahead symbol is
not in the FIRST set for any other right hand side.

2. The procedure uses a production by mimicking the right side. A nonter-


minal results in a call to the procedure for the nonterminal, and a token
matching the lookahead symbol results in the next input token being read.
If at some point the token in the production does not match the look-
ahead symbol, an error is declared. Figure 2.17 is the result of applying
these rules to grammar (2.8).

Just as a translation scheme is formed by extending a grammar, a syntax-


directed translator can be formed by extending a predictive parser. An algo-
rithm for this purpose is given in Section 5.5. The following limited construc-
tion suffices for the present because the translation schemes implemented in

this chapter do not associate attributes with nonterminals:

1. Construct a predictive parser, ignoring the actions in productions.


SEC. 2.4 PARSING 47

2. Copy the actions from the translation scheme into the parser. If an action
appears after grammar symbol X inproduction p, then it is copied after
the code implementing X. Otherwise, if it appears at the beginning of the
production, then it is copied just before the code implementing the pro-
duction.

We shall construct such a translator in the next section.

Left Recursion

It is possible for a recursive-descent parser to loop forever. A problem arises


with left-recursive productions like

expr expr term

in which the leftmost symbol on the right sideis the same as the nonterminal

on the left Suppose the procedure for expr decides to


side of the production.
apply this production. The right side begins with expr so the procedure for
expr is called recursively, and the parser loops forever. Note that the look-
ahead symbol changes only when a terminal in the right side is matched.
Since the production begins with the nonterminal expr, no changes to the input
take place between recursive calls, causing the infinite loop.

p
)
)

48 A SIMPLE COMPILER SEC. 2.4

A = expr, a = + term, and (3 = term.


The nonterminal A is left recursive because the production A ^ Aa has A
itself as the leftmost symbol on the right side. Repeated application of this

production builds up a sequence of a's to the right of .4, as in Fig. 2.18(a).


When A is finally replaced by p, we have a (3 followed by a sequence of zero
or more a's.
The same effect can be achieved, as in Fig. 2.18(b), by rewriting the pro-
ductions for A in the following manner.

'^ -^^ ,
(2.11)

Here /? is a new nonterminal. The production /? — a/? is ri^ht recursive


because this production for R has R symbol on the right side.
itself as the last

Right-recursive productions lead to trees that grow down towards the right, as
in Fig. 2.18(b). Trees growing down to the right make it harder to translate
expressions containing left-associative operators, such as minus. In the next

section, however, we shall see that the proper translation of expressions into
postfix notation can still be attained by a careful design of the translation
scheme based on a right-recursive grammar.
In Chapter 4, we consider more general forms of left recursion and show
how all left recursion can be eliminated from a grammar.

2.5 A TRANSLATOR FOR SIMPLE EXPRESSIONS


Using the techniques of the last three sections, we now construct a syntax-
directed translator, in the form of a working C program, that translates arith-
metic expressions into postfix form. To keep the initial program manageably
small, we start off with expressions consisting of digits separated by plus and
minus signs. The language is extended in the next two sections to include
numbers, identifiers, and other operators. Since expressions appear as a con-
struct in so many languages, it is worth studying their translation in detail.

expr -* expr + term print ( +


{ ' ' }

expr — expr - term { print (


'
-
' }

expr -* term
term — { print (
'
'
}

term — 1 { print (
'
1
' }

term - 9 { print ('9') }

Fig. 2.19. Initial specification of infix-to-postfix translator.

A syntax-directed translation scheme can often serve as the specification for


a translator. We use the scheme in Fig. 2.19 (repeated from Fig. 2.13) as the
SEC. 2.5 A TRANSLATOR FOR SIMPLE EXPRESSIONS 49

definition of the translation to be performed. As is often the case, the under-


lying grammar of a given scheme has to be modified before it can be parsed
with a predictive parser. In particular, the grammar underlying the scheme in
Fig. 2.19 is left-recursive, and as we saw in the last section, a predictive
parser cannot handle a left-recursive grammar. By eliminating the left-
recursion, we can obtain a grammar suitable for use in a predictive recursive-
descent translator.

Abstract and Concrete Syntax

A useful starting point for thinking about the translation of an input string is

an abstract syntax tree in which each node represents an operator and the chil-
dren of the node represent the operands. By contrast, a parse tree is called a
concrete syntax tree, and the underlying grammar is called a concrete syntax
for the language. Abstract syntax trees, or simply syntax trees, differ from
parse trees because superficial distinctions of form, unimportant for transla-
tion, do not appear in syntax trees.

9 5

Fig. 2.20. Syntax tree for 9-5 + 2.

For example, the syntax tree for 9-5 + 2is shown in Fig. 2.20. Since + and
- have the same precedence and operators at the same precedence level
level,
are evaluated left to right, the tree shows 9-5 grouped as a subexpression.
Comparing Fig. 2.20 with the corresponding parse tree of Fig. 2.2, we note
that the syntax tree associates an operator with an interior node, rather than
making the operator be one of the children.
It is desirable for a translation scheme to be based on a grammar whose

parse trees are as close to syntax trees as possible. The grouping of subex-
pressions by the grammar in Fig. 2.19 is similar to their grouping in syntax
trees. Unfortunately, the grammar of and hence
Fig. 2.19 is left-recursive,
not suitable for predictive parsing. It on the one
appears there is a conflict;
hand we need a grammar that facilitates parsing, on the other hand we need a
radically different grammar for easy translation. The obvious solution is to
eliminate the left-recursion. However, this must be done carefully as the fol-
lowing example shows.

Example 2.9. The following grammar is unsuitable for translating expressions


into postfix form, even though it generates exactly the same language as the
grammar in Fig. 2.19 and can be used for recursive-descent parsing.
50 A SIMPLE COMPILER SEC. 2.5

expr -^ term rest


rest -» + expr \
- expr \ e
term — | 1 |

• •
| 9

This grammar has the problem that the operands of the operators generated
by rest -* + expr and rest - - expr are not obvious from the productions.
Neither of the following choices for forming the translation rest.t from that of

expr.t is acceptable:

rest -» - expr { rest.t := '-' ||


expr.t } (2.12)
:= '-'
rest -» - expr { rest.t expr.t \\ } (2.13)

(We have only shown the production and semantic action for the minus opera-
tor.) The translation of 9-5 is 95-. However, if we use the action in (2.12),
then the minus sign appears before expr.t and 9-5 incorrectly remains 9-5 in

translation.
On the other hand, if we use (2.13) and the analogous rule for plus, the
operators consistently move to the right end and 9-5 + 2 is translated
incorrectly into 952 + - (the correct translation is 95-2 + ).

Adapting the Translation Scheme

The left-recursion elimination technique sketched in Fig. 2.18 can also be


applied to productions containing semantic actions. We extend the transfor-
mation in Section 5.5 to take synthesized attributes into account. The tech-
nique transforms the productions /\ -* Aa |
/\(3 |
7 into

A ^ yR
R ^ aR \
^R \
e

When semantic actions are embedded in the productions, we carry them along
in the transformation. {printC + ') },
Here, if we let A - expr, a = + term
= - term {print{'-') }, and 7 = term, the transformation above produces
P
the translation scheme (2.14). The expr productions in Fig. 2.19 have been
transformed into the productions for expr and the new nonterminal rest in
(2.14). The productions for term are repeated from Fig. 2.19. Notice that the
underlying grammar is different from the one in Example 2.9 and the differ-
ence makes the desired translation possible.

expr - term rest


rest -* + term { print {' + ') } rest \
- term { print {'-') } rest \

term - { print {'Q') }


(2 14)
term -* 1 print ( 1
{ ' ' ) }

term -* 9 { print ('9') }

Figure 2.21 shows how 9-5 + 2 is translated using the above grammar.
; ;

SEC. 2.5 A TRANSLATOR FOR SIMPLE EXPRESSIONS 51

expr

term rest

{print{'9') -
9 } term { print (
'
' ) } rest

5
/\ {primes')} + term { prim (
'
+
' ) } rest

2
/\ { print{'2') }
I

Fig. 2.21. Translation of 9-5 + 2 into 95-2 + .

Procedures for the Nonterminals expr, term, and rest

We now implement a translator in C using the syntax-directed translation


scheme (2.14). The essence of the translator is the C code in Fig. 2.22 for the
functions expr, term, and rest. These functions implement the
corresponding nonterminals in (2.14).

expr( )

term( ) ; rest ( ) ;

rest (

if ( lookahead == '+') {

+
match('+'); term( ) ; putchar (
' ' ) ; rest();
}

else if (lookahead == '-') {


match('-'); term( putchar ('-') ) ; ; rest();
}

else ;

terin( )

if ( isdigit( lookahead) ) {

putchar lookahead
( ) ; match lookahead ( )

else error ( )

Fig. 2.22. Functions for the nonterminals expr, rest, and term.

The function match, presented later, is the C counterpart of the code in


52 A SIMPLE COMPILER SEC. 2.5

Fig. 2.17 to match a token with the lookahead symbol and advance through
the input. Since each token is a single character in our language, match can
be implemented by comparing and reading characters.
For those unfamiliar with the programming language C, we mention the
salient differences between C and other Algol derivatives such as Pascal, as
we find uses for those features of C. A program in C consists of a sequence
of function definitions, with execution starting at a distinguished function
called main. Function definitions cannot be nested. Parentheses enclosing
function parameter lists are needed even if there are no parameters; hence we
write expr( ). and rest( ). Functions communicate either by pass-
term( ).

ing parameters "by value" or by accessing data global to all functions. For
example, the functions term{ and rest() examine the lookahead symbol
)

using the global identifier lookahead.


C and Pascal use the following symbols for assignments and equality tests:

Operation
) ) ;

SEC. 2.5 A TRANSLATOR FOR SIMPLE EXPRESSIONS 53

because control flows to the end of the function body after each of these calls.
We can speed up a program by replacing tail recursion by iteration. For a
procedure without parameters, a tail-recursive call can be simply replaced by a
jump to the beginning of the procedure. The code for rest can be rewritten
as:

rest( )

L: if ( lookahead == '+') {
+
match('+'); term( ) ; putchar (
' ' ) ; goto L;
}

else if (lookahead == '-') {


match('-'); term{ putchar ('-') ) ; ; goto L;
}

else ;

As long as the lookahead symbol is a plus or minus sign, procedure rest


matches the sign, calls term to match a digit, and repeats the process. Note
that since match removes the sign each time it is called, this cycle occurs only
on alternating sequences of signs and digits. If this change is made in Fig.

2.22, the only remaining call of rest is from expr (see line 3). The two
functions can therefore be integrated into one, as shown in Fig. 2.23. In C, a
statement stmt can be repeatedly executed by writing

while ( 1 ) stmt

because the condition 1 is always true. We can exit from a loop by executing
a break-statement. The stylized form of the code in Fig. 2.23 allows other
operators to be added conveniently.

expr (

term( ) ;

while( 1

if (lookahead == '+') {

match('+'); term( ) ; putchar ('+')


}

else if (lookahead == '-') {


match('-'); term( putchar ('-') ) ;

else break;

Fig. 2.23. Replacement for functions expr and rest of Fig. 2.22.
54 A SIMPLE COMPILER SEC. 2.6

The Complete Program


The complete C program for our translator is shown in Fig. 2.24. The first

line, beginning with #include, loads <ctype.h>, a file of standard routines


that contains the code for the predicate isdigit.
Tokens, consisting of single characters, are supplied by the standard library
routine getchar that reads the next character from the input file. However,
lookahead is declared to be an integer on line 2 of Fig. 2.24 to anticipate
the additional tokens that are not single characters that will be introduced in

later sections. Since lookahead is declared outside any of the functions, it is

global to any functions that are defined after line 2 of Fig. 2.24.
The function match checks tokens; it reads the next input token if the look-
ahead symbol is matched and calls the error routine otherwise.

The function error uses the standard library function printf to print the

message "sjmtax error" and then terminates execution by the call

exit( 1 ) to another standard library function.

2.6 LEXICAL ANALYSIS


We shall now add to the translator of the previous section a lexical analyzer
that reads and converts the input into a stream of tokens to be analyzed by the
parser. Recall from the definition of a grammar in Section 2.2 that the sen-
tences of a language consist of strings of tokens. A sequence of input charac-
ters that comprises a single token is called a lexeme. A lexical analyzer can
insulate a parser from the lexeme representation of tokens. We begin by list-
ing some of the functions we might want a lexical analyzer to perform.

Removal of White Space and Comments


The expression translator in the last section sees every character in the input,
so extraneous characters, such as blanks, will cause it to fail. Many languages
allow "white space" (blanks, tabs, and newlines) to appear between tokens.
Comments can likewise be ignored by the parser and translator, so they may
also be treated as white space.
If white space is eliminated by the lexical analyzer, the parser will never
have to consider it. The alternative of modifying the grammar to incorporate
white space into the syntax is not nearly as easy to implement.

Constants

Anytime a single digit appears in an expression, it seems reasonable to allow

an arbitrary integer constant an integer constant is a


in its place. Since
sequence of digits, integer constants can be allowed either by adding produc-
tions to the grammar for expressions, or by creating a token for such con-
stants. The job of collecting digits into integers is generally given to a lexical
analyzer because numbers can be treated as single units during translation.
Let num be the token representing an integer. When a sequence of digits
) ) ; ; ; ; ;

SEC. 2.6 LEXICAL ANALYSIS 55

/include <ctype.h> /* loads file with predicate isdigit */


int lookahead;

main(
{

lookahead = getchar();
expr ( )

putchar( '\n' /
adds trailing newline character */
) ;

expr(
{

term( )

while ( 1)
if (lookahead == '+') {

inatch(' + '); term( ) ; putchar (


'
+ '
) ;

else if (lookahead == '-') {


match('-'); term( putchar ('-') ) ;

else break;
}

terin( )

if ( isdigit lookahead ( ) ) {

putchar lookahead (
)

match lookahead
( )

else error ( )

match (t
int t;
{

if (lookahead == t)
lookahead = getchar();
else error ( )

error ( )

printf "syntax errorXn" (


/« print error message ) ; /
exit(1); then halt */ /
}

Fig. 2.24. C program to translate an infix expression into postfix form.


56 A SIMPLE COMPILER SEC. 2.6

appears in the input stream, the lexical analyzer will pass num to the parser.
The value of the integer will be passed along as an attribute of the token num.
Logically, the lexical analyzer passes both the token and the attribute to the
parser. If we write a token and its attribute as a tuple enclosed between <>,
the input

31 + 28 + 59

is transformed into the sequence of tuples

<nuni, 31> <+ , > <num, 28> < + , > <num, 59>
The token + has no attribute. The second components of the tuples, the attri-
butes, play no role during parsing, but are needed during translation.

Recognizing Identifiers and Keywords

Languages use identifiers as names of variables, arrays, functions, and the


like. A grammar for a language often treats an identifier as a token. A
parser based on such a grammar wants to see the same token, say id, each
time an identifier appears in the input. For example, the input

count = count + increment; (2.15)

would be converted by the lexical analyzer into the token stream

id = id + id ; (2.16)

This token stream is used for parsing.


When talking about the lexical analysis of the input line (2.15), it is useful
to distinguish and the lexemes count and increment
between the token id
associated with instances of this token. The translator needs to know that the
lexeme count forms the first two instances of id in (2.16) and that the lex-
eme increment forms the third instance of id.
When a lexeme forming an identifier is seen in the input, some mechanism
is needed to determine if the lexeme has been seen before. As mentioned in
Chapter 1, a symbol table is used as such a mechanism. The lexeme is stored
in the symbol table and a pointer to this symbol-table entry becomes an attri-

bute of the token id.

Many languages use fixed character strings such as begin, end. if, and so
on, as punctuation marks or to identify certain constructs. These character
forming identifiers, so a
strings, called keywords, generally satisfy the rules for
mechanism is needed for deciding when a lexeme forms a keyword and when
it forms an identifier. The problem is easier to resolve if keywords are
reserved, i.e., if they cannot be used as identifiers. Then a character string
forms an identifier only if it is not a keyword.
The problem of isolating tokens also arises if the same characters appear in
the lexemes of more than one token, as in <, < = and <> in Pascal. Tech- ,

niques for recognizing such tokens efficiently are discussed in Chapter 3.


SEC. 2.6 LEXICAL ANALYSIS 57

Interface to the Lexical Analyzer

When a lexical analyzer is inserted between the parser and the input stream, it

interacts with the two in the manner shown in Fig. 2.25. It reads characters
from the input, groups them into lexemes, and passes the tokens formed by
the lexemes, together with their attribute values, to the later stages of the
compiler. In some situations, the lexical analyzer has to read some characters
ahead before it can decide on the token to be returned to the parser. For
example, a lexical analyzer for Pascal must read ahead after it sees the charac-
ter >. If the next character is =, then the character sequence >= is the lexeme

forming the token for the "greater than or equal to" operator. Otherwise > is
the lexeme forming the "greater than" operator, and the lexical analyzer has
read one character too many. The extra character has to be pushed back onto
the input, because it can be the beginning of the next lexeme in the input.

read pass

character token and


its attributes

push back
character

Fig. 2.25. inserting a lexical analyzer between the input and the parser.

The lexical analyzer and parser form a producer-consumer pair. The lexical
analyzer produces tokens and the parser consumes them. Produced tokens can
be held in a token buffer until they are consumed. The interaction between
the two is constrained only by the size of the buffer, because the lexical
analyzer cannot proceed when the buffer is full and the parser cannot proceed
when the buffer is empty. Commonly, the buffer holds just one token. In
this case, making the lexical
the interaction can be implemented simply by
analyzer be a procedure called by the parser, returning tokens on demand.
The implementation of reading and pushing back characters is usually done
by setting up an input buffer. A block of characters is read into the buffer at
a time; a keeps track of the portion of the input that has been
pointer
analyzed. Pushing back a character is implemented by moving back the
pointer. Input characters may also need to be saved for error reporting, since
some indication has to be given of where in the input text the error occurred.
The buffering of input characters can be justified on efficiency grounds alone.
Fetching a block of characters is usually more efficient than fetching one char-
acter at a time. Techniques for input buffering are discussed in Section 3.2.
)

58 A SIMPLE COMPILER SEC. 2.6

A Lexical Analyzer

We now construct a rudimentary lexical analyzer for the expression translator


of Section 2.5. The purpose of the lexical analyzer is to allow white space and
numbers to appear within expressions. In the next section, we extend the lexi-
cal analyzer to allow identifiers as well.

uses getchar { )

to read character

pushes back c using


ungetc ( c , stdin
) ; ;

SEC. 2.6 LEXICAL ANALYSIS 59

Allowing numbers within expressions requires a change in the grammar in


Fig. 2.19. We replace the individual digits by the nonterminal factor and
introduce the following productions and semantic actions:

factor -» ( expr )

I
num { print {num.value) }

The C code for factor in Fig. 2.27 is a direct implementation of the produc-
tions above. When lookahead equals NUM, the value of attribute num.value
is given by the global variable tokenval. The action of printing this value is
done by the standard library function printf. The first argument of
printf is a string between double quotes specifying the format to be used for
printing the remaining arguments. Where %d appears in the string, the
decimal representation of the next argument is printed. Thus, the printf
statement in Fig. 2.27 prints a blank followed by the decimal representation of
tokenval followed by another blank.

f actor {

if (lookahead == '
(
' ) {

match expr( (
'
(
' ) ; ) ; match {
'
)
' )

else if (lookahead == NUM) {


printf (" %d ", tokenval); match (NUM);
}

else error ( )

Fig. 2.27. C code for factor when operands can be numbers.

The implementation of function lexan is shown in Fig. 2.28. Every time


the body of the while statement on lines 8-28 is executed, a character is read
into t on line 9. if the character is a blank or a tab (written '\t'), then no
token is returned to the parser; we merely go around the while loop again. If

the character is a newline (written '\n'), then a global variable lineno is

incremented, thereby keeping track of line numbers in the input, but again no
token is returned. Supplying a line number with an error message helps pin-
point errors.
The code for reading a sequence of digits is on lines 14-23. The predicate
isdigit(t) from the include-file <ctype.h> is used on lines 14 and 17 to
determine if an incoming character t is a digit, if it is, then its integer value
is given by the expression t-'O' in both ASCII and EBCDIC. With other
character sets, the conversion may need to be done differently. In Section
2.9, we incorporate this lexical analyzer into our expression translator.
60 A SIMPLE COMPILER SEC. 2.6

(1)
SEC. 2.7 INCORPORATING A SYMBOL TABLE

illustrate how the lexical analyzer of the previous section might interact with a
symbol table.

The Symbol-Table Interface

The symbol-table routines are concerned primarily with saving and retrieving
lexemes. When a lexeme is saved, we also save the token associated with the
lexeme. The following operations will be performed on the symbol table.

insert s t ( , ) : Returns index of new entry for string s, token t.


lookup s ( ) : Returns index of the entry for string s,
or if s is not found.

The lookup operation to determine whether there is


lexical analyzer uses the
an entry for a lexeme symbol table. If no entry exists, then it uses the
in the

insert operation to create one. We shall discuss an implementation in which


the lexical analyzer and parser both know about the format of symbol-table
entries.

Handling Reserved Keywords

The symbol-table routmes above can handle any collection of reserved key-
words. For example, consider tokens div and mod
with lexemes div and
mod, respectively. We can initialize the symbol table using the calls

insert "div" ( , div);


insert "mod" (
, mod);
Any subsequent call lookup "div" ( ) returns the token div, so div cannot
be used as an identifier.
Any collection of reserved keywords can be handled in this way by
appropriately initializing the symbol table.

A Symbol-Table Implementation

The data structure for a particular implementation of a symbol table is

sketched in Fig. 2.29. We do not wish to set aside a fixed amount of space to
hold lexemes forming identifiers; a fixed amount of space may not be large
enough to hold a very long identifier and may be wastefully large for a short
identifier, such as i. In Fig. 2.29, a separate array lexemes holds the char-
acter string forming an identifier. The string is terminated by an end-of-string
character, denoted by EOS, that may not appear in identifiers. Each entry in
the symbol-table array symtable is a record consisting of two fields,
lexptr, pointing to the beginning of a lexeme, and token. Additional fields
can hold attribute values, although we shall not do so here.
is left empty, because lookup returns
In Fig. 2.29, the 0th entry to indi-
cate that there no entry for a string. The 1st and 2nd entries are for the
is

keywords div and mod. The 3rd and 4th entries are for identifiers count
and i.
62 A SIMPLE COMPILER SEC. 2.7

Array symtable
lexptr token

Array lexemes

Fig. 2,29. Symbol tabic and array for storing strings.

Pseudo-code for a lexical analyzer that handles identifiers is shown in Fig.

2.30; a C implementation appears in Section 2.9. White space and integer


constants are handled by the lexical analyzer in the same manner as in Fig.
2.28 in the last section.
When our present lexical analyzer reads a letter, it starts saving letters and
digits in a buffer lexbuf . The string collected in lexbuf is then looked up
in the symbol table, using the lookup operation. Since the symbol table is

initialized with entries for the keywords div and mod, as shown in Fig. 2.29,
the lookup operation will find these entries if lexbuf contains either div or
mod. If no entry for the string in lexbuf, i.e., lookup returns 0,
there is

then lexbuf contains a lexeme for a new identifier. An entry for the new
identifier is created using insert. After the insertion is made, p is the index
of the symbol-table entry for the string in lexbuf. This index is communi-
cated to the parser by setting tokenval to p, and the token in the token
field of the entry is returned.
The default action is to return the integer encoding of the character as a
token. Since the single character tokens here have no attributes, tokenval is

set to NONE.

2.8 ABSTRACT STACK MACHINES


The front end of a compiler constructs an intermediate representation of the
source program from which the back end generates the target program. One
popular form of intermediate representation code for an abstract stack
is

machine. As mentioned in Chapter 1, partitioning a compiler into a front end


and a back end makes it easier to modify a compiler to run on a new machine.
In this section, we present an abstract stack machine and show how code
SEC. 2.8 ABSTRACT MACHINES 63

function lexan: integer;


var lexbuf : array [0.. 100] of char;

c : char;
begin
loop begin
read a character into c\

if r is a blank or a tab then

do nothing
else if r is a newline then

lineno :— lineno + I

else if c is a digit then begin

set tokenval to the value of this and following digits;


return NUM
end
else if c is a letter then begin

place c and successive letters and digits into lexbuf;

p :— lookup (lexbuf);
if p = then
p := insert (lexbuf , ID);
tokenval := p\
return the token field of table entry p
end
else begin / token is a single character */
set tokenval to NONE; /* there is no attribute /
return integer encoding of character r
end
end
end

Fig. 2.30. Pseudo-code for a lexical analyzer.

can be generated for it. The machine has separate instruction and data
memories and all arithmetic operations are performed on values on a stack.
The instructions are quite limited and fall into three classes; integer arith-
metic, stack manipulation, and control flow. Figure 2.31 illustrates the
machine. The pointer pc indicates the instruction we are about to execute.
The meanings of the instructions shown will be discussed shortly.

Arithmetic Instructions

The abstract machine must implement each operator in the intermediate


language. A basic operation, such as addition or subtraction, is supported
directly by the abstract machine. A more
complex operation, however, may
need to be implemented as a sequence of abstract machine instructions. We
simplify the description of the machine by assuming that there is an
64 A SIMPLE COMPILER SEC. 2.8

Instructions Stack Data


push 5 16
rvalue 2 top II
+ 7
rvalue 3

pc

Fig. 2.31. Snapshot of the stack machine after the first four instructions are executed.

instruction for each arithmetic operator.


The abstract machine code for an arithmetic expression simulates the
evaluation of a postfix representation for that expression using a stack. The
evaluation proceeds by processing the postfix representation from left to right,

pushing each operand onto the stack as it is encountered. When a /:-ary

operator is encountered, its leftmost argument is ^ — 1 positions below the top


of the stack and its rightmost argument is at the top. The evaluation applies
the operator to the top k values on the stack, pops the operands, and pushes
the result onto the stack. For example, in the evaluation of the postfix expres-
sion 13 + 5*, the following actions are performed:

1. Stack 1.

2. Stack 3.

3. Add the two topmost elements, pop them, and stack the result 4.
4. Stack 5.

5. Multiply the two topmost elements, pop them, and stack the result 20.

The value on top of the stack at the end (here 20) is the value of the entire
expression.
In the intermediate language, all values will be integers, with correspond-
ing to false and nonzero integers corresponding to true. The boolean
operators and and or require both their arguments to be evaluated.

L-values and R-values

There is a distinction between the meaning of identifiers on the left and right
sides of an assignment. In each of the assignments

= 5;
= i 1;

the right side specifies an integer value, while the left side specifies where the
value is to be stored. Similarly, if p and q are pointers to characters, and

pt := qt;
SEC. 2.8 ABSTRACT MACHINES 65

the right side qt specifies a character, while pt specifies where the character
is to be stored. The terms l-value and r-value refer to values that are
appropriate on the left and right sides of an assignment, respectively. That is,

r-values are what we usually think of as "values," while /-values are locations.

Stack Manipulation

Besides the obvious instruction for pushing an integer constant onto the stack
and popping a value from the top of the stack, there are instructions to access
data memory:

push V push V onto the stack


rvalue / push contents of data location /

lvalue / push address of data location /

pop throw away value on top of the stack


: = the r-value on top is placed in the /-value below it

and both are popped


copy push a copy of the top value on the stack

Translation of Expressions

Code to evaluate an expression on a stack machine is closely related to postfix


notation for that expression. By definition, the postfix form of expression
£ + F is the concatenation of the postfix form of E, the postfix form of F,
and +. Similarly, stack-machine code to evaluate f + F is the concatenation
of the code to evaluate F, the code to evaluate F, and the instruction to add
their values. The translation of expressions into stack-machine code can
therefore be done by adapting the translators in Sections 2.6 and 2.7.
Here we generate stack code for expressions in which data locations are
addressed symbolically. (The allocation of data locations for identifiers is dis-
cussed in Chapter 7.) The expression a+b translates into:

rvalue a
rvalue b
+

In words: push the contents of the data locations for a and b onto the stack;
then pop the top two values on the stack, add them, and push the result onto
the stack.
The translation of assignments into stack-machine code is done as follows:

the /-value of the identifier assigned to is pushed onto the stack, the expres-
sion is evaluated, and its r-value is assigned to the identifier. For example,
the assignment

day := (1461*y) div 4 + (153*m + 2) div 5 + d (2.17)

translates into the code in Fig. 2.32.


66 A SIMPLE COMPILER SEC. 2.8

lvalue day-
SEC. 2.8 ABSTRACT MACHINES 67

Translation of Statements

The layout in Fig. 2.33 sketches the abstract-machine code for conditional and
while statements. The following discussion concentrates on creating labels.
Consider the code layout for if-statements in Fig. 2.33. There can only be
one label out instruction in the translation of a source program; otherwise,
there will be confusion about where control flows to from a goto out state-
ment. We therefore need some mechanism for consistently replacing out in
the code layout by a unique label every time an if-statement is translated.
Suppose newlahel is a procedure that returns a fresh label every time it is

called. In the following semantic action, the label returned by a call of newla-
hel is recorded using a local variable out:

stmt — if expr then stmt i


{ out newlabel ;

stmt.t expr.t II

'gofalse' out (2.18)


stmt 1. 1 II

'label' out }

If While

label test

code for expr code for expr

gofalse out gofalse out

code for stmt. code for stmt.

label out goto test


label out

Fig. 2.33. Code layout for conditional and while statements.

Emitting a Translation

The expression translators in Section 2.5 used print statements to incremen-


tally generate the translation of an expression. Similar print statements can be
used to emit the translation of statements. Instead of print statements, we use
a procedure emit to hide printing details. For example, emit can worry about
whether each abstract-machine instruction needs to be on a separate line.

Using the procedure emit, we can write the following instead of (2.18):

stmt if

expr { out := newlabel; emiti'gofalse' , out); }

then
stmt^ { emit {' lahel' , out); }

When semantic actions appear within a production, we consider the elements


;

68 A SIMPLE COMPILER SEC. 2.8

on the right side of the production in a left-to-right order. For the above pro-
duction, the order of actions is as follows: actions during the parsing of expr
are done, out is set to the label returned by newlahel and the gofalse
instruction is emitted, actions during the parsing of sfmt ^ are done, and,
finally, the label instruction is emitted. Assuming the actions during the
parsing of c.xpr and stmt i
emit the code for these nonterminals, the above pro-
duction implements the code layout of Fig. 2.33.

procedure .stmi\

var test, out: integer; /* for labels /


begin
if lookahead = id then begin
('A«//( 'lvalue', tokenval); match (id); match {' :='); expr

end
else if lookahead = 'if' then begin
match 'if'); {

expr ;

out :
— newlahel ;

^'m/7( 'gofalse', out);

match 'then'); (

stmt;

emit C label' , out)


end
/ code for remaining statements goes here */
else error

end

Fig. 2.34. Pseudo-code for translating statements.

Pseudo-code for translating assignment and conditional statements is shown


in Fig. 2.34. Since variable out is local to procedure stmt, its value is not
affected by the calls to procedures expr and stmt. The generation of labels

requires some thought. Suppose that the labels in the translation are of the
form LI, L2, .... The pseudo-code manipulates such labels using the
integer following L. Thus, out is declared to be an integer, newlahel returns
an integer that becomes the value of out, and emit must be written to print a
label given an integer.
The code layout for while statements in Fig. 2.33 can be converted into
code in a similar fashion. The translation of a sequence of statements is sim-
ply the concatenation of the statements in the sequence, and is left to the

reader.
The translation of most single-entry single-exit constructs is similar to that
of while statements. We illustrate by considering control flow in expressions.

Example 2.10. The lexical analyzer in Section 2.7 contains a conditional of


SEC. 2.9 PUTTING THE TECHNIQUES TOGETHER 69

the form:

if / = blank or / = tab then

If ris a blank, then clearly it is not necessary to test if / is a tab, because the
first equality implies that the condition is true. The expression

expr ]
or f.v/^ri

can therefore be implemented as

if expri then true else exprj

The reader can verify that the following code implements the or operator:

code for e.xpr^

copy /* copy value of i^vpri */


gotrue out
pop /* pop value of f.vpri */
code for expr2
label out
Recall that the gotrue and gof alse instructions pop the value on top of the
stack to simplify code generation for conditional and while statements. By
copying the value of e.xpr^ we ensure that the value on top of the stack is true
if the gotrue instruction leads to a jump.

2.9 PUTTING THE TECHNIQUES TOGETHER


In this chapter, we have presented a number of syntax-directed techniques for
constructing a compiler front end. To summarize these techniques, in this
section we put together a C program that functions as an infix-to-postfix trans-
lator for a language consisting of sequences of expressions terminated by semi-
colons. The expressions consist of numbers, identifiers, and the operators +,
-, *, /, div, and mod. The output of the translator is a postfix representa-
tion for each expression. The translator is an extension of the programs
developed in Sections 2.5-2.7. A listing of the complete C program is given at
the end of this section.

Description of the Translator

The translator is designed using the syntax-directed translation scheme in Fig.

2.35. The token id represents a nonempty sequence of letters and digits


beginning with a letter, num and eof an end-of-file char-
a sequence of digits,
acter. Tokens are separated by sequences of blanks, tabs, and newlines
("white space"). The attribute lexeme of the token id gives the character
string forming the token; the attribute value of the token num gives the
integer represented by the num.
The code for the translator is arranged into seven modules, each stored in a

separate file. Execution begins in the module main.c that consists of a call
70 A SIMPLE COMPILER SEC. 2.9

start —
SEC. 2.9 PUTTING THE TECHNIQUES TOGETHER 7 1

The Lexical Analysis Module lexer, c


The lexical analyzer is a routine called lexan( ) that is called by the parser to
find tokens. Implemented from the pseudo-code in Fig. 2.30, the routine
reads the input one character at a time and returns to the parser the token it

found. The value of the attribute associated with the token is assigned to a
global variable tokenval.
The following tokens are expected by the parser:

+ - * / DIV MOD ( ) ID NUM DONE

Here ID represents an identifier, NUM a number, and DONE the end-of-file


character. White space is silently stripped out by the lexical analyzer. The
table in Fig. 2.37 shows the token and attribute value produced by the lexical
analyzer for each source language lexeme.

Lexeme
72 A SIMPLE COMPILER SEC. 2.9

start -* list eof


list -' expr list
;

I
e

expr -> term moreterms


moreterms - + term { print (
'
+
' ) } moreterms
- - moreterms
I
term { print (
'
' ) }

I

term - factor morefactors


morefactors - * factor { print (
'
*
' ) } morefactors

I
/ factor { print {'/') ) morefactors

I
div factor { print (' HIV') } morefactors

I
mod factor { print {'MOD') } morefactors

I
e

factor - ( <;^r/>r )

I
id { print (id. lexeme) }

I
num { pmi/(nuni.\Y//M£') }

Fig. 2.38. Syntax-directed translation scheme after eliminating left-recursion.

The Emitter Module emitter, c


The emitter module consists of a single function emit(t,tval) that gen-
erates the output for token t with attribute value tval.

The Symbol-Table Modules symbol c and init c . .

The symbol-table module symbol c implements the data structure shown in .

Fig. 2.29 of Section 2.7. The entries in the array S3nntable are pairs consist-
ing of a pointer to the lexemes array and an integer denoting the token
stored there. The operation insert (s,t) returns the symtable index for
the lexeme s forming the token t. The function lookup(s) returns the
index of the entry in symtable for the lexeme s or if s is not there.
The module init.c is used to preload symtable with keywords. The
lexeme and token representations for all the keywords are stored in the array
keywords, which has the same type as the symtable array. The function
init( goes sequentially through the keyword array, using the function
)

insert to put the keywords in the symbol table. This arrangement allows us
to change the representation of the tokens for keywords in a convenient way.

The Error Module error c .

The error module manages the error reporting, which is extremely primitive.
On encountering a syntax error, the compiler prints a message saying that an
error has occurred on the current input line and then halts. A better error
recovery technique might skip to the next semicolon and continue parsing; the
SEC. 2.9 PUTTING THE TECHNIQUES TOGETHER 73

reader is encouraged to make this modification to the translator. More


sophisticated error recovery techniques are presented in Chapter 4.

Creating the Compiler

The code for the modules appears in seven files: lexer, c, parser, c,
emitter, c, symbol, c, init.c, error, and main.c. The file main.c
c,
contains the main routine in the C program that calls init(), then
parse ( ), and upon successful completion exit(O).
Under the UNIX operating system, the compiler can be created by execut-
ing the command
cc lexer. c parser. c emitter. c symbol. c init.c error. c main.c

or by separately compiling the files, using

cc -c filename .c

and linking the resulting ///c^Ajam^.o files:

cc lexer. o parser. o emitter. o symbol. o init.o error. o main.o

The cc command creates a file a. out that contains the translator. The trans-
lator can then be exercised by typing a. out followed by the expressions to be
translated; e.g.,

2+3*5;
12 div 5 mod 2;

or whatever other expressions you like. Try it.

The Listing

Here is a listing of the C program implementing the translator. Shown is the


global header file global. h, followed by the seven source files. For clarity,
the program has been written in an elementary C style.

/*# global. h »»«*»***»**»**»*»**#*»#*«**«**«•»«#•»/

#include <stdio.h> /• load i/o routines */


#include <ctype.h> /* load character test routines «/

#define BSIZE 128 /* buffer size /


#define NONE -1
#define EOS '\0'

#define NUM 256


#define DIV 257
#define MOD 258
#define ID 259
#define DONE 260

int tokenval; /* value of token attribute /


; ; ; ) ; ;; ; ) ; ;

74 A SIMPLE COMPILER SEC. 2.9

int lineno;
struct entry { /* form of symbol table entry */
char *lexptr;
int token;
};

struct entry symtable[]; /« symbol table »/

/*» lexer. c »»»»#»»•»##«#**#«**»*«*»«**»»*/


#include "global. h"
char lexbuf BSIZE [ ]

int lineno = 1
int tokenval = NONE;

int lexan( ) / lexical analyzer »/


{

int t;
while(l) {

t = getchar ( )

if (t == ' '
! ! t == '\t'
; / strip out white space */
else if t == \n' (
'

lineno = lineno + 1
else if (isdigit(t)) { /* t is a digit */
ungetc t stdin ( , )

scanf("%d", itokenval )

return NUM;
}

else if (isalpha(t)) { /* t is a letter »/


int p, b = 0;
while (isalnum(t)) { /» t is alphanumeric /
lexbuf [b] = t;
t = getchar ( )

b = b + 1;
if (b >= BSIZE)
error "compiler error");
(

lexbuf [b] = EOS;


if (t != EOF)
ungetc t stdin ( , )

p = lookup lexbuf ( )

if (p == 0)
p = insert lexbuf ( , ID);
tokenval = p;
return symtable[p] token; .

else if (t == EOF)
return DONE;
)) ; : ;;; ; ;

SEC. 2.9 PUTTING THE TECHNIQUES TOGETHER 75

else {

tokenval = NONE;
return t
}

/**» parser . c »****»*»*»»**»»*»»****»**«#*»/


#include "global. h"
int lookahead;

parse () / parses and translates expression list */


{

lookahead = lexan( )

while (lookahead != DONE ) {

expr( match ) ;
(
'
;
' )

expr(
{

int t;
terin( ) ;

while(l)
switch (lookahead) {
case + case - ' '
:
' '
:

t = lookahead;
match lookahead) ( ; term( ) ; emit(t, NONE);
continue
default
return;
}

term(
{

int t;
f actor ( )

while ( 1)
switch (lookahead) {
case '*': case '/': case DIV: case MOD:
t = lookahead;
match lookahead factor(); emit(t, NONE);
( ) ;

continue
default:
return;
}
)) : ;

76 A SIMPLE COMPILER SEC. 2.9

factor (

switch! lookahead { )

(
case ' '

match('('); expr{); match(')'); break;


case NUM:
emit(NUM, tokenval); match (NUM); break;
case ID:
emit (ID, tokenval); match(ID); break;
default:
error "syntax error"); (

match ( t
int t;
{

if lookahead == t)
(

lookahead = lexan( )

else error "syntax error");


(

/###» emitter. c »*»##**««***#*****#«****/


#include "global. h"
emit(t, tval) /* generates output »/
int t, tval;
{

switch(t) {
case '+': case '-': case '*': case '/':
printf "%c\n" t); break; (
,

case DIV:
printf "DIV\n" break; (
) ;

case MOD:
printf "MOD\n" break; (
) ;

case NUM:
printf "%d\n" tval); break; (
,

case ID:
printf "%s\n" symtable[ tval lexptr
(
break;, ] . ) ;

default:
printf "token %d tokenval %d\n" t, tval);
(
, ,

/»#«» symbol. c »***«*#*»**»*»»»*»»»**»*»***»***»*/

#include "global. h"


#define STRMAX 999 /» size of lexemes array */
#define SYMMAX 100 / size of symtable »/
;; ; ; ; ; ;

SEC. 2.9 PUTTING THE TECHNIQUES TOGETHER 77

char lexemes STRMAX[ ]

int lastchar = - 1 ; /
last used position in lexemes »/
struct entry symtable[ SYMMAX]
int lastentry =0; /
last used position in symtable /
int lookup(s) / returns position of entry for s */
char s [ ]

int p;
for (p = lastentry; p> 0; p=p- 1)
if ( strcmp( symtable [p] lexptr . , s) == 0)
return p;
return ;

int insert (s, tok ) / returns position of entry for s */


char s [ ]

int tok;
{

int len;
len = strlen(s); /* strlen computes length of s */
if (lastentry + 1 >= SYMMAX)
error "symbol table full");
(

if (lastchar + len + 1 >= STRMAX)


error "lexemes array full");
(

lastentry = lastentry + 1
symtable lastentry] token = tok;
[ .

symtable lastentry ]. lexptr = &.lexemes lastchar + 1];


[ [

lastchar = lastchar + len + 1


strcpy( symtable lastentry] lexptr s [ . , )

return lastentry;

/** init.c «***«**»•*»*««»**««»*»«»*««***/


#include "global. h"
struct entry keywords [] = {

"div", DIV,
"mod", MOD,
0,
};

init() /* loads keywords into symtable */


{

struct entry »p;


for (p = keywords; p->token; p++)
insert p->lexptr p->token);
( ,

}
;

78 A SIMPLE COMPILER SEC. 2.9

/«««« error c . »*»*»«**«**»*«**«**»*»»»*/


#include "global. h"
error(in) /* generates all error messages /
char *m;
{

f printf { stderr , "line %d %s\n" lineno, m);


: ,

exit{1); /* unsuccessful termination /


}

/«*** main.c #»*»*«*»***«*******«##**«**»«**/


#include "global. h"
main( )

init();
parse ( )

exit(O); /* successful termination */


}

/*»#*»»«»»«»*»*»*««#**»*/
EXERCISES

2.1 Consider the context-free grammar

5-^55+ |55* |a
a) Show how the string aa+a* can be generated by this grammar.
b) Construct a parse tree for this string.
c) What language is generated by this grammar? Justify your
answer.

2.2 What language is generated by the following grammars? In each case


justify your answer.
a) 5 - 5 1 I
1

b)5- + S S \
- S S \
a
c) S -* S S
( ) S \

6)S ^ aS hS |b5a5|€
e)5-a|5+5|55|5*| { S )

2.3 Which of the grammars in Exercise 2.2 are ambiguous?

2.4 Construct unambiguous context-free grammars for each of the follow-


ing languages. In each case show that your grammar is correct.
CHAPTER 2 EXERCISES 79

a) Arithmetic expressions in postfix notation.


b) Left-associative lists of identifiers separated by commas.
c) Right-associative lists of identifiers separated by commas.
d) Arithmetic expressions of integers and identifiers with the four
binary operators +, -, *, /.
e) Add unary plus and minus to the arithmetic operators of (d).

*2.5 a) Show that all binary strings generated by the following grammar
have values divisible by 3. Hint. Use induction on the number of
nodes in a parse tree.

num - 1 1 I
1001 I
nam | num nam
b) Does the grammar generate all binary strings with values divisible
by 3?

2.6 Construct a context-free grammar for roman numerals.

2.7 Construct a syntax-directed translation scheme that translates arith-


metic expressions from infix notation into prefix notation in which an
operator appears before its operands; e.g., —xy is the prefix notation
for x—y. Give annotated parse trees for the inputs 9-5 + 2 and 9-
5*2.

2.8 Construct a syntax-directed translation scheme that translates arith-


metic expressions from postfix notation into infix notation. Give
annotated parse trees for the inputs 95-2* and 952*-.
2.9 Construct a syntax-directed translation scheme that translates integers
into roman numerals.

2.10 Construct a syntax-directed translation scheme that translates roman


numerals into integers.

2.11 Construct recursive-descent parsers for the grammars in Exercise 2.2


(a), (b), and (c).

2.12 Construct a syntax-directed translator that verifies that the


parentheses in an input string are properly balanced.

2.13 The following rules define the translation of an English word into pig
Latin:
a) If the word begins with a nonempty string of consonants, move the
initial consonant string to the back of the word and add the suffix
AY; e.g.,pig becomes igpay.
b) If word begins with a vowel, add the
the suffix YAY; e.g., owl
becomes owlyay.
c) U following a Q is a consonant.
d) Y at the beginning of a word is a vowel if it is not followed by a
vowel.
80 A SIMPLE COMPILER CHAPTER 2

e) One-letter words are not changed.

Construct a syntax-directed translation scheme for pig Latin.

2.14 In the programming language C the for-statement has the form:

for ( expr I
; expr2 ', expr^ ) stmt

The first expression is executed before the loop; it is typically used for
initializing the loop index. The second expression is a test made
before each iteration of the loop; the loop is exited if the expression
becomes 0. The loop itself consists of the statement {stmt expr-t, ;}.

The third expression is executed at the end of each iteration; it is typi-


cally used to increment the loop index. The meaning of the for-
statement is similar to

expr \ ; while ( expr2 ) { stmt expr^ ; }

Construct a syntax-directed translation scheme to translate C for-


statements into stack-machine code.

*2.15 Consider the following for-statement:

for / :
= 1 step 1 — y until 1 * 7 do 7 :
= 7 + 1

Three semantic definitions can be given for this statement. One pos-
sible meaning is that the limit 10 * j and increment 10 — j are to be
evaluated once before the loop, as in PL/I. For example, if 7 = 5
before the loop, we would run through the loop ten times and exit. A
second, completely different, meaning would ensue if we are required
to evaluate the limit and increment every time through the loop. For
example, if 7 = 5 before the loop, the loop would never terminate. A
third meaning is given by languages such as Algol. When the incre-
ment is negative, the test made for termination of the loop is
/ < 10*7, rather than / > 10*7. For each of these three semantic
definitions construct a syntax-directed translation scheme to translate
these for-loops into stack-machine code.

2.16 Consider the following grammar fragment for if-then- and if-then-
else-statements:

stmt -^ if expr then stmt


I
if expr then stmt else stmt
I
other

where other stands for the other statements in the language.


a) Show that this grammar is ambiguous.
b) Construct an equivalent unambiguous grammar that associates
each else with the closest previous unmatched then.
CHAPTER 2 BIBLIOGRAPHIC NOTES 81

c) Construct a syntax-directed translation scheme based on this gram-


mar to translate conditional statements into stack machine code.
2.17 Construct a syntax-directed translation scheme that translates arith-
metic expressions in infix notation into arithmetic expressions in infix
notation having no redundant parentheses. Show the annotated parse
tree for the input (((1 + 2) * (3 4)) + 5).

PROGRAMMING EXERCISES
P2.1 Implement a translator from integers to roman numerals based on the
syntax-directed translation scheme developed in Exercise 2.9.

P2.2 Modify the translator in Section 2.9 to produce as output code for the
abstract stack machine of Section 2.8.

P2.3 Modify the error recovery module of the translator in Section 2.9 to
skip to the next input expression on encountering an error.

P2.4 Extend the translator in Section 2.9 to handle all Pascal expressions.

P2.5 Extend the compiler of Section 2.9 to translate into stack-machine


code statements generated by the following grammar:

stmt
82 A SIMPLE COMPILER CHAPTER 2

Study of natural languages. Their use in specifying the syntax of program-


ming languages arose independently. While working with a draft of Algol 60,
John Backus "hastily adapted |EmiI Post's productions] to that use" (Wexel-
blat [1981, p. 162]). The resulting notation was a variant of context-free gram-
mars. The scholar Panini devised an equivalent syntactic notation to specify
the rules of Sanskrit grammar between 400 B.C. and 200 B.C. (Ingerman
11967]).
The proposal that BNF, which began as an abbreviation of Backus Normal
Form, be read as Backus-Naur Form, to recognize Naur's contributions as edi-
tor of the Algol 60 report (Naur 11963]), is contained in a letter by Knuth
11964].
Syntax-directed definitions are a form of inductive definitions in which the
induction is As such they have long been used
on the syntactic structure.
informally in mathematics. Their application to programming languages came
with the use of a grammar to structure the Algol 60 report. Shortly
thereafter. Irons 11961] constructed a syntax-directed compiler.
Recursive-descent parsing has been used since the early 1960's. Bauer
11976] attributes the method to Lucas 11961]. Hoare 11962b, p. 128] describes
an Algol compiler organized as "a set of procedures, each of which is capable
of processing one of the syntactic units of the Algol 60 report." Foster 11968]
discusses the elimination of left recursion from productions containing seman-
tic actions that do not affect attribute values.
McCarthy 11963] advocated that the translation of a language be based on
abstract syntax. In the same paper McCarthy 11963, p. 24] left "the reader to
convince himself" that a tail-recursive formulation of the factorial function is

equivalent to an iterative program.


The benefits of partitioning a compiler into a front end and a back end were
explored committee report by Strong et al. [1958]. The report coined the
in a

name UNCOL(from universal computer oriented language) for a universal


intermediate language. The concept has remained an ideal.
A good way to learn about implementation techniques is to read the code of
existing compilers. Unfortunately, code is not often published. Randell and
Russell 11964] give a comprehensive account of an early Algol compiler.
Compiler code may also be seen in McKeeman, Horning, and Wortman
11970]. Barron 11981] is a collection of papers on Pascal implementation,
including implementation notes distributed with the Pascal P compiler (Nori et
al. 11981)), code generation details (Ammann [1977]), and the code for an

implementation of Pascal S, a Pascal subset designed by Wirth 11981] for stu-


dent use. Knuth 11985] gives an unusually clear and detailed description of
the T^X translator.
Kernighan and Pike 11984] describe in detail how to build a desk calculator
program around a syntax-directed translation scheme using the compiler-
construction tools available on the UNIX operating system. Equation (2.17) is
from Tantzen 11963].
CHAPTER 3

Lexical
Analysis

This chapter deals with techniques for specifying and implementing lexical
analyzers. A simple way to build a lexical analyzer is to construct a diagram
that illustrates the structure of the tokens of the source language, and then to
hand-translate the diagram into a program for finding tokens. Efficient lexi-
cal analyzers can be produced in this manner.
The techniques used to implement lexical analyzers can also be applied to
other areas such as query languages and information retrieval systems. In
each application, the underlying problem is the specification and design of
programs that execute actions triggered by patterns in strings. Since pattern-
directed programming is widely useful, we introduce a pattern-action language
called Lex for specifying lexical analyzers. In this language, patterns are
specified by regular expressions, and a compiler for Lex can generate an effi-
cient finite-automaton recognizer for the regular expressions.
Several other languages use regular expressions to describe patterns. For
example, the pattern-scanning language AWK uses regular expressions to
select input lines for processing and the UNIX system shell allows a user to
refer to a set of file names by writing a regular expression. The UNIX com-
mand rm *.o, for instance, removes all files with names ending in ".o".'
A software tool that automates the construction of lexical analyzers allows
people with different backgrounds to use pattern matching in their own appli-
cation areas. For example, Jarvis [19761 used a lexical-analyzer generator to
create a program that recognizes imperfections in printed circuit boards. The
circuits are digitally scanned and converted into "strings" of line segments at
different angles. The "lexical analyzer" looked for patterns corresponding to
imperfections in the string of line segments. A major advantage of a lexical-
analyzer generator is that it can utilize the best-known pattern-matching algo-
rithms and thereby create efficient lexical analyzers for people who are not
experts in pattern-matching techniques.

'
The expression .o is a variant of the usual notation for regular expressions. Excreises .^.10
and 3.14 mention some commonly used variants of regular expression notations.
84 LEXICAL ANALYSIS SEC. 3.1

3.1 THE ROLE OF THE LEXICAL ANALYZER


The lexical analyzer is the first phase of a compiler, its main task is to read

the input characters and produce as output a sequence of tokens that the
parser uses for syntax analysis. This interaction, summarized schematically in
Fig. 3.1, is commonly implemented by making the lexical analyzer be a sub-
routine or a coroutine of the parser. Upon receiving a "get next token" com-
mand from the parser, the lexical analyzer reads input characters until it can
identify the next token.
SEC. 3.1 THE ROLE OF THE LEXICAL ANALYZER 85

one or the other of these phases. For example, a parser embodying the
conventions for comments and white space is significantly more complex
than one that can assume comments and white space have already been
removed by a lexical analyzer. If we are designing a new language,
separating the lexical and syntactic conventions can lead to a cleaner
overall language design.

2. Compiler efficiency is improved. A separate lexical analyzer allows us to


construct a specialized and potentially more efficient processor for the
task. A large amount of time is spent reading the source program and
partitioning it into tokens. Specialized buffering techniques for reading
input characters and processing tokens can significantly speed up the per-
formance of a compiler.

3. Compiler portability is enhanced. Input alphabet peculiarities and other


device-specific anomalies can be restricted to the lexical analyzer. The
representation of special or non-standard symbols, such as t in Pascal,
can be isolated in the lexical analyzer.

Specialized tools have been designed to help automate the construction of


lexical analyzersand parsers when they are separated. We shall see several
examples of such tools in this book.

Tokens, Patterns, Lexemes

When talking about lexical analysis, we use the terms "token," "pattern," and
"lexeme" with specific meanings. Examples of their use are shown in Fig.
3.2. In general, there is a set of strings in the input for which the same token
is produced as output. This set of strings is described by a rule called a pat-
tern associated with the token. The
said to match each string in the
pattern is

set. A
lexeme is a sequence of characters in the source program that is
matched by the pattern for a token. For example, in the Pascal statement

const pi = 3.1416;
the substring pi is a lexeme for the token "identifier."

Token
86 LEXICAL ANALYSIS SEC. 3.1

We treat tokens as terminal symbols in the grammar for the source


language, using boldface names to represent tokens. The lexemes matched by
the pattern for the token represent strings of characters in the source program
that can be treated together as a lexical unit.
In most programming languages, the following constructs are treated as
tokens: keywords, operators, identifiers, constants, literal strings, and punc-
tuation symbols such as parentheses, commas, and semicolons. In the exam-
ple above, when the character sequence pi appears in the source program, a
token representing an identifier is returned to the parser. The returning of a
token is often implemented by passing an integer corresponding to the token.
It is this integer that is referred to in Fig. 3.2 as boldface id.
A pattern is a rule describing the set of lexemes that can represent a partic-
ular token in source programs. The pattern for the token const in Fig. 3.2 is

just the single string const that spells out the keyword. The pattern for the
token relation is the set of all six Pascal relational operators. To describe pre-
cisely the patterns for more complex tokens like id (for identifier) and num
(for number) we shall use the regular-expression notation developed in Section
3.3.
Certain language conventions impact the difficulty of lexical analysis.
Languages such as Fortran require certain constructs in fixed positions on the
input line. Thus the alignment of a lexeme may be important in determining
the correctness of a source program. The trend in modern language design is
toward free-format input, allowing constructs to be positioned anywhere on
the input line, so this aspect of lexical analysis is becoming less important.
The treatment of blanks varies greatly from language to language. In some
languages, such as Fortran or Algol 68, blanks are not significant except in

literal strings. They can be added at will to improve the readability of a pro-
gram. The conventions regarding blanks can greatly complicate the task of
identifying tokens.
A popular example that illustrates the potential difficulty of recognizing
tokens is the DO statement of Fortran. In the statement

DO 5 I = 1.25

we cannot tell until we have seen the decimal point that DO is not a keyword,
but rather part of the identifier D05I. On the other hand, in the statement

DO 5 I = 1,25
we have seven tokens, corresponding to the keyword DO, the statement label
5, the identifier I, the operator =, the constant 1, the comma, and the con-
stant 25. Here, we cannot be sure until we have seen the comma that DO is a
keyword. To alleviate this uncertainty, Fortran 77 allows an optional comma
between the label and index of the DO statement. The use of this comma is
encouraged because it helps make the DO statement clearer and more read-
able.
In many languages, certain strings are reserved; i.e., their meaning is
SEC. 3.1 THE ROLE OF THE LEXICAL ANALYZER 87

predefined and cannot be changed by the user. If keywords are not reserved,
then the lexical analyzer must distinguish between a keyword and a user-
defined identifier. In PL/I, keywords are not reserved; thus, the rules for dis-
tinguishing keywords from identifiers are quite complicated as the following
PL/I statement illustrates:

IF THEN THEN THEN = ELSE; ELSE ELSE = THEN;

Attributes for Tokens

When more than one pattern matches a lexeme, the lexical analyzer must pro-
vide additional information about the particular lexeme that matched to the
subsequent phases of the compiler. For example, the pattern num matches
both the strings and 1, but it is essential for the code generator to know
what string was actually matched.
The lexical analyzer collects information about tokens into their associated
attributes. The tokens influence parsing decisions; the attributes influence the
translation of tokens. As a practical matter, a token has usually only a single
attribute — a pointer to the symbol-table entry in which the information about
the token is kept; the pointer becomes the attribute for the token. For diag-
nostic purposes, we may be interested in both the lexeme for an identifier and
the line number on which it was first seen. Both these items of information
can be stored in the symbol-table entry for the identifier.

Example 3.1. The tokens and associated attribute-values for the Fortran
statement

E = M C 2

are written below as a sequence of pairs:

<id, pointer to symbol-table entry for E>


<assigii_op, >
<id, pointer to symbol-table entry for M>
<mult_op, >
<id, pointer to symbol-table entry for C>
<exp_op, >
<num, integer value 2>
Note that in certain pairs there is no need for an attribute value; the first com-
ponent is sufficient to identify the lexeme. In this small example, the token
num has been given an integer-valued attribute. The compiler may store the
character string that forms a number in a symbol table and let the attribute of
token num be a pointer to the table entry.
LEXICAL ANALYSIS SEC. 3.1

Lexical Errors

Few errors are discernible at the lexical level alone, because a lexical analyzer
has a very localized view of a source program, if the string f i is encountered
in a C program for the first time in the context

fi ( a == f (x) ) •

a lexical analyzer cannot tell whether f i is a misspelling of the keyword if or


an undeclared function identifier. Since fi is a valid identifier, the lexical
analyzer must return the token for an identifier and let some other phase of
the compiler handle any error.
But, suppose a situation does arise in which the lexical analyzer is unable to
proceed because none of the patterns for tokens matches a prefix of the
remaining input. Perhaps the simplest recovery strategy is "panic mode"
recovery. We delete successive characters from the remaining input until the
lexical analyzer can find a well-formed token. This recovery technique may
occasionally confuse the parser, but in an interactive computing environment it

may be quite adequate.


Other possible error-recovery actions are:

1. deleting an extraneous character


2. inserting a missing character
3. replacing an incorrect character by a correct character
4. transposing two adjacent characters.

Error transformations like these may be tried in an attempt to repair the


input. The simplest such strategy is to see whether a prefix of the remaining
input can be transformed into a valid lexeme by just a single error transforma-
tion. This strategy assumes most lexical errors are the result of a single error
transformation, an assumption usually, but not always, borne out in practice.
One way of finding the errors in a program is to compute the minimum
number of error transformations required to transform the erroneous program
into one that is syntactically well-formed. We say that the erroneous program
has k errors if the shortest sequence of error transformations that will map it

into some valid program has length k. Minimum-distance error correction is a

convenient theoretical yardstick, but it is not generally used in practice

because it is too costly to implement. However, a few experimental compilers


have used the minimum-distance criterion to make local corrections.

3.2 INPUT BUFFERING


This section covers some efficiency issues concerned with the buffering of
input. We first mention a two-buffer input scheme that is useful when look-

ahead on the input is necessary to identify tokens. Then we introduce some


useful techniques for speeding up the lexical analyzer, such as the use of "sen-
tinels" to mark the buffer end.
SEC. 3.2 INPUT BUFFERING 89

There are three general approaches to the implementation of a lexical


analyzer.

1. Use a lexical-analyzer generator, such as the Lex compiler discussed in

Section 3.5, to produce the lexical analyzer from a regular-expression-


based specification. In this case, the generator provides routines for read-
ing and buffering the input.

2. Write the lexical analyzer in a conventional systems-programming


language, using the 1/0 facilities of that language to read the input.

3. Write the lexical analyzer in assembly language and explicitly manage the
reading of input.

The three choices are listed in order of increasing difficulty for the imple-
mentor. Unfortunately, the harder-to-implement approaches often yield faster
lexical analyzers. Since the lexical analyzer is the only phase of the compiler
that reads the source program character-by-character, it is possible to spend a
considerable amount of time in the lexical analysis phase, even though the
later phases are conceptually more complex.
Thus, the speed of lexical
analysis is a concern in compiler design. While the bulk of the chapter is
devoted to the first approach, the design and use of an automatic generator,
we also consider techniques that are helpful in manual design. Section 3.4
discusses transition diagrams, which are a useful concept for the organization
of a hand-designed lexical analyzer.

Buffer Pairs

For many source languages, there are times when the lexical analyzer needs to
look ahead several characters beyond the lexeme for a pattern before a match
can be announced. The lexical analyzers in Chapter 2 used a function
ungetc to push lookahead characters back into the input stream. Because a
large amount of time can be consumed moving characters, specialized buffer-
ing techniques have been developed to reduce the amount of overhead
required to process an input character. Many buffering schemes can be used,
but since the techniques are somewhat dependent on system parameters, we
shall only outline the principles behind one class of schemes here.
We use a buffer divided into two A'-character halves, as shown in Fig. 3.3.
Typically, A' is the number of characters on one disk block, e.g., 1024 or
4096.

M C « 2 eof

forwcird
lexeme _he}^inninii

Fig. 3.3. An input buffer in two halves.


;

90 LEXICAL ANALYSIS SEC. 3.2

We read A' input characters into each half of the buffer with one system
read command, rather than invoking a read command for each input charac-
ter. If fewer than A^ characters remain in the input, then a special character
eof is read into the buffer after the input characters, as in Fig. 3.3. That is,
eof marks the end of the source file and is different from any input character.
Two pointers to the input buffer are maintained. The string of characters
between the two pointers is the current lexeme. Initially, both pointers point
to the first character of the next lexeme to be found. One, called the forward
pointer, scans ahead until a match for a pattern is found. Once the next lex-
eme is determined, the forward pointer is set to the character at its right end.
After the lexeme is processed, both pointers are set to the character immedi-
ately past the lexeme. With this scheme, comments and white space can be
treated as patterns that yield no token.
If the forward pointer is about to move past the halfway mark, the right

half is filled with A' new input characters. If the forward pointer is about to
move past the right end of the buffer, the left half is filled with N new charac-
tersand the forward pointer wraps around to the beginning of the buffer.
This buffering scheme works quite well most of the time, but with it the
amount of lookahead is limited, and this limited lookahead may make it
impossible to recognize tokens in situations where the distance that the for-
ward pointer must travel is more than the length of the buffer. For example,
if we see

DECLARE { ARG1, ARG2 , . . . , ARGn )

in a we cannot determine whether DECLARE is a keyword or an


PL/I program,
array name
we see the character that follows the right parenthesis. In
until
either case, the lexeme ends at the second E, but the amount of lookahead
needed is proportional to the number of arguments, which in principle is
unbounded.

if forward at end of first half then begin


reload second half;
forward :— forward + 1

end
else if forward at end of second half then begin
reload first half;

move forward to beginning of first half

end
else forward — forward +: 1

Fig. 3.4. Code to advance forward pointer.


SEC. 3.2 INPUT BUFFERING 91

Sentinels

If we use the scheme of shown, we must check each time


Fig. 3.3 exactly as
we move the forward pointer that we have
not moved off one half of the
buffer; if we do, then we must reload the other half. That is, our code for
advancing the forward pointer performs tests like those shown in Fig. 3.4.
Except at the ends of the buffer halves, the code in Fig. 3.4 requires two
tests for each advance of the forward pointer. We can reduce the two tests to
one if we extend each buffer half to hold a sentinel character at the end. The
sentinel is a special character that cannot be part of the source program. A
natural choice is eof; Fig. 3.5 shows the same buffer arrangement as Fig. 3.3,
with the sentinels added.

M eof C * * 2 eof eof

foni'ord
lexemeJbeginning

Fig. 3.5. Sentinels at end of each buffer half.

With the arrangement of Fig. 3.5, we can use the code shown in Fig. 3.6 to
advance the forward pointer (and test for the end of the source file). Most of
the time the code performs only one test to see whether forward points to an
eof. Only when we reach the end of a buffer half or the end of the file do we
perform more tests. Since N input characters are encountered between eof's,
the average number of tests per input character is very close to 1.

forward := foryi'ard + 1;

if forward T = eof then begin


if for\v'ard at end of first half then begin
reload second half;
forH'ard :— forward + 1

end
else if fom'ard at end of second half then begin
reload first half;

move forward to beginning of first half

end
else /» eof within a buffer signifying end of input */
terminate lexical analysis
end

Fig. 3.6. Lookahcad code with sentinels.


92 LEXICAL ANALYSIS SEC. 3.2

We also need to decide how to process the character scanned by the forward
pointer; does it mark the end of a token, does it represent progress in finding
a particular keyword, or what? One way to structure these tests is to use a
case statement, if the implementation language has one. The test

if forwarcn = eof

can then be implemented as one of the different cases.

3.3 SPECIFICATION OF TOKENS

Regular expressions are an important notation for specifying patterns. Each


pattern matches a set of strings, so regular expressions will serve as names for
sets of strings. Section 3.5 extends this notation into a pattern-directed
language for lexical analysis.

Strings and Languages

The term alphabet or character class denotes any finite set of symbols. Typi-
cal examples of symbols are letters and characters. The set {0,1} is the binary
alphabet. ASCII and EBCDIC are two examples of computer alphabets.
A string over some alphabet is a finite sequence of symbols drawn from that
alphabet. In language theory, the terms sentence and word are often used as
synonyms for the term "string." The length of a string s, usually written |.s|,

is the number of occurrences of symbols in .v. For example, banana is a


string of length six. The empty string, denoted e, is a special string of length
zero. Some common terms associated with parts of a string are summarized
in Fig. 3.7.

The term language denotes any set of strings over some fixed alphabet.
This definition is very broad. Abstract languages like 0, the empty set, or
{e}, the set containing only the empty string, are languages under this defini-
tion. So too are the set of all syntactically well-formed Pascal programs and
the set of all grammatically correct English sentences, although the latter two

sets are much more difficult to specify. Also note that this definition does not
ascribe any meaning to the strings in a language. Methods for ascribing
meanings to strings are discussed in Chapter 5.
If ,v and y are strings, then the concatenation of v and y, written .vv, is the

string formed by appending y to x. For example, if .v = dog and y = house,


then xy = doghouse. The empty string is the identity element under con-
catenation. That is, .ve = e.v = .v.

If we think of concatenation as a "product", we can define string "exponen-


tiation" as follows. Define .v" to be e, and for />0 define .v' to be .v'"'.v.

Since e.v is ,v itself, .v' = .v. Then, .v" = .v.v, .v'' = .v.v.v, and so on.
SEC. 3.3 SPECIFICATION OF TOKENS 93

Term
94 LEXICAL ANALYSIS SEC. 3.3

Operation
SEC. 3.3 SPECIFICATION OF TOKENS 95

3. Suppose r and s are regular expressions denoting the languages L{r) and
L(s). Then,

a) (r) |(5) is a regular expression denoting L{r) U L{s).


b) {r){s) is a regular expression denoting L{r)L{s).
c) (r)* is a regular expression denoting (L(r))*.
d) (r) is a regular expression denoting L(r)}

A language denoted by a regular expression is said to be a regular set.


The specification of a regular expression is an example of a recursive defini-
tion. Rules (1) and (2) form the basis of the definition; we use the term basic
symbol to refer to e or a symbol in 2 appearing in a regular expression. Rule

(3) provides the inductive step.


Unnecessary parentheses can be avoided in regular expressions if we adopt
the conventions that:

1. the unary operator * has the highest precedence and is left associative,
2. concatenation has the second highest precedence and is left associative,
3. I
has the lowest precedence and is left associative.

Under these conventions, {a)\{{b)*(c)) is equivalent to a\b*c. Both expres-


sions denote the set of strings that are either a single a or zero or more b\
followed by one c.

Example 3.3. Let S = {«, b).

1. The regular expression a\b denotes the set {a, b).

2. The regular expression {a\b){a\b) denotes {aa, ab, ba, bb), the set of all

strings of a's and /?'s of length two. Another regular expression for this
same set is aa \
ab \
ba \
bb.

3. The regular expression a* denotes the set of all strings of zero or more
a's, i.e., {e, a, aa, aaa, •
}.

4. The regular expression ia\b)* denotes the set of all strings containing
zero or more instances of an a or b, that is, the set of all strings of «'s
and /7's. Another regular expression for this set is {a*b*)*.

5. The regular expression a \


a*b denotes the set containing the string a and
all strings consisting of zero or more a's followed by a /?.

If and 5 denote the same language, we say r and s


two regular expressions r

are equivalent and write r =


For example, (a\b) = ib\a). s.

There are a number of algebraic laws obeyed by regular expressions and


these can be used to manipulate regular expressions into equivalent forms.
Figure 3.9 shows some algebraic laws that hold for regular expressions r, s,

and t.

-
This rule says that extra pairs of parentheses may be placed around regular expressions if we
desire.
96 LEXICAL ANALYSIS SEC. 3.3

Axiom
SEC. 3.3 SPECIFICATION OF TOKENS 97

6.336E4, or 1 .894E-4. The following regular definition provides a precise


specification for this class of strings:

digit ^ I
1
I
• • •
I
9
digits — digit digit*
optional-fraction — . digits |
e
optional-exponent — ( E ( + |
- |
e ) digits ) |
e
num -» digits optional-fraction optional-exponent

This definition says that an optional-fraction is either a decimal point fol-


lowed by one or more digits, or it is missing (the empty string). An
optional-exponent, if it is not missing, is an E followed by an optional + or -
sign, followed by one or more digits. Note that at least one digit must follow
the period, so num does not match 1 . but it does match 1.0.

Notational Shorthands

Certain constructs occur so frequently in regular expressions that it is con-


venient to introduce notational shorthands for them.

1. One or more instances. The unary postfix operator means "one or "^

more instances of." If r is a regular expression that denotes the language


L(r), then (r)^ is a regular expression that denotes the language
(L(r))^ . Thus, the regular expression a^ denotes the set of all strings of
one or more a's. The operator ^ has the same precedence and associa-
tivity as the operator *. The two algebraic identities r* = r^ |e and
^+ = rr* relate the Kleene and positive closure operators.

2. Zero or one instance. The unary postfix operator ? means "zero or one
instance of." The notation r? is a shorthand for r|e. If r is a regular
expression, then (r)? is a regular expression that denotes the language

L{r) U {e}. For example, using the ^ and ? operators, we can rewrite
the regular definition for num in Example 3.5 as

digit - I
1 I
• • •

I
9
digits — digit^
optional-fraction - ( . digits )'?

optional-exponent — ( E ( + |
- )? digits )?
num -» digits optional-fraction optional-exponent

3. Character classes. The notation jabc] where a, b, and c are alphabet


symbols denotes the regular expression a b c. An abbreviated char- | |

acter class such as |a— z| denotes the regular expression a b z. | |


• • •
|

Using character classes, we can describe identifiers as being strings gen-


erated by the regular expression

|A-Za-z||A-Za-zO-9|*
98 LEXICAL ANALYSIS SEC. 3.3

Nonregular Sets

Some languages cannot be described by any regular expression. To illustrate

the limits of the descriptive power of regular expressions, here we give exam-
ples of programming language constructs that cannot be described by regular
expressions. Proofs of these assertions can be found in the references.
Regular expressions cannot be used to describe balanced or nested con-
structs. For example, the set of all strings of balanced parentheses cannot be
described by a regular expression. On the other hand, this set can be speci-
fied by a context-free grammar.
Repeating strings cannot be described by regular expressions. The set

{wcw |hms a string of t/'s and /7's }

cannot be denoted by any regular expression, nor can it be described by a


context-free grammar.
Regular expressions can be usedto denote only a fixed number of repeti-

tions or an unspecified number of repetitions of a given construct. Two arbi-


trary numbers cannot be compared to see whether they are the same. Thus,
we cannot describe Hollerith strings of the form nHa\a2 o„ from early • •

versions of Fortran with a regular expression, because the number of charac-


ters following H must match the decimal number before H.

3.4 RECOGNITION OF TOKENS


In the previous section, we considered the problem of how to specify tokens.

In this section, we address the question of how to recognize them.


Throughout this section, we use the language generated by the following
grammar as a running example.

Example 3.6. Consider the following grammar fragment:

stmt -* if c.xpr then .stmt

I
if expr then stmt else stmt
I
e

expr — term relop term


I
term
term -* id

I
num
where the terminals if, then, else, relop, id, and num generate sets of strings
given by the following regular definitions:

if - if
then - then
else - else
relop -<|<=| = |<>|>|> =
id - letter ( letter |
digit )*
num - digit^ ( . digit^ )? ( E( +|- )? digit^ )"!
SEC. 3.4 RECOGNITION OF TOKENS 99

where letter and digit are as defined previously.


For this language fragment the lexical analyzer will recognize the keywords
if, then, else, as well as the lexemes denoted by relop, id, and num. To
simplify matters, we assume keywords are reserved; that is, they cannot be
used as identifiers. As in Example 3.5. num represents the unsigned integer
and real numbers of Pascal.
In addition, we assume lexemes are separated by white space, consisting of
nonnull sequences of blanks, tabs, and newlines. Our lexical analyzer will
strip out white space. It will do so by comparing a string against the regular
definition ws, below.

delim -» blank |
tab |
newline
^
ws — delim

If a match for ws is found, the lexical analyzer does not return a token to the
parser. Rather, it proceeds to find a token following the white space and
returns that to the parser.
Our goal is to construct a lexical analyzer that will isolate the lexeme for
the next token in the input buffer and produce as output a pair consisting of
the appropriate token and attribute-value, using the translation table given in
Fig. 3.10. The attribute-values for the relational operators are given by the
symbolic constants LT, LE, EQ, NE, GT, GE.

Regular
Expression
100 LEXICAL ANALYSIS SEC. 3.4

depict the actions that take place when a lexical analyzer is called by the
parser to get the next token, as suggested by Fig. 3.1. Suppose the input
buffer is as in Fig. 3.3 and the lexeme-beginning pointer points to the charac-
ter following the last lexeme found. We use a transition diagram to keep
track of information about characters that are seen as the forward pointer
scans the input. We do so by moving from position to position in the diagram
as characters are read.
Positions in a transition diagram are drawn as circlesand are called states.

The states are connected by arrows, called edges. Edges leaving state .v have
labels indicating the input characters that can next appear after the transition
diagram has reached state .v. The label other refers to any character that is

not indicated by any of the other edges leaving .v.

We assume the transition diagrams of this section are deterministic; that is,

no symbol can match the labels of two edges leaving one state. Starting in

Section 3.5, we shall relax this condition, making life much simpler for the
designer of the lexical analyzer and, with proper tools, no harder for the
implementor.
One state is labeled the start state; it is the initial state of the transition
diagram where control resides when we begin to recognize a token. Certain
states may have actions that are executed when the flow of control reaches
that state. On entering a state we read the next input character. If there is
an edge from the current state whose label matches this input character, we
then go to the state pointed to by the edge. Otherwise, we indicate failure.
shows a transition diagram for the patterns >= and >. The
Figure 3.11
transitiondiagram works as follows. Its start state is state 0. In state 0, we
read the next input character. The edge labeled > from state is to be fol-

lowed to state 6 if this input character is >. Otherwise, we have failed to


recognize either > or > = .

start

Fig. 3.11. Transition diagram for > = .

On reaching state 6 we read the next input character. The edge labeled =
from state 6 is to be followed to state 7 if this input character is an =. Other-
wise, the edge labeled other indicates that we are to go to state 8. The double
circleon state 7 indicates that it is an accepting state, a state in which the
token >= has been found.
Notice that the character > and another extra character are read as we fol-
low the sequence of edges from the start state to the accepting state 8. Since
the extra character is not a part of the relational operator >, we must retract
SEC. 3.4 RECOGNITION OF TOKENS 101

the forward pointer one character. We use a * to indicate states on which this

input retraction must take place.


In general, there may be several transition diagrams, each specifying a
group of tokens. If failure occurs while we are following one transition
diagram, then we retract the forward pointer to where it was in the start state
of this diagram, and activate the next transition diagram. Since the lexeme-
beginning and forward pointers marked the same position in the start state of
the diagram, the forward pointer is retracted to the position marked by the
lexeme-beginning pointer. If failure occurs in all transition diagrams, then a
lexical error has been detected and we invoke an error-recovery routine.

Example 3.7. A transition diagram for the token relop is shown in Fig. 3.12.
Notice that Fig. 3. 1 1 is a part of this more complex transition diagram.

start
2j) return( relop. LE)

3j) return( relop. NE)

other ^*
4J) return( relop, LT)

5 return( relop, EQ)

return( relop, GE)


7J)

other ^^s:*
Sj) return( relop, GT)

Fig. 3.12. Transition diagram for relational operators.

Example 3.8. Since keywords are sequences of letters, they are exceptions to
the rule that a sequence of letters and digits starting with a letter
is an identi-

fier. Rather than encode the exceptions into a transition diagram, a useful
trick is to treat keywords as special identifiers, as in Section 2.7. When the
accepting state in Fig. 3.13 is reached, we execute some code to determine if

the lexeme leading to the accepting state is a keyword or an identifier.

letter or digit

start
-0-
letter V-^y^ other
-»( 10)
.;?=?x*

KliJ returniiic'tfokenO, inslall-id{))

Fig. 3.13. Transition diagram for identifiers and keywords.


102 LEXICAL ANALYSIS SEC. 3.4

A simple technique for separating keywords from identifiers is to initialize

appropriately the symbol table in which information about identifiers is saved.


For the tokens of Fig. 3.10 we need to enter the strings if, then, and else
into the symbol table before any characters in the input are seen. We also
make a note in the symbol table of the token to be returned when one of these
strings is recognized. The return statement next to the accepting state in Fig.

3.13 uses gettokenO and instaU-id{) to obtain the token and attribute-value,
respectively, to be returned. The procedure install-idO has access to the
buffer, where the identifier lexeme has been located. The symbol table is

examined and if the lexeme is found there marked as a keyword, install^idi)


returns 0. If the lexeme is found and is a program variable. instalLidi)
returns a pointer to the symbol table entry. If the lexeme is not found in the
symbol table, it is installed as a variable and a pointer to the newly created
entry is returned.
The procedure gettokenO similarly looks for the lexeme in the symbol table.

If the lexeme is a keyword, the corresponding token is returned; otherwise,


the token id is returned.
Note diagram does not change if additional keywords are
that the transition
to be recognized; we simply initialize the symbol table with the strings and
tokens of the additional keywords. n

The technique of placing keywords in the symbol table is almost essential if

the lexical analyzer is coded by hand. Without doing so, the number of states
in a lexical analyzer for a typical programming language is several hundred,
while using the trick, fewer than a hundred states will probably suffice.

digit digit digit

start ,--^ digit >-< >-< E


-*{ \2) *^ 13 :'
.

*\\^) —^^J5
^--x digit

digit

digit digit

digit >-< other


22 )
*^. 23 >((24):

digit

start -^ digit v/ other ^;=J


*{15J *1^6/- *@
Fig. 3.14, Transition diagrams for unsigned numbers in Paseal.

Example 3.9. A number of issues arise when we construct a recognizer for


unsigned numbers given by the regular definition
SEC. 3.4 RECOGNITION OF TOKENS 103

num -» digit ^ (. digit ^)? (E(+|-)? digit^)?

Note that the definition is of the form digits fraction? exponent? in which
fraction and exponent are optional.
The lexeme must be the longest possible. For example,
for a given token
the lexical analyzer must not stop after seeing 12 or even 12.3 when the
input is 12.3E4. Starting at states 25, 20, and 12 in Fig. 3.14, accepting
states will be reached after 12, 12.3, and 12,3E4 are seen, respectively,
provided 12.3E4 is followed by a non-digit in the input. The transition
diagrams with start states 25, 20, and 12 are for digits, digits fraction, and
digits fraction? exponent, respectively, so the start states must be tried in the
reverse order 12, 20, 25.
The action when any of the accepting states 19, 24, or 27 is reached is to
call a procedure install-num that enters the lexeme into a table of numbers and
returns a pointer to the created entry. The lexical analyzer returns the token
num with this pointer as the lexical value. n

Information about the language that is not in the regular definitions of the
tokens can be used to pinpoint errors the input. For example, on input
in

1. <x, we fail in states 14 and 22 in Fig. 3.14 with next input character <.
Rather than returning the number 1, we may wish to report an error and con-
tinue as if the input were 1 .0 <x. Such knowledge can also be used to sim-
plify the transition diagrams, because error-handling may be used to recover
from some situations that would otherwise lead to failure.
There are several ways in which the redundant matching in the transition
diagrams of Fig. 3.14 can be avoided. One approach is to rewrite the transi-
tion diagrams by combining them into one, a nontrivial task in general.
Another is to change the response to failure during the process of following a
diagram. An approach explored later in this chapter allows us to pass through
several accepting states; we revert back to the last accepting state that we
passed through when failure occurs.

Example 3.10. A sequence of transition diagrams for all tokens of Example


3.6 is obtained ifwe put together the transition diagrams of Fig. 3.12, 3.13,
and 3.14. Lower-numbered start states are to be attempted before higher
numbered states.
The only remaining issue concerns white space. The treatment of ws,
representing white space, is different from that of the patterns discussed above
because nothing is returned to the parser when white space is found in the
input. A transition diagram recognizing ws by itself is

delim

Nothing is returned when the accepting state is reached; we merely go back to


the start state of the first transition diagram to look for another pattern.
. .

104 LEXICAL ANALYSIS SEC. 3.4

Whenever possible, it is better to look for frequently occurring tokens


before less frequently occurring ones, because a transition diagram is reached
only after we fail on all earlier diagrams. Since white space is expected to
occur frequently, putting the transition diagram for white space near the
beginning should be an improvement over testing for white space at the end.

Implementing a Transition Diagram

A sequence of transition diagrams can be converted into a program to look for


the tokens specified by the diagrams. We adopt a systematic approach that
works diagrams and constructs programs whose size is pro-
for all transition
portional to the number of states and edges in the diagrams.
Each state gets a segment of code. If there are edges leaving a state, then
its code reads a character and selects an edge to follow, if possible. A func-
tion nextchar( ) is used to read the next character from the input buffer,
advance the forward pointer at each call, and return the character read. If
there is an edge labeled by the character read, or labeled by a character class

containing the character read, then control is transferred to the code for the
state pointed to by that edge. If there is no such edge, and the current state is

not one that indicates a token has been found, then a routine fail{) is

invoked to retract the forward pointer to the position of the beginning pointer
and to initiate a search for a token specified by the next transition diagram.
If there are no other transition diagrams to try, fail() calls an error-
recovery routine.
To return tokens we use a global variable lexical_value, which is
assigned the pointers returned by functions install_id( and )

install_num( ) when an identifier or number, respectively, is found. The


token class is returned by the main procedure of the lexical analyzer, called
nexttoken( )

We use a case statement to find the start state of the next transition
diagram. In the C implementation in Fig. 3.15, two variables state and
start keep track of the present state and the starting state of the current
transition diagram. The state numbers in the code are for the transition
diagrams of Figures 3.12-3. 14.
Edges in transition diagrams are traced by repeatedly selecting the code
fragment for a state and executing the code fragment to determine the next
state as shown in Fig. 3.16. We show the code for state 0, as modified in
Example 3.10 to handle white space, and the code for two of the transition
diagrams from Fig. 3.13 and 3.14. Note that the C construct

while ( 1 ) stmt

repeats stmt ''forever,'" i.e.. until a return occurs.

*
A more efficient implementation would u.sc an in-line macro in place of the function
nextchar ( )
) ; ;

SEC. 3.5 A LANGUAGE FOR SPECIFYING LEXICAL ANALYZERS 105

int state = 0, start = 0;


int lexical-value
/
to "return" second component of token /
int fail(
{

forward = token_beginning;
switch (start) {
case 0: start = 9; break;
case 9: start = 12 break;
case 12 start = 20 break;
case 20 start = 25 break;
case 25 recover break; ( )

default /* compiler error */


}

return start;

Fig. 3.15. C code to find next start state.

Since C does not allow both a token and an attribute-value to be returned,


install_id( and install_num(
) appropriately set some global variable
)

to the attribute-value corresponding to the table entry for the id or num in

question.
If the implementation language does not have a case statement, we can
create an array for each state, indexed by characters. If state 1 is such an
array, then state \\c\ is a pointer to a piece of code that must be executed
whenever the lookahead character is c. This code would normally end with a
goto to code for the next state. The array for state .v is referred to as the
indirect transfer table for s.

3.5 A LANGUAGE FOR SPECIFYING LEXICAL ANALYZERS


Several tools have been built for constructing lexical analyzers from special-
purpose notations based on regular expressions. We have already seen the use
of regular expressions for specifying token patterns. Before we consider algo-
rithms for compiling regular expressions into pattern-matching programs, we
give an example of a tool that might use such an algorithm.
In this section, we describe a particular tool, called Lex, that has been
widely used to specify lexical analyzers for a variety of languages. We refer
to the tool as the Lex compiler, and to its input specification as the Lex
language. Discussion of an existing tool will allow us to show how the specifi-
cation of patterns using regular expressions can be combined with actions,
e.g., making entries into a symbol table, that a lexical analyzer may be
required to perform. Lex-like specifications can be used even if a Lex
) ; ;; ; ;

106 LEXICAL ANALYSIS SEC. 3.5

token nexttoken(
{ while(l) {
switch (state) {
case 0: c = nextchar();
/* c is lookahead character /
if (c==blank c==tab c==newline)
! ! ! ! {

state = 0;
lexeme_beginning++
/* advance beginning of lexeme /
}

else if (c == '<') state = 1

else if (c == '=') state = 5


else if (c == '>') state = 6
else state = fail();
break;
. . . /* cases 1-8 here */

case 9: c = nextchar();
if (isletter(c) state = 10; )

else state = fail();


break;
case 10: c = nextchar();
if (isletter(c) state = 10; )

else if (isdigit(c)) state = 10;


else state = 11;
break;
case 11: retract(l); install_id( )

return gettoken(
( ) )

. . . /* cases 12-24 here */

case 25: c = nextchar();


if (isdigit(c)) state = 26;
else state = fail();
break;
case 26: c = nextchar();
if {isdigit(c)) state = 26;
else state = 27;
break;
case 27: retract{1); install_num( )

return NUM ( )

Fig. 3.16. C code for lexical analyzer.


c c

SEC. 3.5 A LANGUAGE FOR SPECIFYING LEXICAL ANALYZERS 107

compileris not available; the specifications can be manually transcribed into a

working program using the transition diagram techniques of the previous sec-
tion.
Lex is generally used in the manner depicted in Fig. 3.17. First, a specifi-
cation of a lexical analyzer is prepared by creating a program lex.l in the
Lex language. Then, lex.l is run through the Lex compiler to produce a C
program lex.yy.c. The program lex.yy.c consists of a tabular represen-
tation of a transition diagram constructed from the regular expressions of
lex.l, together with a standard routine that uses the table to recognize lex-
emes. The actions associated with regular expressions in lex.l are pieces of
C code and are carried over directly to lex.yy.c. Finally, lex.jry.c is run
through the C compiler to produce an object program a. out, which is the
lexical analyzer that transforms an input stream into a sequence of tokens.

Lex
source Lex
lex.yy .

program compiler
lex.l

C
lex.yy . a out
.

compiler

sequence
mput
a out
. of
stream
tokens

Fig. 3.17. Creating a lexical analyzer with Lex.

Lex Specifications

A Lex program consists of three parts:

declarations
%%
translation rules
%%
auxiliary procedures

The declarations section includes declarations of variables, manifest constants,


and regular definitions. (A manifest constant is an identifier that is declared
to represent a constant.) The regular definitions are statements similar to
those given in Section 3.3 and are used as components of the regular expres-
sions appearing in the translation rules.
108 LEXICAL ANALYSIS SEC. 3.5

The translation rules of a Lex program are statements of the form

p I
{ action | }

Pj { action 2 }

p„ { action,, }

where each /?, is a regular expression and each action, is a program fragment
describing what action the lexical analyzer should take when pattern p,
matches a lexeme. In Lex, the actions are written in C; in general, however,
they can be any implementation language.
in

The third section holds whatever auxiliary procedures are needed by the
actions. Alternatively, these procedures can be compiled separately and
loaded with the lexical analyzer.
A lexical analyzer created by Lex behaves in concert with a parser in the
following manner. When activated by the parser, the lexical analyzer begins
reading its remaining input, one character at has found the
a time, until it

longest prefix of the input that is matched by one of the regular expressions
/?,. Then, it executes action,. Typically, action, will return control to the
parser. However, if it does not, then the lexical analyzer proceeds to find
more lexemes, until an action causes control to return to the parser. The
repeated search for lexemes until an explicit return allows the lexical analyzer
to process white space and comments conveniently.
The lexical analyzer returns a single quantity, the token, to the parser. To
pass an attribute value with information about the lexeme, we can set a global

variable called yylval.

Example 3.11. Figure 3.18 is a Lex program that recognizes the tokens of
Fig. 3.10 and returns the token found. A few observations about the code
will introduce us to many of the important features of Lex.
In the declarations section, we see (a place for) the declaration of certain
manifest constants used by the translation rules. These declarations are sur-
"^

rounded by the special brackets %{ and %}. Anything appearing between


these brackets is copied directly into the lexical analyzer lex.yy.c, and is
not treated as part of the regular definitions or the translation rules. Exactly
the same treatment is accorded the auxiliary procedures in the third section.
In Fig. 3.18, there are two procedures, install_id and install_num, that
are used by the translation rules; these procedures will be copied into
lex.yy.c verbatim.
Also included in the definitions section are some regular definitions. Each
such definition consists of a name and a regular expression denoted by that
name. For example, the first name defined is delim; it stands for the

*
It is common for the program lex.yy.c to be used as a subroutine of a parser generated by
Yacc, a parser generator to be discussed in Chapter 4. In this case, the declaration of the manifest
constants would be provided by the parser, when it is compiled with the program lex.yy.c.
} } , ? }

SEC. 3.5 A LANGUAGE FOR SPECIFYING LEXICAL ANALYZERS 109

%{
/ definitions of manifest constants
LT, LE, EQ, NE, GT GE ,

IF, THEN, ELSE, ID, NUMBER, RELOP /


%}

/* regular definitions */
delim \t\n] [

ws {deliin} +
letter [A-Za-z]
digit [0-9]
id {letter} {letter} {digit} )*
( I

number { digit }+(\. {digit }+)?( E[ +\- ]? {digit }+)

%%
{ws} {/* no action and no return */}
if {return(IF) ;

then {return (THEN) ;

else {return (ELSE) ;

{id} {yylval = install_id( return(ID);} ) ;

{number} {yylval = install_num( r eturn (NUMBER) ) ; ;

{yylval = LT return (RELOP)


<= {yylval = LE return (RELOP)
{yylval = EQ return (RELOP)
{yylval = NE return (RELOP)
II .^ II
{yylval = GT return (RELOP)
{yylval = GE return (RELOP)
%%
install_id( { )

/* procedure to install the lexeme, whose


first character is pointed to by yytext and
whose length is yyleng, into the symbol table
and return a pointer thereto */
}

install_num( { )

/* similar procedure to install a lexeme that


is a number */
}

Fig. 3.18. Lex program for the tokens of Fig. 3.10.


110 LEXICAL ANALYSIS SEC. 3.5

character class [ \t\n],


any of the three symbols blank, tab
that is,

(represented by \t), or newline (represented by \n). The second definition is


of white space, denoted by the name ws. White space is any sequence of one
or more delimiter characters. Notice that the word delim must be sur-
rounded by braces in Lex, to distinguish it from the pattern consisting of the
five letters delim.
In the definition of letter, we see the use of a character class. The short-
hand [A-Za-z] means any of the capital letters A through Z or the lower-
case letters a through z. The fifth definition, of id, uses parentheses, which
are metasymbols in Lex, with their natural meaning as groupers. Similarly,
the vertical bar is a Lex metasymbol representing union.
In the last regular definition, of number, we observe a few more details.
We see ? used as a metasymbol, with its customary meaning of "zero or one
occurrences of." We also note the backslash used as an escape, to let a char-
acter that is a Lex metasymbol have its natural meaning. In particular, the
decimal point in the definition of number is expressed by \. because a dot by
itself represents the character class of all characters except the newline, in Lex
as in many UNIX system programs that deal with regular expressions. In the
character class [+\-], we placed a backslash before the minus sign because
the minus sign standing for itself could be confused with its use to denote a
range, as in [A-Z] .^

There is another way to cause characters to have their natural meaning,


even if they are metasymbols of Lex: surround them with quotes. We have
shown an example of this convention in the translation rules section, where
the six relational operators are surrounded by quotes.^
Now, let us consider the translation rules in the section following the first

%%. The first rule says that if we see ws, that is, any maximal sequence of
blanks, tabs, and newlines, we take no action. In particular, we do not return
to the parser. Recall that the structure of the lexical analyzer is such that it

keeps trying to recognize tokens, until the action associated with one found
causes a return.
The second rule says that if the letters if are seen, return the token IF,
which is a manifest constant representing some integer understood by the
parser to be the token if. The next two rules handle keywords then and
else similarly.
In the rule for id, we see two statements in the associated action. First, the
variable yylval is set to the value returned by procedure install_id; the
definition of that procedure is in the third section, yylval is a variable

^ Actually, Lex handles the character class [+-] correctly without the backslash, because the
minus sign appearing at the end cannot represent a range.
*"
We did so because < and > are Lex metasymbols; they surround the names of "states," enabling
Lex to change state when encountering certain tokens, such as comments or quoted strings, that
must be treated dit'tercntly from the usual text. There is no need to surround the equal sign by
quotes, but neither is it forbidden.
SEC. 3.5 A LANGUAGE FOR SPECIFYING LEXICAL ANALYZERS HI

whose definition appears in the Lex output lex.yy.c, and which is also

available to the parser. The purpose of yylval is to hold the lexical value
returned, since the second statement of the action, return (ID), can only
return a code for the token class.
We do not show the code for install_id. However, we
details of the
may suppose that symbol table for the lexeme matched by the
it looks in the
pattern id. Lex makes the lexeme available to routines appearing in the third
section through two variables yytext and yyleng. The variable yytext
corresponds to the variable that we have been calling lexeme-beginning, that
is, a pointer to the first character of the lexeme; yyleng is an integer telling
how long the lexeme is. For example, if install_id fails to find the identi-
fier in the symbol table, it might create a new entry for it. The yyleng char-
yytext, might be copied into a character array
acters of the input, starting at
and delimited by an end-of-string marker as in Section 2.7. The new symbol-
table entry would point to the beginning of this copy.
Numbers are treated similarly by the next rule, and for the last six rules,
yylval is used to return a code for the particular relational operator found,
while the actual return value is the code for token relop in each case.
Suppose the lexical analyzer resulting from the program of Fig. 3.18 is

given an input consisting of two tabs, the letters if, and a blank. The two
matched by a pattern, namely
tabs are the longest initial prefix of the input
the pattern ws. The action for ws
do nothing, so the lexical analyzer is to
moves the lexeme-beginning pointer, yytext, to the i and begins to search
for another token.
The next lexeme to be matched is if. Note that the patterns if and {id}
both match this lexeme, and no pattern matches a longer string. Since the
pattern for keyword if precedes the pattern for identifiers in the list of Fig.
3.18, the conflict is resolved in favor of the keyword. In general, this
ambiguity-resolving strategy makes it easy to reserve keywords by listing them
ahead of the pattern for identifiers.
For another example, suppose <= are the first two characters read. While
pattern < matches the first character, it is not the longest pattern matching a
prefix of the Thus Lex's strategy of selecting the longest prefix
input.
matched by makes it easy to resolve the conflict between < and < =
a pattern
in the expected manner - by choosing <= as the next token.

The Lookahead Operator

As we saw in Section 3.1, lexical analyzers for certain programming language


constructs need to look ahead beyond the end of a lexeme before they can
determine a token with certainty. Recall the example from Fortran of the pair
of statements

DO 5 I = 1.25
DO 5 I = 1,25

In Fortran, blanks are not significant outside of comments and Hollerith


112 LEXICAL ANALYSIS SEC. 3.5

Strings, so suppose that all removable blanks are stripped before lexical
analysis begins. The above statements then appear to the lexical analyzer as

D05I=1 .25
D05I=1 ,25
In the first statement, we cannot tell until we see the decimal point that the
string DO is part of the identifier D05I. In the second statement, DO is a key-
word by itself.

In Lex, we can write a pattern of the form ri/ri, where r\ and r^ are reg-
ular expressions, meaning match a string in r|, but only if followed by a
string in r2. The regular expression r2 after the lookahead operator / indi-
cates the right context for a match; it is used only to restrict a match, not to
be part of the match. For example, a Lex specification that recognizes the
keyword DO in the context above is

D0/( {letter} 1 {digit})* = ({letter} I {digit})*,


With this specification, the lexical analyzer will look ahead in its input buffer
for a sequence of letters and digits followed by an equal sign followed by
letters and digits followed by a comma to be sure that it did not have an
assignment statement. Then only the characters D and O, preceding the looka-
head operator / would be part of the lexeme that was matched. After a suc-
cessful match, yytext points to the D and yyleng = 2. Note that this sim-
ple lookahead pattern allows DO to be recognized when followed by garbage,
like Z4 = 6Q, but it will never recognize DO that is part of an identifier.

Example 3.12. The lookahead operator can be used to cope with another dif-
ficult lexical analysis problem in Fortran: distinguishing keywords from identi-
fiers. For example, the input

IF{I, J) = 3

is good Fortran assignment statement, not a logical if-statement.


a perfectly
One way keyword IF using Lex is to define its possible right
to specify the
contexts using the lookahead operator. The simple form of the logical if-
statement is

IF ( condition ) statement

Fortran 77 introduced another form of the logical if-statement:

IF ( condition ) THEN
then-hlock
ELSE
elsc-hlock
END IF
We note that every unlabeled Fortran statement begins with a letter and that
every right parenthesis used for subscripting or operand grouping must be fol-

lowed by an operator symbol such as =, +, or comma, another right


SEC. 3.6 FINITE AUTOMATA 113

parenthesis, or the end of the statement. Such a right parenthesis cannot be


followed by a letter, confirm that IF is a keyword rather
in this situation, to
than an array name we scan forward looking for a right parenthesis followed
by a letter before seeing a newline character (we assume continuation cards
"cancel" the previous newline character). This pattern for the keyword IF
can be written as

IF / \( .* \) {letter}
The dot stands for "any character but newline" and the backslashes in front of
the parentheses tell Lex to treat them literally, not as metasymbols for group-
ing in regular expressions (see Exercise 3.10). n
Another way to attack the problem posed by if-statements in Fortran is,
after seeing IF(, to determine whether IF has been declared an array. We
scan for the full pattern indicated above only if it has been so declared. Such
tests make the automatic implementation of a lexical analyzer from a
Lex
and they may even cost time in the long run, since fre-
specification harder,
quent checks must be made by the program simulating a transition diagram to
determine whether any such tests must be made. It should be noted that tok-
enizing Fortran is such an irregular task that it is frequently easier to write an
ad hoc lexical analyzer for Fortran in a conventional programming language
than it is to use an automatic lexical analyzer generator.

3.6 FINITE AUTOMATA


A recognizer for a language is a program that takes as input a string x and
answers "yes" if x is a sentence of the language and "no" otherwise. We
compile a regular expression into a recognizer by constructing a generalized
transition diagram called automaton. A finite automaton can be deter-
a finite
ministic or nondeterministic, where "nondeterministic" means that more than
one transition out of a state may be possible on the same input symbol.
Both deterministic and nondeterministic finite automata are capable of
recognizing precisely the regular sets. Thus they both can recognize exactly
what regular expressions can denote. However, there is a time-space tradeoff;
while deterministic finite automata can lead to faster recognizers than non-
deterministic automata, a deterministic finite automaton can be much bigger
than an equivalent nondeterministic automaton. In the next section, we
present methods for converting regular expressions into both kinds of finite
automata. The conversion into a nondeterministic automaton is more direct
so we discuss these automata first.
The examples in this section and the next deal primarily with the language
denoted by the regular expression ia\h)*ahb, consisting of the set of all
strings of «'s and b's ending in abb. Similar languages arise in practice. For
example, a regular expression for the names of all files that end in .o is of
the form (. |o|<:)*,o, with c representing any character other than a dot or
an o. As another example, after the opening /*, comments in C consist of
.

114 LEXICAL ANALYSIS SEC. 3.6

any sequence of characters ending in */, with the added requirement that no
proper prefix ends in */.

Nondeterministic Finite Automata

A nondeterministic finite automaton (NFA, for short) is a mathematical model


that consists of

1 a set of states S
2. a set of input symbols S (the input symbol alphabet)
3. a transition function move that maps state-symbol pairs to sets of states
4. a state ^o that is distinguished as the start (or initial) state

5. a set of states F distinguished as accepting (or final) states

An NFA can be represented diagrammatically by a labeled directed graph,


called a transition graph, which the nodes are the states and the labeled
in

edges represent the transition function. This graph looks like a transition
diagram, but the same character can label two or more transitions out of one
state, and edges can be labeled by the special symbol e as well as by input
symbols.
The transition graph for an NFA that recognizes the language (a \b)*abb is

shown in Fig. 3.19. The set of states of the NFA is {0, 1, 2, 3} and the input
symbol alphabet is {a, b}. State in Fig. 3.19 is distinguished as the start
state, and the accepting state 3 is indicated by a double circle.

start Q
}-<, „
a ,^x h z^-^-. h ,^

Fig. 3.19. A nondeterministic finite automaton.

When describing an NFA, we use the transition graph representation. In a


computer, the transition function of an NFA can be implemented in several
different ways, as we shall see. The easiest implementation is a transition
table in which there is a row for each state and a column for each input sym-
bol and e, if necessary. The entry for row and symbol a in the table is the
/

set of states (or more likely in practice, a pointer to the set of states) that can
be reached by a transition from state / on input a. The transition table for the
NFA of Fig. 3.19 is shown in Fig. 3.20.
The transition table representation has the advantage that it provides fast
access to the transitions of a given state on a given character; its disadvantage
is that it can take up a lot of space when the input alphabet is large and most
transitions are to the empty set. Adjacency list representations of the
SEC. 3.6 FINITE AUTOMATA 115
116 LEXICAL ANALYSIS SEC. 3.6

Fig. 3.21. NFA accepting aa* \hh*.

2. for each state .v and input symbol a, there is at most one edge labeled a
leaving .v.

A deterministic finite automaton has at most one transition from each state

on any input. If we are using a transition table to represent the transition


function of a DFA, then each entry in the transition table is a single state. As
a consequence, it is very easy to determine whether a deterministic finite auto-

maton accepts an input string, since there is at most one path from the start

state labeled by that string. The following algorithm shows how to simulate

the behavior of a DFA on an input string.

Algorithm 3.1. Simulating a DFA.


Input. An input string .v terminated by an end-of-file character eof. A DFA D
with start state .so and set of accepting states F.

Output. The answer "yes*' if D accepts .v; "no" otherwise.

Method. Apply the algorithm in Fig. 3.22 to the input string .v. The function
movc(.s, c) gives the state to which there is a transition from state s on input
character c. The function ne.xtchar returns the next character of the input
string .V.
i^

s := .v,,;

( :— ne.xtchar;

while e ^ eof do
,v := moveis, r);
(• := ne.xtchar
end;
if ,v is in / then
return "yes"
else return "no":

Fig. 3.22. Simulating a DFA.


SEC. 3.6 FINITE AUTOMATA 17

Example 3.14. In Fig. 3.23, we see the transition graph of a deterministic fin-
iteautomaton accepting the same language (a \b)*ahh as that accepted by the
NFA of Fig. 3.19. With this DFA and the input string alxibb. Algorithm 3.1
follows the sequence of states 0, 1,2, 1, 2, 3 and returns "yes".

Fig. 3.23. DFA accepting (a \h)*ahh.

Conversion of an NFA into a DFA


Note that the NFA of Fig. 3.19 has two transitions from state on input u\
that is, it may go to state or 1. Similarly, the NFA of Fig. 3.21 has two
transitions on efrom state 0. While we have not shown an example of it, a
situation where we could choose a transition on e or on a real input symbol
also causes ambiguity. These situations, in which the transition function is
multivalued, make it hard to simulate an NFA with a computer program. The
definition of acceptance merely asserts that there must be some path labeled
by the input string in question leading from the start state to an accepting
state. But if there are many paths that spell out the same input string, we
may have to consider them all before we find one that leads to acceptance or
discover that no path leads to an accepting state.
We now present an algorithm for constructing from an NFA a DFA that
recognizes the same language. This algorithm, often called the subset con-
struction, is useful for simulating an NFA by a computer program. A closely
related algorithm plays a fundamental role in the construction of LR parsers
in the next chapter.
an NFA, each entry is a set of states; in the transi-
In the transition table of
tion table of a DFA, each entry is just a single state. The general idea behind
the NFA-to-DFA construction is that each DFA state corresponds to a set of
NFA states. The DFA uses its state to keep track of all possible states the
NFA can be in after reading each input symbol. That is to say, after reading
input «| ^2 <'»' the DFA is in a state that represents the subset T of the
states of the NFA that are reachable from the NFA's start state along some
path labeled c/| ^2 • •
a„. The number of states of the DFA can be exponen-
tial in the number of states of the NFA, but in practice this worst case occurs
rarely.
118 LEXICAL ANALYSIS SEC. 3.6

Algorithm 3.2. {Subset construction.) Constructing a DFA from an NFA.

Input. An NFA N.

Output. A DFA D accepting the same language.

Method. Our algorithm Each DFA


constructs a transition table Dtran for D.
state is a set of NFA and we construct Dtran so that D will simulate "in
states
parallel" all possible moves N can make on a given input string.
We use the operations in Fig. 3.24 to keep track of sets of NFA states (s
represents an NFA state and T a set of NFA states).

Operation
SEC. 3.6 FINITE AUTOMATA 119

States that N could be in after reading some sequence of input symbols includ-
ing all possible e-transitions before or after symbols are read. The start state
of D is e-closure(sQ). States and transitions are added to D using the algo-
rithm of Fig. 3.25. A state of D is an accepting state if it is a set of NFA
states containing at least one accepting state of N.

push all states in T onto stack:

initialize e-closure(T) to T;

while stack is not empty do begin


pop t, the top element, off of stack;
for each state u with an edge from / to u labeled e do
if M is not in ^-closure(T) do begin
add u to f.-closure(T)\

push u onto stack


end
end

Fig. 3.26. Computation of ^-closure.

The computation of e-closure(T) is a typical process of searching a graph for


nodes reachable from a given set of nodes. In this case the states of T are the
given set of nodes, and the graph consists of just the e-labeled edges of the
NFA. A simple algorithm to compute €-closure{T) uses a stack to hold states
whose edges have not been checked for e-labeled transitions. Such a pro-
cedure is shown in Fig. 3.26.

Example 3.15. Figure 3.27 shows another NFA A' accepting the language
(a\b)*abb. (It happens to be the one in the next section, which will be
mechanically constructed from the regular expression.) Let us apply Algo-
rithm 3.2 to A'. The start state of the equivalent DFA is e-closure(O) , which is

A — {0, 1, 2,4, 7}, since these are exactly the states reachable from state via
a path in which every edge is labeled e. Note that a path can have no edges,
so is reached from itself by such a path.
The input symbol alphabet here is {a, b}. The algorithm of Fig. 3.25 tells
us to mark A and then to compute

^-closure{move{A, a)).

We first compute move{A, a), the set of states of N having transitions on a

from members of A. Among the states 0, 1, 2, 4 and 7, only 2 and 7 have


such transitions, to 3 and 8, so

e-closure(movei{0, I. 2, 4, 7}. a)) = e-closure{{3, 8}) = {1,2, 3, 4, 6, 7, 8}

Let us call this set B.Thus, Dtran\A, a\ = B.


Among the states in A, only 4 has a transition on b to 5, so the DFA has a
transition on b from A to
120 LEXICAL ANALYSIS SEC. 3.6

Fig. 3.27. NFA N for (a \h)*ahh.

C = e-chsure{{5}) = {1,2, 4, 5, 6, 7}

Thus, Dtran[A, b] = C.
If we continue this process with the now unmarked sets B and C, we even-
tually reach the point where all sets that are states of the DFA are marked.
This is certain since there are "only" 2" different subsets of a set of eleven
states, and a set, once marked, is marked forever. The five different sets of
states we actually construct are:

A = {0, 1, 2, 4, 7} D = {1, 2, 4, 5, 6, 7, 9}
B = {1, 2, 3, 4, 6, 7, 8} £ = {1, 2, 4, 5, 6, 7, 10}
C = {1, 2, 4, 5, 6, 7}

State A is the start state, and state E is the only accepting state. The complete
transition table Dtran is shown in Fig. 3.28.

State
SEC. 3.7 FROM A REGULAR EXPRESSION TO AN NFA 121

Fig. 3.29. Result of applying the subset construction to Fig. 3.27.

fewer state. We discuss the question of minimization of the number of states


of a DFA in Section 3.9. n

3.7 FROM A REGULAR EXPRESSION TO AN NFA


There are many strategies for building a recognizer from a regular expression,
each with its own strengths and weaknesses. One strategy that has been used
in a number of text-editing programs is to construct an NFA from a regular
expression and then to simulate the behavior of the NFA on an input string
using Algorithms 3.3 and 3.4 of this section. If run-time speed is essential,
we can convert the NFA into a DFA using the subset construction of the pre-
vious section. we see an alternative implementation of a DFA
In Section 3.9,
from a regular expression which an intervening NFA is not explicitly con-
in

structed. This section concludes with a discussion of time-space tradeoffs in


the implementation of recognizers based on NFA and DFA.

Construction of an NFA from a Regular Expression

We now give an algorithm to construct an NFA from a regular expression.


There are many variants of this algorithm, but here we present a simple ver-
sion that is easy to implement. The algorithm is syntax-directed in that it uses
the syntactic structure of the regular expression to guide the construction pro-
cess. The cases in the algorithm follow the cases in the definition of a regular
expression. We first show how to construct automata to recognize e and any
symbol in the alphabet. Then, we show how to construct automata for expres-
sions containing an alternation, concatenation, or Kleene closure operator.
For example, for the expression r\s, we construct an NFA inductively from
theNFA's for r and s.
As the construction proceeds, each step introduces at most two new states,
so the resulting NFA constructed for a regular expression has atmost twice as
many states as there are symbols and operators in the regular expression.
122 LEXICAL ANALYSIS SEC. 3.7

Algorithm 3.3. (Thompson's construction.) An NFA from a regular expres-


sion .

Input. A regular expression r over an alphabet S.

Output. An NFA A' accepting L{r).

Method. We first parse r into its constituent subexpressions. Then, using


rules (1) and (2) below, we construct NFA's for each of the basic symbols in r
(those that are either € or an alphabet symbol). The basic symbols correspond
to parts (1) and (2) in the definition of a regular expression. It is important
to understand that if a symbol a occurs several times in r, a separate NFA is

constructed for each occurrence.


Then, guided by the syntactic structure of the regular expression r, we com-
bine these NFA's inductively using rule (3) below until we obtain the NFA for
the entire expression. Each intermediate NFA produced during the course of
the construction corresponds to a subexpression of r and has several important
properties: it has exactly one final state, no edge enters the start state, and no
edge leaves the final state.

1. For €, construct the NFA


start

Here / is a new start state and /a new accepting state. Clearly, this NFA
recognizes {e}.

For a in S, construct the NFA


start

L
Again / is a new start state and / a new accepting state. This machine
recognizes {«}.

3. Suppose N{s) and NKt) are NFA's for regular expressions s and /.

a) For the regular expression .v|/, construct the following composite


NFA N{s\ty.
^^^- -^-^ FROM A REGULAR EXPRESSION TO AN NFA 123

Here / is a new start state and / a new accepting state.


There is a
transition on € from / to the start states of A^(.v)
and N(t). There is
a transition on € from the accepting
states of N(s) and N(t) to the
new accepting state /. The start and accepting
states of Ni,) and
/V(/) are not start or accepting states
of ^(.v|0. Note that any path
from / to /must pass through either N(s) or N(t) exclusively. Thus,
we see that the composite NFA recognizes L(.0 U L(/).
b) For the regular expression st, construct the composite NFA Nist):

The start state of N(s) becomes the start


state of the composite NFA
and the accepting state of Nit) becomes the
accepting state of the
composite NFA. The accepting state of A^(.y) is
merged with the
start state of A^(/); that is, all transitions from the start state of N{t)
become transitions from the accepting of
state A^(.v). The new
merged state loses
status as a start or accepting state in the
its
com-
posite NFA. A
path from / to /must go first through N(s) and
then
through Nit), so the label of that path will be
a string in Lis)Lit).
Since no edge enters the start state of
Nit) or leaves the accepting
state of N
is), there can be no path from / to /that
travels from Nit)
back to Nis). Thus, the composite NFA recognizes Lis)Lit).
c) For the regular expression .v*, construct the composite NFA Nis*):

start

Here / is a new start state and /a new accepting state. In the com-
posite NFA, we can go from / to /directly, along an edge labeled e,
representing the fact that € is in (L(.v))*, or we can go from / to
f
passing through A'(^) one or more times. Clearly, the composite
NFA recognizes iLis))*.
d) For the parenthesized regular expression (.v), use Nis) itself as the
NFA.
124 LEXICAL ANALYSIS SEC. 3.7

Every time we construct a new state, we give it a distinct name. In this way,
no two states of any component NFA can have the same name. Even if the
same symbol appears several times in r, we create for each instance of that
symbol a separate NFA with its own states.

We can verify that each step of the construction of Algorithm 3.3 produces
an NFA that recognizes the correct language. In addition, the construction
produces an NFA N{r) with the following properties.

1. N(r) has at most twice as many states as the number of symbols and
operators in This follows from the fact each step of the construction
r.

creates at most two new states.

2. N(r) has exactly one start state and one accepting state. The accepting
state has no outgoing transitions. This property holds for each of the
constituent automata as well.

3. Each state of Nir) has either one outgoing transition on a symbol in 2 or


atmost two outgoing €-transitions.

' 11

\ Tk
I

h
I

r^ h
I

a
SEC. 3.7 FROM A REGULAR EXPRESSION TO AN NFA 125

NFA for r^ = f\\f2

The NFA for (r^) is the same as that for Kt,. The NFA for (r^)* is then:

The NFA for r(, =a is

start /^~N a

To obtain the automaton for r^rf,, we merge states 7 and 7', calling the
resulting state 7, to obtain

Continuing in this fashion we obtain the NFA for rn = {a\h)*ahb that was
first exhibited in Fig. 3.27. Q
126 LEXICAL ANALYSIS SEC. 3.7

Two-Stack Simulation of an NFA


We now present an algorithm that, given an NFA N constructed by Algorithm
3.3 and an input string x, determines whether A' accepts a. The algorithm
works by reading the input one character at a time and computing the com-
plete set of states that N could be in after having read each prefix of the input.
The algorithm takes advantage of the special properties of the NFA produced
by Algorithm 3.3 to compute each set of nondeterministic states efficiently. It

can be implemented to run in time proportional to JA^jx |jf|, where |A| is the
number of states in A^ and |jc| is the length of a.

Algorithm 3.4. Simulating an NFA.

Input. An NFA A' constructed by Algorithm 3.3 and an input string x. We


assume x is terminated by an end-of-file character eof. A' has start state Sq

and set of accepting states F.

Output. The answer "yes" \f N accepts .v; "no" otherwise.

Method. Apply the algorithm sketched in Fig. 3.31 to the input string x. The
algorithm in effect performs the subset construction computes at run time. It

a transition from the current set of states 5 to the next set of states in two

stages. First, it determines moveiS, a), all states that can be reached from a
state in 5 by a transition on a, the current input character. Then, it computes
the e-chsure of move{S, a), that is, all states that can be reached from
moveiS, a) by zero or more e-transitions. The algorithm uses the function
nextchar to read the characters of .v, one at a time. When all characters of x
have been seen, the algorithm returns "yes" if an accepting state is in the set
S of current states; "no", otherwise.

5 := ^-closure({s^^})\

a :— nextchar;
while a + eof do begin
S :— i.-c\osure{move{S , «));
a := nextchar
end
if 5 n F ^ then
return "yes";
else return "no";

Fig. 3.31. Simulating the NFA of Algorithm 3.3.

Algorithm 3.4 can be efficiently implemented using two stacks and a bit
vector indexed by NFA states. We use one stack to keep track of the current
set of nondeterministic states and the other stack to compute the next set of
nondeterministic states. We can use the algorithm in Fig. 3.26 to compute the
e-closure. The bit vector can be used to determine in constant time whether a
SEC. 3.7 FROM A REGULAR EXPRESSION TO AN NFA 127

nondeterministic state is already on a stack so that we do not add it twice.


Once we have computed the next state on the second stack, we can inter-
change the roles of the two stacks. Since each nondeterministic state has at
most two out-transitions, each state can give rise to at most two new states in

a transition. Let us write \N\ for the number of states of A^. Since there can
be at most [A^l states on a stack, the computation of the next set of states

from the current set of states can be done in time proportional to \N\. Thus,
the total time needed to simulate the behavior of A^ on input x is proportional
to A' I
X |.v|.
I

Example 3.17. Let A^ be the NFA of Fig. 3.27 and let x be the string consist-
ing of the single character a. The start state is €-closurei{0}) = {0, 1, 2, 4, 7}.
On input symbol a there is a transition to 8. Thus, T
from 2 to 3 and from 7
is Taking the e-closure of T gives us the next state {1, 2, 3, 4, 6, 7, 8}.
{3, 8}.
Since none of these nondeterministic states is accepting, the algorithm returns
"no."
Notice that Algorithm 3.4 does the subset construction at run-time. For
example, compare the above transitions with the states of the DFA in Fig.
3.29 constructed from the NFA of Fig. 3.27. The start and next state sets on
input a correspond to states A and B of the DFA.

Time-Space Tradeoffs

Given and an input string .v, we now have two methods


a regular expression r
for determining whether x
in L(r). One approach is to use Algorithm 3.3
is

to construct an NFA A' from r. This construction can be done in 0{ \r\) time,
where |r| is the length of r. A' has at most twice as many states as |r|, and at
most two transitions from each state, so a transition table for A' can be stored
in C>(|r|) space. We can then use Algorithm 3.4 to determine whether A^
accepts j: in 0( |r| X |;c|) time. Thus, using this approach, we can determine
whether x is in L{r) in total time proportional to the length of r times the
length of X. This approach has been used in a number of text editors to
search for regular expression patterns when the target string x is generally not
very long.
A second approach is to construct a DFA from the regular expression r by
applying Thompson's construction to r and then the subset construction, Algo-
rithm 3.2, to the resulting NFA. (An implementation that avoids constructing
the intermediate NFA explicitly is given in Section 3.9.) Implementing the
transition function with a transition table, we can use Algorithm 3.1 to simu-
late the DFA on input .v in time proportional to the length of x, independent
of the number of states in the DFA. This approach has often been used in

pattern-matching programs that search text files for regular expression pat-
terns. Once the finite automaton has been constructed, the searching can
proceed very rapidly, so this approach is advantageous when the target string
X is very long.
There are, however, certain regular expressions whose smallest DFA has a
128 LEXICAL ANALYSIS SEC. 3.7

number of states that is in the size of the regular expression.


exponential For
example, the regular expression (a\b)*a(a\h)(a\b) ia\b), where there
• •

are n — (a|/?)'s at the end, has no DFA with fewer than 2" states. This reg-
1

ular expression denotes any string of «'s and b's in which the nth character
from the end is an a. It is not hard to prove that any DFA for this
right
expression must keep track of the last n characters it sees on the input; other-
wise, it may give an erroneous answer. Clearly, at least 2" states are required
to keep track of all possible sequences of n a's and ^'s. Fortunately, expres-
sions such as this do not occur frequently in lexical analysis applications, but
there are applications where similar expressions do arise.
A third approach is to use a DFA, but avoid constructing all of the transi-
tion table by using a technique called "lazy transition evaluation." Here,
transitions are computed at run time but a transition from a given state on a
given character is not determined until it is actually needed. The computed
transitions are stored in a cache. Each time a transition is about to be made,
the cache is consulted. If the transition is not there, it is computed and stored
in the cache. If the cache becomes full, we can erase some previously com-

puted transition to make room for the new transition.


Figure 3.32 summarizes the worst-case space and time requirements for
determining whether an input string x is in the language denoted by a regular
expression r using recognizers constructed from nondeterministic and deter-
ministic finite automata. The "lazy" technique combines the space require-
ment of the NFA method with the time requirement of the DFA approach.
Its space requirement is the size of the regular expression plus the size of the
cache; its observed running time is almost as fast as that of a deterministic
recognizer. In some applications, the "lazy" technique is considerably faster
than the DFA approach, because no time is wasted computing state transitions
that are never used.

Automaton
SEC. 3.8 DESIGN OF A LEXICAL ANALYZER GENERATOR 129

P\ { action i }

Pi { action 2 }

Pn { action,, }

where, as in Section 3.5, each pattern /?, is a regular expression and each
action actioni is a program fragment whenever a lexeme
that is to be executed
matched by /?, is found in the input.
Our problem is to construct a recognizer that looks for lexemes in the input
buffer. If more than one pattern matches, the recognizer is to choose the

longest lexeme matched. If there are two or more patterns that match the
longest lexeme, the first-listed matching pattern is chosen.
A finite automaton is a natural model around which to build a lexical
analyzer, and the one constructed by our Lex compiler has the form shown in
Fig. 3.33(b). There is an input buffer with two pointers to it, a lexeme-
beginning and a forward pointer, as discussed in Section 3.2. The Lex com-
piler constructs a transition table for a finite automaton from the regular
expression patterns in the Lex specification. The lexical analyzer itself con-
sists of a finite automaton simulator that uses this transition table to look for

the regular expression patterns in the input buffer.

Lex
specification

(a) Lex compiler.

lexeme input buffer

FA
simulator

transition
table

(b) Schematic lexical analyzer.

Fig. 3.33. Model of Lex compiler.

The remainder of this section shows that the implementation of a Lex


130 LEXICAL ANALYSIS SEC. 3.8

compiler can be based on either nondeterministic or deterministic automata.


At the end of the last section we saw that the transition table of an NFA for a
regular expression pattern can be considerably smaller than that of a DFA,
but the DFA has the decided advantage of being able to recognize patterns
faster than the NFA.

Pattern Matching Based on NFA's

One method is to construct the transition table of a nondeterministic finite


automaton A' for the composite pattern /?i |/?2 I
' '

\Pii- This can be done by


first creating an NFA N{p,) for each pattern p, using Algorithm 3.3, then
adding a new start state ,v„, and finally linking -so to the start state of each
N(pi) with an e-transition, as shown in Fig. 3.34.

Fig. 3.34. NFA constructed from Lex specification.

To simulate this NFA we can use a modification of Algorithm 3.4. The


modification ensures that the combined NFA recognizes the longest prefix of
the input that is matched by a pattern. In the combined NFA, there is an
accepting state for each pattern /?,. When we simulate the NFA using Algo-
rithm 3.4, we construct the sequence of sets of states that the combined NFA
can be in after seeing each input character. Even if we find a set of states
that contains an accepting state, to find the longest match we must continue to
simulate the NFA until it reaches termination, that is, a set of states from

which there are no transitions on the current input symbol.


We presume that the Lex specification is designed so that a valid source
program cannot entirely fill the input buffer without having the NFA reach
termination. For example, each compiler puts some restriction on the length
of an identifier, and violations of this limit will be detected when the input
buffer overflows, if not sooner.
To find the correct match, we make two modifications to Algorithm 3.4.
First, whenever we add an accepting state to the current set of states, we
SEC. 3.8 DESIGN OF A LEXICAL ANALYZER GENERATOR 131

record the current input position and the pattern /?, corresponding to this
accepting state. If the current set of states already contains an accepting state,
then only the pattern that appears first in the Lex specification is recorded.
Second, we continue making transitions until we reach termination. Upon ter-
mination, we retract the forward pointer to the position at which the last

match occurred. The pattern making this match identifies the token found,
and the lexeme matched is the string between the lexeme-beginning and for-
ward pointers.
Usually, the Lex specification is such that some pattern, possibly an error
pattern, will always match. If no pattern matches, however, we have an error
condition for which no provision was made, and the lexical analyzer should
transfer control to some default error recovery routine.

Example 3.18. A simple example illustrates the above ideas. Suppose we


have the following Lex program consisting of three regular expressions and no
regular definitions.

a { } /* actions are omitted here */


abb { }

a*b^ {}

The three tokens above are recognized by the automata of Fig. 3.35(a). We
have simplified the third automaton somewhat from what would be produced
by Algorithm 3.3. As indicated above, we can convert the NFA's of Fig.
3.35(a) into one combined NFA A' shown in 3.35(b).
Let us now consider the behavior of on the input string aaba using our
A'

modification of Algorithm 3.4. Figure 3.36 shows the sets of states and pat-
terns that match as each character of the input aaba is processed. This figure
shows that the initial set of states is {0, 1, 3, 7}. States 1, 3, and 7 each have
a transition on a, to states 2, 4, and 7, respectively. Since state 2 is the
accepting state for the first pattern, we record the fact that the first pattern
matches after reading the first a.

However, there is a transition from state 7 to state 7 on the second input


character, so we must continue making transitions. There is a transition from

state 7 to state 8 on the input character b. State 8 is the accepting state for
the third pattern.Once we reach state 8, there are no transitions possible on
the next input character a so we have reached termination. Since the last
match occurred after we read the third input character, we report that the
third pattern has matched the lexeme aab.

The role of actionj associated with the pattern /;, in the Lex specification is

as follows. When an instance of/?, is recognized, the lexical analyzer executes


the associated program action j. Note that action^ is not executed just because
the NFA enters a state that includes the accepting state for />,; action^ is only
executed if p, turns out to be the pattern yielding the longest match.
132 LHXICAL ANALYSIS SEC. 3.8

(a) NFA for a, uhh, and a*h '


.

(b) Combined NFA.

Fig. 3.35. NFA recognizing three different patterns.

P\ P^
SEC. 3. DESIGN OF A LEXICAL ANALYZER GENERATOR 133

may be several accepting states in a given subset of nondeterministic states.


In such a situation, the accepting state corresponding to the pattern listed first

in the Lex specification has priority. As in the NFA simulation, the only
other modification we need perform is to continue making state transitions
to
until we reach a state withno next state (i.e., the state 0) for the current
input symbol. To find the lexeme matched, we return to the last input posi-
tion at which the DFA entered an accepting state.

State
134 LEXICAL ANALYSIS SEC. 3.8

Implementing the Lookahead Operator

Recall from Section 3.4 that the lookahead operator / is necessary in some
situations, since the pattern that denotes a particular token may need to

describe some trailing context for the actual lexeme. When converting a pat-
tern with / to an NFA, we were e, so that we do not
can treat the / as if it

actually look for / on the input. However, denoted by this regular


if a string
expression is recognized in the input buffer, the end of the lexeme is not the
position of the NFA's accepting state. Rather it is at the last occurrence of
the state of this NFA having a transition on the (imaginary) /.

Example 3.20. The NFA recognizing the pattern for IF given in Example
3.12 is shown in Fig. 3.38. State 6 indicates the presence of keyword IF;
however, we find the token IF by scanning backwards to the last occurrence
of state 2. n

any

Fig. 3.38. NFA recognizing Fortran keyword IF.

3.9 OPTIMIZATION OF DFA-BASED PATTERN MATCHERS


In this section, we present three algorithms that have been used to implement
and optimize pattern matchers constructed from regular expressions. The first
algorithm is suitable for inclusion in a Lex compiler because it constructs a
DFA directly from a regular expression, without constructing an intermediate
NFA along the way.
The second algorithm minimizes the number of states of any DFA, so it can
be used to reduce the size of a DFA-based pattern matcher. The algorithm is
efficient; its running time is 0(n\ogn), where n is the number of states in the
DFA. The third algorithm can be used to produce fast but more compact
representations for the transition table of a DFA than a straightforward two-
dimensional table.

Important States of an NFA


Let us call a state of an NFA important if it has a non-e out-transition. The
subset construction in Fig. 3.25 uses only the important states in a subset T
when it determines e-closure{move{T, a)), the set of states that is reachable
from T on input a. The set moveis, a) is nonempty only if state is impor- .v

tant. During the construction, two subsets can be identified if they have the
same important states, and either both or neither include accepting states of
the NFA.
SEC. 3.9 OPTIMIZATION OF DFA-BASED PATTERN MATCHERS 135

When the subset construction is applied to an NFA obtained from a regular


expression by Algorithm 3.3, we can exploit the special properties of the NFA
to combine the two constructions. The combined construction relates the
important states of the NFA with the symbols in the regular expression.
Thompson's construction builds an important state exactly when a symbol in

the alphabet appears in a regular expression. For example, important states


will be constructed for each a and b in {a\b)*abb.
Moreover, the resulting NFA has exactly one accepting state, but the
accepting state is not important because it has no transitions leaving it. By
concatenating a unique right-end marker # to a regular expression r, we give
the accepting state of r a transition on #, making it an important state of the
NFA for r#. In other words, by using the augmented regular expression (r)#
we can forget about accepting states as the subset construction proceeds; when
the construction is complete, any DFA state with a transition on # must be an
accepting state.
We represent an augmented regular expression by a syntax tree with basic
symbols at the leaves and operators at the interior nodes. We refer to an inte-
rior node as a cat-node, or-nodc\ or star-node if it is labeled by a concatena-
tion, |, or
* operator, respectively. Figure 3.39(a) shows a syntax tree for an
augmented regular expression with cat-nodes marked by dots. The syntax tree
for a regular expression can be constructed in the same manner as a syntax
tree for an arithmetic expression (see Chapter 2).
Leaves in the syntax tree for a regular expression are labeled by alphabet
symbols or by €. To each leaf not labeled by e we attach a unique integer and
refer to this integer as the position of the leaf and also as a position of itssym-
bol. A repeated symbol therefore has several positions. Positions are shown
below the symbols in the syntax tree of Fig. 3.39(a). The numbered states in
the NFA of Fig. 3.39(c) correspond to the positions of the leaves in the syntax
tree in Fig. 3.39(a). It no coincidence that these states are the important
is

states of the NFA. Non-important states are named by upper case letters in
Fig. 3.39(c).
The DFA in Fig. 3.39(b) can be obtained from the NFA in Fig. 3.39(c) if
we apply the subset construction and identify subsets containing the same
important states. The identification results in one fewer state being con-
structed, as a comparison with Fig. 3.29 shows.

From a Regular Expression to a DFA


In this section, we show how from an augmented
to construct a DFA directly
regular expression (r)#. We T for (r)#
begin by constructing a syntax tree
and then computing four functions: nullable, firstpos, lastpos, and followpos,
by making traversals over T. Finally, we construct the DFA from followpos.
The functions nullable, firstpos, and lastpos are defined on the nodes of the
syntax tree and are used to compute followpos, which is defined on the set of
positions.
.

136 LEXICAL ANALYSIS SEC. 3.9

/ \
• #
/ \ 6 (a) Syntax tree for {a\h)*ahh#:.
• h
/ \ 5
• h
/ \ 4
a

/ \
a h
1 2

(b) Resulting DFA.

start

(c) Underlying NFA.

Fig. 3.39. DFA and NFA constructed from (^a\b)'^ abb#

Remembering the equivalence between the important NFA states and the
positions of the leaves in the syntax tree of the regular expression, we can
short-circuit the construction of the NFA by building the DFA whose states
correspond to sets of positions in the tree. The e-transitions of the NFA
represent some fairly complicated structure of the positions; in particular, they
encode the information regarding when one position can follow another. That
is, each symbol in an input string to a DFA can be matched by certain posi-

tions. An input symbol c can only be matched by positions at which there is a


c, but not every position with a c can necessarily match a particular
occurrence of c in the input stream.
The notion of a position matching an input symbol will be defined in terms
)

SEC. 3.9 OPTIMIZATION OF DFA-BASED PATTERN MATCHERS 137

of the function followpos on positions of the syntax tree. If / is a position,


then followpos {i) is the set of positions j such that there is some input string
cd such that / corresponds to this occurrence of c and j to this
occurrence of d.

Example 3.21. In Fig. 3.39(a), followposH) = {1, 2, 3}. The reasoning is


that if we see an a corresponding to position 1, then we have just seen an
occurrence of a\b in the closure (a\b)*. We could next see the first position
of another occurrence of a\b, which explains why 1 and 2 are in followpos i\).
We could also next see the first position of what follows ia\b)*, that is, posi-
tion 3.

In order to compute the function followpos, we need to know what positions


can match the first or last symbol of a string generated by a given subexpres-

sion of a regular expression. (Such information was used informally in Exam-


ple 3.21.) If r* is such a subexpression, then every position that can be first

in r follows every position that can be last in r. Similarly, if rs is a subexpres-


sion, then every first position of s follows every last position of r.

At each node n of the syntax tree of a regular expression, we define a func-


tion firstpos(n) that gives the set of positions that can match the first symbol
of a string generated by the subexpression rooted at n. Likewise, we define a
function lastpos(n) that gives the set of positions that can match the last sym-
bol in such a string. For example, if n is the root of the whole tree in Fig.
3.39(a), then firstpos{n) = {1,2, 3} and lastpos{n) = {6}. We give an algo-
rithm for computing these functions momentarily.
In order to compute firstpos and lastpos, we need to know which nodes are
the roots of subexpressions that generate languages that include the empty
string. Such nodes are called nullable, and we define nullable(n) to be true if

node n is nullable, false otherwise.


We can now give the rules to compute the functions nullable, firstpos, last-
pos, and followpos. For the first three functions, we have a basis rule that
tells about expressions of a basic symbol, and then three inductive rules that
allow us to determine the value of the functions working up the syntax tree
from the bottom; in each case the inductive rules correspond to the three
operators, union, concatenation, and closure. The rules for nullable and first-

pos are given in Fig. 3.40. The rules for lastpos (n) are the same as those for
firstpos(n), but with C] and Ct reversed, and are not shown.
The first rule for nullable states that \f n \s a leaf labeled e, then nullable (n)
is surely true. The second rule states that if n is a leaf labeled by an alphabet
symbol, then nullable (n) is false. In this case, each leaf corresponds to a sin-

gle input symbol, and therefore cannot generate e. The last rule for nullable

states that if ai is a star-node with child c\, then nullable (n) is true, because
the closure of an expression generates a language that includes e.

As another example, the fourth rule for firstpos states that if n is a cat-node
with left child c^ and right child ct, and if nullable (c^) is true, then

firstpos (n) = firstpos ( c |


) U firstpos ( c 2
138 LEXICAL ANALYSIS SEC. 3.9

Node n
SEC. 3.9 OPTIMIZATION OF DFA-BASED PATTERN MATCHERS 139

{1,2,3} . {6}

{1,2,3} . {5} {6} # {6}

{1,2,3} . {4} {5} h {5}

{1,2,3} . {3} {4} h {4}

{1,2} * {1,2} {3} a {3}

{1.2} I
{1,2}

{1} " {1} {2} h {2}

Fig. 3.41. firstpos and lastpos for nodes in syntax tree for {a\h)*abb#.

The node labeled * is the only nullable node. Thus, by the if-condition of
the fourth rule, firstpos for the parent of this node (the one representing
expression {a\b)*a) is the union of {!, 2} and {3}, which are the firstpos's of
its left and right children. On the other hand, the else-condition applies for
lastpos of this node, since the leaf at position 3 is not nullable. Thus, the
parent of the star-node has lastpos containing only 3.

now compute foUowpos bottom up for each node of the syntax tree
Let us
of Fig. 3.41. At the star-node, we add both and 2 to followpos{\) and to
1

followpos(2) using rule (2). At the parent of the star-node, we add 3 to fol-
lowpos(\) and followpos(2) using rule (1). At the next cat-node, we add 4 to
followpos O) using rule (1). At the next two cat-nodes we add 5 to fol-
lowpos{4) and 6 to followpos (5) using the same rule. This completes the con-
struction of followpos. Figure 3.42 summarizes /<9//ow/?o5.

Node
140 LEXICAL ANALYSIS SEC. 3.9

3.42.

Fig. 3.43. Directed graph for the function foUowpos.

It is interesting to note that this diagram would become an NFA without e-


transitions for the regular expression in question if we:

1. make all positions in firstpos of the root be start states,


2. label each directed edge (/, j) by the symbol at position 7. and
3. make the position associated with # be the only accepting state.

It should therefore come as no surprise that we can convert the foUowpos


graph into a DFA using the subset construction. The entire construction can
be carried out on the positions, using the following algorithm.

Algorithm 3.5. Construction of a DFA from a regular expression r.

Input. A regular expression r.

Output. A DFA D that recognizes L(r).

Method.

1. Construct a syntax tree for the augmented regular expression (r)#, where
# is a unique endmarker appended to (r).

2. Construct the functions nullable, firstpos.. lastpos, and foUowpos by mak-


ing depth-first traversals of 7".

3. Construct Dstates, the set of states of D, and Dtran, the transition table
for D by the procedure in The states in Dstates are sets of
Fig. 3.44.
positions; initially, each "unmarked," and a state becomes
state is

"marked" just before we consider its out-transitions. The start state of D


\s firstpos (root), and the accepting states are all those containing the posi-
tion associated with the endmarker #. Q

Example 3.23. Let us construct a DFA for the regular expression (a\b)*abb.
The syntax tree for {ia\b)*abb)# is shown in Fig. 3.39(a). nuUable is true
only for the node labeled *. The functions firstpos and lastpos are shown in
Fig. 3.41, and foUowpos is shown in Fig. 3.42.
From Fig. 3.41, firstpos of the root is {1, 2, 3}. Let this set be A and
SEC. 3.9 OPTIMIZATION OF DFA-BASED PATTERN MATCHERS 141

initially, the only unmarked state in Dstates \^ firstpos(root),


where root is the root of the syntax tree for (r)#;
while there is an unmartced state T in Dstates do begin
mark T\

for each input symbol a do begin


let U be the set of positions that are in Jollowpos(p)
for some position p in 7,

such that the symbol at position p is «;

if (J is not empty and is not in Dstates then


add U as an unmarked state to Dstates;

DtranlT, a \
:= U
end
end

Fig. 3.44. Construction of DFA.

consider input symbol a. Positions 1 and 3 are for «, so let B =


followpos{\) {J followpos{2>) = {!, 2, 3, 4}. Since this set has not yet been
seen, we set Dtran\A, a\ := B.
When we we note that of the positions in A, only 2 is
consider input h,
associated with b, so we must consider the i>et followpos(2) = {1,2, 3}. Since
this set has already been seen, we do not add it to Dstates, but we add the
transition D/ra/j |A, b\ := A.
We now continue with B = {1, 2, 3, 4}. The states and transitions we
finally obtain are the same as those that were shown in Fig. 3.39(b).

Minimizing the Number of States of a DFA


An important theoretical result is that every regular set is recognized by a
minimum-state DFA that is unique up to state names. In this section, we
show how to construct this minimum-state DFA by reducing the number of
states in a given DFA
minimum without affecting the language
to the bare
that is Suppose that we have a DFA
being recognized. with set of states S M
and input symbol alphabet 2. We assume that every state has a transition on
every input symbol, if that were not the case, we can introduce a new "dead
state" d, with transitions from d to d on all inputs, and add a transition from
state s to d on input a was no transition from s on a.
if there
We say that string vv distinguishes state from state t if, by starting with the .y

DFA M
in state and feeding it input h', we end up in an accepting state, but
,y

starting in state t and feeding it input w, we end up in a nonaccepting state, or


vice versa. For example, e distinguishes any accepting state from any nonac-
cepting state, and in the DFA of Fig. 3.29, states A and B are distinguished by
the input bh, since A goes to the nonaccepting state C on input bb, while B
goes to the accepting state E on that same input.
142 LEXICAL ANALYSIS SEC. 3.9

Our algorithm for minimizing the number of states of a DFA works by


finding all groups of states that can be distinguished by some input string.
Each group of states that cannot be distinguished is then merged into a single
state. The algorithm works by maintaining and refining a partition of the set
of states. Each group of states within the partition consists of states that have
not yet been distinguished from one another, and any pair of states chosen
from different groups have been found distinguishable by some input.
Initially, the partition consists of two groups: the accepting states and the

nonaccepting states. The fundamental step is to take some group of states,


say A = {.Vi,.y2' ^-'^kl ^^^
• • some input symbol a, and look at what transi-

tions states .V|, ,5;


. have
. .on input a. If these transitions are to states that
fall into two or more different groups of the current partition, then we must
split A so that the transitions from the subsets of A are all confined to a single

group of the current partition. Suppose, for example, that .V] and S2 go to
states /| and tj on input a, and t\ and tj are in different groups of the parti-
tion. Then we must split A into at least two subsets so that one subset con-
tains .V] and the other .s^- Note that /] and t2 are distinguished by some
string w, so .V] and .V2 are distinguished by string aw.
We repeat this process of splitting groups in the current partition until no
more groups need to be split. While we have justified why states that have
been split into different groups really can be distinguished, we have not indi-
cated why states that are not split into different groups are certain not to be
distinguishable by any input string. Such is the case, however, and we leave a
proof of that fact to the reader interested in the theory (see, for example,
Hopcroft and Ullman [1979)). Also left to the interested reader is a proof
that theDFA constructed by taking one state for each group of the final parti-
tion and then throwing away the dead state and states not reachable from the
start state has as few states as any DFA accepting the same language.

Algorithm 3.6. Minimizing the number of states of a DFA.


Input. A DFA M with set of states S, set of inputs 2, transitions defined for
all states and inputs, start state .so, and set of accepting states F.

Output. A DFA M' accepting the same language as M and having as few
states as possible.

Method.

1. Construct an initial partition FI of the set of states with two groups: the
accepting states F and the nonaccepting states 5 -F.

2. Apply the procedure of Fig. 3.45 to U to construct a new partition Onew


3. If rin^.^ = n, let n,inai = n and continue with step (4). Otherwise, repeat
step (2) with := Hnew

4. Choose one state in each group of the partition U{,na\ as the representative
SEC. 3.9 OPTIMIZATION OF DFA-BASED PATTERN MATCHERS 143

for that group. The representatives will be the states of the reduced DFA
M' . Let 5 be a representative state, and suppose on input a there is a
transition of M from s to /. Let r be the representative of /'s group (r
may be t). Then M' has a transition from 5 to r on a. Let the start state
of M' be the representative of the group containing the start state .Vq of
M, and let the accepting states of M' be the representatives that are in F.
Note that each group of n,inai either consists only of states in F or has no
states in F.

If M' has a dead state, that is, a state d that is not accepting and that has
transitions to itself on all input symbols, then remove d from M' . Also
remove any states not reachable from the start state. Any transitions to d
from other states become undefined.

for each group G of II do begin


partition G into subgroups such that two states s and t

of G arc in the same subgroup if and only if for all

input symbols a, states .v and / have transitions on a


to states in the same group of II;

/* at worst, a state will be in a subgroup by itself »/


replace G in lln^w by the set of all subgroups formed
end

Fig. 3.45. Construction of 11^

Example 3.24. Let us reconsider the DFA represented in Fig. 3.29. The ini-

tial partition 11 consists of two groups: (E), the accepting state, and (ABCD),
the nonaccepting states. To construct Onevv the algorithm of Fig. 3.45 first

considers (£). Since this group consists of a single state, it cannot be split

further, so (E) is placed in flne^. The algorithm then considers the group
(ABCD). On input «, each of these states has a transition to B, so they could
all remain in one group as far as input a is concerned. On input b, however,
A, B, and C go to members of the group (ABCD) of 11, while D goes to E, a
member of another group. Thus, in flpew the group (ABCD) must be split into

two new groups (ABC) and (D); flnew is thus {ABC)(D)iE).


In the next pass through the algorithm of Fig. 3.45, we again have no split-

ting on input a, but two new groups iAC){B), since


(ABC) must be split into

on input b, A and C each have a transition to C, while B has a transition to D,


a member of a group of the partition different from that of C. Thus the next
value of n is (AC)iB)(D){E).
In the next pass through the algorithm of Fig. 3.45, we cannot split any of
the groups consisting of a single state. The only possibility is to try to split
(AC). But A and C go the same state B on input a, and they go to the same
C on input
state b. Hence, after this pass, 11^^^ = H. Ilnn^i is thus
(AOiBHDHE).
144 LEXICAL ANALYSIS SEC. 3.9

If we choose A as the representative for the group (AC), and choose B, D,


and E for the singleton groups, then we obtain the reduced automaton whose
transition table is shown in Fig. 3.46. State A is the start state and state E is

the only accepting state.

State
SEC. 3.9 OPTIMIZATION OF DFA-BASED PATTERN MATCHERS 145

States and characters, provides the fastest access, but it can take up too much
space (say several hundred states by 128 characters). A more compact but
slower scheme is to use a linked list to store the transitions out of each state,
with a "default" transition at the end of the list. The most frequently occur-
ing transition one obvious choice for the default.
is

There is a more subtle implementation that combines the fast access of the
array representation with the compactness of the list structures. Here we use
a data structure consisting of four arrays indexed by state numbers as depicted
in Fig. 3.47.^ The base array is used to determine the base location of the

entries for each state stored in the next and check arrays. The default array is
used to determine an alternative base location in case the current base location
is invalid.

default base next heck


; ;

146 LEXICAL ANALYSIS SEC. 3.9

example, state q, the default for state s, might be the state that says we are
"working on an identifier," such as state 10 in Fig. 3.13. Perhaps s is entered
after seeing th a prefix of the keyword then as well as a prefix of an identif-
ier. On input character e we must go to a special state that remembers we
have seen the, but otherwise state behaves as state q does. Thus, we set
.v

check\base\s\ + e\ to and next [base \s\ + e\ to the state for the.


.v

While we may not be able to choose base values so that no next-check


entries remain unused, experience shows that the simple strategy of setting the
base to the lowest number such that the special entries can be filled in without
conflicting with existing entries is fairly good and utilizes little more space
than the minimum possible.
We can shorten check into an array indexed by states if the DFA has the
property that the incoming edges to each state t all have the same label a. To
implement this scheme, we set check\t\ = a and replace the test on line 2 of
procedure next state by

\{ check\next\base[s\ + a]] = a then

EXERCISES

3.1 What is the input alphabet of each of the following languages?


a) Pascal
b) C
c) Fortran 77
d) Ada
e) Lisp

3.2 What are the conventions regarding the use of blanks in each of the
languages of Exercise 3.1?

3.3 Identify the lexemes that make up the tokens in the following pro-
grams. Give reasonable attribute values for the tokens.
a) Pascal

function max i j integer ( integer


, : ) : ;

{ return maximum of integers i and j }


begin
if i > j then max := i
else max := j
end;

b) C
int max ( i , j ) int i , j
/* return maximum of integers i and j */
{

return i> j?i : j


}
.

CHAPTER 3 EXERCISES 147

c) Fortran 77

FUNCTION MAX I, J ( )

C RETURN MAXIMUM OF INTEGERS I AND J


IF (I .GT. J) THEN
MAX = I
ELSE
MAX = J
END IF
RETURN
3.4 Write a program for the function nextchar( ) of Section 3.4 using
the buffering scheme with sentinels described in Section 3.2.

3.5 In a string of length n, how many of the following are there?


a) prefixes
b) suffixes
c) substrings
d) proper prefixes
e) subsequences

*3.6 Describe the languages denoted by the following regular expressions:


a) 0(0|1)*0
b) ((€|0)1*)*
c) (0|1)*0(0|1)(0|1)
d) 0*10*10*10*
e) (00| 1 1)*((01 1 10)(00| 1 1)*(01 1 10)(00| 1 1)*)*

*3.7 Write regular definitions for the following languages.


a) All strings of letters that contain the five vowels in order.
b) All strings of letters in which the letters are in ascending lexico-
graphic order.
c) Comments consisting of a string surrounded by / and / without
an intervening »/ unless it appears inside the quotes " and ".

*d) All strings of digits with no repeated digit.


e) All strings of digits with at most one repeated digit.
f) All strings of O's and I's with an even number of O's and an odd
number of I's.
g) The set of chess moves, such asp —k4 or khpXqn.
h) All strings of O's and I's that do not contain the substring Oil.
i) All strings of O's and I's that do not contain the subsequence 01 1

3.8 Specify the lexical form of numeric constants in the languages of


Exercise 3.1.

3.9 Specify the lexical form of identifiers and keywords in the languages
of Exercise 3.1.
148 LEXICAL ANALYSIS CHAPTER 3

3.10 The regular expression constructs permitted by Lex are listed in Fig.
3.48 in decreasing order of precedence. In this table, c stands for any
single character, r for a regular expression, and s for a string.

Expression
^^^^^^^ EXERCISES 149

c) The regular expression r{m,n} matches from m to


n occurrences
of the pattern r. For example, a{ 1 ,
5} matches a string of one to
five a's. Show that for every regular
expression containing repeti-
tion operators there is an equivalent
regular expression without
repetition operators.
d) The operator ^ matches the leftmost end of a line. This is the
same operator that introduces a complemented character class, but
the context in which " appears will always determine a unique
meaning for this operator. The operator $ matches the rightmost
end of a line. For example, '^[^aeiou]*$ matches any line
that
does not contain a lower case vowel. For every
regular expression
containing the and $ operators is there an equivalent regular
-^

expression without these operators?

3.11 Write a Lex program that copies a file, replacing each nonnull
sequence of white space by a single blank.

3.12 Write a Lex program that copies a Fortran


program, replacing all
DOUBLE PRECISION by REAL.
instances of

3.13 Use your specification for keywords and identifiers


for Fortran 77
from Exercise 3.9 to identify the tokens in the following
statements:
IF(I) = TOKEN
IF(I) ASSIGN5T0KEN
IF{I) 10,20,30
IF{I) GOT0 15
IF(I) THEN
Can you write your specification for keywords and
identifiers in Lex?
3.14 In the UNIX
system, the shell command sh uses the operators
in Fig.
3.49 filename expressions to describe sets of filenames.
in
For exam-
ple, the filename expression .o
matches all filenames ending in .o;
sort ? matches all filenames that are of the form sort .c where
.
c is
any character. Character classes may be abbreviated
as in [a-z].
Show how shell filename expressions can be represented by regular
expressions.

3.15 Modify Algorithm 3.1 to find the longest prefix of the input
that is
accepted by the DFA.
3.16 Construct nondeterministic finite automata for the following
regular
expressions using Algorithm 3.3. Show the sequence of
moves made
by each in processing the input string ababbah
a) (a\b)*
b) ia* \b*)*
150 LEXICAL ANALYSIS CHAPTER 3

Expression
CHAPTER 3 EXERCISES 151

3.24 Construct the representation of Fig. 3.47 for the transition table of
Exercise 3.19. Pick default states and try the following two methods
of constructing the next array and compare the amounts of space used:
a) Starting with the densest states (those with the largest number of
entries differing from their default states) first, place the entries
for the states in the next array.
b) Place the entries for the states in the next array in a random order.

3.25 A variant of the table compression scheme of Section 3.9 would be to


avoid a recursive nextstate procedure by using a fixed default location
for each state. Construct the representation of Fig. 3.47 for the tran-
sition table of Exercise 3.19 using this nonrecursive technique. Com-
pare the space requirements with those of Exercise 3.24.

3.26 Let /?i^2 ^m be a pattern string, called a keyword. A trie for a


keyword is diagram with m + states in which each state
a transition 1

corresponds to a prefix of the keyword. For < 5 < m, there is a 1

transition from state 5 — to state s on symbol b^. The start and final
1

states correspond to the empty string and the complete keyword,


respectively. The trie for the keyword ababaa is:

~N
1
h
y—M^
,^x
2
a ,-^-\
)-—M^zy-—M.^Ay—-*{^
h /—-^ a ^

1
.

152 LEXICAL ANALYSIS CHAPTERS

/* compute failure function /for h\ h„, */


r:= 0;/(l) := 0;
for = to AW - do begin
s : I I

while / > and /?, ,


, ^ /'z 1
1 do / := /(/);
if /?, ,1 =/?,,! then begin / :
= r + 1 ; / (v + I ) :
= / end;
else/(,v + l) :=
end

Fig. 3.50. Algorithm to compute failure function for Exercise 3.26.

3.27 Algorithm KMP in Fig. 3.51 uses the failure function /constructed as
in Exercise 3.26 to determine whether keyword ^i h,„ is a sub-

string of a target string a^ a„. States in the trie for ^| b,„ • •

are numbered from to w as in Exercise 3.26(b).

/* does </ 1
• •
a„ contain h^ •
h,„ as a substring «/
s :
= 0;

for i :- 1 to « do begin
while .V > and </, 9^ /?^ , ,
do ,v := /(.v);
if a, = h, then = + , I
.v : .v 1

if .V = m then return "yes"


end;
return "no"

Fig. 3.51. Algorithm KMP.

a) Apply Algorithm KMP to determine whether ababaa is a substring


of abababaub
*b) Prove that Algorithm KMP returns "yes" if and only \{ b b,„ \

is a substring of « |
• •
a,,.

*c) Show that Algorithm KMP runs in 0{m+n) time.


*d) Given a keyword y, show that the failure function can be used to
construct, in 0{ |v|) time, a DFA with |v| + 1 states for the regu-
lar expression .*v.*, where . stands for any input character.

**3.28 Define the period of a string .v to be an integer p such that can be .v

expressed as (uv)^u, for some k > 0, where |wv'| = p and v is not the
empty string. For example, 2 and 4 are periods of the string abababa.
a) Show that /?is a period of a string if and only if ,vr = us for some .v

and u of length p.
strings /

b) Show that if p and q are periods of a string ,v and if p+q ^


|.v| + gcd(/?,(/), then gcd{p,q) is a period of .v, where gcd(/J,^) is

the greatest common divisor of/? and q.


CHAPTER 3 EXERCISES 153

c) Let spiSj) be the smallest period of the prefix of length / of a


string s. Show that the failure function / has the property that
fij) = J - spiSj-^).
*3.29 Let the shortest repeating prefix of a string s be the shortest prefix u of
s such that s = u'^ , some k > I. For example, ab is the shortest
for
repeating prefix of abababab and aba is the shortest repeating prefix
of aba. Construct an algorithm that finds the shortest repeating pre-
fix of a string s in 0(|5|) time. Hint. Use the failure function of
Exercise 3.26.

3.30 A Fibonacci string is defined as follows:

S\ = b
St = a
Sk = -^A-i^A-:^ for k > 2.

For example, st, = ab, S4 = aba, and 55 = abaab.


What
a) is the length of s„?
**h) What is the smallest period of 5„?
c) Construct the failure function for S(,.

*d) Using induction, show that the failure function for s„ can be
expressed by f (j) = j - \s^-<^\, where k is such that
\sk\ ^7 + < 1 ki + il for 1 < 7 < \s„\.

e) Apply Algorithm KMP to determine whether Sfy is a substring of


the target string 57.
f) Construct a DFA for the regular expression .*S(,.'^.

**g) In Algorithm KMP, what is the maximum number of consecutive


applications of the failure function executed in determining
whether 5^ is a substring of the target string 5^ + 1?

3.31 We can extend the trie and failure function concepts of Exercise 3.26
from a single keyword keywords as follows. Each state in
to a set of
the trie corresponds to a prefix of one or more keywords. The start
state corresponds to the empty string, and a state that corresponds to
a complete keyword is a final state. may be made
Additional states
final during the computation of the failure function. The transition
diagram for the set of keywords {he, she, his, hers} is shown in
Fig. 3.52.

For the trie we define a transition function g that maps state-symbol


pairs to states such that g{s, bj + i)
= s' if state s corresponds to a
prefix Z?] bj of some keyword and s' corresponds to a prefix
b\ bjbj + \. If 5o is the start state, we define g(sQ, a) — Sq for

all input symbols a that are not the initial symbol of any keyword.
We then set g(s, a) = fail for any transition not defined. Note that
there are no fail transitions for the start state.
154 LEXICAL ANALYSIS CHAPTER 3

tH^^iy^^

Fig. 3.52. Trie for keywords {he, she, his, hers}.

Suppose states s and / represent prefixes u and v of some keywords.


Then, we define f (s) = Mf and only if v is the longest proper suffix
of u that is also the prefix of some keyword. The failure function /
for the transition diagram above is

/(.v)
CHAPTER 3 EXERCISES 155

*b) Show that the algorithm in Fig. 3.53 correctly computes the failure
function.
*c) Show that the failure function can be computed in time propor-
tional to the sum of the lengths of the keywords.

3.32 Let g be the transition function and /the failure function of Exercise
3.31 for a set of keywords A' = {vi, .V2 Vil- Algorithm AC in
Fig. 3.54 uses g and /to determine whether a target string a^ a,, • • •

contains a substring that is a keyword. State s^ is the start state of


the transition diagram for K, and F is the set of final states.

/* docs rt| a,, contain a keyword as a substring */


.V:= ^o;
for / := 1 to « do begin
while ^(.v, </,) = fail do s = f (s)\
s := g(.v, a,)\

if ,v is in F then return "yes"


end;
return "no"

Fig. 3.54. Algorithm AC.

a) Apply Algorithm AC to the input string ushers using the transi-


tion and failure functions of Exercise 3.31.
*b) Prove that Algorithm AC returns "yes" if and only if some key-
word V, is a substring of « -a,,. |
• •

*c) Show that Algorithm AC makes at most In state transitions in pro-


cessing an input string of length n.
*d) Show that from the transition diagram and failure function for a
k

set of keywords (v'l ,


yi, . . . , V;} a DFA with at most ^ |v, |
+ 1

States can be constructed in linear time for the regular expression

• *(vi ly: \yk)


•*•
I

e) Modify Algorithm AC to print out each keyword found in the tar-

get string.

3.33 Use the algorithm in Exercise 3.32 to construct a lexical analyzer for
thekeywords in Pascal.

3.34 Define lcs(x, y), a longest common subsequence of two strings x and
>',to be a string that is a subsequence of both .v and y and is as long
as any such For example, tie is a longest common
subsequence.
subsequence of striped and tiger. Define d{x, y), the distance
between v and y, to be the minimum number of insertions and dele-
tions required to transform x into y. For example, J( striped,
tiger) = 6.
156 LEXICAL ANALYSIS CHAPTER 3

a) Show that for any two strings v and y, the distance between a and
y and the length of their longest common subsequence are related
by J(.v, _v) = |.v| + |_v| - (2* |/c-,v(.v, _v)|).

*b) Write an algorithm that takes two strings .v and y as input and pro-
duces a longest common subsequence of a and y as output.

3.35 Define ei.x, y). the allt distance between two strings a and y, to be
the minimum number of character insertions, deletions, and replace-
ments that are required to transform a into y. Let a = c/ 1
• • •
a,„ and
y =
/?! •
h„. c(.\, y) can be computed by a dynamic programming
algorithm using a distance array d\i)..m, 0..//| in which c/(/, /) is the
between a\
edit distance a, and h^ •
/;^. The algorithm in Fig.
3.55 can be used to compute the cJ matrix. The function repl is just
the cost of a character replacement: repKa,. /?,) = if a, = /?,, 1 oth-
erwise.

for / := to tn do d\i. 0| := /;

for 7 := I to /; do </|(), / 1 := j\

for / := I to /// do
for 7 ;
= 1 to /; do
D|/, j\ := min(J|/-l. /-1| + repUa,, /?,),

d\i-\, j\ + 1.

d\i. 7-H + I)

Fig. 3.55. Algorithm to compute edit distance between two strings.

a) What is the relation between the distance metric of Exercise 3.34


and edit distance?
b) Use the algorithm in Fig. 3.55 to compute the edit distance
between ahahh and hahaaa.
c) Construct an algorithm that prints out the minimal sequence of
editing transformations required to transform .v into y.

3.36 Give an algorithm that takes as input a string a and a regular expres-
sion r, and produces as output a string y in L(r) such that d{x, y) is
as small as possible, where d is the distance function in Exercise 3.34.

PROGRAMMING EXERCISES
P3.1 Write a lexical analyzer in Pascal or C for the tokens shown in Fig.
3.10.

P3.2 Write a specification for the tokens of Pascal and from this specifica-
tion construct transition diagrams. Use the transition diagrams to
implement a lexical analyzer for Pascal in a language like C or Pascal.
CHAPTERS BIBLIOGRAPHIC NOTES 157

P3.3 Complete the Lex program in Fig. 3.18. Compare the size and speed
of the resulting lexical analyzer produced by Lex with the program
written in Exercise P3.1.

P3.4 Write a Lex specification for the tokens of Pascal and use the Lex
compiler to construct a lexical analyzer for Pascal.

P3.5 Write a program that takes as input a regular expression and the
name of a file, and produces as output all lines of the file that contain
a substring denoted by the regular expression.

P3.6 Add an error recovery scheme to the Lex program in Fig. 3.18 to
enable it to continue to look for tokens in the presence of errors.

P3.7 Program a lexical analyzer from the DFA constructed in Exercise 3.18
and compare this lexical analyzer with that constructed in Exercises
P3.1 and P3.3.

P3.8 Construct a tool that produces a lexical analyzer from a regular


expression description of a set of tokens.

BIBLIOGRAPHIC NOTES
The restrictions imposed on the lexical aspects of a language are often deter-
mined by the environment in which the language was created. When Fortran
was designed in 1954, punched cards were a common input medium. Blanks
were ignored in Fortran partially because keypunchers, who prepared cards
from handwritten notes, tended to miscount blanks (Backus |1981|). Algol
58's separation of the hardware representation from the reference language
was a compromise reached after a member of the design committee insisted,
"No! I will never use a period for a decimal point." (Wegstein |1981)).
Knuth 11973a] presents additional techniques for buffering input. Feldman
|1979bj discusses the practical difficulties of token recognition in Fortran 77.
Regular expressions were first studied by Kleene |1956|, who was interested
in describing the events that could be represented by McCulloch and Pitts
[1943] finite automaton model of nervous activity. The minimization of finite
automata was first studied by Huffman 11954] and Moore [1956]. The
equivalence of deterministic and nondeterministic automata as far as their
ability to recognize languages was shown by Rabin and Scott |1959|.
McNaughton and Yamada (I960] describe an algorithm to construct a DFA
directly from a regular expression. More of the theory of regular expressions
can be found in Hopcroft and UUman [1979].
It was quickly appreciated that tools to build lexical analyzers from regular

expression specifications would be useful in the implementation of compilers.


Johnson et al. |1968] discuss an early such system. Lex, the language dis-
cussed in this chapter, is due to Lesk 11975), and has been used to construct
lexical analyzers for many compilers using the UNIX system. The compact
implementation scheme in Section 3.9 for transition tables is due to S. C.
158 LEXICAL ANALYSIS CHAPTER 3

Johnson, who first used it in the implementation of the Yacc parser generator
(Johnson |1975|). Other table-compression schemes are discussed and
evaluated in Dencker, DCirre, and Heuft |l984j.
The problem of compact implementation of transition tables has been
theoretically studied in a general setting by Tarjan and Yao |1979| and by
Fredman, Komlos, and Szemeredi |I984|. Cormack, Horspool, and
Kaiserswerth |1985j present a perfect hashing algorithm based on this work.
Regular expressions and finite automata have been used in many applica-
tions other than compiling. Many text editors use regular expressions for con-

text searches. Thompson |1968|, for example, describes the construction of an


NFA from a regular expression (Algorithm 3.3) in the context of the QED
text editor. The UNIX system has three general purpose regular expression
searching programs: grep, egrep, and fgrep. grep does not allow union
or parentheses for grouping in its regular expressions, but it does allow a lim-
ited form of backreferencing as in Snobol. grep employs Algorithms 3.3 and
3.4 to search for its regular expression patterns. The regular expressions in

egrep are similar to those in Lex, except for iteration and lookahead.
egrep uses a DFA with lazy state construction to look for its regular expres-
sion patterns, as outlined in Section 3.7. fgrep looks for patterns consisting
of sets of keywords using the algorithm in Aho and Corasick [1975], which is
discussed in Exercises 3.31 and 3.32. Aho |1980| discusses the relative per-
formance of these programs.
Regular expressions have been widely used in text retrieval systems, in
database query languages, and in file processing languages like (Aho, AWK
Kernighan, and Weinberger |1979|). Jarvis [1976] used regular expressions to
describe imperfections in printed circuits. Cherry |1982j used the keyword-
matching algorithm in Exercise 3.32 to look for poor diction in manuscripts.
The string pattern matching algorithm in Exercises 3.26 and 3.27 is from
Knuth, Morris, and Pratt |1977|. This paper also contains a good discussion
of periods in strings. Another efficient algorithm for string matching was
invented by Boyer and Moore |I977| who showed that a substring match can
usually be determined without having to examine all characters of the target
string. Hashing has also been proven as an effective technique for string pat-
tern matching (Harrison [1971]).
The notion of a longest common subsequence discussed in Exercise 3.34 has
been used in the design of the UNIX system file comparison program diff
(Hunt and Mcllroy |1976|). An efficient practical algorithm for computing
longest common subsequences is described in Hunt and Szymanski |1977|.
The algorithm for computing minimum edit distances in Exercise 3.35 is from
Wagner and Fischer |1974|. Wagner |1974| contains a solution to Exercise
3.36. Sankoff and Kruskal |1983| contains a fascinating discussion of the
broad range of applications of minimum distance recognition algorithms from
the study of patterns in genetic sequences to problems in speech processing.
CHAPTER 4

Syntax
Analysis

Every programming language has rules that prescribe the syntactic structure of
well-formed programs. In Pascal, for example, a program is made out of
blocks, a block out of statements, a statement out of expressions, an expres-
sion out of tokens, and so on. The syntax of programming language con-
structs can be described by context-free grammars or BNF (Backus-Naur
Form) notation, introduced in Section 2.2. Grammars offer significant advan-
tages to both language designers and compiler writers.

• A grammar gives a precise, yet easy-to-understand, syntactic specification


of a programming language.

• From certain classes of grammars we can automatically construct an effi-

cient parser that determines if a source program is syntactically well


formed. As an additional benefit, the parser construction process can
reveal syntactic ambiguities and other difficult-to-parse constructs that
might otherwise go undetected in the initial design phase of a language
and its compiler.

• A properly designed grammar imparts a structure to a programming


language that is useful for the translation of source programs into correct
object code and for the detection of errors. Tools are available for con-
verting grammar-based descriptions of translations into working pro-
grams.

• Languages evolve over a period of time, acquiring new constructs and


performing additional tasks. These new constructs can be added to a
language more easily when there is an existing implementation based on a
grammatical description of the language.

The bulk of this chapter is devoted to parsing methods that are typically
used in compilers. We first present the basic concepts, then techniques that
are suitable for hand implementation, and finally algorithms that have been
used automated tools. Since programs may contain syntactic errors, we
in

extend the parsing methods so they recover from commonly occurring errors.
160 SYNTAX ANALYSIS SEC. 4.1

4.1 THE ROLE OF THE PARSER


In our compiler model, the parser obtains a string of tokens from the lexical
analyzer, as shown in Fig. 4.1, and verifies that the string can be generated by
the grammar for the source language. We expect the parser to report any
syntax errors in an intelligible fashion. It should also recover from commonly
occurring errors so that it can continue processing the remainder of its input.
SEC. 4.1 THE ROLE OF THE PARSER 161

In the remainder of this section, we consider the nature of syntactic errors


and general strategies for error recovery. Two of these strategies, called
panic-mode and phrase-level recovery, are discussed in more detail together
with the individual parsing methods. The implementation of each strategy
calls upon the compiler writer's judgment, but we shall give some hints

regarding approach.

Syntax Error Handling

If a compiler had to process only correct programs,


its design and implementa-

tion would be greatly simplified. But programmers frequently write incorrect


programs, and a good compiler should assist the programmer in identifying
and locating errors. It is striking that although errors are so commonplace,
few languages have been designed with error handling in mind. Our civiliza-
tion would be radically different if spoken languages had the same require-
ment for syntactic accuracy as computer languages. Most programming
language specifications do not describe how a compiler should respond to
errors; the response is left to the compiler designer. Planning the error han-
dling right from the start can both simplify the structure of a compiler and
improve its response to errors.
We know that programs can contain errors at many different levels. For
example, errors can be

• lexical, such as misspelling an identifier, keyword, or operator


syntactic, such as an arithmetic expression with unbalanced parentheses
semantic, such as an operator applied to an incompatible operand
logical, such as an infinitely recursive call

Often much of the error detection and recovery in a compiler is centered


around the syntax analysis phase. One reason for this is that many errors are
syntactic in nature or are exposed when the stream of tokens coming from the
lexical analyzer disobeys the grammatical rules defining the programming
language. Another is the precision of modern parsing methods; they can
detect the presence of syntactic errors in programs very efficiently. Accu-
rately detecting the presence of semantic and logical errors at compile time is
a much more difficult task. In this section, we present a few basic techniques
for recovering from syntax errors; their implementation is discussed in con-
junction with the parsing methods in this chapter.
The error handler in a parser has simple-to-state goals:

• It should report the presence of errors clearly and accurately.

• It should recover from each error quickly enough to be able to detect sub-
sequent errors.

• It should not significantly slow down the processing of correct programs.

The effective realization of these goals presents difficult challenges.


Fortunately, common errors are simple ones and a relatively straightforward
162 SYNTAX ANALYSIS SEC. 4.1

error-handling mechanism often suffices. In some cases, however, an error


may have occurred long before the position at which its presence is detected,
and the precise nature of the error may be very difficult to deduce. In diffi-
cult cases, the error handler may have to guess what the programmer had in
mind when the program was written.
Several parsing methods, such as the LL and LR methods, detect an error
as soon as possible. More precisely, they have the viable-prefix property,
meaning they detect that an error has occurred as soon as they see a prefix of
the input that is not a prefix of any string in the language.

Example 4.1. To gain an appreciation of the kinds of errors that occur in


practice, let us examine the errors Ripley and Druseikis [1978] found in a
sample of student Pascal programs.
They discovered that errors do not occur that frequently; 60% of the pro-
grams compiled were syntactically and semantically correct. Even when
errors did occur, they were quite sparse; 80% of the statements having errors
had only one error, 13% had two. Finally, most errors were trivial; 90%
were single token errors.
Many of the errors could be classified simply: 60%' were punctuation errors,
20% operator and operand errors, 15% keyword errors, and the remaining
five per cent other kinds. The bulk of the punctuation errors revolved around
the incorrect use of semicolons.
For some concrete examples, consider the following Pascal program.

(1) program prmax( input, output);


(2) var
(3) X, y: integer;

(4) function inax( i integer j integer


:
; : ) : integer;
(5) { return maximum of integers i and j }

(6) begin
(7) if i > j then max := i
(8) else max := j
(9) end;

(10) begin
( 1 1) readln (x,y ) ;

(12) writeln (max(x,y))


(13) end.

A common punctuation error is to use a comma in place of the semicolon in


the argument list of a function declaration (e.g., using a comma in place of
the first semicolon on line (4)); another is to leave out a mandatory semicolon
at the end of a line (e.g., the semicolon at the end of line (4)); another is to
put an extraneous semicolon at the end of a line before an else (e.g., put-
in

ting a semicolon at the end of line (7)).


Perhaps one reason why semicolon errors are so common is that the use of
semicolons varies greatly from one language to another. In Pascal, a
SEC. 4.1 THE ROLE OF THE PARSER 163

semicolon is a statement separator; in PL/1 and C, it is a statement termina-


tor. Some studies have suggested that the latter usage is less error prone
(Gannon and Horning |1975|).
A example of an operator error is to leave out the colon from :=.
typical
Misspellings of keywords are usually rare, but leaving out the i from
writeln would be a representative example.
Many Pascal compilers have no difficulty handling common insertion, dele-
tion, and mutation errors. In fact, several Pascal compilers will correctly com-
pile the above program with a common punctuation or operator error; they
will issue only a warning diagnostic, pinpointing the offending construct.
However, another common type of error is much more difficult to repair
correctly. This is a missing begin or end (e.g., line (9) missing). Most
compilers would not try to repair this kind of error.

How should an error handler report the presence of an error? At the very
least, it should report the place in the source program where an error is

detected because there is a good chance that the actual error occurred within
the previous few tokens. A common strategy employed by many compilers is

to print the offending line with a pointer to the position at which an error is

detected. If there is a reasonable likelihood of what the error actually is, an


informative, understandable diagnostic message is also included; e.g., "semi-
colon missing at this position."

Once an error is detected, how should the parser recover? As we shall see,
there are a number of general strategies, but no one method clearly dom-
inates. In most cases, it is not adequate for the parser to quit after detecting
the first error, because subsequent processing of the input may reveal addi-
tional errors. Usually, there is some form of error recovery in which the
parser attempts to restore itself to a state where processing of the input can
continue with a reasonable hope that correct input will be parsed and other-
wise handled correctly by the compiler.
An inadequate job of recovery may introduce an annoying avalanche of
"spurious" errors, those that were not made by the programmer, but were
introduced by the changes made to the parser state during error recovery. In
a similar vein, syntactic error recovery may introduce spurious semantic errors
that will later be detected by the semantic analysis or code generation phases.
For example, in recovering from an error, the parser may skip a declaration
of some variable, say zap. When zap is later encountered in expressions,
there is nothing syntactically wrong, but since there is no symbol-table entry
for zap, a message "zap undefined" is generated.
A is to inhibit error messages that stem
conservative strategy for a compiler
from errors uncovered too close together in the input stream. After discover-
ing one syntax error, the compiler should require several tokens to be parsed
successfully before permitting another error message. In some cases, there
may many errors for the compiler to continue sensible processing. (For
be too
example, how should a Pascal compiler respond to a Fortran program as
164 SYNTAX ANALYSIS SEC. 4.1

input?) It seems that an error-recovery strategy has to be a carefully con-


sidered compromise, taking into account the kinds of errors that are likely to
occur and reasonable to process.
As we have mentioned, some compilers attempt error repair, a process in
which the compiler attempts to guess what the programmer intended to write.
The PL/C compiler (Conway and Wilcox |1973|) is an example of this type of
compiler. Except possibly in an environment of short programs written by
beginning students, extensive error repair is not likely to be cost effective. In

fact, with the increasing emphasis on interactive computing and good pro-
gramming environments, the trend seems to be toward simple error-recovery
mechanisms.

Error-Recovery Strategies

There are many different general strategies that a parser can employ to
recover from a syntactic error. Although no one strategy has proven itself to
be universally acceptable, a few methods have broad applicability. Here we
introduce the following strategies:

• panic mode
• phrase level
• error productions
• global correction

Panic-mode recovery. This is the simplest method to implement and can be


used by most parsing methods. On discovering an error, the parser discards
input symbols one at a time until one of a designated set of synchronizing
tokens is found. The synchronizing tokens are usually delimiters, such as
semicolon or end, whose role in the source program is clear. The compiler
designer must select the synchronizing tokens appropriate for the source
language, of course. While panic-mode correction often skips a considerable
amount of input without checking it for additional errors, it has the advantage
of simplicity and, unlike some other methods to be considered later, it is
guaranteed not to go into an infinite loop. In situations where multiple errors
in the same statement are rare, this method may be quite adequate.
Phrase-level recovery. On discovering an error, a parser may perform local

correction on the remaining input; that is, it may replace a prefix of the
remaining input by some string that allows the parser to continue. A typical
local correction would be to replace a comma by a semicolon, delete an
extraneous semicolon, or insert a missing semicolon. The choice of the local
correction is left to the compiler designer. Of course, we must be careful to
choose replacements that do not lead to infinite loops, as would be the case,
for example, if we always inserted something on the input ahead of the
current input symbol.
This type of replacement can correct any input string and has been used in
several error-repairing compilers. The method was first used with top-down
parsing. Its major drawback is the difficulty it has in coping with situations in
SEC. 4.2 CONTEXT-FREE GRAMMARS 165

which the actual error has occurred before the point of detection.
Error productions. If we have a good idea of the common errors that might
be encountered, we can augment the grammar for the language at hand with
productions that generate the erroneous constructs. We then use the grammar
augmented by these error productions to construct a parser. If an error pro-
duction is used by the parser, we can generate appropriate error diagnostics to
indicate the erroneous construct that has been recognized in the input.
Global correction. Ideally, we would like a compiler to make as few
changes as possible in processing an incorrect input string. There are algo-
rithms for choosing a minimal sequence of changes to obtain a globally least-
cost correction. Given an incorrect input string x and grammar G, these algo-
rithms will find a parse tree for a related string y, such that the number of
insertions, deletions, and changes of tokens required to transform x into y is
as small as possible. Unfortunately, these methods are in general too costly to
implement in terms of time and space, so these techniques are currently only
of theoretical interest.
We should point out that a closest correct program may not be what the
programmer had in mind. Nevertheless, the notion of least -cost correction
does provide a yardstick for evaluating error-recovery techniques, and it has
been used for finding optimal replacement strings for phrase-level recovery.

4.2 CONTEXT-FREE GRAMMARS


Many programming language constructs have an inherently recursive structure
that can be defined by context-free grammars. For example, we might have a
conditional statement defined by a rule such as

If Si and 52 are statements and E is an expression, then , .


,,
(4.1)
"iff" then 5i else ^2" is a statement.

This form of conditional statement cannot be specified using the notation of


regular expressions; in Chapter 3, we saw that regular expressions can specify
the lexical structure of tokens. On the other hand, using the syntactic variable
stmt to denote the class of statements and expr the class of expressions, we can
readily express (4.1) using the grammar production

stmt -* if expr then stmt else stmt (4.2)

In this section, we review the definition of a context-free grammar and


introduce terminology for talking about parsing. From Section 2.2, a context-
free grammar (grammar for short) consists of terminals, nonterminals, a start
symbol, and productions.

1. Terminals are the basic symbols from which strings are formed. The
word "token" is a synonym for "terminal" when we arc talking about
grammars for programming languages. In (4.2), each of the keywords if,

then, and else is a terminal.


166 SYNTAX ANALYSIS SEC. 4.2

2. Nonterminals are syntactic variables that denote sets of strings, in (4.2),


stmt and expr are nonterminals. The nonterminals define sets of strings
that help define the language generated by the grammar. They also
impose a hierarchical structure on the language that is useful for both
syntax analysis and translation.

3. In a grammar, one nonterminal is distinguished as the start symbol, and


the set of strings it denotes is the language defined by the grammar.

4. The productions of a grammar specify the manner in which the terminals


and nonterminals can be combined to form strings. Each production con-
sists of a nonterminal, followed by an arrow (sometimes the symbol :: =

is used in place of the arrow), followed by a string of nonterminals and

terminals.

Example 4.2. The grammar with the following productions defines simple
arithmetic expressions.

e.xpr - expr op expr


expr -* ( expr )

expr -» — expr
expr -* id
op - +
op ^ -
op — ^
op ^/
op -* T

In this grammar, the terminal symbols are

id + - */ t ( )

The nonterminal symbols are expr and op, and expr is the start symbol.

Notational Conventions

To avoid always having to state that "these are the terminals," "these are the
nonterminals," and so on, we shall employ the following notational conven-
tions with regard to grammars throughout the remainder of this book.

1. These symbols are terminals:

i) Lower-case letters early in the alphabet such as a, /?, c.

ii) Operator symbols such as -I- — etc. , ,

iii) Punctuation symbols such as parentheses, comma, etc.

iv) The digits 0, I, . . . , 9.

v) Boldface strings such as id or if.

2. These symbols are nonterminals:

i) Upper-case letters early in the alphabet such as A, B, C.


SEC. 4.2 CONTEXT-FREE GRAMMARS 167

ii) The letter 5, which, when it appears, is usually the start symbol,
iii) Lower-case italic names such as expr or stmt.

3. Upper-case letters late in the alphabet, such as X, K, Z, represent gram-


mar symbols, that is, either nonterminals or terminals.

4. Lower-case letters late in the alphabet, chiefly w, v, . . . , z, represent


strings of terminals.

5. Lower-case Greek letters, a, p, 7, for example, represent strings of


grammar symbols. Thus, a generic production could be written as
/4 — a, indicating that there is a single nonterminal A on the left of the
arrow (the left side of the production) and a string of grammar symbols a
to the right of the arrow (the right side of the production).

6. If A -» tt] , A - a^, . . . , A - a; are all productions with A on the left

(we call them A-productions), we may write A — a\ lao I


' ' '
\^k- ^c
call tt) , ttT a^ the alternatives for A.

7. Unless otherwise stated, the left side of the first production is the start
symbol.

Example 4.3. Using these shorthands, we could write the grammar of Exam-
ple 4.2 concisely as

E ^ EA E I
{ E ) I
- E I
id
/\ ^ + - * / t
I I I I

Our notational conventions tell us that E and A are nonterminals, with E the
start symbol. The remaining symbols are terminals.

Derivations

There are several ways to view the process by which a grammar defines a
language. In Section 2.2, we viewed this process as one of building parse
trees, but there is also a related derivational view that we frequently find use-
ful. In fact, this derivational view gives a precise description of the top-down
construction of a parse tree. The central idea here is that a production is

treated as a rewriting rule in which the nonterminal on the left is replaced by


the string on the right side of the production.
For example, consider the following grammar for arithmetic expressions,
with the nonterminal E representing an expression.

E-*E + E\E^E\(E)\-E\\A (4.3)

The production E ^ - E signifies that an expression preceded by a minus sign


is also an expression. This production can be used to generate more complex
expressions from simpler expressions by allowing us to replace any instance of
an £ by — E. In the simplest case, we can replace a single £" by — £. We
can describe this action by writing
168 SYNTAX ANALYSIS SEC. 4.2

E => - E
which is read "£ derives — E." The production E - (E) tells us that we could
also replace one instance of an E in any string of grammar symbols by (£);
e.g., E*E => {E)*E or E*£ => EME).
We can take a single £ and repeatedly apply productions in any order to
obtain a sequence of replacements. For example,

E => -E => -(£) => -(id)

We call such a sequence of replacements a derivation of —(id) from E. This


derivation provides a proof that one particular instance of an expression is the
string -(id).
more abstract setting, we say that aA^ => a^p if A - 7 is a produc-
In a

tion and a and (3 are arbitrary strings of grammar symbols. If

tt] => a2 ^^^ "^^ «/M we say ai derives a„. The symbol =t> means
"derives in one step." Often we wish to say "derives in zero or more steps."
For this purpose we can use the symbol =^. Thus,

1. a =^> a for any string a, and


2. If a => P and (3 => 7, then a =t> 7.

Likewise, we use => to mean "derives in one or more steps."


Given a grammar G with start symbol S, we can use the => relation to
define L{G), the langua^^e generated by G. Strings in L{G) may contain only
terminal symbols of G. We say a string of terminals w is in L(G) if and only
if S => w. The string w is called a sentence of G. A language that can be
generated by a grammar is If two gram-
said to be a context-free language.
mars generate the same language, the grammars are said to be equivalent.
If S =>a, where a may contain nonterminals, then we say that a is a sen-

tential form of G. A sentence is a sentential form with no nonterminals.

Example 4.4. The string -(id + id) is a sentence of grammar (4.3) because
there is the derivation

E => -E => -{E) => -{E + E) => -(id + E) => -(id+id) (4.4)

The strings E, -E, -(£),-(id+id) appearing in this derivation are all


. . .
,

sentential forms of this grammar. We write E => -(id + id) to indicate that
— (id + id) can be derived from E.
We can show by induction on the length of a derivation that every sentence
in the language of grammar (4.3) is an arithmetic expression involving the
binary operators + and *, the unary operator — parentheses, and the ,

operand id. Similarly, we can show by induction on the length of an arith-


metic expression that all such expressions can be generated by this grammar.
Thus, grammar (4.3) generates precisely the set of all arithmetic expressions
involving binary + and *, unary -, parentheses, and the operand id. Q

At each step in a derivation, there are two choices to be made. We need to


choose which nonterminal to replace, and having made this choice, which
SEC. 4.2 CONTEXT-FREE GRAMMARS 169

alternative to use for that nonterminal. For example, derivation (4.4) of


Example 4.4 could continue from —{E + E) as follows

-(E + E) => -(£ + id) => -(id+id) (4.5)

Each nonterminal in (4.5) is replaced by the same right side as in Example


4.4, but the order of replacements is different.
To understand how certain parsers work we need to consider derivations in
which only the leftmost nonterminal in any sentential form is replaced at each
step. Such derivations are termed leftmost. If a => p by a step in which the
leftmost nonterminal in a is replaced, we write a => (3. Since derivation
(4.4) is leftmost, we can rewrite it as:

^ ^> -^ ?r -(^) ijr>


-(^ + ^)
?r -(«d+f ) j^> -(id+id)

Using our notational conventions, every leftmost step can be written


wA'y =^ H'by where w consists of terminals only, /4 - 8 is the production
applied, and 7 is grammar symbols. To emphasize the fact that a
a string of
derives 6 by a leftmost derivation, we write a => B. If 5 =>a, then we say
a is a left-sentential form of the grammar at hand.
Analogous definitions hold for rightmost derivations in which the rightmost
nonterminal is replaced at each step. Rightmost derivations are sometimes
called canonical derivations.

Parse Trees and Derivations

A parse tree may be viewed as a graphical representation for a derivation that


filters out the choice regarding replacement order.
Recall from Section 2.2
that each interior node of a parse tree is labeled by some nonterminal A, and
that the children of the node are labeled, from left to right, by the symbols in
the right side of the production by which this A was replaced in the derivation.
The leaves of the parse tree are labeled by nonterminals or terminals and,
read from left to right, they constitute a sentential form, called the yield or
frontier of the tree. For example, the parse tree for -(id + id) implied by
derivation (4.4) is shown in Fig. 4.2.

To see the relationship between derivations and parse trees, consider any
derivation ai => ai => • • •
=>a„, where a| is a single nonterminal A. For
each sentential form a, in the derivation, we construct a parse tree whose
yield is a,. The process is an induction on /. For the basis, the tree for
tt] — /I is a single node labeled A. To do the induction, suppose we have
already constructed a parse tree whose yield is a,_| = X^Xi X/^. (Recal-
ling our conventions, each X, is either a nonterminal or a terminal.) Suppose
a, derived from a,_| by replacing Xj, a nonterminal, by p = KiKt
is Y^. • • •

That is, at the /th step of the derivation, production Xj - p is applied to a,_i
to derive a, = XiA'2 Xy_iPX^^ X^. • •
1
• • •

To model this step of the derivation, we find the 7th leaf from the left in

the current parse tree. This leaf is labeled Xj. We give this leaf r children,
labeled ^1,^2, . . . , Y^, from the left. As a special case, if r = 0, i.e..
170 SYNTAX ANALYSIS SEC. 4.2

E
/ \
/ \ I

/ \£ +
I

£
I I

id id

Fig. 4.2. Parse tree for —(id + id).

P = e. then we give the /th leaf one child labeled e.

Example 4.5. Consider derivation (4.4). The sequence of parse trees con-
structed from this derivation shown in Fig. 4.3. In the first step of the
is

derivation. E => —E. To model this step, we add two children, labeled —
and £, to the root E of the initial tree to create the second tree.

E
SEC. 4.2 CONTEXT-FREE GRAMMARS 1
" 1

not hard to see that e\er> parse tree ha< associated uith it a unique leftmost
and a unique rightmost deri\ation. In what tollows. we parseshall t'requentK
by producmg a leftmost or rightmost derivation, understanding that instead of
this derivation we could produce the parse tree itself.Ho\ve\er. we should
not assume that e\er\ sentence necessanl) has onl> one parse tree or onlv one
leftmost or rightmost deri\ation.

Example 4.6. Let us agam ci^nsider the arithmetic expression grammar (4.3).
The sentence id-id^id has the two distinct leftmost dernations;

E =^ E'E E =c- E^E


=> id-E E-E^E
=> id- E^E id'E^E
=> id*id*E id^id*E
=> id - id ^ id id*id:f:id

with the two corresponding parse trees shown in Fig. 4.4.

£
I

id E £ £ £
I I I

id id id id

(a) (b)

Fig. 4.4. T\'.o parse trees for id-id*id.

Note that the parse tree of Fig. 4.4(a) reflects the commonly assumed pre-
cedence of ~ and *. while the tree of Fig. 4.4(b) does not. That is. it is cus-
tomar) to treat operator >• as haMng higher precedence than ~. corresponding
to the fact that we would normalK evaluate an expression like a-h^c as
a-^i.b*c), rather than as (a-t-b)^c.

.\mbiguity

.^ grammar produces more than one parse tree for some sentence is said
that
to be amhii;u(>us.Put another wa\. an ambiguous grammar is one that pro-
duces more than one leftmost or more than one rightmost derivation for the
same sentence. For certain types of parsers, is desirable that the grammar
it

be made unambiguous, for if it is not. we cannot uniquely determine which


parse tree to select tor a sentence. For some applications we shall also con-
sider methods whereb\ we can use certain ambiguous grammars, together with
clisamhii^uaiini; rules that "throw away" undesirable parse trees, leaving us
with onl\ one tree for each sentence.
172 SYNTAX ANALYSIS SEC. 4.3

4.3 WRITING A GRAMMAR


Grammars are capable of describing most, but not all, of the syntax of pro-
gramming languages. A limited amount of syntax analysis is done by a lexical
analyzer as it produces the sequence of tokens from the input characters. Cer-
tain constraints on the input, such as the requirement that identifiers be
declared before they are used, cannot be described by a context-free grammar.
Therefore, the sequences of tokens accepted by a parser form a superset of a
programming language; subsequent phases must analyze the output of the
parser to ensure compliance with rules that are not checked by the parser (see
Chapter 6).
We begin this section by considering the division of work between a lexical
analyzer and a parser. Because each parsing method can handle grammars
only of a certain form, the initial grammar may have to be rewritten to make
it parsable by the method chosen. Suitable grammars for expressions can
often be constructed using associativity and precedence information, as in Sec-
tion 2.2. In this section, we consider transformations that are useful for
rewriting grammars become suitable for top-down parsing. We con-
so they
clude this section by considering some programming language constructs that
cannot be described by any grammar.

Regular Expressions vs. Context-Free Grammars


Every construct that can be described by a regular expression can also be
described by a grammar. For example, the regular expression (a\b)*abb and
the grammar
y4 — ciA I
M I
ciA 1

A I
^ bA.
A. ^ bA^
A^ - e
describe the same language, the set of strings of «'s and b\ ending in abb.
We can mechanically convert a nondeterministic automaton (NFA) finite
into a grammar that generates the same language as recognized by the NFA.
The grammar above was constructed from the NFA of Fig. 3.23 using the fol-
lowing construction: For each state of the NFA, create a nonterminal sym-
/

bol A,. If state / has a transition to state j on symbol a, introduce the produc-
tion A, — ciAj. If state / goes to state j on input e, introduce the production
Aj -^ Aj. If / is an accepting state, introduce A, - e. If / is the start state,
make A, be the start symbol of the grammar.
Since every regular set is a context-free language, we may reasonably ask,
"Why use regular expressions to define the lexical syntax of a language?"
There are several reasons.

1. The lexical rules of a language are frequently quite simple, and to


describe them we do not need a notation as powerful as grammars.
SEC. 4.3 WRITING A GRAMMAR 173

2. Regular expressions generally provide a more concise and easier to under-


stand notation for tokens than grammars.

3. More efficient lexical analyzers can be constructed automatically from


regular expressions than from arbitrary grammars.

4. Separating the syntactic structure of a language into lexical and nonlexical


parts provides a convenient way of modularizing the front end of a com-
piler into two manageable-sized components.

There are no firm guidelines as to what to put into the lexical rules, as
opposed to the syntactic rules. Regular expressions are most useful for
describing the structure of lexical constructs such as identifiers, constants,
keywords, and so forth. Grammars, on the other hand, are most useful in
describing nested structures such as balanced parentheses, matching begin-
end's, corresponding if-then-else's, and so on. As we have noted, these
nested structures cannot be described by regular expressions.

Verifying the Language Generated by a Grammar


Although compiler designers rarely do it for a complete programming
language grammar, it is important to be able to reason that a given set of pro-
ductions generates a particular language. Troublesome constructs can be stu-
died by writing a concise, abstract grammar and studying the language that it
generates. We shall construct such a grammar for conditionals below.
A proof that a grammar G generates a language L has two parts: we must
show that every string generated by G is in L, and conversely that every string
in L can indeed be generated by G.

Example 4.7. Consider the grammar (4.6)

5 -* (S)S e (4.6)
I

It may not be initially apparent, but this simple grammar generates all strings
of balanced parentheses, and only such strings. To see this, we shall show
first that every sentence derivable from S is balanced, and then that every bal-
anced string is derivable from S. To show that every sentence derivable from
S is balanced, we use an inductive proof on the number of steps in a deriva-
tion. For the basis step, we note that the only string of terminals derivable
from S in one step is the empty string, which surely is balanced.
Now assume that all derivations of fewer than // steps produce balanced sen-
tences, and consider a leftmost derivation of exactly n steps. Such a deriva-
tion must be of the form

S =>(S)S => ix)S => (x)y

The derivations of x and y from S take fewer than n steps so, by the inductive
hypothesis, x and y are balanced. Therefore, the string (.v)y must be bal-
anced.
We have thus shown that any string derivable from S is balanced. We must
174 SYNTAX ANALYSIS SEC. 4.3

next show that every balanced string is derivable from 5. To do this we use
induction on the length of a string. For the basis step, the empty string is

derivable from S.
Now assume that every balanced string of length less than 2n is derivable
from S, and consider a balanced string w of length 2n, n > \. Surely w
begins with a left parenthesis. Let (x) be the shortest prefix of w having an
equal number of left and right parentheses. Then h' can be written as {x)y
where both x and y are balanced. Since x and y are of length less than 2n,
they are derivable from 5 by the inductive hypothesis. Thus, we can find a
derivation of the form

5 =>{S)S (x)S => {x)y

proving that w — ix)y is also derivable from S.

Eliminating Ambiguity

Sometimes an ambiguous grammar can be rewritten to eliminate the ambi-


guity. As an example, we shall eliminate the ambiguity from the following
"dangling-else" grammar:

stmt -» if expr then stmt


I
if expr then stmt else stmt (4.7)
I other

Here "other" stands for any other statement. According to this grammar, the
compound conditional statement

if E\ then S\ else if Ei then 5 2 else 53

has the parse tree shown in Fig. 4.5. Grammar (4.7) is ambiguous since the
string

ifEi then if £2 then 5 1 else 5. (4.8)

has the two parse trees shown in Fig. 4.6.

stmt

if expr then stmt else stmt

if expr then stmt else stmt

E2 S2

Fig. 4.5. Parse tree for conditional statement.


SEC. 4.3 WRITING A GRAMMAR 175

stmt

if expr then xtmt

if expr then stmt else iVmf

E.

if 6'.v/7r then ,s7wr else stmt

E\
X / \ \ ^'-

if expr then i/w/

ZA ZA
Fig. 4.6. Two parse trees for an ambiguous sentence.

In allprogramming languages with conditional statements of this form, the


first parse tree
is preferred. The general rule is, "Match each else with the
closest previous unmatched then." This disambiguating rule can be incor-
porated directly into the grammar. For example, we can rewrite grammar
(4.7) as the following unambiguous grammar. The idea is that a statement
appearing between a then and an else must be "matched;" i.e., it must not
end with an unmatched then followed by any statement, for the else would
then be forced to match this unmatched then. A matched statement is either
an if-then-else statement containing no unmatched statements or it is any other
kind of unconditional statement. Thus, we may use the grammar

stmt - matched_stmt
I
unmatched_stmt
matched_stmt -* if expr then matched_stmt else matched_stmt
I
other (4.9)
unmatched_stmt -» if expr then stmt
I
if expr then matched_stmt else unmatched_stmt

This grammar generates the same set of strings as (4.7), but it allows only one
parsing for string (4.8), namely the one that associates each else with the
closest previous unmatched then.
176 SYNTAX ANALYSIS SEC. 4.3

Elimination of Left Recursion

A grammar is left recursive if it has a nonterminal A such that there is a


derivation A ^ Aa for some string a. Top-down parsing methods cannot
handle left-recursive grammars, so a transformation that eliminates left recur-
sion is needed. In Section 2.4, we discussed simple left recursion, where there
was one production of the form A ^ Aa. Here we study the general case. In

Section 2.4, we showed how the left-recursive pair of productions A — Aa | P


could be replaced by the non-left-recursive productions

A -* p/\'

A' - a/4' I
e

without changing the set of strings derivable from A. This rule by itself suf-

fices in many grammars.


Example 4.8. Consider the following grammar for arithmetic expressions.

E -* E + T \T
T ^ T * E E \
(4.10)
E ^ E ( ) I
id

Eliminating the immediate left recursion (productions of the form A -* Aa) to

the productions for E and then for T, we obtain

E -* TE'
E' +TE'
-*
I

T - FT' (4.11)
T' -* *ET' I
e
E - (£) I
id n

No matter how many /^-productions there are, we can eliminate immediate


left recursion from them by the following technique. First, we group the A-
productions as

A ^ Aa^ \Aa2\ |
Aa, | P, | ^2 I
"
' "
I P«

where no p, begins with an A. Then, we replace the i4 -productions by

A - p,A' I
P2A' I

• •
I
P„A'
A' -* a, A' I
ajA' |
• • •
|
a„A' \
t

The nonterminal A generates is no longer left the same strings as before but
recursive. This procedure eliminates all immediate left recursion from the A
and A' productions (provided no a, is e), but it does not eliminate left recur-

sion involving derivations of two or more steps. For example, consider the
grammar
S ^ Aa \h
(4,2)
A ^ Ac Sd \t \

The nonterminal 5 is left-recursive because S => Aa =^ Sda, but it is not


SEC. 4.3 WRITING A GRAMMAR 177

immediately left recursive.


Algorithm 4.1. below, will systematically eliminate left recursion from a
grammar. It is guaranteed to work if the grammar has no cycles (derivations
of the form A =>/\) or e-productions (productions of the form A — e).
Cycles can be systematically eliminated from a grammar as can e-productions
(see Exercises 4.20 and 4.22).

Algorithm 4,1. Eliminating left recursion.

Input. Grammar G with no cycles or e-productions.

Output. An equivalent grammar with no left recursion.

Method. Apply the algorithm in Fig. 4.7 to G. Note that the resulting non-
left-recursive grammar may have e-productions.

1. Arrange the nonterminals in some order A^. A^ A„.

2. for / := I to /; do begin
for y := 1 to /
- ! do begin
replace each production of the form A, — ^4^7
by the productions /4, -• 5,7 |
ft.^ I
'
' '

I 5a7-
where /4, — 5, | 5^ |

| 5^ are all the current -4, -productions;
end
eliminate the immediate left recursion among the ,4, -productions
end

Fig. 4.7. Algorithm to ehminate left recursion from a grammar.

The reason the procedure in Fig. 4.7 works is that after the / - T' iteration
of the outer for loop any production of the form A^
in step (2). Aia, where —
k < i, must have / > k. As a result, on the next iteration, the inner loop (on j)
progressively raises the lower limit on /// in any production A, — A,„a, until we
must have m>i. Then, eliminating immediate left recursion for the A,-
productions forces m to be greater than /.

Example 4.9. Let us apply this procedure to grammar (4.12). Technically,


Algorithm 4.1 is not guaranteed to work, because of the e-production, but in

this case the production A — e turns out to be harmless.


We order the nonterminals 5, A. There is no immediate left recursion
among the i'-productions. so nothing happens during step (2) for the case /
=
1. For / = 2, we substitute the i'-productions in A -^ Sd to obtain the follow-
ing /^-productions.

A -^ Ac \
Acid I
hd \
e

Eliminating the immediate left recursion among the /\-productions yields the
following grammar.
178 SYNTAX ANALYSIS SEC. 4.3

S Aa h -- \

A - bdA' I
A'
A' -* cA' I
a JA' I

Left Factoring

Left factoring is a grammar transformation that is useful for producing a


grammar suitable for predictive parsing. The basic idea is that when it is not
clear which of two alternative productions to use to expand a nonterminal A,
we may be able to rewrite the /4-productions to defer the decision until we
have seen enough of the input to make the right choice.
For example, if we have the two productions

stmt - If e.xpr then stmt else stmt


I
if e.xpr then stmt

on seeing the input token we cannot immediately tell which production to if,

choose to expand stmt. A -» aPi a^T are two A-productions, In general, if |

and the input begins with a nonempty string derived from a, we do not know
whether to expand A to ap, or to 0^2 However, we may defer the decision
by expanding A to aA' Then, after seeing the input derived from a, we .

expand A' to Pi or to p^- That is, left-factored, the original productions


become

A — a/4'
A' ^ (3, I
p,

Algorithm 4.2. Left factoring a grammar.

Input. Grammar G.

Output. An equivalent left-factored grammar.

Method. For each nonterminal A find the longest prefix a common to two or
more of its alternatives. If a 9^ e, i.e., there is a nontrivial common prefix,
replace all the' A productions A — aPi | aP2 I
' ' '

I
^(3,, |
7 where 7
represents all alternatives that do not begin with a by

A -^ aA' I 7
>^' ^ Pi I P2 I

• •
I P„

Here A' is a new nonterminal. Repeatedly apply this transformation until no


two alternatives for a nonterminal have a common prefix.

Example 4.10. The following grammar abstracts the dangling-else problem:

S - iEtS I
iEtSeS \
a , . x-,x

Here /, /, and e stand for if, then and else, E and S for "expression" and
"statement." Left-factored, this grammar becomes:
SEC. 4.3 WRITING A GRAMMAR 179

S -* iEtSS' I
a
S'^ eS \
e (4.14)
E ^ b

Thus, we may expand S to iEtSS' on input /, and wait until iEtS has been seen
to decide whether to expand S' to eS or to e. Of course, grammars (4.13) and
(4.14) are both ambiguous, and on input e, it will not be clear which alterna-
tive for S' should be chosen. Example 4.19 discusses a way out of this
dilemma. n

Non-Context-Free Language Constructs

It should come as no surprise that some languages cannot be generated by any


grammar. In fact, a few syntactic constructs found in many programming
languages cannot be specified using grammars alone. In this section, we shall
present several of these constructs, using simple abstract languages to illus-
trate the difficulties.

Example Consider the abstract language L| = {wcw


4.11. \v is in (a\b)*}. \

L] consists of words composed of a repeated string of a's and b\ separated


all

by a c, such as aabcaab. It can be proven this language is not context free.


This language abstracts the problem of checking that identifiers are declared
before their use in a program. That is, the first w in wcw represents the
declaration of an identifier vv. The second w represents its use. While it is

beyond the scope of this book to prove it, the non-context-freedom of L\


directly implies the non-context-freedom of programming languages like Algol
and Pascal, which require declaration of identifiers before their use, and
which allow identifiers of arbitrary length.
For this reason, a grammar for the syntax of Algol or Pascal does not
specify the characters in an identifier. Instead, all identifiers are represented
by a token such as id in the grammar. In a compiler for such a language, the
semantic analysis phase checks that identifiers have been declared before their
use. n

Example 4.12. The language Lj = {a"b"'c"d"'\ n>\ and w>l} is not con-
text free. That
Lt consists of strings in the language generated by the reg-
is,

ular expression a*b*c*d* such that the number of «'s and c's are equal and
the number of b's and Ws are equal. (Recall a" means a written n times.) Lt
abstracts the problem of checking that the number of formal parameters in the
declaration of a procedure agrees with the number of actual parameters in a
use of the procedure. That is, a" and b'" could represent the formal parame-
ter lists in two procedures declared to have n and m arguments, respectively.

Then c" and d'" represent the actual parameter lists in calls to these two pro-
cedures.
Again note that the typical syntax of procedure definitions and uses does
not concern itself with counting the number of parameters. For example, the
CALL statement in a Fortran-like language might be described
.

180 SYNTAX ANALYSIS SEC. 4.3

stmt - call id ( exprjist )

exprjist -> exprjlist , expr


I
expr

with suitable productions for expr. Checking that the number of actual
parameters in the call is correct is usually done during the semantic analysis
phase.

Example 4.13. The language L3 = {a"h"c"\ n>0), that is, strings in


L(a*b*c*) with equal numbers of «'s, b\ and c's, is not context free. An
example of a problem that embeds L3 is the following. Typeset text uses ital-
ics where ordinary typed text uses underlining. In converting a file of text
destined to be printed on a line printer to text suitable for a phototypesetter,
one has to replace underlined words by italics. An underlined word is a string
of letters followed by an equal number of backspaces and an equal number of
underscores. If we regard a
as any letter, b as backspace, and c as under-
score, the language L^ represents underlined words. The conclusion is that
we cannot use a grammar to describe underlined words in this fashion. On
the other hand, if we represent an underlined word as a sequence of letter-
backspace-underscore triples then we can represent underlined words with the
regular expression {abc)*

It is interesting to note that languages very similar to L |


, L2, and L3 are
context free. For example, L\ = {vvcu''^ |
w is in (a|/7)*}, where w'^ stands
for H' reversed, is context free. It is generated by the grammar
S -» aSa I
bSb \
c

The language LS = {a" b'" c'" d" \


//>1 and w>l} is context free, with gram-
mar

S -^ aSd I
a Ad
A -* bAc I
be

Also, L'{ = {a"b"c"'d"' \


n>\ and w>l} is context free, with grammar
S ^ AB
A -» ciAb I
ab
B - eBd I
ed

Finally, L'3 = {a"b" «>1} |


is context free, with grammar
S - aSb I
ab

It is worth noting that L\ is the prototypical example of a language not defin-


able by any regular expression. To see this, suppose L'3 were the language
defined by some regular expression. Equivalently, suppose we could construct
a DFA D accepting L'3. D must have some finite number of states, say k.
Consider the sequence of states .so, .V], .V2, . . . , -v^^ entered by D having read
€, a, aa, . . . , a^ . That is, s; is the state entered by D having read / a's.
SEC. 4.4 TOP-DOWN PARSING 181

path labeled a' '

path labeled a' \_^^ P^^*^ labeled h'


,v„j ^ ... ^(7^ ^ ... ^(/

Fig. 4.8. DFA D accepting a'h' and a^h'.

Since D has only k different states, at least two states in the sequence
Sq, Si, ... , .v^ must be the same, say Sj and .v^. From state 5, a sequence of
/ ^'s takes D to an accepting state /, since a'b' is in L'y But then there is also
a path from the initial state sq to -v, to /labeled a^b', as shown in Fig. 4.8.
Thus, D also accepts a^b', which is not in L\, contradicting the assumption
that L'3 is the language accepted by D.
Colloquially, we say that "a finite automaton cannot keep count," meaning
that a finite automaton cannot accept
a language like L'3 which would require
it keep count of the number of «'s before it sees the b\. Similarly, we say
to
"a grammar can keep count of two items but not three," since with a gram-
mar we can define L'3 but not L3.

4.4 TOP-DOWN PARSING


In this section, we introduce the basic ideas behind top-down parsing and
show how to construct an efficient non-backtracking form of top-down parser
called a predictive parser. We define the class of LL(1) grammars from
which predictive parsers can be constructed automatically. Besides formaliz-
ing the discussion of predictive parsers in Section 2.4, we consider nonrecur-
sive predictive parsers. This section concludes with a discussion of error
recovery. Bottom-up parsers are discussed in Sections 4.5 - 4.7.

Recursive-Descent Parsing

Top-down parsing can be viewed as an attempt to find a leftmost derivation


for an input string. Equivalently, it can be viewed as an attempt to construct

a parse tree for the input starting from the root and creating the nodes of the
parse tree in preorder. In Section 2.4, we discussed the special case of
recursive-descent parsing, called predictive parsing, where no backtracking is

required. We now consider a general form of top-down parsing, called recur-


sive descent, that may involve backtracking, that is, making repeated scans of
the input. However, backtracking parsers are not seen frequently. One rea-
son is that backtracking is rarely needed to parse programming language con-
structs. In situations like natural language parsing, backtracking is still not
very efficient, and tabular methods such as the dynamic programming algo-
rithm of Exercise 4.63 or the method of Barley |1970| are preferred. See Aho
and Ullman (I972b| for a description of general parsing methods.
182 SYNTAX ANALYSIS SEC. 4.4

Backtracking is required in the next example, and we shall suggest a way of


keeping track of the input when backtracking takes place.

Example 4.14. Consider the grammar

S - cAd
(4.15)
A ^ ab a \

and the input string h- = cad. To construct a parse tree for this string top-
down, we initially create a tree consisting of a single node labeled S. An
input pointer points to c, the first symbol of w. We then use the first produc-
tion for S to expand the tree and obtain the tree of Fig. 4.9(a).

S S S

c
/w A d c
/w A d c
/w A d

a
/ \ b
I

a
(a) (b) (c)

Fig. 4.9. Steps in top-down parse.

The leftmost leaf, labeled c, matches the first symbol of w, so we now


advance the input pointer to a, the second symbol of w, and consider the next
leaf, labeled A. We can then expand A using the first alternative for A to
obtain the tree of Fig. 4.9(b). We now have a match for the second input
symbol so we advance the input pointer to d, the third input symbol, and com-
pare d against the next leaf, labeled b. Since b does not match d, we report
failure and go back to A to see whether there is another alternative for A that
we have not tried but that might produce a match.
In going back to A, we must reset the input pointer to position 2, the posi-
tion it had when we first came to A, which means that the procedure for A
(analogous to the procedure for nonterminals in Fig. 2.17) must store the
input pointer in a local variable. We now try the second alternative for A to
obtain the tree of Fig. 4.9(c). The leaf a matches the second symbol of w and
the leafd matches the third symbol. Since we have produced a parse tree for
w, we halt and announce successful completion of parsing. n

A left-recursive grammar can cause a recursive-descent parser, even one


with backtracking, to go into an infinite loop. That is,when we try to expand
A, we may eventually find ourselves again trying to expand A without having
consumed any input.

Predictive Parsers

In many cases, by carefully writing a grammar, eliminating left recursion from


it, and left factoring the resulting grammar, we can obtain a grammar that can
SEC. 4.4 TOP-DOWN PARSING 183

be parsed by a recursive-descent parser that needs no backtracking, i.e., a


predictive parser, as discussed in Section 2.4. To construct a predictive
parser, we must know, given symbol a and the nonterminal
the current input
A to be expanded, which one of the alternatives of production
A -» tt] |a2 I l«/i is the unique alternative that derives a string beginning
with a. That is, the proper alternative must be detectable by looking at only
the first symbol it derives. Flow-of-control constructs in most programming
languages, with their distinguishing keywords, are usually detectable in this
way. For example, if we have the productions

stmt -> if expr then stmt else stmt


I
while expr do stmt
I
begin stmtjist end

then the keywords if, while, and begin tell us which alternative is the only one
that could possibly succeed if we are to find a statement.

Transition Diagrams for Predictive Parsers

In Section 2.4, we discussed the implementation of predictive parsers by recur-


sive procedures, e.g., those of Fig. 2.17. Just as a transition diagram was
seen in Section 3.4 to be a useful plan or flowchart for a lexical analyzer, we
can create a transition diagram as a plan for a predictive parser.
Several differences between the transition diagrams for a lexical analyzer
and a predictive parser are immediately apparent. In the case of the parser,
there is one diagram for each nonterminal. The labels of edges are tokens
and nonterminals. A transition on a token (terminal) means we should take
that transition if that token is the next input symbol. A transition on a non-
terminal y4 is a call of the procedure for A.
To construct the transition diagram of a predictive parser from a grammar,
first eliminate left recursion from the grammar, and then left factor the gram-
mar. Then for each nonterminal A do the following:

1. Create an initial and final (return) state.

2. For each production A -^XxXj X„, create a path from the " "
initial to
the final state, with edges labeled X) X2, ,X„. , . . .

The predictive parser working off the transition diagrams behaves as fol-
lows. It begins in the start state for the start symbol. If after some actions it

is in state s with and if the next input


an edge labeled by terminal a to state r,

symbol is a, then the parser moves the input cursor one position right and
goes to state t. If, on the other hand, the edge is labeled by a nonterminal A,
the parser instead goes to the start state for A, without moving the input cur-
sor. If it ever reaches the final state for A, it immediately goes to state /, in
effect having "read" A from the input during the time it moved from state s
to t. Finally, if there is an edge from s to t labeled e, then from state s the
parser immediately goes to state r, without advancing the input.
.

184 SYNTAX ANALYSIS SEC. 4.4

A predictive parsing program based on a transition diagram attempts to


match terminal symbols against the input, and makes a potentially recursive
procedure call whenever it has to follow an edge labeled by a nonterminal. A
nonrecursive implementation can be obtained by stacking the states when .v

there is a transition on a nonterminal out of s, and popping the stack when the
final state for a nonterminal is reached. We shall discuss the implementation
of transition diagrams in more detail shortly.
The above approach works if the given transition diagram does not have
nondeterminism, in the sense that there is more than one transition from a
state on the same input, if ambiguity occurs, we may be able to resolve it in
an ad-hoc way, as in the next example. If the nondeterminism cannot be
eliminated, we cannot build a predictive parser, but we could build a
recursive-descent parser using backtracking to systematically try all possibili-
ties, if that were the best parsing strategy we could find.

Example 4.15. Figure 4.10 contains a collection of transition diagrams for


grammar (4.11). The only ambiguities concern whether or not to take an e-
edge. If we interpret the edges out of the initial state for E' as saying take
the transition on + whenever that is the next input and take the transition on
e otherwise, and make the analogous assumption for 7", then the ambiguity is

removed, and we can write a predictive parsing program for grammar (4.1 l).n

Transition diagrams can be simplified by substituting diagrams in one


another; these substitutions are similar to the transformations on grammars
used in Section 2.5. For example, in Fig. 4.11(a), the call of E' on itself has
been replaced by a jump to the beginning of the diagram for E'

r
.:(7^^^0.

Fig. 4.10. Transition diagrams for grammar (4.11;


SEC. 4.4 TOP-DOWN PARSING 185

-^(I^^-0

(b)

' ^tWWV) ^ £:rov^^3

(c) (d)

Fig. 4.11. Simplified transition diagrams.

Figure 4.11(b) shows an equivalent transition diagram for E' . We may then
substitute the diagram of Fig. 4.11(b) for the transition on E' in the diagram
for E in Fig. 4.10, yielding the diagram of Fig. 4.11(c). Lastly, we observe
that the first and third nodes in Fig. 4.11(c) are equivalent and we merge
them. The result. Fig. 4.11(d), is repeated as the first diagram in Fig. 4.12.
The same techniques apply to the diagrams for T and T The complete set of
.

resulting diagrams is shown in Fig. 4.12. AC


implementation of this predic-
tive parser runs 20-25% faster than a C implementation of Fig. 4.10.

E. Co) — ^-^(t) —^— <(6:

^^ (5K^--{5^^-0

Fig. 4.12. Simplified transition diagrams for arithmetic expressions.


186 SYNTAX ANALYSIS SEC. 4.4

Nonrecursive Predictive Parsing

It is possible to build a nonrecursive predictive parser by maintaining a stack


explicitly, rather than implicitly via recursive calls. The key problem during
predictive parsing is that of determining the production to be applied for a
nonterminal. The nonrecursive parser in Fig. 4.13 looks up the production to
be applied in a parsing table. In what follows, we shall see how the table can
be constructed directly from certain grammars.

Input

Stack
SEC. 4.4 TOP-DOWN PARSING 187

assume that the parser just prints the production used; any other code
could be executed here. If M[X, a\ = error, the parser calls an error
recovery routine.

The behavior of the parser can be described in terms of its configurations,


which give the stack contents and the remaining input.

Algorithm 4.3. Nonrecursive predictive parsing.

Input. A string w and a parsing table M for grammar G.


Output. If w is in L{G), a leftmost derivation of w; otherwise, an error indi-
cation.

Method. Initially, the parser is in a configuration in which it has $S on the


stack with S, the start symbol of G on top, and w'$ in the input buffer. The
program that utilizes the predictive parsing table M to produce a parse for the
input is shown in Fig. 4.14. n

set ip to point to the first symbol of w$;


repeat
let X be the top stack symbol and a the symbol pointed to by ip\

if X is a terminal or $ then
if X = « then

pop X from the stack and advance ip


else error

else /» A" is a nonterminal /


\{M[X,a] = X ^YJ. Yk then begin
pop X from the stack;

push K^, y;_, y, onto the stack, with K, on top;


output the production X -^Y^Yi '
Yk
end
else error {)

until X = $ / stack is empty */

Fig. 4.14. Predictive parsing program.

Example Consider the grammar (4.11) from Example 4.8. A predictive


4.16.
grammar is shown in Fig. 4.15. Blanks are error entries;
parsing table for this
non-blanks indicate a production with which to expand the top nonterminal on
the stack. Note that we have not yet indicated how these entries could be
selected, but we shall do so shortly.
With input id+ id * id the predictive parser makes the sequence of moves
in Fig. 4.16. The input pointer points to the leftmost symbol of the string in
the Input column. If we observe the actions of this parser carefully, we see
that it is tracing out a leftmost derivation for the input, that is, the produc-
tions output are those of a leftmost derivation. The input symbols that have
188 SYNTAX ANALYSIS SEC. 4.4

Nonter-
.

SEC. 4.4 TOP-DOWN PARSING 189

that begin the strings derived from a. If a =J> €, then e is also in FIRST(a).
Define FOLLOW (A), for nonterminal A, to he the set of terminals a that
can appear immediately to the right of A in some sentential form, that is, the
set of terminals a such that there exists a derivation of the form S =^ aAa^
for some a and p. Note that there may, at some time during the derivation,
have been symbols between A and a, but if so, they derived € and disap-
peared. If i4 can be the rightmost symbol in some sentential form, then $ is in
FOLLOW(A).
To compute FIRST(A^ for all grammar symbols X, apply the following rules
until no more terminals or € can be added to any FIRST set.

1 If X is terminal, then FIRST(X) is {X}.

2. If X - e is a production, then add € to FIRST(X).

3. If X is nonterminal and X ^ Y iY2 Yi, is a production, then place a in


FIRST(X) if for some ;, a is in FIRST(K,), and € is in all of
FIRST(y,), . . . , FIRST(y,_,); that is, K, • •
K,-, :^e. If € is in

FIRST(y^) for all y = I, 2, . . . , *, then add € to FIRST(X). For exam-


ple, everything in FIRST(r,) is surely in FIRST(X). If K, does not
derive e, then we add nothing more to FIRST(X), but if Yi => e, then we
add FIRST(K2) and soon.

Now, we can compute FIRST for any string X\X2 X„ as follows. Add •

to FIRST(X,X2 X„) all the non-€ symbols of FIRST(X,). Also add the

non-€ symbols of FIRST(X2) if € is in FIRST(X|), the non-e symbols of

FIRST(X3) if 6 is in both FIRST(X,) and FIRSTCXj), and so on. Finally, add


€ to FIRST(X,X2 X„) if, for all /, FIRST(X,) contains e. •

To compute FOLLOW(i4) for all nonterminals A, apply the following rules


until nothing can be added to any FOLLOW set.

1. Place $ in FOLLOW(5), where 5 is the start symbol and $ is the input


right endmarker.

2. If there is a production A - aflp, then everything in FIRST(P) except for


e is placed in FOLLOW(fl).
3. If there is a production A -* olB, or a production A -* aflp where
FIRSTO) contains e (i.e., p =^e), then everything in FOLLOW^) is in

FOLLOW(fl).
Example 4.17. Consider again grammar (4.1 1), repeated below:

E -* TE'
E' -* +TE' I
e
T -* FT'
r -* *Fr I

F - ( £ ) I
id

Then:
190 SYNTAX ANALYSIS SEC. 4.4

FIRST(£) = F1RST(D = FIRST(F) = {(, id}.

FIRST(E') = {+, €}

FIRST(r) = {*, e}

FOLLOW(E) = FOLLOW(£') = {), $}

FOLLOW(D = FOLLOW(r) = {+,), $}

FOLLOW(F) = {+,*, ), $}

For example, id and left parenthesis are added to FIRST(F) by rule (3) in

the definition of FIRST with / = 1 in each case, since FIRST(id) = {id} and
FIRSTCC) = { ( } by rule (1). Then by rule (3) with / = 1 , the production
T FT implies that id and left parenthesis
-«• are in F1RST(7) as well. As
another example, € is in F1RST(£') by rule (2).
To compute FOLLOW sets, we put $ in FOLLOW(E) by rule (1) for FOL-
LOW. By rule (2) applied to production F -» (E), the right parenthesis is also
in FOLLOW(£:). By rule (3) applied to production E ^ TE' , $ and right
parenthesis are in FOLLOW(E'). Since E'=>e, they are also in

FOLLOW(r). For a last example of how the FOLLOW rules are applied, the
production E — TE' implies, by rule (2), that everything other than e in

F1RST(£') must be placed in FOLLOW(D. We have already seen that $ is in

FOLLOW(D. n

Construction of Predictive Parsing Tables

The following algorithm can be used to construct a predictive parsing table for
a grammar G. The idea behind the algorithm is the following. Suppose
A - a is a production with a in FIRST(a). Then, the parser will expand A by
a when the current input symbol is a. The only complication occurs when
a = € or a =>e. In this case, we should again expand A by a if the current
input symbol is in FOLLOW(/4), or if the $ on the input has been reached and
$ is in FOLLOW(/l).

Algorithm 4.4. Construction of a predictive parsing table.

Input. Grammar G.

Output. Parsing table M.

Method.

1. For each production /\ —a of the grammar, do steps 2 and 3.

2. For each terminal a in FIRST(a), add A -^ a to M\A, a\.

3. If e FlRST(a), add A ^a to M[A,h\ for each terminal b in


is in

FOLLOW(A). If e is in FIRST(a) and $ is in FOLLOW(/\), add /4 -* a


to M\A, $|.

4. Make each undefined entry of M be error.


.

SEC. 4.4 TOP-DOWN PARSING 191

Example 4.18. Let us apply Algorithm 4.4 to grammar (4.11). Since


FIRST(rr) = FIRSTCD = {(, id}, production E ^ TE' causes M[E, (] and
M\E, id] to acquire the entry E -» TE'
Production £' - -VTE' causes M[E' +] to acquire E' - +TE' ,Production .

E' — e causes M[E' )] and A/ [£",$]


, to acquire £" — e since
FOLLOW(£') = {), $}.

The parsing table produced by Algorithm 4.4 for grammar (4.11) was
shown in Fig. 4. 15. n

LL(1) Grammars
Algorithm 4.4 can be applied to any grammar G to produce a parsing table M.
For some grammars, however, M
may have some entries that are multiply-
defined. For example, if G is left recursive or ambiguous, then will have at M
least one multiply-defined entry.

Example 4.19. Let us consider grammar (4.13) from Example 4.10 again; it

is repeated here for convenience.

5 -* iEtSS'
S' -^ eS \
e

E - b

The parsing table for this grammar is shown in Fig. 4.17.

Nonter-
192 SYNTAX ANALYSIS SEC. 4.4

decisions. It can be shown that Algorithm 4.4 produces for every LL(1)
grammar G a parsing table that parses all and only the sentences of G.
LL(1) grammars have several distinctive properties. No ambiguous or left-
recursive grammar can be LL(1). It can also be shown that a grammar G is
LL( 1) if and only if whenever A -» a 3 are two distinct productions of G the
|

following conditions hold:

1. For no terminal a do both a and P derive strings beginning with a.

2. At most one of a and p can derive the empty string.

3. If P =^€, then a does not derive any string beginning with a terminal in
FOLLOW(A).
Clearly, grammar (4.11) for arithmetic expressions is LL(1). Grammar
(4.13), modeling if-then-else statements, is not.
There remains the question of what should be done when a parsing table
has multiply-defined entries. One recourse is to transform the grammar by
eliminating all left recursion and then left factoring whenever possible, hoping
to produce a grammar for which the parsing table has no multiply-defined
entries. Unfortunately, there are some grammars for which no amount of
alteration will yield an LL(I) grammar. Grammar (4.13) is one such exam-
ple; its language has no LL(I) grammar at all. As we saw, we can still parse
(4.13) with a predictive parser by arbitrarily making A/ 15", e = {5' |
-^ eS}. In
general, there are no universal rules by which multiply-defined entries can be
made single-valued without affecting the language recognized by the parser.
The main difficulty in using predictive parsing is in writing a grammar for
the source language such that a predictive parser can be constructed from the
grammar. Although left-recursion elimination and left factoring are easy to
do, they make the resultinggrammar hard to read and difficult to use for
translation purposes. To alleviate some of this difficulty, a common organiza-
tion for a parser in a compiler is to use a predictive parser for control con-
structs and to use operator precedence (discussed in Section 4.6) for expres-
sions. However, if an LR parser generator, as discussed in Section 4.9, is
available, one can get all the benefits of predictive parsing and operator pre-
cedence automatically.

Error Recovery in Predictive Parsing

The stack of a nonrecursive predictive parser makes explicit the terminals and
nonterminals that the parser hopes to match with the remainder of the input.
We shall therefore refer to symbols on the parser stack in the following dis-
cussion. An error is detected during predictive parsing when the terminal on
top of the stack does not match the next input symbol or when nonterminal A
is on top of the stack, a is the next input symbol, and the parsing table entry
M[A, a\ is empty.

Panic-mode error recovery is based on the idea of skipping symbols on the


the input until a token in a selected set of synchronizing tokens appears. Its
SEC. 4.4 TOP-DOWN PARSING 193

effectiveness depends on the choice of synchronizing set. The sets should be


chosen so that the parser recovers quickly from errors that are likely to occur
in practice. Some heuristics are as follows:

1. As a starting point, we can place all symbols in FOLLOW(A) into the


synchronizing set for nonterminal A. If we skip tokens until an element
of FOLLOW(A) is seen and pop A from the stack, it is likely that parsing
can continue.

2. It is not enough to use FOLLOW(y4) as the synchronizing set for A. For


example, if semicolons terminate statements, as in C, then keywords that
begin statements may not appear in the FOLLOW set of the nonterminal
generating expressions. A missing semicolon after an assignment may
therefore result in the keyword beginning the next statement being
skipped. Often, there is a hierarchical structure on constructs in a
language; e.g., expressions appear within statements, which appear within
blocks, and so on. We can add to the synchronizing set of a lower con-
symbols that begin higher constructs. For example, we might
struct the
add keywords that begin statements to the synchronizing sets for the non-
terminals generating expressions.

3. If we add symbols in FIRST(A) to the synchronizing set for nonterminal


A, then it may be possible to resume parsing according to A if a symbol in

FIRST(/4) appears in the input.

4. If a nonterminal can generate the empty string, then the production deriv-
ing e can be used as a default. Doing so may postpone some error detec-
tion, but cannot cause an error to be missed. This approach reduces the
number of nonterminals that have to be considered during error recovery.

5. If a terminal on top of the stack cannot be matched, a simple idea is to


pop the terminal, issue a message saying that the terminal was inserted,
and continue parsing. In effect, this approach takes the synchronizing set

of a token to consist of all other tokens.

Example 4.20. Using FOLLOW and FIRST symbols as synchronizing tokens


works reasonably well when expressions are parsed according to grammar
(4.11). The parsing table for this grammar in Fig. 4.15 is repeated in Fig.
4.18, with "synch" indicating synchronizing tokens obtained from the FOL-
LOW set of the nonterminal in question. The FOLLOW sets for the nonter-
minal are obtained from Example 4.17.
The table in Fig. 4.18 is to be used as follows. If the parser looks up entry
M[A, a\ and finds that it is blank, then the input symbol a is skipped. If the
entry is synch, then the nonterminal on top of the stack popped in an is

attempt to resume parsing. If a token on top of the stack does not match the
input symbol, then we pop the token from the stack, as mentioned above.
On the erroneous input ) id * + id the parser and error recovery mechanism
of Fig. 4.18 behave as in Fig. 4. 19. D
194 SYNTAX ANALYSIS SEC. 4.4

Nonter-
SEC. 4.5 BOTTOM-UP PARSING 195

any event, we must be sure that there is no possibility of an infinite loop.


Checking that any recovery action eventually results in an input symbol being
consumed (or the stack being shortened if the end of the input has been
reached) is a good way to protect against such loops.

4.5 BOTTOM-UP PARSING


In this section, we introduce a general style of bottom-up syntax analysis,
known as shift-reduce parsing. An easy-to-implement form of shift-reduce
parsing, called operator-precedence parsing, is presented in Section 4.6. A
much more general method of shift-reduce parsing, called LR parsing, is dis-
cussed in Section 4.7. LR parsing is used in a number of automatic parser
generators.
Shift-reduce parsing attempts to construct a parse tree for an input string
beginning at the leaves (the bottom) and working up towards the root (the
top). We can think of this process as one of "reducing" a string w to the start
symbol of a grammar. At each reduction step a particular substring matching
the right side of a production is replaced by the symbol on the left of that pro-
duction, and if the substring is chosen correctly at each step, a rightmost
derivation is traced out in reverse.

Example 4.21. Consider the grammar

S -* aABe
A -^ Abe \b
B ^ d
The sentence abbede can be reduced to S by the following steps:

abbede
aAbcde
oAde
aABe
S

We scan abbede looking for a substring that matches the right side of some
production. The substrings b and d qualify. Let us choose the leftmost b and
replace it by y4, the left side of the production A -» ft; we thus obtain the string
aAbcde. Now the substrings Abe, b, and d match the right side of some pro-
duction. Although b is the leftmost substring that matches the right side of
some production, we choose to replace the substring Abe by A, the left side of
the production A -* We now obtain oAde. Then replacing d by B,
Abe. the
left side of the production B — d, we obtain aABe. We can now replace this
entire string by 5. Thus, by a sequence of four reductions we are able to
reduce abbede to S. These reductions, in fact, trace out the following right-
most derivation in reverse:

S =i>
rm
aABe =^ aAde =^
rm
aAbcde =^
nn
abbede O
rm
196 SYNTAX ANALYSIS SEC. 4.5

Handles

Informally, a "handle" of a string is a substring that matches the right side of

a production, and whose reduction to the nonterminal on the left side of the
production represents one step along the reverse of a rightmost derivation. In
many cases the leftmost substring (3 that matches the right side of some pro-
duction A - P is not a handle, because a reduction by the production A -* (3
yields a string that cannot be reduced to the start symbol. In Example 4.21, if
we replaced b by A in the second string aAhcde we would obtain the string
aAAcde that cannot be subsequently reduced to 5. For this reason, we must
give a more precise definition of a handle.
Formally, a handle of a right-sentential form 7 is a production A ^ ^ and a
position of 7 where the string P may be found and replaced by A to produce
the previous right-sentential form in a rightmost derivation of 7. That is, if

S =?> olAw =>aBvv,


rm "^ then /\ -* B in the position following
'
a is a handle of
rm '

a|3w. The string w to the right of the handle contains only terminal symbols.
Note we say "a handle" rather than "the handle" because the grammar could
be ambiguous, with more than one rightmost derivation of a|3w. If a gram-
mar is unambiguous, then every right-sentential form of the grammar has
exactly one handle.
In example above, cibbcde is a right-sentential form whose handle is
the
A ^ b 2ii Likewise, aAbcde is a right-sentential form whose handle
position 2.
is A — Abe at position 2. Sometimes we say "the substring P is a handle of
apvv" if the position of P and the production /4 -» p we have in mind are
clear.
Figure 4.20 portrays the handle /\ — p in the parse tree of a right-sentential
form aPw. The handle represents the leftmost complete subtree consisting of
a node and all its children. In Fig. 4.20, A is the bottommost leftmost interior
node with all its children in the tree. Reducing p to A in a^w can be thought
of as "pruning the handle," that is, removing the children of A from the parse
tree.

Example 4.22. Consider the following grammar

(1) E ^ E + E
(2) E ^ E * E
(4.16)
(3) E - (E)
(4) E - id
and the rightmost derivation

E =>
rm
E + E
=>
rm
E + E ^ E
=>
rm
E + £ * id^—

=>
rm
£ + ido
£
* id^

=> id I
+ idi * idi
"

SEC. 4.5 BOTTOM-UP PARSING 197

Fig. 4.20. The handle ^4 — p in the parse tree for a^w.

We have subscripted the Id's for notational convenience and underlined a han-
dle of each right-sentential form. For example, idj is a handle of the right-
sentential form id + id2 * id3 because id is the right side of the production
I

E -» id, and replacing id| by E produces the previous right-sentential form


E + id2 * id^. Note that the string appearing to the right of a handle con-
tains only terminal symbols.
Because grammar (4.16) is ambiguous, there is another rightmost derivation
of the same string:

E =>
rm
E * E
=> £ * id,

=>
rm
E + E * id,

=>
rm
£ + idi
^
* id,

=>
rm
id.
L
+ ido * id,

Consider the right sentential form E + £ * id, . In this derivation, E + E is

a handle of £ + £ * id, whereas id, by itself is a handle of this same right-


sentential form according to the derivation above.
The two rightmost derivations in this example are analogs of the two left-

most derivations in Example 4.6. The first derivation gives * a higher pre-
cedence than +, whereas the second gives the higher precedence. -I-

Handle Pruning

A rightmost derivation in reverse can be obtained by "handle pruning." That


is, we start with a string of terminals w that we wish to parse. If w is a sen-
grammar at hand, then
tence of the w = 7„, where 7,, is the nth right-sentential
form of some as yet unknown rightmost derivation

7() => "Yi => 7--


rm 7> y» rm '
198 SYNTAX ANALYSIS SEC. 4.5

To reconstruct this derivation in we locate the handle P„ in -7,,


reverse order,
and replace p„ by the left side of some production A„ — (3„ to obtain the
(« — l)st right-sentential form 7„-|. Note that we do not yet know how han-
dles are to be found, but we shall see methods of doing so shortly.
We then repeat this process. That is, we locate the handle p„_i in y„-\
and reduce this handle to obtain the right-sentential form 7,, -2- If by continu-
ing this process we produce a right-sentential form consisting only of the start
symbol S, then we halt and announce successful completion of parsing. The
reverse of the sequence of productions used in the reductions is a rightmost
derivation for the input string.

Example 4.23. Consider the grammar (4.16) of Example 4.22 and the input
string + idi * id^.
id| The sequence of reductions shown in Fig. 4.21
reduces id| + id^ * id^ to the start symbol E. The reader should observe that
the sequence of right-sentential forms in this example is just the reverse of the
sequence in the first rightmost derivation in Example 4.22. Q

Right-Sentential Form
SEC. 4.5 BOTTOM-UP PARSING 199

side of the appropriate production. The parser repeats this cycle until it has
detected an error or until the stack contains the start symbol and the input is

empty:

Stack Input
$S

After entering this configuration, the parser halts and announces successful
completion of parsing.

Example 4.24. Let us step through the actions a shift-reduce parser might
make in parsing the input string id| idi * id3 according to grammar (4.16),
-i-

using the first derivation of Example 4.22. The sequence is shown in Fig.
4.22. Note that because grammar (4.16) has two rightmost derivations for
this input there is another sequence of steps a shift-reduce parser might take.

Stack
200 SYNTAX ANALYSIS SEC. 4.5

There is an important fact that justifies the use of a stack in shift-reduce


parsing: the handle will always eventually appear on top of the stack, never
inside. This fact becomes obvious when we consider the possible forms of two
successive steps in any rightmost derivation. These two steps can be of the
form

(1) S => aAz => a^Byz => a^yyz


(2)
^ '
S =>
rm
aBxAz =t>
rm
aBxyz •'
=>
rm
ayxyz
' -^

In case (1), A is replaced by pflv, and then the rightmost nonterminal B in

that right side is replaced by y. In case (2), A is again replaced first, but this
time the right side is a string y of terminals only. The next rightmost nonter-
minal B will be somewhere to the left of y.
Let us consider case (1) in reverse, where a shift-reduce parser has just
reached the configuration

Stack Input
$aP7 yz$

The parser now reduces the handle 7 to fi to reach the configuration

Stack Input
$apfl yz%

Since B is the rightmost nonterminal in a^Byz, the right end of the handle of
a^Byz cannot occur inside the stack. The parser can therefore shift the string

y onto the stack to reach the configuration

Stack Input
Sa^By z$

in which (3i5y is the handle, and it gets reduced to A.


In case (2), in configuration

Stack Input
$a-y xyz%

the handle 7 is on top of the stack. After reducing the handle 7 to B, the
parser can shift the string xy to get the next handle y on top of the stack:

Stack Input
$afijcy z$

Now the parser reduces y to A.


In both cases, after making a reduction the parser had to shift zero or more
symbols to get the next handle onto the stack. It never had to go into the
stack to find the handle. It is this aspect of handle pruning that makes a stack
a particularly convenient data structure for implementing a shift-reduce
parser. We still must explain how choices of action are to be made so the
shift-reduce parser works correctly. Operator precedence and LR parsers are
two such techniques that we shall discuss shortly.
SEC. 4.5 BOTTOM-UP PARSING 201

Viable Prefixes

The set of prefixes of right sentential forms that can appear on the stack of a
shift-reduce parser are called viable prefixes. An equivalent definition of a
viable prefix is that it is a prefix of a right-sentential form that does not con-
tinue past the right end of the rightmost handle of that sentential form. By
this definition, always possible to add terminal symbols to the end of a
it is

viable prefix to obtain a right-sentential form. Therefore, there is apparently


no error as long as the portion of the input seen to a given point can be
reduced to a viable prefix.

Conflicts During Shift-Reduce Parsing

There are context-free grammars for which shift-reduce parsing cannot be


used. Every shift-reduce parser for such a grammar can reach a configuration
in which the parser, knowing the entire stack contents and the next input sym-

bol, cannot decide whether to shift or to reduce (a shift/reduce conflict), or


cannot decide which of several reductions to make (a reduce/ reduce conflict).
We now give some examples of syntactic constructs that give rise to such
grammars. Technically, these grammars are not in the LR(^) class of gram-
mars defined in Section 4.7; we refer to them as non-LR grammars. The k in
LR(/:) refers to the number of symbols of lookahead on the input. Grammars
used in compiling usually fall in the LR( 1) class, with one symbol lookahead.

Example 4.25. An ambiguous grammar can never be LR. For example, con-
sider the dangling-else grammar (4.7) of Section 4.3:

stmt -* if expr then stmt


I
if expr then stmt else stmt
I
other

If we have a shift-reduce parser in configuration

Stack Input
•if expr then stmt else • •
$

we cannot tell whether if expr then stmt is the handle, no matter what appears

below it on the stack. Here there is a shift/reduce conflict. Depending on


what follows the else on the input, it might be correct to reduce
if expr then stmt to stmt, or it might be correct to shift else and then to look

for another stmt to complete the alternative if expr then stmt else stmt. Thus,
we cannot tell whether to shift or reduce in this case, so the grammar is not
LR(I). More generally, no ambiguous grammar, as this one certainly is, can
be LR(A:) for any k.

We should mention, however, that shift-reduce parsing can be easily


adapted to parse certain ambiguous grammars, such as the if-then-else gram-
mar above. When we construct such a parser for a grammar containing the
two productions above, there will be a shift/reduce conflict: on else, either
shift, or reduce by stmt — if expr then stmt. If we resolve the conflict in favor
) )

202 SYNTAX ANALYSIS SEC. 4.5

of shifting, the parser will behave naturally. We discuss parsers for such
ambiguous grammars in Section 4.8.

Another common cause of non-LR-ness occurs when we know we have a


handle, but the stack contents and the next input symbol are not sufficient to
determine which production should be used in a reduction. The next example
illustrates this situation.

Example 4.26. Suppose we have a lexical analyzer that returns token id for
all identifiers, Suppose also that our language invokes
regardless of usage.
procedures by giving their names, with parameters surrounded by parentheses,
and that arrays are referenced by the same syntax. Since the translation of
indices in array references and parameters in procedure calls are different, we
want to use different productions to generate lists of actual parameters and
indices. Our grammar might therefore have (among others) productions such
as:

(1) stmt - \d( parameterjist


(2) stmt - expr := expr
(3) parameterjist — parameterjist ,
parameter
(4) parameterjist — parameter
(5) parameter - id

(6) expr - id ( exprjist


expr -* id
(7)

(8) exprjist -* exprjist , expr


(9) exprjist - expr

A statement beginning with A{I,J) would appear as the token stream


id(id, id) to the parser. After shifting the first three tokens onto the stack, a
shift-reduce parser would be in configuration

Stack Input
id ( id , id )

It is evident that the id on top of the stack must be reduced, but by which pro-
duction? The correct choice is production (5) if A is a procedure and produc-
tion (7) if A is an array. The stack does not tell which; information in the
symbol table obtained from the declaration of A has to be used.
One solution is to change the token id in production (I) to procid and to

use a more sophisticated lexical analyzer that returns token procid when it

recognizes an identifier which is the name of a procedure. Doing this would


require the lexical analyzer to consult the symbol table before returning a
token.
If we made this modification, then on processing A(I, J) the parser would
be either in the configuration

Stack Input
• • •
procid (id , id ) •

SEC. 4.6 OPERATOR-PRECEDENCE PARSING 203

or in the configuration above. In the former case, we choose reduction by


production (5); in Notice how the symbol
the latter case by production (7).
thirdfrom the top of the stack determines the reduction to be made, even
though it is not involved in the reduction. Shift-reduce parsing can utilize
information far down in the stack to guide the parse.

4.6 OPERATOR-PRECEDENCE PARSING


The largest class of grammars for which shift-reduce parsers can be built suc-
cessfully - the LR grammars - will be discussed in Section 4.7. However, for
a small but important class of grammars we can easily construct efficient
shift-reduce parsers by hand. These grammars have the property (among
other essential requirements) that no production right side is e or has two
adjacent nonterminals. A grammar with the latter property is called an
operator grammar.

Example 4.27. The following grammar for expressions

E -* EAE I
(£) I
-£ I
id
/I ^ + I
- I
* |/ t
I

is not an operator grammar, because the right side EAE has two (in fact three)
consecutive nonterminals. However, if we substitute for A each of its alterna-
tives, we obtain the following operator grammar:

E -* E+E I
E-E I
E*£ |
E/E |
E t £ |
(£) |
-£ |
id (4.17)

We now describe an easy-to-implement parsing technique called operator-


precedence parsing. Historically, the technique was first described as a mani-
pulation on tokens without any reference to an underlying grammar. In fact,
once we from a grammar, we
finish building an operator-precedence parser
may effectively ignore the grammar, using the nonterminals on the stack only
as placeholders for attributes associated with the nonterminals.
As a general parsing technique, operator-precedence parsing has a number
of disadvantages. For example, minus
it is hard to handle tokens like the
sign, which has two different precedences (depending on whether it is unary
or binary). Worse, since the relationship between a grammar for the language
being parsed and the operator-precedence parser itself is tenuous, one cannot
always be sure the parser accepts exactly the desired language. Finally, only a
small class of grammars can be parsed using operator-precedence techniques.
Nevertheless, because of its simplicity, numerous compilers using operator-

precedence parsing techniques for expressions have been built successfully.


Often these parsers use recursive descent, described in Section 4.4, for state-
ments and higher-level constructs. Operator-precedence parsers have even
been built for entire languages.
In operator-precedence parsing, we define three disjoint precedence rela-
tions, <•, =, and •>, between certain pairs of terminals. These precedence
relations guide the selection of handles and have the following meanings:
204 SYNTAX ANALYSIS SEC. 4.6

Relation
SEC. 4.6 OPERATOR-PRECEDENCE PARSING 205

pair of terminals and between the endmost terminals and the $'s maricing the
ends of the For example, suppose we initially have the right-sentential
string.
form id + id * id and the precedence relations are those given in Fig. 4.23.
These relations are some of those that we would choose to parse according to
grammar (4.17).
206 SYNTAX ANALYSIS SEC. 4.6

the stack of a shift-reduce parser to indicate placeholders for attribute values.


It may appear from the discussion above that the entire right-sentential form

must be scanned at each step to find the handle. Such is not the case if we
use a stack to store the input symbols already seen and if the precedence rela-
tions are used to guide the actions of a shift-reduce parser. If the precedence
relation <• or = holds between the topmost terminal symbol on the stack and
the next input symbol, the parser shifts; it has not yet found the right end of
the handle. If the relation > holds, a reduction is called for. At this point

the parser has found the right end of the handle, and the precedence relations
can be used to find the end of the handle in the stack.
left

If no precedence between a pair of terminals (indicated by a


relation holds
blank entry in Fig. 4.23), then a syntactic error has been detected and an
error recovery routine must be invoked, as discussed later in this section. The
above ideas can be formalized by the following algorithm.

Algorithm 4.5. Operator-precedence parsing algorithm.

Input. An input string w and a table of precedence relations.

Output. If w is well formed, a skeletal parse tree, with a placeholder nonter-


minal E labeling all interior nodes; otherwise, an error indication.

Method. Initially, the stack contains $ and the input buffer the string w$. To
parse, we execute the program of Fig. 4.24.

(1) set ip to point to the first symbol of w$\


(2) repeat forever
(3) if $ is on top of the stack and ip points to $ then
(4) return
else begin

(5) let a be the topmost terminal symbol on the stack


and let h be the symbol pointed to by ip\

(6) if a < h or a ^ h then begin


(7) push h onto the stack;
(8) advance ip to the next input symbol;
end;
(9) else If rt •> h then /* reduce »/
( 10) repeat
(11) pop the stack
(i2) until the top stack terminal is related by <•
to the terminal most recently popped
(13) else error {)

end

Fig. 4.24. Operator-precedence parsing algorithm.


SEC. 4.6 OPERATOR-PRECEDENCE PARSING 207

Operator-Precedence Relations from Associativity and Precedence

We are always free to create operator-precedence relations any way we see fit

and hope that the operator-precedence parsing algorithm will work correctly
when guided by them. For a language of arithmetic expressions such as that
generated by grammar (4.17) we can use the following heuristic to produce a
proper set of precedence relations. Note that grammar (4.17) is ambiguous,
and right-sentential forms could have many handles. Our rules are designed
to select the "proper" handles to reflect a given set of associativity and pre-
cedence rules for binary operators.

1. Ifoperator 6i has higher precedence than operator 62, make 61 > 62 and
62 <• 61. For example, if * has higher precedence than +, make
* •> + and -I- < *. These relations ensure that, in an expression of the
form E +E*E +E, the central £*£ is the handle that will be reduced
first.

2. If 8] and 62 are operators of equal precedence (they may in fact be the


same operator), then make 0] •> 62 and 62 > 61 if the operators are
left-associative, or make 61 <• 62 and 62 <" 0| if they are right-
associative. For example, if + and — are left-associative, then make
+ •>+, + •>—, — •>—, and — •> +. If t is right associative, then
make t <• These relations ensure that E — E + E will have handle
t.

E—E selected and E ^ E ^ E will have the last E \ E selected.

3. Make 6 <• id, id •> 6, 9 <• (, (


<• 6, ) > 6, 6 > ), 6 •> $, and
$ <• 6 for all operators 6. Also, let

(
= )
208 SYNTAX ANALYSIS SEC. 4.6
SEC. 4.6 OPERATOR-PRECEDENCE PARSING 209

numerical comparison between f (a) and g{b). Note, however, that error
entries in the precedence matrix are obscured, since one of (I), (2), or (3)
holds no matter what f{a) and gib) are. The loss of error detection capabil-
ity is generally not considered serious enough to prevent the using of pre-

cedence functions where possible; errors can still be caught when a reduction
is called for and no handle can be found.

Not every table of precedence relations has precedence functions to encode


it, but in practical cases the functions usually exist.

Example 4.29. The precedence table of Fig. 4.25 has the following pair of
precedence functions.
210 SYNTAX ANALYSIS SEC. 4.6

beginning at the group of /„; let g(a) be the length of the longest path
from the group of ^„. n

Example 4.30. Consider the matrix of Fig. 4.23. There are no = relation-
ships, so each symbol is in a group by itself. Figure 4.26 shows the graph
constructed using Algorithm 4.6.

Fig. 4.26. Graph representing precedence functions.

There are no cycles, so precedence functions exist. As /$ and g^ have no


out-edges, /($) = g($) = 0. The longest path from g+ has length 1, so
g{ + ) = 1. There is a path from gia to /* to ^* to /+ to ^+ to /$, so
gi'id) = 5. The resulting precedence functions are:
SEC. 4.6 OPERATOR-PRECEDENCE PARSING 211

are treated anonymously, they still have places held for them on the parsing

stack. Thus when we talk in (2) above about a handle matching a


production's right side, we mean that the terminals are the same and the posi-
tions occupied by nonterminals are the same.
We should observe that, besides (1) and (2) above, there are no other
points at which errors could be detected. When scanning down the stack to
find the left end of the handle in steps (10-12) of Fig. 4.24, the operator-
precedence parsing algorithm, we are sure to find a <• relation, since $ marks
the bottom of stack and is related by <• to any symbol that could appear

immediately above it on the stack. Note also that we never allow adjacent
symbols on the stack in Fig. 4.24 unless they are related by <• or =. Thus
steps (10-12) must succeed in making a reduction.
Just because we find a sequence of symbols a < b\ = bj = = bi^ on
the stack, however, does not mean that bibj bf^ is the string of terminal
'

symbols on the right side of some production. We did not check for this con-
dition in Fig. 4.24, but we clearly can do so, and in fact we must do so if we
wish to associate semantic rules with reductions. Thus we have an opportun-
ity to detect errors in Fig. 4.24, modified at steps (10-12) to determine what
production is the handle in a reduction.

Handling Errors During Reductions

We may divide the error detection and recovery routine into several pieces.
One piece handles errors of type (2). For example, this routine might pop
symbols off the stack just as in steps (10-12) of Fig. 4.24. However, as there
is no production to reduce by, no semantic actions are taken; a diagnostic mes-

sage is printed instead. To determine what the diagnostic should say, the rou-
tine handling case (2) must decide what production the right side being
popped "looks like." For example, suppose abc is popped, and there is no
production right side consisting of a, b and c together with zero or more non-
terminals. Then we might consider if deletion of one of a, b, and c yields a
legal right side (nonterminals omitted). For example, if there were a right
side aEcE, we might issue the diagnostic

illegal b on line (line containing b)

We might also consider changing or inserting a terminal. Thus if abEdc were


a right side, we might issue a diagnostic
missing d on line (line containing c)

We may also find that there is a right side with the proper sequence of ter-
minals, but the wrong pattern of nonterminals. For example, if abc is popped
off the stack with no intervening or surrounding nonterminals, and abc is not
a right side but aEbc is, we might issue a diagnostic

missing E on line (line containing b)


212 SYNTAX ANALYSIS SEC. 4.6

Here E stands for an appropriate syntactic category represented by nontermi-


nal E. For example, if a, b, or c is an operator, we might say "expression;" if
a is a keyword like if, we might say "conditional."
In general, the difficulty of determining appropriate diagnostics when no
legal right side is found depends upon whether there are a finite or infinite
number of possible strings that could be popped in lines (10-12) of Fig. 4.24.
Any such string ^1^2 ^a must have = relations holding between adjacent
' " "

symbols, so b^ = bj = ' ' ^ bi^. if an operator precedence table tells us


that there are only a finite number of sequences of terminals related by =,
then we can handle these strings on a case-by-case basis. For each such string
X we can determine in advance a minimum-distance legal right side y and issue
a diagnostic implying that x was found when y was intended.
it is easy to determine all strings that could be popped from the stack in

steps (10-12) of Fig. 4.24. These are evident in the directed graph whose
nodes represent the terminals, with an edge from a to b if and only if a = b.
Then the possible strings are the labels of the nodes along paths in this graph.
Paths consisting of a single node are possible. However, in order for a path
^1^2 ' ' '
^k to be "poppable" on some input, there must be a symbol a (pos-
sibly $) such that a<bi. Call such a bi initial. Also, there must be a sym-
bol ( (possibly $) such that b,,>c. Call b^^ final. Only then could a reduction

be called for and /?i/?2


' ' '
^* be the sequence of symbols popped, if the
graph has a path from an initial to a final node containing a cycle, then there
are an infinity of strings that might be popped; otherwise, there are only a fin-
ite number.

O © O
(lK-0
Fig. 4.27. Graph for precedence matrix of Fig. 4.25.

Example 4.31. Let us reconsider grammar (4.17):

E - E+E I
E-E I
£*E |
E/E \
E \ E \
{E) \
-E \
id

The precedence matrix for this grammar was shown in Fig. 4.25, and its

graph is given in Fig. 4.27. There is only one edge, because the only pair
related by = is the left and right parenthesis. All but the right parenthesis
are initial, and all but the Thus the only paths from
left parenthesis are final.
an node are the paths +,-,*,/, id, and of length one, and
initial to a final t

the path from ( to ) of length two. There are but a finite number, and each
corresponds to the terminals of some production's right side in the grammar.
Thus the error checker for reductions need only check that the proper set of
SEC. 4.6 OPERATOR-PRECEDENCE PARSING 213

nonterminal markers appears among the terminal strings being reduced.


Specifically, the checker does the following:

1. If +,-,*,/, or t is reduced, it checks that nonterminals appear on both


sides. If not, it issues the diagnostic

missing operand
2. If id is reduced, it checks that there is no nonterminal to the right or left.
If there is, it can warn

missing operator
3. If ( ) is reduced, it checks that there is a nonterminal between the
parentheses. If not, it can say

no expression between parentheses


Also it must check that no nonterminal appears on either side of the
parentheses. If one does, it issues the same diagnostic as in (2).

If there are an infinity of strings that may be popped, error messages cannot
be tabulated on a case-by-case basis. We might use a general routine to deter-
mine whether some production right side is close (say distance or 2, where 1

distance is measured in terms of tokens, rather than characters, inserted,

deleted, or changed) to the popped string and if so, issue a specific diagnostic
on the assumption that that production was intended. If no production is close
to the popped string, we can issue a general diagnostic to the effect that
"something is wrong in the current line."

Handling Shift/Reduce Errors

We must now discuss the other way in which the operator-precedence parser
detects errors. When consulting the precedence matrix to decide whether to
shift or reduce (lines (6) and (9) of Fig. 4.24), we may find that no relation
holds between the top stack symbol and the first input symbol. For example,
suppose a and b are the two top stack symbols {b is at the top), c and d are

the next two input symbols, and there is no precedence relation between b
and c. To recover, we must modify the stack, input or both. We may change
symbols, insert symbols onto the input or stack, or delete symbols from the
input or stack. If we insert or change, we must be careful that we do not get
into an infinite loop, where, for example, we perpetually insert symbols at the
beginning of the input without being able to reduce or to shift any of the
inserted symbols.
One approach that will assure us no infinite loops is to guarantee that after
recovery the current input symbol can be shifted (if the current input is $,
guarantee that no symbol is placed on the input, and the stack is eventually
shortened). For example, given ab on the stack and cd on the input, if «<c-^

We use s to mean <• or =.


214 SYNTAX ANALYSIS SEC. 4.6

we might pop b from the stack. Another choice is to delete c from the input
if b^d. A third choice is to find a symbol e such that b<e-^c and insert e
in front of c on the input. More generally, we might insert a string of sym-
bols such that

e x-^e2 — e„

if a single symbol for insertion could not be found. The exact action chosen
should reflect the compiler designer's intuition regarding what error is likely

in each case.
For each blank entry in the precedence matrix we must specify an error-
recovery routine; the same routine could be used in several places. Then
when the parser consults the entry for a and b in step (6) of Fig. 4.24, and no
precedence relation holds between a and b, it finds a pointer to the error-
recovery routine for this error.

Example 4.32. Consider the precedence matrix of Fig. 4.25 again. In Fig.
4.28, we show the rows and columns of this matrix that have one or more
blank entries, and we have filled in these blanks with the names of error han-
dling routines.
SEC. 4.7 LR PARSERS 215

erroneous input id + ). The first actions taken by the parser are to shift id,
reduce it to E (we again use E for anonymous nonterminals on the stack), and
then to shift the + . We now have configuration

Stack Input
%E+ )$

Since + •> ) a reduction is called for, and the handle is +. The error
checker for reductions is required to inspect for £"s to left and right. Finding
one missing, it issues the diagnostic

missing operand
and does the reduction anyway.
Our configuration is now
%E )$

There is no precedence relation between $ and ), and the entry in Fig. 4.28
for this pair of symbols is e2. Routine e2 causes diagnostic

unbalanced right parenthesis


to be printed and removes the right parenthesis from the input. We are now
left with the final configuration for the parser.

%E % u

4.7 LR PARSERS
This section presents an efficient, bottom-up syntax analysis technique that
can be used to parse a large class of context-free grammars. The technique is
called LR(^) parsing; the "L" is for left-to-right scanning of the input, the
"R" for constructing a rightmost derivation in reverse, and the k for the
number of input symbols of lookahead that are used in making parsing deci-
sions. When {k) is omitted, k is assumed to be 1. LR parsing is attractive for
a variety of reasons.

• LR parsers can be constructed to recognize virtually all programming-


language constructs for which context-free grammars can be written.

• The LR parsing method is the most general nonbacktracking shift-reduce


parsing method known, yet it can be implemented as efficiently as other
shift-reduce methods.

• The class of grammars that can be parsed using LR methods is a proper


superset of the class of grammars that can be parsed with predictive
parsers.

• An LR parser can detect a syntactic error as soon as it is possible to do so


on a left-to-right scan of the input.
216 SYNTAX ANALYSIS SEC. 4.7

The principal drawback of the method is that it is too much work to con-
struct an LR parser by hand for a typical programming-language grammar.
One needs a specialized tool - an LR parser generator. Fortunately, many
such generators are available, and we shall discuss the design and use of one,
Yacc, in Section 4.9. With such a generator, one can write a context-free

grammar and have the generator automatically produce a parser for that
grammar. If the grammar contains ambiguities or other constructs that are
difficult to parse in a left-to-right scan of the input, then the parser generator
can locate these constructs and inform the compiler designer of their presence.
After discussing the operation of an LR parser, we present three techniques
for constructing an LR parsing table for a grammar. The first method, called
simple LR (SLR for short), is the easiest to implement, but the least powerful
of the three. It may fail to produce a parsing table for certain grammars on

which the other methods succeed. The second method, called canonical LR,
is the most powerful, and the most expensive. The third method, called look-
ahead LR (LALR for short), is intermediate in power and cost between the
other two. The LALR method will work on most programming-language
grammars and, with some effort, can be implemented efficiently. Some tech-
niques for compressing the size of an LR parsing table are considered later in
this section.

The LR Parsing Algorithm

The schematic form of an LR parser is shown in Fig. 4.29. It consists of an


input, an output, a stack, a driver program, and a parsing table that has two
parts (action and f^oto). The driver program is the same for all LR parsers;
only the parsing table changes from one parser to another. The parsing pro-
gram reads characters from an input buffer one at a time. The program uses
a stack to store a string of the form SqX ]S]X2S2 A',„5,„, where s,„ is on ' ' '

top. Each Xj is a grammar symbol and each Sj is a symbol called a state.


Each state symbol summarizes the information contained in the stack below it,

and the combination of the state symbol on top of the stack and the current
input symbol are used to index the parsing table and determine the shift-
reduce parsing decision. In an implementation, the grammar symbols need
not appear on the stack; however, we shall always include them in our discus-
sions to help explain the behavior of an LR parser.
The parsing two parts, a parsing action function action and
table consists of
a goto function goto. The program driving the LR parser behaves as follows.
It determines .v,„, the state currently on top of the stack, and «,, the current
input symbol. It then consults action[s,„, a,], the parsing action table entry for
state s,„ and input a,, which can have one of four values:

1. shift s, where s is a state,


2. reduce by a grammar production A — p,
3. accept, and
4. error.
SEC. 4.7 LR PARSERS 217

Input

LR
Stack -»• Output
Parsing Program

X,

action goto

Fig. 4.29. Model of an LR parser.

The function goto takes a state and grammar symbol as arguments and pro-
duces a state. We shall see that the goto function of a parsing table con-
structed from a grammar G using the SLR, canonical LR, or LALR method is

the transition function of a deterministic finite automaton that recognizes the


viable prefixes of G. Recall that the viable prefixes of G are those prefixes of
right-sentential forms that can appear on the stack of a shift-reduce parser,
because they do not extend past the rightmost handle. The initial state of this
DFA is the state initially put on top of the LR parser stack.
A configuration of an LR parser is a pair whose first component is the stack
contents and whose second component is the unexpended input:

(^qXi ^1 X2 S2 ' ' "


X^ s„ , a, a, + \
a„$)

This configuration represents the right-sentential form

X X I
2

X^ a, a, + 1
• • •
a„

in essentially the same way as a shift-reduce parser would; only the presence
of states on the stack is new.
The next move of the parser is determined by reading a,, the current input
symbol, and s,„, the state on top of the stack, and then consulting the parsing
action table entry action[s,„, a,]. The configurations resulting after each of
the four types of move are as follows:

If action[s^, a, shift s, the parser executes a shift move, entering the


configuration

(^0 X \ s \ Xi S2 ' ' '


X^ s,„ Oj s, a, -,. 1
• •
a„$)

Here the parser has shifted both the current input symbol a, and the next
state 5, which is given in action[s^, a,], onto the stack; «, + i
becomes the
current input symbol.
218 SYNTAX ANALYSIS SEC. 4.7

2. If action\s,„, «,| = reduce /4 -» p, then the parser executes a reduce


move, entering the configuration

where s = goto[s,„-r, A ] and r is the length of (3, the right side of the
production. Here the parser first popped 2r symbols off the stack (r state
symbols and r grammar symbols), exposing state .v,„_^. The parser then
pushed both A, the left side
,y, of the production, and
the entry for
goto[s,„-r. A], onto the stack. The current input symbol is not changed
in a reduce move. For the LR parsers we shall construct,
X,„-r + X,„, the sequence
\

of grammar symbols popped off the stack,
will always match p, the right side of the reducing production.

The output of an LR parser is generated after a reduce move by execut-


ing the semantic action associated with the reducing production. For the
time being, we shall assume the output consists of just printing the reduc-
ing production.

3. \f action\s,„, a,] = accept, parsing is completed.

4. if action\s,„, tf,| = error, the parser has discovered an error and calls an
error recovery routine.

The LR parsing algorithm is summarized below. All LR parsers behave in


this fashion; the only difference between one LR parser and another is the
information in the parsing action and goto fields of the parsing table.

Algorithm 4.7. LR parsing algorithm.

Input. An input string w and an LR parsing table with functions action and
goto for a grammar G.

Output, if w is in L(G), a bottom-up parse for w; otherwise, an error indica-


tion.

Method. Initially, the parser has sq on its stack, where a'o is the initial state,
and w$ in the input buffer. The parser then executes the program in Fig.
4.30 until an accept or error action is encountered. Q

Example 4.33. Figure 4.31 shows the parsing action and goto functions of an
LR parsing table for the following grammar for arithmetic expressions with
binary operators + and *:

(1)
)

SEC. 4.7 LR PARSERS 219

set ip to point to the first symbol of w$;


repeat forever begin
let j^ be the state on top of the stack and
a the symbol pointed to by ip;

if action [s, a\ - shift -v' then begin


push a then s' on top of the stack;
advance ip to the next input symbol
end
else if action [s, a ]
= reduce /\ - p then begin
pop 2* Ip I
symbols off the stack;
let .v' be the state now on top of the stack;
push A then goto[s' A on top of the stack;
, |

output the production A -* ^


end
else if action [s, = accept then
return
else error (

end

Fig. 4.30. LR parsing program.

State
220 SYNTAX ANALYSIS SEC. 4.7

Note that the value of f>oto\s, a\ for terminal a is found in the action field
connected with the shift action on input a for state .v. The goto field gives
goto[s, A\ for nonterminals A. Also, bear in mind that we have not yet
explained how the entries for Fig. 4.31 were selected; we shall deal with this

issue shortly.
On input id * id + id, the sequence of stack and input contents is shown in

Fig. 4.32. For example, at line (I) the LR parser is in state with id the first

input symbol. The action in row and column id of the action field of Fig.
4.31 is meaning shift and cover the stack with state 5. That is what has
s5,
happened at line (2): the first token id and the state symbol 5 have both been
pushed onto the stack, and id has been removed from the input.
Then, * becomes the current input symbol, and the action of state 5 on
input * is to reduce by F -* id. Two symbols are popped off the stack (one
state symbol and one grammar symbol). State is then exposed. Since the
goto of state on F is 3, F and 3 are pushed onto the stack. We now have
the configuration in line (3). Each of the remaining moves is determined
similarly. D

Stack
SEC. 4.7 LR PARSERS 221

An LR parser does not have to scan the entire stack to know when the han-
dle appears on top. Rather, the state symbol on top of the stack contains all

the information it needs. It is a remarkable fact that if it is possible to recog-


nize a handle knowing only the grammar symbols on the stack, then there is a
finiteautomaton that can, by reading the grammar symbols on the stack from
top to bottom, determine what handle, if any, is on top of the stack. The
goto function of an LR parsing table is essentially such a finite automaton.
The automaton need not, however, read the stack on every move. The state
symbol stored on top of the stack is the state the handle-recognizing finite
automaton would be in if it had read the grammar symbols of the stack from
bottom to top. Thus, the LR parser can determine from the state on top of
the stack everything that it needs to know about what is in the stack.
Another source of information that an LR parser can use to help make its
shift-reduce decisions is the next k input symbols. The cases k=0 or A:= are
1

of practical interest, and we shall only consider LR parsers with k<\ here.
For example, the action table in Fig. 4.31 uses one symbol of lookahead. A
grammar that can be parsed by an LR parser examining up to k input symbols
on each move is called an LR(k) grammar.
There is a significant difference between LL and LR grammars. For a
grammar to be LR(/:), we must be able to recognize the occurrence of the
right side of a production,having seen all of what is derived from that right
side with k input symbols of lookahead. This requirement is far less stringent
than that for LL(A:) grammars where we must be able to recognize the use of
a production seeing only the first k symbols of what its right side derives.
Thus, LR grammars can describe more languages than LL grammars.

Constructing SLR Parsing Tables

We now show how from a grammar an LR parsing table. We


to construct
shall give three in their power and ease of implementation.
methods, varying
The first, called "simple LR" or SLR for short, is the weakest of the three in
terms of the number of grammars for which it succeeds, but is the easiest to
implement. We shall refer to the parsing table constructed by this method as
an SLR table, and to an LR parser using an SLR parsing table as an SLR
parser. A grammar for which an SLR parser can be constructed is said to be
an SLR grammar. The other two methods augment the SLR method with
lookahead information, so the SLR method is a good starting point for study-
ing LR parsing.
An LR(0) item {item for short) of a grammar G is a production of G with a
dot at some position of the right side. Thus, production A -^ XYZ yields the
four items

A - XYZ
A -* XYZ
A -* XY Z
A -* XYZ
222 SYNTAX ANALYSIS SEC. 4.7

The production A -* e generates only one item,A — An item can be


•.

represented by a pair of integers, the first giving the number of the production
and the second the position of the dot. Intuitively, an item indicates how

much of a production we have seen at a given point in the parsing process.


For example, the first item above indicates that we hope to see a string deriv-
able from XYZ next on the input. The second item indicates that we have just
seen on the input a string derivable from X and that we hope next to see a

string derivable from YZ.


The central idea in the SLR method is first to construct from the grammar a
deterministic finite automaton to recognize viable prefixes. We group items
together into sets, which give rise to the states of the SLR parser. The items
can be viewed as the states of an NFA recognizing viable prefixes, and the
"grouping together" is really the subset construction discussed in Section 3.6.
One collection of sets of LR(0) items, which we call the canonical LR(0)
collection, provides the basis for constructing SLR parsers. To construct the
canonical LR(0) collection for a grammar, we define an augmented grammar
and two functions, closure and goto.
If C is a grammar with start symbol S, then C, the augmented grammar for

G, is G with a new start symbol 5" and production 5" — S. The purpose of
this new starting production is to indicate to the parser when it should stop
parsing and announce acceptance of the input. That is, acceptance occurs
when and only when the parser is about to reduce by 5' -^ S.

The Closure Operation

If / is a set of items for a grammar G, then closured) is the set of items con-

structed from / by the two rules:

1. Initially, every item in / is added to closure (I).

2. If A -^ a-fip is in closured) and fi - 7 is a production, then add the item


fi - 7 to /, if it is not already there. We apply this rule until no more
new items can be added to closure (I).

Intuitively, A -» aSp in closured) indicates that, at some point in the parsing


process, we think we might next see a substring derivable from 5(3 as input.
If fi - 7 is a production, we also expect we might see a substring derivable
from 7 at this point. For this reason we also include fi — 7 in closure (I).

Example 4.34. Consider the augmented expression grammar:

E ^ E + T T \

(4.19)
T -* T ^ F F \

F ^ (E) id \

If/ is the set of one item {[E' -^ E]}, then closured) contains the items
SEC. 4.7 LR PARSERS 223

E'
224 SYNTAX ANALYSIS SEC. 4.7

kernel items, of course. Thus, we can represent the sets of items we are
really interested in with very little storage if we throw away all nonkernel

items, knowing that they could be regenerated by the closure process.

The Goto Operation

The second useful function is goto (I, X) where / is a set of items and X is a
grammar symbol. goto{l, X) is defined to be the closure of the set of all items
[A — aXpj such that [A - aX^\ is in /. Intuitively, if / is the set of items
that are valid for some viable prefix y, then goto{I, X) is the set of items that
are valid for the viable prefix yX.

Example 4.35. If / is the set of two items {\E'-*E\, [E^E+T]}, then


goto{I, + ) consists of

E ^ E + T
T ^ T * F
T ^ F
F -* (E)
F - id
We computed gotoil, +) by examining / for items with -I- immediately to the
right of the dot. E' ^E is not such an item, but E -» £ -l-T is. We moved
the dot over the + to get {E -^ E+-T} and then took the closure of this set.

The Sets-of-I terns Construction

We are now ready to give the algorithm to construct C, the canonical collec-
tion of sets of LR(0) items for an augmented grammar C; the algorithm is

shown in Fig. 4.34.

procedure items {G');


begin
C := {ciosure{{\S' -* -51})};

repeat
for each set of items / in C and each grammar symbol X
such that gotoih X) is not empty and not in C do
add goto (I, X)ioC
until no more sets of items can be added to C
end

Fig. 4.34. The sets-of-items construction.

Example The canonical collection of sets of LR(0) items for grammar


4.36.
(4.19) of Example 4.34 is shown in Fig. 4.35. The goto function for this set
of items is shown as the transition diagram of a deterministic finite automaton
D in Fig. 4.36. =>
SEC. 4.7 LR PARSERS 225

E' -* E h
E -* E+T
E ^ T
T - T*F
T ^ F
F -•(£)
F - id

E' ^ E-
E ^ E +T

E ^T
T -* T^F

(•£)
E -* E+T
E -* T
T -* T*F
T - F
F -* (E)
F -* id
226 SYNTAX ANALYSIS SEC. 4.7

(ty^<ly^<^y^-(i^f)
J

^ -^
SEC. 4.7 LR PARSERS 227

and goto function are exhibited in Fig. 4.35 and 4.36. Clearly, the string
£ + r * is a viable prefix of (4.19). The automaton of Fig. 4.36 will be in
state I J after having read E + T *. State /? contains the items

T ^ T * F
F -* (E)
F -* id

which are precisely the items valid for £" + 7 *. To see this, consider the fol-
lowing three rightmost derivations

£' =t> £ E' => E E' =^ E


=^E + T =^ E + T ===> E+T
=^E+T*F =^E + T*F =i>E+T*F
=^E + T*{E) =^£ + r*id
The first derivation shows the validity of T -* T * F, the second the validity
of F -» (E), and the third the validity of F -* id for the viable prefix
E + T *. It can be shown that there are no other valid items for £ + T *,
and we leave a proof to the interested reader.

SLR Parsing Tables


Now we shall show how to construct the SLR parsing action and goto func-
tions from the deterministic finite automaton that recognizes viable prefixes.
Our algorithm will not produce uniquely defined parsing action tables for all
grammars, but it does succeed on many grammars for programming
languages. Given a grammar, G, we augment G to produce G', and from G'
we construct C, the canonical collection of sets of items for G' We construct .

action, the parsing action function, and goto, the goto function, from C using
the following algorithm. It requires us to know FOLLOW(/4) for each nonter-
minal A of a grammar (see Section 4.4).

Algorithm 4.8. Constructing an SLR parsing table.

Input. An augmented grammar G'.

Output. The SLR parsing table functions action and goto for G'.

Method.

1. Construct C ={Iq, I^ , . . . , /„}, the collection of sets of LR(0) items for


G'.

2. State / is constructed from /,. The parsing actions for state ; are deter-
mined as follows:

a) If \A -* aa^] is in /, and gotoilj, a) = Ij, then set action[i, a] to


"shift y." Here a must be a terminal.
b) If [A -* a] is in /,, then set action[i, a\ to "reduce A -* a" for all a
:

228 SYNTAX ANALYSIS SEC. 4.7

in FOLLOW(i4); here A may not be 5'.

c) If 15' — 5- 1 is in /,, then set action\i, $| to "accept."

Ifany conflicting actions are generated by the above rules, we say the gram-
mar is not SLR(l). The algorithm fails to produce a parser in this case.

3. The goto transitions for state / are constructed for all nonterminals A
using the rule: If goto {I i
A) = Ij, then goto\i, /4 I
= y.

4. All entries not defined by rules (2) and (3) are made "error."

5. The initial state of the parser is the one constructed from the set of items
containing \S' -* S). °

The parsing table consisting of the parsing action and goto functions deter-
mined by Algorithm 4.8 is called the SLR{1) table for G. An LR parser using
the SLR(l) table for G is called the SLR(l) parser for G, and a grammar hav-
ing an SLR(l) parsing table is said to be SLR(l). We usually omit the "(I)"
after the "SLR," since we shall not deal here with parsers having more than
one symbol of lookahead.

Example 4.38. I^et us construct the SLR table for grammar (4.19). The
canonical collection of sets of LR(0) items for (4.19) was shown in Fig. 4.35.

First consider the set of items /q:

E'-* E
E -* E+T
E ^ T
T -* T*F
T ^ F
F -* •(£)
F -* id
The item F - (E) gives rise to the entry action[0, (] = shift 4, the item

F - id to the entry action[0, id] = shift 5. Other items in /q yield no


actions. Now consider / 1

E' -* E-
E -* E +T
The first item yields action\\, $] = accept, the second yields action[\, +] =
shift 6. Next consider l2'-

E ^ T
T -* T *F
Since FOLLOW(E) = {$, +, )}, the first item makes action[2, $] =
action\2, +\ = action\2, )| = reduce E^T. The second item makes
action[2, *] = shift 7. Continuing in this fashion we obtain the parsing action
and goto tables that were shown in Fig. 4.31. In that figure, the numbers of
productions in reduce actions are the same as the order in which they appear
SEC. 4.7 LR PARSERS 229

in the original grammar (4.18). That is, E E+T is number 1, E T is 2,


and so on. n

Example 4.39. Every SLR(l) grammar is unambiguous, but there are many
unambiguous grammars that are not SLR(l). Consider the grammar with pro-
ductions

S ^ L = R
S ^ R
L ^ ^ R (4.20)
L - id
R ^ L
We may think of L and R as standing for /-value and r-value, respectively, and
* as an operator indicating "contents of."^ The canonical collection of sets of
LR(0) items for grammar (4.20) is shown in Fig. 4.37.

0- S'
.

230 SYNTAX ANALYSIS SEC. 4.7

input symbol =
Grammar (4.20) is not ambiguous. This shift/reduce conflict arises from
the fact that the SLR parser construction method is not powerful enough to

remember enough left context to decide what action the parser should take on
input = having seen a string reducible to L. The canonical and LALR
methods, to be discussed next, will succeed on a larger collection of gram-
mars, including grammar (4.20). It should be pointed out, however, that
there are unambiguous grammars for which every LR parser construction
method will produce a parsing action table with parsing action conflicts. For-
tunately, such grammars can generally be avoided in programming language
applications. ^

Constructing Canonical LR Parsing Tables

We shall now present the most general technique for constructing an LR pars-

ing table from a grammar. Recall that in the SLR method, state / calls for

reduction by /4 —a if the set of items /, contains item \A — a] and a is in

FOLLOW(A). In some situations, however, when state / appears on top of


the stack, the viable prefix pa on the stack is such that pA cannot be followed
by a in a right-sentential form. Thus, the reduction by /4 —a would be
invalid on input a.

Example 4.40. Let us reconsider Example 4.39, where in state 2 we had item
R — L-, which could correspond to /4 ^a above, and a could be the = sign,

which is in FOLLOW(/?). Thus, the SLR parser calls for reduction by /? -* L


in state 2 with = as the next input (the shift action is also called for because

of item 5 - L- = /? in state 2). However, there is no right-sentential form of


the grammar in Example 4.39 that begins R = Thus state 2, which is
.

the state corresponding to viable prefix L only, should not really call for
reduction of that L to R. ^

is possible to carry more information in the state that will allow us to rule
It

out some of these invalid reductions by /\ — a. By splitting states when


necessary, we can arrange to have each state of an LR parser indicate exactly
which input symbols can follow a handle a for which there is a possible reduc-
tion to A.
The extra information is incorporated into the state by redefining items to
include a terminal symbol as a second component. The general form of an
item becomes \A - a-p, a |, where A — aP is a production and a is a terminal
or the right endmarker $. We call such an object an LR(I) item. The I refers

to the length of the second component, called the lookahead of the item. The
lookahead has no effect in an item of the form |y4 - a-p, u\, where p is not
e, but an item of the form \A -> a-, a\ calls for a reduction by .4 ^a only if

"
Lookahcads that arc strings of length greater than one are possible, of course, but we shall not

consider such lookahcads here.


SEC. 4.7 LR PARSERS 231

the next input symbol is a. Thus, we are compelled to reduce by y4 - a only


on those input symbols a for which [A - a a] is an LR(1) item in the state ,

on top of the stack. The set of such a's will always be a subset of
FOLLOW(y4), but it could be a proper subset, as in Example 4.40.
Formally, we say LR(1) item [A — a-p, a] is valid for a viable prefix 7 if

there is a derivation S =t>


rm
bAw =^
rm
8a6w, where "^

1. -y = 8a, and
2. either a is the first symbol of w, or w is e and a is $.

Example 4.41. Let us consider the grammar

S ^ BB
B ^ aB \b
There is a rightmost
^ derivation S =t>
rm
aaBah =t>
rm
aaaBab. We see that item
[B -* aB, a] is valid for a viable prefix -y = aaa by letting 8 = aa, A = B,
w = ab, a = a, and p = fl in the above definition.
There is also a rightmost
^ derivation S =5>
rm
BaB =t>
rm
BaaB. From this deriva-
tion we see that item [B -* a-B, $] is valid for viable prefix Baa.

The method for constructing the collection of sets of valid LR(1) items is

essentially the same as the way we built the canonical collection of sets of
LR(0) items. We only need to modify the two procedures closure and goto.
To appreciate the new definition of the closure operation, consider an item
of the form \A -» aflp, a] in the set of items valid for some viable prefix 7.
Then there is ° S =t>
a rightmost
rm
8Aajr =^haB2>ax,
derivation
rm '^ where 7 = 8a. '

Suppose Pojc derives terminal string by. Then for each production of the form
B -* for
-T] some t), we have derivation S =t> "^Bhy ^^> -^-^by. Thus,
[fl -» T), ^] is valid for 7. Note that b can be the first terminal derived from
P, or it is possible that P derives e in the derivation ^ax =^ by, and b can
therefore be a. To summarize both possibilities we say that b can be any ter-
minal in FIRST(Par), where FIRST is the function from Section 4.4. Note
that X cannot contain the first terminal of by, so FIRST(PaAr) = FIRST(Pa).
We now give the LR(1) sets of items construction.
Algorithm 4.9. Construction of the sets of LR( 1) items.

Input. An augmented grammar G'.

Output. The sets of LR(I) items that are the set of items valid for one or
more viable prefixes of G'.

Method. The procedures closure and goto and the main routine items for con-
structing the sets of items are shown in Fig. 4.38.

Example 4.42. Consider the following augmented grammar.

S'^ S
S ^ CC (4.21)
C ^ cC d \
;

232 SYNTAX ANALYSIS SEC. 4.7

function closure (I);

begin
repeat
for each item [A - afip, a] in /,

each production B — 7 in G',


and each terminal b in FIRST(Prt)
such that [B -* -7, h] is not in / do
add \B -* y. b] to/;
until no more items can be added to /;

return I

end;

function goto(l, X);


begin
let J be the set of items [^4 — aXp, a] such that
\A -*
[A aXQ.
aXp. fll is in /;
fl]

return closure (J)


end;

procedure items (
C )
'

begin
C := {closure ({\S' - S, $]})};

repeat
for each set of items / in C and each grammar symbol X
such that gotod, X) is not empty and not in C do
add goto (I, X) to C
until no more sets of items can be added to C
end

Fig. 4.38. Sets of LR(1) items construction for grammar C.

We begin by computing the closure of {[5' -^ S, $]}. To close, we match the


item [S' - -5, $] with the item [A - afi(3, a] in the procedure closure. That
is, /i = 5', a = e, fi = 5, 3 = e, and a %. =
Function closure tells us to add
[B - 7, b] for each production B -* and terminal b in FIRSTOa). In
-y

terms of the present grammar, B -* y must be 5 — CC. and since P is e and a


is $, b may Thus we add [S - CC, $].
only be $.

We continue to compute the closure by adding all items (C -* 7, b] for b in


FIRST(C$). That is, matching [S -* CC, $] against [A - a .6(3, a] we have
A = 5, a = e, fi = C, P = C, and a = %. Since C does not derive the empty
string, FIRST(C$) = FIRST(C). Since FIRST(C) contains terminals c and d,
we add items [C - cC, c\, [C - cC, d], [C - d, c] and [C - d. d].
None of the new items has a nonterminal immediately to the right of the dot,
so we have completed our first set of LR(1) items. The initial set of items is:
SEC. 4.7 LR PARSERS 233

S' -* S, $
/o:
S -* CC, $
C - cC, eld
C -* d, eld

The brackets have been omitted for notational convenience, and we use the
notation [C -» eC, eld] as a shorthand for the two items [C -* eC, e] and

[C -* eC, d].
Now wecompute goto (I q, X) for the various values of X. For X = S we
must close the item [5' -*S-, $]. No additional closure is possible, since the
dot is at the right end. Thus we have the next set of items:

/,: 5' - S-, $

For X = C we close [5 -*CC, $]. We add the C-productions with second


component $ and then can add no more, yielding:

I2: S -* C C, $
C ^ eC, $
C ^ d, $

Next, let X = e. We
must close {(C -» eC, eld]}. We add the C-productions
with second component eld, yielding:

73: C -* e C, eld
C -* eC, eld
C -* d, eld

Finally, let X = d, and we wind up with the set of items:

U: C -* d-, eld

We have finished considering goto on /q. We get no new sets from /j, but
I2 has goto's on C, e, and d. On C we get:

/j: S - CC , $

no closure being needed. On e we take the closure of {[C -* eC, $]}, to


obtain:

U: C - eC, $
C -* eC, $
C ^ d, $
Note that /^ differs from 73 only in second components. We shall see that it

is common for several sets of LR(1) items for a grammar to have the same
first components and differ in their second components. When we construct
the collection of sets of LR(0) items
same grammar, each set of LR(0)
for the
items will coincide with the set of first components of one or more sets of
LR(1) items. We shall have more to say about this phenomenon when we dis-
cuss LALR parsing.
Continuing with the goto function for 1 2, goto (1 2, d) is seen to be:
.

234 SYNTAX ANALYSIS SEC. 4.7

77: C - J , $

Turning now to 73, the goto's, of 73 on c and d are 73 and 1 4, respectively,


and goto (1 2, C) is:

78: C - cC-, c/t7

74 and 75 have no goto's. The goto's of 7^ on c and d are 7^, and I-j, respec-
tively, and goto {If,, C) is:

79: C -* cC-, $

The remaining sets of items yield no goto's, so we are done. Figure 4.39
shows the ten sets of items with their goto's. D
We now whereby the LR(1) parsing action and goto functions
give the rules
are constructed from the sets of LR(1) items. The action and goto functions
are represented by a table as before. The only difference is in the values of
the entries.

Algorithm 4.10. Construction of the canonical LR parsing table.

Input. An augmented grammar G'.

Output. The canonical LR parsing table functions action and goto for G'

Method.

1. Construct C = {7o, 7], . . .


,7„}, the collection of sets of LR(1) items for
G'.

2. State / of the parser is constructed from 7,. The parsing actions for state /

are determined as follows:

a) U [A -* aa^, h] \s in I, and goto (I j, a) = Ij, then set action[i, a] to


"shift y." Here, a is required to be a terminal.

b) If [A -»a-, a\ is in 7,, A ¥" 5', then set action[i, a] to "reduce


A -» a."
c) If 15' — S-, $] is in 7,, then set action[i, $] to "accept."

If a conflict results from the above rules, the grammar is said not to be
LR(1), and the algorithm is said to fail.

3. The goto transitions for state / are determined as follows: If

goto (I j. A) = Ij, then goto[i. A] = j.

4. All entries not defined by rules (2) and (3) are made "error."

5. The initial state of the parser is the one constructed from the set contain-
ing item 15' - 5, $]. D

The table formed from the parsing action and goto functions produced by
Algorithm 4.10 is called the canonical LR(1) parsing table. An LR parser
using this table is called a canonical LR(1) parser. If the parsing action
SEC. 4.7 LR PARSERS 235

Fig. 4.39. The ^oto graph for grammar (4.21).

function has no multiply-defined entries, then the given grammar is called an


LR(I) grammar. As before, we omit the "( I)" if it is understood.

Example 4.43. The canonical parsing table for the grammar (4.21) is shown
in Fig. 4.40. Productions 1, 2, and 3 are 5 - CC, C -* cC, and C ^ d.

Every SLR(l) grammar is an LR(1) grammar, but for an SLR(l) grammar


the canonical LR parser may have more states than the SLR parser for the
236 SYNTAX ANALYSIS SEC. 4.7

State
SEC. 4.7 LR PARSERS 237

requirement that c or d follow makes sense, since these are the symbols that
could begin strings in c*d. If $ follows the first d, we have an input like ccd,
which is not in the language, and state 4 correctly declares an error if $ is the
next input.
The parser enters state 7 after reading the second d. Then, the parser must
see $ on the input, or it started with a string not of the form c*dc*d. It thus
makes sense that state 7 should reduce by C — c/ on input $ and declare error
on inputs c or d.

Let us now replace 14 and l^ by I^-j, the union of 74 and Ij, consisting of
the set of three items represented by [C -» d\ c/d/$]. The goto's on d to I4
or /y from /q, Ij^ h. ^^d !(, now enter 747. The action of state 47 is to
reduce on any input. The revised parser behaves essentially like the original,
although it might reduce ^ to C in circumstances where the original would
declare error, for example, on input like ccd or cdcdc. The error will eventu-
ally be caught; in fact, it will be caught before any more input symbols are
shifted.
More generally, we can look for sets of LR( 1) items having the same core,
that is, set of first components, and we may merge these sets with common
cores into one set of items. For example, in Fig. 4.39, 74 and Ij form such a
pair, with core {C — d}. Similarly, 73 and
form another pair, with core
7(,

{C -* cC, C -* cC, C - d}. There one more


pair, 7g and 79, with core
is

{C -* cC}. Note that, in general, a core is a set of LR(0) items for the gram-

mar at hand, and that an LR(I) grammar may produce more than two sets of
items with the same core.
Since the core of goto (I, X) depends only on the core of 7, the goto's of
merged sets can themselves be merged. Thus, there no problem revising
is

the goto function as we merge sets of items. The action functions are modi-
fied to reflect the non-error actions of all sets of items in the merger.
Suppose we have an LR( 1) grammar, that is, one whose sets of LR(1) items
produce no parsing action conflicts. If we replace all states having the same
core with their union, it is possible that the resulting union will have a con-
flict, but it is unlikely for the following reason: Suppose in the union there is a
conflict on lookahead a because there is an item [A -a-, a] calling for a
reduction by A -a, and there is another item [B -» ^a-y, b] calling for a
shift. Then some set of items from which the union was formed has item
[A -> a-, a], and since the cores of all these states are the same, it must have
an item [B -
(Ba^, c] for some c. But then this state has the same
shift/reduce conflict on a, and the grammar was not LR(1) as we assumed.
Thus, the merging of states with common cores can never produce a
shift/reduce conflict that was not present in one of the original states, because
shift actions depend only on the core, not the lookahead.
It is possible, however, that a merger will produce a reduce/reduce conflict,

as the following example shows.

Example 4.44. Consider the grammar


. .

238 SYNTAX ANALYSIS SEC. 4.7

S'^ s
^ aAd
S \
bBd \
ciBe \
bAe
A ^ c
B ^ c
which generates the four strings acd, ace, bed, and bee. The reader can
check that the grammar is LR(1) by constructing the sets of items. Upon
doing so, we find the set of items {\A ^ e- , d\, [B ^ e, e\] valid for viable
prefix ae and {\A -^e-, e\, \B -^e-, d\) valid for be. Neither of these sets
generates a conflict, and their cores are the same. However, their union,
which is

A ^ C-, die
B -* C-, die

generates a reduce/reduce conflict, since reductions by both A ^ e and B -* c


are called for on inputs d and e.

We are now prepared to give the first of two LALR table construction algo-
rithms. The general idea is to construct the sets of LR(1) items, and if no
conflicts arise, merge sets with common cores. We then construct the parsing
table from the collection of merged sets of items. The method we are about
to describe serves primarily as a definition of LALR( 1) grammars. Construct-
ing the entire collection of LR(1) sets of items requires too much space and
time to be useful in practice.

Algorithm 4.11. An easy, but space -consuming LALR table construction.

Input. An augmented grammar G'

Output. The LALR parsing table functions aetion and goto for G'

Method.

1. Construct C = {/q, /i , • • • , /„}, the collection of sets of LR( 1) items.

2. For each core present among the set of LR(1) items, find all sets having
that core, and replace these sets by their union.

3. Let C
[Jq, Ji,= . . . , J,„} be the resulting sets of LR(1) items. The
parsing actions for state / are constructed from 7, in the same manner as
in Algorithm 4.10. If there is a parsing action conflict, the algorithm
fails to produce a parser, and the grammar is said not to be LALR(l).

4. The goto table is constructed as follows. U J is the union of one or more


sets of LR(1) items, that is, 7 = /, U /2 U • •
U /|, then the cores of
go/o(/,, X), gotodj, X), gotoil^, X) are the same, . . .
,
since

/] , /2,
/;i
all
. have
. the
. same
, core. Let K be the union of all sets of

items having the same core as gotoU\, X). Then gotoU, X) = K.

The table produced by Algorithm 4.11 is called the LALR parsing table for
G. If there are no parsing action conflicts, then the given grammar is said to
SEC. 4.7 LR PARSERS 239

be an LALR{1) grammar. The collection of sets of items constructed in step

(3) is LALRil) collection.


called the

Example Again consider the grammar (4.21) whose goto graph was
4.45.
shown in Fig. 4.39. As we mentioned, there are three pairs of sets of items
that can be merged. /3 and If, are replaced by their union:

/36: C -* cC, c/d/$


C -* cC, cldl%
C -* d, cldl%

I4 and I-i are replaced by their union:

747: C - d-, cldl%

and /g and I<) are replaced by their union:

/sq: C -* cC-, c/d/$

The LALR action and goto functions for the condensed sets of items are
shown in Fig. 4.41.

State
240 SYNTAX ANALYSIS SEC. 4.7

parser will put 1 1^(^


on the stack. This relationship holds in general for an
LALR grammar. The LR and LALR parsers will mimic one another on
correct inputs.
However, when presented with erroneous input, the LALR parser may
proceed to do some reductions after the LR parser has declared an error,
although the LALR parser will never shift another symbol after the LR parser
declares an error. For example, on input ccd followed by $, the LR parser of
Fig. 4.34 will put

c 3 c- 3 ^ 4

on the stack, and in state 4 will discover an error, because $ is the next input
symbol and state 4 has action error on $. In contrast, the LALR parser of
Fig. 4.41 will make the corresponding moves, putting

(• 36 c- 36 d 47

on the stack. But state 47 on input $ has action reduce C -^ d. The LALR
parser will thus change its stack to

c 36 c 36 C 89

Now the action of state 89 on input $ is reduce C -» cC. The stack becomes

( 36 C 89

whereupon a similar reduction is called for, obtaining stack

OC 2

Finally, state 2 has action error on input $, so the error is now discovered.

EfTicient Construction of LALR Parsing Tables

There are several modifications we can make to Algorithm 4.11 to avoid con-
structing the full collection of sets of LR( 1) items in the process of creating an
LALR( 1) parsing table. The first observation is that we can represent a set of
items / by its kernel, that by those items that are either the initial item
is,

\S' -^ 5, $|, or that have the dot somewhere other than at the beginning of
the right side.
Second, we can compute the parsing actions generated by / from the kernel
alone. Any item calling for a reduction by y4 —a will be in the kernel unless
a = e. Reduction by i4 -» e is called for on input a if and only if there is a
kernel item [B -*
-yCb, b\ such that C =^ A-x] for some and a is in
t],

FIRST(ti8/7). The set of nonterminals A such that C =>Ati can be precom-


puted for each nonterminal C.
The generated by / can be determined from the kernel of / as
shift actions

follows. on input a if there is a kernel item \B - 7C8, b\ where


We shift

C => «x in a derivation in which the last step does not use an e-production.
The set of such «'s can also be precomputed for each C.
Here is how the goto transitions for / can be computed from the kernel, if
SEC. 4.7 LR PARSERS 241

[B - 7X8, ^] is in the kernel of /, then [B -^yXb, /?] is in the kernel of


goto{I, X). Item [A -Xp, a] is also in the kernel of gotoH, X) if there is

an item [B - yCb, b] in the kernel of /, and C =^ Aj] for some t|. If we


precompute for each pair of nonterminals C and A whether C =^ At] for some
T], then computing sets of items from kernels only is just slightly less efficient

than doing so with closed sets of items.


To compute the LALR(l) sets of items for an augmented grammar G', we
start with the kernel 5" - S of the initial set of items Iq. Then, we compute
the kernels of the goto transitions from Iq as outlined above. We continue
computing the goto transitions for each new kernel generated until we have
the kernels of the entire collection of sets of LR(0) items.

Example 4.46. Let us again consider the augmented grammar


S' S
S L = R
L * R id
R L
The kernels of the sets of LR(0) items for this grammar are shown in Fig.
4.42. n

'o-
. .

242 SYNTAX ANALYSIS SEC. 4.7

then [A -*Xp, b] will also be in goto (I, X). We say, in this case, that look-
aheads propagate from B — 7C8 to A -» Xp. A simple method to determine
when an LR(1) item in / generates a lookahead in goto{I, X) spontaneously,
and when lookaheads propagate, is contained in the next algorithm.

Algorithm 4.12. Determining lookaheads.

Input. The kernel K of a set of LR(0) items / and a grammar symbol X.

Output. The lookaheads spontaneously generated by items in / for kernel


items in goto (I, X) and the items in / from which lookaheads are propagated
to kernel items in goto {I, X).

Method. The algorithm is given in Fig. 4.43. It uses a dummy lookahead


symbol # to detect situations in which lookaheads propagate.

for each item B -76 in A^ do begin


J' := closure({[B - 78, #]});
if [A — aXp, a] is in J' where a is not # then
lookahead a is generated spontaneously for item
A -* aXp in goto{l, X):
if [A -aXp, #] is in J' then
lookaheads propagate from B — 76 in / to

A - aXp in gotoH, X)
end

Fig. 4.43. Discovering propagated and spontaneous lookaheads.

Now let us consider how we go about finding the lookaheads associated


with the items in the kernels of the sets of LR(0) items. First, we know that
$ is a lookahead for S' -» S in the initial set of LR(0) items. Algorithm 4.12
gives us all the lookaheads generated spontaneously. After listing all those
lookaheads, we must allow them to propagate until no further propagation is

possible. There are many different approaches, all of which in some sense
keep track of "new" lookaheads that have propagated to an item but which
have not yet propagated out. The next algorithm describes one technique to
propagate lookaheads to all items.

Algorithm 4.13. Efficient computation of the kernels of the LALR(l) collec-


tion of sets of items.

Input. An augmented grammar G'

Output. The kernels of the LALR( 1) collection of sets of items for G'

Method.

1. Using the method outlined above, construct the kernels of the sets of
LR(0) items for G.
SEC. 4.7 LR PARSERS 243

2. Apply Algorithm 4.12 to the kernel of each set of LR(0) items and gram-
mar symbol X to determine which lookaheads are spontaneously gen-
erated for kernel items in goto {I, X), and from which items in / look-
aheads are propagated to kernel items in goto {I, X).

3. Initialize a table that gives, for each kernel item in each set of items, the
associated lookaheads. Initially, each item has associated with it only
those lookaheads that we determined in (2) were generated spontane-
ously.

4. Make repeated passes over the kernel items in all sets. When we visit an
item /, we look up the kernel items to which / propagates its lookaheads,
using information tabulated in (2). The current set of lookaheads for / is

added to those already associated with each of the items to which / pro-
pagates its lookaheads. We continue making passes over the kernel items
until no more new lookaheads are propagated.

Example 4.47. Let us construct the kernels of the LALR(l) items for the
grammar in the previous example. The kernels of the LR(0) items were
shown in Fig. 4.42. When we apply Algorithm 4.12 to the kernel of set of
items /q, we compute closure {{\S' -» S, # |}), which is

S' -* S, #
5 - L^R, #
5 - •/?, #
L -•*/?, #/=
L -* -id, #/=
/? - L, #
Two items in this closure cause lookaheads to be generated spontaneously.
Item \L -»•*/?, =] causes lookahead = to be spontaneously generated for
kernel item L - *•/? in 1 4 and item [L — id, =J causes = to be spontane-
ously generated for kernel item L -» id- in 75.

The pattern of propagation of lookaheads among the kernel items deter-


mined in step (2) of Algorithm 4.13 is summarized in Fig. 4.44. For example,
the gotos of /() on symbols 5, L, R, *, and id are respectively /[, /t, /3, /4,
and 75. For /q we computed only the closure of the lone kernel item
|5' -^ #]. Thus,
S, S' -- S propagates its lookahead to each kernel item in

/i through /s.
In Fig. 4.45,we show steps (3) and (4) of Algorithm 4.13. The column
labeled INIT shows the spontaneously generated lookaheads for each kernel
item. On the first pass, the lookahead $ propagates from 5' — 5 in /q to the
six items listed in Fig. 4.44. The lookahead = propagates from L -* *7? in 74

to items L - */?• in Ij and R -^ L in I^. It also propagates to itself and to

L -id- in I^, but these lookaheads are already present. In the second and
third passes, the only new lookahead propagated is $, discovered for the suc-
cessors of I2 and /4 on pass 2 and for the successor of If, on pass 3. No new
lookaheads are propagated on pass 4, so the final set of lookaheads is shown
244 SYNTAX ANALYSIS SEC. 4.7

From
SEC. 4.7 LR PARSERS 245
246 SYNTAX ANALYSIS SEC. 4.7

In state 2, we can replace the error entries by r2, so reduction by production 2


will occur on any input but *. Thus the list for state 2 is:

* s7
any r2

State 3 has only error and r4 entries. We can replace the former by the latter,

so the list for state 3 consists of only the pair (any, r4). States 5, 10, and 11

can be treated similarly. The list for state 8 is:

-1-
SEC. 4.8 USING AMBIGUOUS GRAMMARS 247

4 8
any 1

If the reader totals up the number of entries in the lists created in this

example and the previous one, and then adds the pointers from states to
action lists and from nonterminals to next-state lists, he will not be impressed
with the space savings over the matrix implementation of Fig. 4.31. We
should not be misled by this small example, however. For practical gram-
mars, the space needed for the list representation is typically less than ten per-
cent of that needed for the matrix representation.
We should also point out that the table-compression methods for finite auto-

mata that were discussed in Section 3.9 can also be used to represent LR pars-
ing tables. Application of these methods is discussed in the exercises.

4.8 USING AMBIGUOUS GRAMMARS


It a theorem that every ambiguous grammar fails to be LR, and thus is not
is

in any of the classes of grammars discussed in the previous section. Certain


types of ambiguous grammars, however, are useful in the specification and
implementation of languages, as we shall see in this section. For language
constructs like expressions an ambiguous grammar provides more a shorter,
natural specification than any equivalent unambiguous grammar. Another use
of ambiguous grammars is in isolating commonly occurring syntactic con-
structs for special case optimization. With an ambiguous grammar, we can
specify the special case constructs by carefully adding new productions to the
grammar.
We should emphasize that although the grammars we use are ambiguous, in

all cases we specify disambiguating rules that allow only one parse tree for
each sentence. In this way, the overall language specification still remains
unambiguous. We also stress that ambiguous constructs should be used spar-
ingly and in a strictly controlled fashion; otherwise, there can be no guarantee
as to what language is recognized by a parser.

Using Precedence and Associativity to Resolve Parsing Action Conflicts

Consider expressions in programming languages. The following grammar for


arithmetic expressions with operators + and *
E ^ E + E \
E ^ E \
{E) \
\A (4.22)

is ambiguous because it does not specify the associativity or precedence of the


operators + and *. The unambiguous grammar
E ^ E ^ T \T
T ^ T * E \
E (4.23)
E ^ (E) \