You are on page 1of 374
COMPILER CONSTRUCTION Sy INTRODUCTION TO COMPILER CONSTRUCTION Thomas W. Parsons Hofstra University Converted with STDU Converter trial version http://Wwww.stdutility.com tl le Computer Science Press An imprint of W. H. Freeman and Company New York Trademark Information Ada is a trademark of the Ada Joint Program Office, Department of Defense, United States Government. DEC, PDP, and VAX are trademarks of Digital Equipment Corporation. IBM, System/360, System/370, PL/I, and IBM PC are trademarks of International Business Machines, Inc. Intel is a trademark of the Intel Corporation. Macintosh is a trademark of Apple Computer, Inc. Microsoft, MS-DOS, and CodeView are trademarks of Microsoft, Inc. NeXT is a trademark of Next Corporation. PcyAcc is a trademark of Abraxas Software, Inc. Qparser Plus is a trademark of QCad Systems. TX is a trademark of the American Mathematical Society. Turbo Pascal, Turbo C, Turbo Debugger, and Turbo Pascal Integrated Environment are trademarks of Borland International, Inc. UCSD is a trademark of the Regents of the University of California. UNIX is a trademark of AT&T Bell Laboratories. Composition, layout, and illustrations by the author in IATpX. Libra Parsons, Thomas Converted with An introduction p. cm. Includes bibliogr ISBN 0-7167-821 1. Compilers (C i i QA76.76 COSP37 trial version 005.47 53-de20 http://Wwww.stdutility.com Copyright ©1992 by W.H. Freeman and Company No part of this book may be reproduced by any mechanical, photographic, or electronic process, or in the form of a phonographic recording, nor may it be stored in a retrieval system, transmitted, or otherwise copied for public or private use, without written permission from the publisher. Printed in the United States of America Computer Science Press An imprint of W.H. Freeman and Company The book publishing arm of Scientific American 41 Madison Avenue, New York, NY 10010 20 Beaumont Street, Oxford OX1 2NQ, England 1234567890 RRD 998765432 To Phil Panzeca Converted with STDU Converter trial version http://Wwww.stdutility.com Converted with STDU Converter trial version http://Wwww.stdutility.com CONTENTS Foreword ........... 2.0.00 eee ee eee eee Preface... 2... ee 1 Introduction 1.1 Machine Language, Assembly Language, High-level Languages... 2... 2-2... -00000. Clr“ ®lr”lrTrrsT Fert 1.3 Compil . 1.4 The En Converted with 1.5 Phases 1s ee STDU Converter 1.7 System . . 1.8 Writing trial version 1.9 Retarge 1.10 Summa http://www.stdutility.com 2 Lexical Analysis 2.1 Tokens and Lexemes .............-..000- 222 DUieriNe. ) ee ee 2.3 Finite-State Automata. ................0. 2.3.1 State Diagrams and State Tables ........ 2.3.2 Formal Definition. ................ 2.3.3 Acceptance ......... 0.00.0 ee eee 2.4 Nondeterministic Finite-State Automata. ....... ST rrlrhLhLr rrrrL——™——U—C“#5T"’""ssrr—™lr—C eer SFrrrr=—™™:—_—=—_L_DULEDdrtrDsrL=ser_Lh=rc“c t®™™lmr™mLrtlCrlLrtLL 2.5 The Subset Construction...............-. 2.6 Regular Expressions ................0-. 2.7 Regular Expressions and Finite-State Machines ............ 2... .0000000.4 2.8 The PumpingLemma .................. 2.9 Application to Lexical Analysis ............. 2.9.1 Recognizing Tokens ............... vii viii Contents 2.9.2 A Stripped-Down Lexical Analyzer for Pascal. ........ 54 2.9.3 Coding a Finite-State Machine ................. 58 2.9.4 The Lookahead Problem............. 0.000004 60 2e10 SUMMaby 9 oe ee 61 3 Syntactic Analysis I 64 3.1 Grammars... 64 3.1.1 Syntax and Semantics .............2 200000008 68 3.1.2. Grammars: Formal Definition... ............0.-. 70 3.1.3 Parse Trees and Derivations... .............004 72 lea) | hightmoct and Wetumost Derivations) 6.1 an) (ret | 74 3.1.5 Ambiguous Grammars ..............00.0000.4 75 3.1.6 The Chomsky Hierarchy... ..............200. 76 3.1.7 Some Context-Free and Non-Context-Free Languages .... 78 3.1.8 More about the Chomsky Hierarchy .............. 80 8.2 Top-Down Parsers ... 1... ee ee 82 S20) 2 Weltshecursions: | "tt "Cte orate es ee 83 3.2.2 Pearcletren elles ce J ______________,,,,,,,.,, 89 3.3 Recursi th © fl. eee eee 91 3.4 Predict: Converted with fp 93 34.1 | QTE Ganiinariar so....... 101 3.4.2 STDU Converter eee 106 3.4.3 7 inm i (ti‘(‘“‘“‘“‘W 110 344 trialversion fn ane 35 Summa fittp://www.stdutility.com =|. — 4 Syntactic Analysis II 119 OO TT ES Tr €Flr™rSL rr LC cr 119 4.1.1 Parsing witha Stack... 2... 2... eee eee 121 4.1.2 More about Handles ...................000. 124 Aldo 7 the Operator rececencet;alsch» ta) ee ee ee 126 4.2.1. A Simple Operator-Precedence Parser ............. 126 4.2.2 Forming the Table ..................-00000. 130 4.3 The LR Parser ... 1... ee ee 132 4.3.1 Construction of LR Parsing Tables ............... 137 4.3.2 Error Handling... 1... 0.0.00... 000. e eee 148 4.3.3 Conflicts... ee eee 150 4.3.4 Canonical LR Parsers .. 2... 0.0.2... . 000200004 151 4.3.5 Lookahead LR (LALR) Parsers..............0.0. 156 4.3.6 Compiler-Compilers ............... 00000008 160 4.4 Summary: Which Parser Should I Use? ................ 160 Contents 5 Intermediate Code Generation 5.1 Semantic Actions and Syntax-Directed Translation ©... ee 5.2 Intermediate Representations .. 1... 6... ee ee es 5.2.1 Syntax Trees 2.0... ee ee nt 5.2.2 Directed Acyclic Graphs... 1.6... ee ee eee 5.2.3 Postfix Notation ... 2... 0.0.0.0... 000 2 eee 5.2.4 Three-Address Code... 2... 20... 0. cee eee eee rrr rll sr r™TL™L—™—C<“ ErtLcr CL c§€rLhLr™Llr CL eT Ts SC rrr sc Trl Flr Lr GC F€rlr Cr CE roel melrces a ee ee ee ee 5.3.2 Postfix Notation .. 2... eee FT rrlrrrrL—™—C”—C"T""t—s'—urs—C “#$TtThr™rhrhr——UC #FTterhrLCLCLCl ee errlrlrlrrrrrrTU—C—rU—rtc(”TE$SXXE§E&S:- =S—sr™—r—‘<é=P’”’ = = = 5.4.1 Synthesized and Inherited Attributes.............. 5.4.2 Attributes ina Top-Down Parse ..............004 5.4.3 Removal of Left Recursions ............. 000006 5.5 More al _U™”™”™”~”~C~d See 5.5.1 Convertedwith = § |....... Deoee 000 me so rwctt ST DU Converter |: 7 pe . I 5.8 Summa trialversion = =§=j....... Optimizatic = nttp://www.stdutility.com ge tr—t— 6.1.1 Basic Blocks .. 2.1... ee OO i rr rLhrlr— C ktrlhmrmlrt LC“ Fresher SS rrrrrrh™™—™LrUC—C—UC<$ST'”T"|"-ETT’"'s'—(#s$+'s—s—#—s=ss—“‘(téPw™hr™hr™hrhhLhhshsSsS—hSh 6.1.4 Global Optimization... .............0. 00008. G2) DAGSOAcain ty ey ee 6.2.1 Generating DAGs from 3AC ... 1... eee ee 6.2.2 Generating 3AC from DAGs .............00005 6.3 Data-Flow Analysis... 0... 6.3.1 Flow Graphs ......... 0.0.0. eee ee eee ee ST rrrrr™L—r—“—M—”rrrrE™rDrc_LD_:L=stLhdrLhdrhdrchdcLhdrEdrcrchEdrcrcrL 6.3.3 Solving Data-Flow Equations ...............00- 6.4 Optimization Techniques Revisited .............2.2004 6.4.1 Loop Optimizations ......... 0.0.0.0. 0 200008 6.4.2 Other Optimizations ...............-.0 200005 6.5 Other Optimization Issues... . 2... 0.2.00. 0b ee 6.6 Machine-Dependent Optimization.................004 6.6.1 Registers and Instructions... .... 6... 00 eee eee 164 165 168 168 169 170 171 172 173 174 181 183 188 189 191 198 201 201 Contents x 6.6.2 Peephole Optimization... .......... 0.000000 8 6.7 Summary: How Much Optimization? ...............04. 7 Object Code Generation 7.1 Generating Machine Language from 3AC ............... 7.1.1 Blind Generation... .......... 0.0.00 000 0008 7.1.2 Special Considerations... ....... 0.0.0.0 ee ee OO CC SC Se 7.2.1 Liveness and Next Use... 2.2... 2. eee ee ee ee Ee rrr e™™—™——T——U—"C—C “§ETSrsrts=Ds=s=hCrc cl c §lrtlrcLrLLLCL LC (72top Assionineehecisterse see es eee ee ee 7.2.4 Generating Code .......... 2.0... 00. eee 7.2.5 Instruction Sequencing. ............ 0000000 0s 7.3 Special Architectural Features... 2.2... 0. 2 ee eee 7.4 Summary ........ ee ee 8 Memory Use 8.1 The Symbol Table ............ 0.2.0.0. 000000004 8.1.1 jpeeore 8.1.2 Convertedwith = = j....... 8.1.3 | gummy gm. eee. ew “| $TDUConverter | | 8.2 Run-Ti . ee 8.2.1 trialversion = ii sisdT«wsiwiw“iw“aw“ aw an 8.2.2 epee 8.2.3 hitp://www.stdutility.com ition. OOO eT er ™ rr cr SC Sr SC rrr Cc Appendix A Minimizing the Number of States in a DFA Appendix B_ Lex and Yacc Bl Lex... 2. B.1.1 Token Definitions... ......... 0.0.0.0 00 0000.8 B.1.2 Definitions ... 2... 0.0.0.0... eee ee B.1.3. Tokens and Actions ............. 000000 ee B.1.4 User-Written Code... 2... 0.2... .. 0.00000 eee B.1.5 A Sample Lex Specification ...............000.4 B.2 Yace 2. B.2.1 Grammar ...... 2.0... 2. ee ee B.2.2 Declarations and Definitions................... B.2.3 User-Written Code... 2... 0... ee B.2.4 A Sample Yacc Specification ..............0000. References... 2... ...0.0 0.0.0.0. ee ee 287 287 FOREWORD The art of compiler construction is a fascinating subject. It integrates the mathe- matical foundations of formal languages, a wealth of well-established techniques for syntactic analysis, semantic analysis and optimization, and a tremendous amount of practical experience accumulated over the past four decades in designing and us- ing high-level languages and their translators. For many undergraduates, however, learning this richly loaded craft is a challenge. Dr. Parsons presents an undergraduate text on this fascinating art. It expounds, in a lucid style, the fundamentals and the practical considerations of compiler con- struction. The techniques covered are immediately applicable, and the coverage is tailored to the needs of an undergraduate course. The algorithms are presented in a readable Pascal-like notation, and the examples vividly bring out the essence of the algorithms. While constr Converted with he fundamentals of the craft are ry component is indispensable in STDU Converter k elucidates the required theory . . n. The abstract concepts are ble trial version iler construction, making the abst — Ss. Compiler con, Http: //Anww.stdutility.com [programmers and language designers. The ideas and techniques used in writing high-quality com- pilers are also useful in many aspects of software development and in an increasing variety of new applications. A study of the subject should be a valuable intellec- tual experience. Facilitating such a study at the undergraduate level is the purpose served by this readable text on the art of compiler construction. Samuel Hsieh, Ph.D. Vanderbilt University Nashville, Tennessee xi Converted with STDU Converter trial version http://Wwww.stdutility.com PREFACE This book is meant to provide an introduction to compilers for the junior- or senior- level student: to show how a compiler works, how it is organized, what the termi- nology is, what the major problems are, and how they have been solved. It is based on approximately eight years of classroom experience and is intended for computer science majors with only a sketchy knowledge of formal languages or automata the- ory. The material presented is probably too much for a one-term course, but the instructor will have an ample menu of topics from which to choose. Objectives For the graduate student and the professional, the classic compiler book is, and probably will be 986], commonly known as the D Converted with Imprehensiveness and detail, it is prepare students at the undergrai STDU Converter bk, and my goal is clarity rather th| br after studying this text, but thi trial version to be able to go on to the Drago [1985] or Fischer and LeBlanc [19 http://www.stdutility.com them. Organization Compilers are written as a set of separate parts, called phases. Each phase depends on the one before it, and it is natural to cover the subject phase by phase. This is what most compiler books do, and as you will see from the table of contents, this is the organization I have followed here. The phases make up the main body of the book; I have preceded them with a general introductory chapter and have followed them with a chapter on memory management. The phases interact, of course, and in many compilers the first three phases are carried out more or less concurrently; but for learning it is better to study each phase by itself and consider the interactions as they come up. Prerequisites The prerequisites for this book are a familiarity with some high-level language like Pascal or C, courses in assembly language and data structures, and enough discrete math to know the elements of set theory and some graph theory. I assume the reader xiii xiv Preface knows what sets and subsets are, what the power set of a set is, and the common set identities. The reader should have had enough graph theory to know what a tree is, what nodes are, what a simple path is, and what a topological sort is. I assume no knowledge of automata theory or formal languages, however. The amount of formal language and automata theory needed for an introductory compiler course is minimal and is provided here. Chapter 2, indeed, contains a lengthy discussion of deterministic and nondeterministic finite automata, some of which might be omitted from an elementary course; however, I have included it because it provides indispensable background for the development of LR parsers in Chapter 4. Approach I have avoided the formal mathematical definition-theorem-proof approach. You will find theorems and proofs aplenty in the admirable books by Denning, Dennis, and Qualitz [1978], by Hopcroft and Ullman [1979], or by Lewis and Papadimitriou [1981]. Those books are written with different goals in mind. Furthermore, theorems and proofs are fine for rigor but not for quick comprehension; in an introductory text I have foun assume that any reader who wan Converted with n to the Dragon Book next, so I herever feasible nowertoese ST DU Converter Much of the fact that it offers simple solutions trial version rely look simple the first time ar ugh most of the algorithms step http:/ /www.stdutility.com their simplicity, and the book abounds with worked examples of finite-state machine construction, parsing, code generation, and similar things. A graduate student should be able to derive these step-by-step examples unaided, but an undergraduate is entitled to some help. Language The programming language used throughout is Turbo Pascal. For a professional, C would be a more realistic choice, but C is notoriously a “write-only” language, and Pascal is better where comprehensibility is of paramount importance. Pascal is also a simple enough language that it can quickly be picked up by the reader who is unfamiliar with it. I have used C only in the appendix on Lex and Yacc, where it is unavoidable. For assembly-language examples I use the languages of the IBM System 370 family and the Intel 80*86 family; these are the processors for which I have written the most assembly-language code myself and they are also the languages with which I believe the average reader is most likely to be familiar. I have commented all code freely to help readers unfamiliar with the languages (and to set a good example). Preface Xv Acknowledgments I am deeply indebted to many people who have helped and encouraged me in this project. First, to my long-suffering wife, who was a computer widow while the manuscript was in preparation; second, to Bill Gruener, Nola Hague, and all the other people at W. H. Freeman who have guided me through the work; and third, to Carol Loomis, my sharp-eyed and understanding copy editor. I must also thank the reviewers — Dr. Stephen V. Bechtolsheim, Purdue University; Dr. Robert Flynn, Polytechnic University; Dr. Joseph M. Lambert, Pennsylvania State University; Dr. Timothy McGuire, Texas A & M University; and Dr. Samuel Hsieh, Vanderbilt University, who also wrote the Foreword. They spotted many errors for me and their comments made this a much better book than it might have been otherwise. I’m particularly obligated to Prof. Philip J. Panzeca, who got me into the compiler business in the first place. Many thanks, Phil. Brooklyn, New York February 1992 Converted with STDU Converter trial version http://Wwww.stdutility.com Converted with STDU Converter trial version http://Wwww.stdutility.com CHAPTER 1 INTRODUCTION A compiler is a program that translates a program written in a language like Pascal, C, PL/I, FORTRAN, or COBOL, into machine language. It is called a compiler for historical reasons; in the earliest days of automatic programming, before the first compilers as we those +1 thot ssom4through libraries of subroutines a Converted with tines needed for carrying out spe me a reality, the manestwckand’ ST DU Converter To begin wit’ out the various kinds of languag: trial version 1.1 MAC hitp://www.stdutility.com GUAGE, HIGH-LEVEL LANGUAGES Machine language is the native language of the computer on which the program is run; indeed, people sometimes call it “native code.” It consists of bit strings which are interpreted by the mechanism inside the computer. A typical machine-language instruction in the IBM 370 family of computers looks like this: 0001100000110101 or, written in hexadecimal for greater conciseness and clarity, 1835 This instruction causes the computer to copy the contents of General Register 5 into General Register 3. You should be aware, by now, that this is literally the only language the computer can properly be said to “understand.” In the early days of computing, people programmed in binary, and wrote out the bit strings as shown above. The first and most primitive programming-language translators were assemblers; these permitted the programmer to write 2 1. Introduction LR 3,5 instead of the bit string shown above. This representation of a program is known as assembly language. The assembler looked up the mnemonic LR in a table and found that the corresponding op-code was 00011000; it then found the binary rep- resentations of 3 and 5 and pieced the machine-language instruction together—that is, it assembled the instruction—from its component parts. In our example, the op-code concatenated with 0011 and 0101 gave the machine-language instruction shown above. A single line of assembly-language code normally corresponds to a single machine-language instruction. Languages like Pascal, C, PL/I, FORTRAN, and COBOL are known, generally, as high-level languages; they have the property that a single statement, such as x= yt zZ; corresponds to more than one machine-language instruction. In particular, if z, y, and z are integ ae fone _ 370 or one of its congeners, then Converted with sequence . 3s §TDUConverter | A 3 ST 3 trial version pono henmae http://www.stdutility.com The main virtues of high-level languages are productivity and the ability to man- age complexity. It has been estimated that the average programmer can produce 10 lines of debugged code in a working day. (The significant word is “debugged”; we can all write huge amounts of code in a much shorter time, but the additional time required for testing and debugging reduces the overall figure drastically.) It has also been found that this figure is essentially independent of the programming language used. Since a typical high-level language statement may correspond to perhaps 10 assembly-language instructions, it follows that we can be roughly 10 times as productive if we program in Pascal or C instead of assembly language. Any program that has to deal with the real world will be complex. The real world teems with special cases, exceptions, and the general untidiness that distinguishes it from worlds like that of mathematics. Much of the evolution of programming lan- guages has consisted of finding new ways of dealing with complexity that minimize its impact on the programmer. Such concepts as information hiding, encapsulation, and abstract data types are ways of coping with complexity. Many of these tech- niques can be implemented in assembly language, if you have the patience and the ingenuity, but they can be built into high-level languages in such a way that the implementation of the techniques is itself hidden from the user. 1.2. Terminology 3 1.22 TERMINOLOGY The high-level language that the compiler accepts as its input is called the source language, and the source-language program that is fed to the compiler is known as source code. The particular machine language that it generates is the object language, and the output of the compiler is object code. The object code is normally written to a file in external storage; this file is called an object file, or, sometimes, an object module. If you compile a program and then examine your directory, you will normally discover a file named something like MYPROG.OBJ; this file contains the object code. I said that the object language is machine language, and this is almost always so; but compilers have been written whose object language is assembly language. Using those compilers, you must first compile the program to assembly language and then feed the assembly-language object module to an assembler before you get machine language. This is a bothersome step to have to take, and compilers of this sort are now rare. Some compilers, however, will optionally generate an assembly- language version of the object code; programmers occasionally need to improve the performance . ode, and this is easiest to do if a Converted with is available. The compute target machine. In the vast majd STDU Converter puter on which the program was . . scal for an IBM PC, we will run trial version in the machine language of the — t we take it for granted; but occ AT :/AWWW.STOUTHITY.COM rates code for a machine that is different from the machine on which the compiler runs. Such a compiler is called a cross-compiler. For example, we might need a compiler which can be run on a VAX but which compiles to the machine language of an IBM PC. In this case, the compiled code cannot be run on the VAX; it has to be written out to some convenient medium, like a floppy disc, and put inside the PC to be run. It is not immediately clear why we would want such a thing as a cross-compiler; we will come back to this later and see a case where it can be very useful. 1.3 COMPILERS AND INTERPRETERS We have already drawn the distinction between a compiler and an assembler: the assembler takes assembly language as its source language, instead of a higher-level language. We must also distinguish between a compiler and an interpreter. A compiler translates the high-level program we give it; an interpreter executes the program. The most familiar interpreter, to many students, is BASIC. A program in BASIC is normally not compiled. Instead, BASIC analyzes a program, line by line, and instead of translating the code, it actually uses the computer to carry out the operations specified. For example, if we write 4 1. Introduction 20x=ytz in BASIC, the instruction will be analyzed using many of the procedures we will outline in describing compilers. But then comes the difference: instead of producing a translation into machine language, BASIC will proceed to add y and z on the spot and place the result into z. The main advantage of interpreters is immediacy of response. You type RUN and the program is executed immediately, without any preliminary translation. A secondary advantage, less obvious in a language like BASIC, is flexibility. If you have an array of 10 elements and if, during the execution of the program, you suddenly find that you would like to make this the first row of a 10 x 10 matrix, there is nothing a compiled language can do for you. There is nothing BASIC can do for you, either, in this particular example, but APL, another interpreted language, has the ability to expand the array into a matrix during the execution of the program. The main disadvantage of interpreted languages is in speed of execution. If your statement appears inside a loop, for example, 10 for i= 1 to 100 20 x(i) . ane Converted with pce tio STDU Converter igen on the translated c trial version translation, but it is not instantane = b around the loop makes the progi http:/ I Www.stdutility.com (For this reason, there are a few BASIC compilers around.) 14 THE ENVIRONMENT OF THE COMPILER We must now see where the compiler fits into the overall process of writing and executing a program. The object file produced by the compiler is normally not ready to run. There are many reasons why this is so. For example, if your program contains a statement like y := sqrt(x); then that square root has to be computed. It is not practical for a compiler to have at hand all the various methods for computing things like square roots, logs, or other functions. Instead, these functions are provided in a run-time library, a collection of object modules for computing these functions. (When you buy a compiler, the run-time library is normally provided as part of the package.) The run-time library supports, among many other things, functions like logs or square roots; character-string op- erations; input-output handling like opening, reading, writing, and closing files, or 1.4. The Environment of the Compiler 5 converting between internal binary numbers and their ASCII representation; and dynamic-memory operations like allocating and freeing blocks of memory as required by the Pascal new and dispose procedures. Furthermore, many programming languages permit you to write external proce- dures and functions, which are separately compiled and kept in a user’s library on disc. The compiler translates our example into code for calling the square root func- tion, but it doesn’t provide the square root function. Hence we need another step, in which all the required run-time library services are identified and loaded into memory along with the object module for the program. It is further necessary to tell the program where these modules have been loaded. For example, the call to that square root function, in the IBM 370 family, would look like this: L 15,=V(SQRT) Get the address of the SQRT function BALR 14,15 Jump to that address In the literal pool there is a space reserved for the address of the external SQRT function; when thi i ati ihrary, the correct address is inserte js een the program and the SQRT fu Converted with \dresses is known as linking. STDU Converter lwe have the fol- The linker n lowing picture: trial version http://www.stdutility.com roa wore code ZN Source code Run-time User’s : object library modules Figure 1.1 The user, sitting at the terminal, must typically type PASC myprog [Compiles the program] LINK myprog [Links the program] RUN myprog {Runs the program] (Notice that the linker, in fetching and loading the necessary subroutines and func- tions from a library, has taken over the task that was the original job of the earliest compilers, the job that gave compilers their name.) Turbo Pascal, and a few other systems, are a sort of exception to all this. In Turbo’s Integrated Environment, the compiling and linking operations are present but concealed from the user. When a user requests compilation, the linking step 6 1. Introduction takes place automatically, without the user’s having to request it, and the resulting program is ready to run. In this environment, the executable code is placed in memory, and if the user wants that executable code to be stored on disc, you must request it specifically. The study of linkers or program loaders is beyond the scope of this book; this brief discussion is simply to show what the compiler is and is not expected to do and where the compiler fits into the overall picture. 1.5 PHASES OF A COMPILER A compiler is a very complex program, and, like many complex programs, it is implemented as a number of separate parts that work together. These parts are known as phases; there are five of them: Lexical analysis Syntactic analysis Intermediate| Converted with Optimizatior STDU Converter Object code trial version The easiest way http: / /www.stdutility.com phases in order, and I will devot! lout any of these phases, however, you have to know a little about all of them; so I will describe them all briefly here. Lexical analysis (or lexical scanning) breaks the source code up into mean- ingful units. The program, as it arrives at the input to the compiler, is like one enormous character string. You and I can recognize the meaningful elements in a statement like for i := 1 to max do x[i] := 0; and before the compiler can make a start at translating this, it must also identify the meaningful elements of this statement. We can get a better idea of what the compiler is up against by writing this statement in binary, using the ASCII representation of the characters: 09 66 6F 72 20 69 20 3A 3D 20 31 20 64 6F 20 6D 61 78 20 64 6F OD OA 09 20 20 20 78 5B 69 5D 20 3A 3D 20 30 3B OD OA 1.5. Phases of a Compiler 7 This representation puts us at the same disadvantage as the computer, in that the familiar letters, symbols, and spaces are gone. What the compiler sees is this sequence of bit strings, and the lexical analyzer’s task is to break this statement up into meaningful units. These units are called tokens, and lexical analysis is also sometimes called tokenizing. (We will see a more precise definition of tokens in Chapter 2, but this will do for now.) In this statement, the meaningful units are keywords: for, to, do identifiers: i, max, x constants: 1, 0 operators: := punctuation: ; brackets: [, ] Tokenizing is re Converted with ages, since the programmer is r th blanks or tab characters (whit STDU Converter ler to determine where one token The lexical aj i 7 It may remove ; trial version y excess white spat and passes over them; it identifie: http: / /www.stdutility.com nts; and in case- insensitive langu' PEIRCE CRC F UW COTE CL UOC PITT cept character- string constants to a uniform case, so that the identifiers max, Max, and MAX will all be considered the same. Syntactic analysis determines the structure of the program and of the indi- vidual statements. Taking our example above, it determines that this is a for loop, identifies i as the loop counter, and recognizes 1 and max as the limits of the loop. It determines that the body of the loop consists of a single statement and identifies the statement as an assignment. The parser also detects errors in statements; if our example had said x[i := 0;, it would have been the parser that noticed the missing bracket and flagged it as an error. “Parsing” means displaying the structure of a sentence; it is a term in linguistics, and the construction of parsers draws heavily on techniques and concepts developed in linguistics over the last 30 years. It would not be too much to say that we have the theory of mathematical linguistics to thank for the fact that compilers are practical and not excessively difficult to write. Modern linguistics describes languages by means of generative grammars, grammars that show how legal sentences can be generated. Such sentences are represented by parse trees. Programming languages can also be described by grammars, and the design of a parser takes the grammar 8 1. Introduction as a starting point. Parse trees are an important representation of statements in programming languages. For example, suppose the user’s program contains the statement distance := rate*time; The lexical analyzer isolates the identifiers, the operators, and the semicolon, and it passes these to the parser in tokenized form as follows: id, := idg « id3; where id, is the token that corresponds to the identifier distance, id2 corresponds to rate, and id3 corresponds to time. The parser analyzes this sequence of tokens, and the parse tree looks like this: Ss JIN ’ E * E | | Converted with The tree shows STDU Converter ect, that it is a statement; that trial version bnt operator, and an expression; al —e Issions, which are identifiers, and hitp://www.stdutility.com Intermediate code generation creates an internal representation of the pro- gram that reflects the information uncovered by the parser. I said that every high-level language statement corresponds normally to several machine-language instructions. This means that at some point, the statement has to be broken down into small pieces corresponding to these instructions. This is normally done in in- termediate code generation. The intermediate code is just that: code at a level intermediate between the high-level form and machine language. It is in a form in which the small pieces are clearly visible, but which is not yet machine language or even assembly language. This is because optimization is still to be done, and machine language is not generally convenient for optimization purposes. There are several representations used for intermediate code; of these perhaps the most important and most widely used is three-address code. Instructions in three-address code take the form, result := thing operator thing for example, T := rate * time 1.5. Phases of a Compiler 9 Here the things are the variables rate and time, the operator is *, and the result goes in a temporary variable T. Instructions in three-address code are easy to generate under the control of the parser, and they are simple enough that they can be readily translated into machine language at object-code-generation time. Optimization, or code enhancement, is the process of identifying and removing redundant operations from the intermediate code. Since compilers, when they were first introduced, had to compete against assembly-language programmers, optimiza- tion of the compiled program was important from the very start. For intermediate code to be generated efficiently, a certain amount of deadwood has to be tolerated. For example, a statement eat za, will probably generate the three-address instructions Ti :=ax*y T2 :=Ti+z x := T2 where T1 and T2 Converted with ate results. The last two instruct STDU Converter x := Ti trial version te code genera- tion phase. It is| http: i /www.stdutility.com are many other simplifications t an experienced assembly-language programmer would make in translating a high-level language. Object code generation, often called simply code generation, translates the optimized intermediate program into the language of the target machine. For ex- ample, in machines in the IBM System/370 family, the three-address code in the last example translates fairly easily into LE 4,A A in floating-point register 4 ME 4,Y Multiply by Y AE 4,Z Add Z STE 4,X Store in X Where the other phases are language-dependent, the code generation-phase is ma- chine dependent. This is one of the most difficult parts of compiler writing, and it is the part that generally receives the lightest treatment in compiler texts, because it is so heavily dependent on the instruction set and architecture of the target machine that it is difficult to set forth any general rules. The questions to be considered in this phase are mostly associated with the order in which machine instructions are to be generated and how the machine’s registers are to be used. 10 1. Introduction 1.6 PASSES, FRONT END, BACK END The compiler makes one or more passes through the program. A pass consists of reading a version of the program from a file and writing a new version of it to an output file. A pass normally comprises more than one phase, but the number of passes, and the phases they cover, varies. Single-pass compilers tend to be fastest, but there are two reasons why multiple passes may be necessary. First, certain questions raised early in the program may remain unanswered until the rest of the program has been read. (This is particularly true if there are references to internal procedures that appear later in the source code.) Once these questions have been answered, it will then be necessary to go back and reread the program in the light of these answers. Pascal is a special case; since everything in Pascal must be defined before it is used, a one-pass Pascal compiler is quite feasible. The variety of forward references permitted in Pascal is so limited that they are easily resolved in a single pass. The second reason multiple passes may be necessary is that there may not be enough memory available to hold all the intermediate results obtained in the course of compilation. . le which is then read as necessa: Converted with ntire program is tokenized in the n can be written to a file and use STDU Converter severely limited, there may not . . ich case we may use overlays. T trial version y be loaded into memory, overwr! — The first thr hitp:/ I Www.stdutility.com late code genera- tion—work very closely together. In some compilers, particularly those in which these phases are taken care of in a single pass, the parser is in the driver’s seat. The lexical analyzer goes into operation when the parser requests a token, and a piece of intermediate code is generated when the parser requests it. These three phases, together with some of the optimization phase, are called the front end of the compiler. Their design and operation are heavily dependent on the programming language being compiled and have little or no concern with the target machine. The code generation phase, together with those aspects of optimization that are machine-dependent, form the back end of the compiler. 1.7 SYSTEM SUPPORT In addition to the five phases of the compiler, there is a certain amount of supporting code to be supplied. This is associated mainly with the symbol table and with error handling. The symbol table is the central repository of information about the names, or identifier, created by the programmer. The name of every variable, every array, 1.8. Writing a Compiler 11 every record, and every procedure, is entered in the symbol table; for each identifier, this table contains its name and attributes and various other information. Every identifier found by the lexical analyzer is referred to the symbol table. In strongly- typed languages, where everything must be declared by the programmer before use, the lexical analyzer and the parser, working together, must make sure that every newly declared identifier is entered in the symbol table, and they must ensure that every identifier used subsequently has been declared. The symbol table is normally maintained by a procedure known as the symbol- table handler; this is the interface between the rest of the compiler and the table. For example, we saw a piece of three-address code Ti :=arx*y T2 :=T1l+z x := 72 The temporary variables T1 and T2 are not the programmer’s variables; they are new variables that were ae necessary in generating three-address code. One of the symbol t poraries for the intermediate-co symbol table. Converted with Error handli in the code it is compiling. Whe STDU Converter 1 the user about it in as much d . . re the error was found and what trial version ell if some kind of makeshift fix- aan, iler to continue through the pro http://www.stdutility.com this is; a simple error in the beginning of a program can lead the compiler into a misunderstanding which generates a flood of diagnostics as it misinterprets the balance of the program. The way errors are handled depends in part on how the compiler is intended to be used. Some compilers are specially written for teaching purposes. It is expected that programs written by students will be run only until they are debugged; hence teaching compilers provide little or no optimization but detailed and occasionally very intelligent error handling, so that beginning programmers can get as full an explanation of their errors as possible. On the other hand, the extremely fast compilation provided in Turbo Pascal’s Integrated Environment, combined with its built-in editor, leads to a quite different response to errors: upon finding the first error, the compiler quits immediately and returns control to the editor with the cursor pointing to the place in the program where the error was found. The user can then correct the error at once and re-compile. 1.8 WRITING A COMPILER The first compiler was written in assembly language; there was no other alternative. The sequence of operations was as shown below: 12 1. Introduction Writing the compiler: Source for eae FORTRAN ree) A = Executable : compiler 5 compiler (in ——>| Assembler rs Linker |-> FORTRAN assembly J compiler module language) Using the compiler: User’s source [27 User’s | User’s code (in —I Compiler [|— object —| Linker |—- executable FORTRAN) module program Once an adequate high-level language is available, however, there are more at- tractive options. If you have a compiler for C and wish to write a compiler for PL/I, you can write the PL/I compiler in C. In this case, the sequence is Writing the compiler: a a Converted with Executable or 5 L_. PL/I compiler STDU C 1 . (in C) onver er compiler Using the compi trial version User's http://www. Stdutility. com | Users source code — PL/I complierj— object i—> executable (in PL/I) module program This is the normal way of writing a compiler. For example, from internal ev- idence in its documentation, it appears that the first FORTRAN 77 compiler from Microsoft Corporation was written in Pascal. From similar internal evidence, it appears that, by the time its second FORTRAN compiler was released, Microsoft had had a change of policy and decided to write the second version in C. It should be clear that the choice of language concerned only Microsoft; to the user, a FORTRAN compiler is a FORTRAN compiler, and to the user, the language in which the compiler source was written is irrelevant. In some cases, a compiler may be written in its own language. There are two ways of doing this. To give a specific example, suppose you require a compiler for C. You start off by writing a minimal compiler in assembly language. This compiler supports only those operations needed in writing a compiler (for example, there is no need to provide a floating-point data type). Then, in a second step, you write a full compiler in C, using only those features in the language supported by the minimal compiler. You then compile the full compiler using the minimal compiler. The sequence is as follows: 1.8. Writing a Compiler 13 Writing the compiler (first step): Minimal C compiler Minimal Executable source (in + Assembler | Shier er {tinker | minimal assembly nl C compiler language) module Writing the compiler (second step): Full c 1G Full c Executable a ae [compiler __J Linker [> Full C source (in compiler object mpiler Minimal C) module com Using the compiler: ’ - 9 User's Converted with User's source code executable “ro $TDU Converter Now all subs trial version full C; each new version can be c oe bootstrapping. It is as if you we http:/ I Www. stdutility.com ladder to reach the beam where the hoist is to be installed; then you throw away the ladder and use the hoist instead. If your ultimate goal is an elevator, you use the hoist to install the elevator machinery and then you throw away the hoist. In our case, the ladder is the assembler, the hoist is the minimal C compiler, and the elevator is the full C compiler. Second, if you have a satisfactory C compiler on the VAX and wish to write a C compiler for (say) the MacIntosh, you can write the compiler in C and run it on the VAX. This is a case where a cross-compiler is useful. The sequence is as follows: Writing the compiler (on VAX): Source code Mac C Mac C for Mac C VAX C compiler : compiler ; — : I—+ —r| Linker |}— compiler compiler object (executable (in C) module on VAX) The compiler we get from this step is the cross-compiler. To get a compiler we can actually use on the MacIntosh, we feed the very same source code into the 14 1. Introduction cross-compiler: Source code Mac C Mac C for Mac C Mac C compiler : compiler : — : I —+J Linker }-—> compiler compiler object (executable (in C) module on Mac) Using the compiler (on MacIntosh): User’s User’s User’s Mac C : : source code —| : I— object —J| Linker |—— executable ' compiler (in C) module program In both of these examples, the second step, where you have the compiler compile itself, is the crucial one. This is the step in which the new compiler is made usable by compiling it with the preliminary version. A useful tool for many of these approaches is software for generating parts of the compiler. These are sometimes called “compiler compilers.” We will find, in particular, that i i n off tables, and these tables can Converted with ny practical pro- gramming langu les are normally constructed with STDU Converter ts tables for lexi- cal scanners; the he tables needed for the scanner. 7 i nson, 1975]) ac- cepts a grammal trial version R. parser for the language. http://www.stdutility.com 1.9 RETARGETABLE COMPILERS In many cases, a compiler writer will want to adapt a compiler for use with a new target machine. A compiler that can be modified in this way is said to be retargetable. We have seen a way in which this can be done using a cross-compiler, but there are a couple of alternative approaches. One approach is to take advantage of the distinction between the front end and the back end. If you have a working front end for a VAX C compiler, it will normally not be difficult to write a new back end for the MacIntosh; this means you have a MacIntosh C compiler with a minimal amount of extra work. It is tempting to imagine having a shelf full of front and back ends—front ends for various languages and back ends for various machines; then upon receiving an order for a PL/I compiler for the NExT computer, you pull a PL/I front end and a NExT back end off the shelf and mate them. In practice, there are enough subtle differences among programming languages and their interactions with the computer that this is not as attractive a prospect as it sounds. Nevertheless, compiler writers sometimes go to a great deal of effort to make a compiler retargetable. 1.10. Summary 15 A second approach, used with great success by some of the first implementers of Pascal, was to write a compiler for an imaginary machine. The imaginary machine was a stack-based machine whose language was known as p-code. The entire com- piler was written in p-code, and the compiler compiled Pascal to p-code. To install this system on a given machine, you merely wrote a p-code interpreter for the ma- chine. The p-code language was designed to be easy to interpret, so this was not a difficult undertaking. The result, after installation of the p-code interpreter, was a stack-based virtual computer whose machine language appeared to be p-code. The use of p-code undoubtedly contributed to the initial rapid spread of Pascal. The UCSD p-system, an outgrowth of this effort which did much to bring Pascal to the world of personal computers, included not only the compiler but an entire operating system, all written in p-code. 1.10 SUMMARY We have seen a quick overall alae of what a ae does, what goes into it, and how it is orga Detai i a bseauent chapters. Probably the Converted! with lo write one. It is also very illumin| wever. The trou- ble is that not TDU C I f their products. Barron [1981] ing $ onver er [Pascal compilers and includes an . . microscopically detailed descript trial version gramming probl¢ WWW. You can also do L— http: Ht === Stdutility. com - f you have a de- bugger that can relate to the source code. If you have Turbo C or Turbo Pascal, you can write short programs and use the Turbo Debugger to see how each statement was compiled. Microsoft’s CodeView offers a similar facility. I recommend doing this. There are a number of advanced texts on compilers. Since 1977 the standard reference work on compilers has been Aho and Ullman [1977], which was succeeded by Aho, Sethi, and Ullman [1986], both affectionately known as the Dragon Books from the pictures on their covers. People have taught themselves out of these books. There may be some things about compilers that aren’t in them, but not many. Other recent books include Barrett et al. [1986], Fischer and LeBlanc [1988], and Tremblay and Sorenson [1985]. The references at the end of this book list these and some of the more important papers on the subject. CHAPTER 2 LEXICAL ANALYSIS The basic task of the lexical analyzer is to scan the source-code string and break it up into small, meaningful units, that is, into the tokens we talked about in Chapter 1. For example, given the statement, if distg . distd Converted with the lexical analy STDU Converter pn, the identifiers distance, rate . . tors *, - and :=, the relational o trial version icolon. In this chapter we will —_ from the source pene hitp://www.stdutility.com The lexical analyzer may take care of a few other things as well, unless they are handled by a preprocessor: Removal of comments. Comments are marked with special symbols; the scan- ner must detect these symbols and pass over all of the source string until the end of the comment has been found. Case conversion. In most programming languages, capitalization is ignored; in order to prevent names like arraysize and ArraySize from being treated as differ- ent identifiers, the scanner converts all letters to uppercase (or lowercase) internally. Within a character-string constant, of course, case conversion must be turned off. Removal of white space. White space is any character code that results in empty space when the program is printed; it normally includes blanks, tab char- acters, carriage returns, and line feeds. In most modern programming languages, white space is needed to separate certain kinds of tokens; once white space has served this purpose, it must be passed over. FORTRAN was designed to require no 16 2.1. Tokens and Lexemes 17 white space at all; the statements do 100 i = 1, 30, 2 and do100i=1,30,2 are both legal, and most FORTRAN compilers convert the former to the latter in the course of the lexical scan. Interpretation of compiler directives. Many programming languages provide ways of controlling the operation of the compiler from within the program. For example, in Turbo Pascal the strings {$ or (*$ mark a compiler directive; {$R- means that range checking is to be disabled, and {$R+ means that it is to be enabled. Many languages have include directives. In C, #include means that the contents of a file named stdio.h are to be inserted into the source code right where the directive appears. The lexical analyzer must open this file and read from it until it is exhausted; then the analyzer must close the file and return to the main program file. Communicatig the compiler mu. the attributes of Converted with tifier is declared, tormally includes STDU Converter hnd what its size before it is used. In such languags trial version just check to see whether a symba or if none exists. If there is an en http://www.stdutility.com hat entry to the parser. is. In strongly ty Preparation of the output listing. Most compilers create a file containing an annotated version of the programmer’s source code. The lexical analyzer normally creates this file. During the scan, it must keep count of the lines of the source code; it normally writes each numbered line to the listing file. The scanner must also insert error and warning messages as needed, and it must prepare the table of summary data that appears at the end of most source listings. 2.1 TOKENS AND LEXEMES In our example, if distance >= rate*(end_time - start_time) then distance := maxdist; we have five identifiers. But for parsing purposes, all identifiers are (with a few exceptions) the same, and it is necessary only to tell the parser that the next 18 2. Lexical Analysis thing in the input is an identifier. On the other hand, it will clearly be important, eventually, to be able to distinguish among them. Similarly, for parsing purposes, one relational operator is the same as any other, since the grammatical structure of the statement in our example would not change if the >= were changed to > or even <=. So all we need to tell the parser is that we have found a relational operator. But again, we will eventually have to distinguish among these various relational operators, since they are functionally different even though they are grammatically the same. We handle this distinction as follows: The generic type, as passed to the parser, is called a token, and specific instances of the generic type are termed leremes. We may say that a token is the name of a set of lexemes all of which have the same grammatical significance for the parser. Thus, in our example, we have five instances of the identifier token (abbreviated id). The token is id, and the lexemes are distance, rate, end_time, start_time, and maxdist. We have one instance of the relational operator token (usually abbreviated relop); the corresponding lexeme is >=. In many cases, a token may have only one valid instantiation; then we use the same word or s . mple, we cannot simply lump if, Converted with oken “keyword,” because these all cases, therefore, the token is if a STDU Converter len is the lexeme, and so on. Simi . . tain, the token is := and the lexe trial version following tokens and Texemes: http://www stdutility.com Tok DecAcIICs id distance rate end_time start_time maxdist relop >= ( ¢ ) ) if if In our example, what the parser receives from the lexical analyzer is if id relop id+(id — id) then id := id; Thus the lexical analyzer must do two things: it must isolate tokens and also take note of the particular lexemes. It has an additional task: when an id is found, it must confer with the symbol-tablehandler. If the identifier is being declared, the semantic analyzer (which we will study in Chapter 5) will have a new entry created 2.2. Buffering 19 for the lexeme in the symbol table. If the identifier appears elsewhere, the lexical analyzer must have the symbol-table handler verify that there is an entry for the lexeme in the symbol table. 2.2 BUFFERING In principle, the analyzer goes through the source string a character at a time; in practice, it must be able to access substrings of the source. Hence the source is normally read into a buffer area so the scanner can back up. We normally use a technique known as double buffering. The buffer area consists of two halves. Initially, source characters are read into both halves, as shown here: if distance >= rate (end_thime - start_time) then dis| In finding tokens, the scanner works its way through the buffer. As soon as it has gotten well into the second half, the first half is replenished with new source read from the file, as shown here: ance Converted with wan Notice that the g STDU Converter that the scanner hasn’t worked th . . the second half, it wraps around trial version tion of the code. Thus in thi a ai tifier di i a rie ea Dtta:/AWW.Stdutility.com SS erence Once it is well in __tirly. In this way, the scanner has access to a substantial amount of source code, and line breaks do not create any special problems. The line breaks themselves (normally the ASCII carriage-return and line-feed characters) are treated as white space. The scanner needs a minimum of two subscripts to note places in the buffer. A subscript is needed to mark the beginning of a lexeme; we will call this the lexeme- start pointer. There is a current-character pointer which normally advances through the buffer. The end of a lexeme is normally detected when the current-character pointer runs past it. It has to work this way, since if the characters scanned are f, o, and r, it is not yet clear, when the r is read, whether this is the keyword for or the beginning of an identifier, possibly form. Similarly, if the character scanned is <, the scanner has no way of knowing whether this is going to be the lexeme < or the beginning of the lexeme <= or <>. It must examine at least one more character to find out. For example, in our sample statement, while the identifier rate is being scanned, the lexeme-start pointer will be pointing to the initial r. The current-character pointer will run through the rest of the word. When it encounters the asterisk after the final e, it will recognize that it has passed the end of the lexeme; at this point, it will back up the current-character pointer (so that it will be able to reexamine 20 2. Lexical Analysis the asterisk when looking for the next token) and extract the substring starting at the lexeme-start pointer and ending at the current-character pointer. It will then advance the lexeme-start pointer to the current-character position so it can start afresh on the next lexeme. 2.3 FINITE-STATE AUTOMATA The compiler writer defines tokens in the language by means of regular expressions. These are a compact notation that indicates what characters may go into the lex- emes belonging to that particular token and how they go together. We will see many regular expressions when we get to Section 2.6, but to satisfy your curiosity for the moment, here is one: (el + |-)(0|d*).(0|d*) This is the expression specifying what constitutes a legal floating-point (real) num- ber in Pascal. The lexical analyzer is best implemented as a finite-state machine or finite-state automaton (FS. i ta.) This is the basis of the sys Converted with chapter, I must explain how fini lions are (in more detail than the STDU C 1 gap between the expressions and onver er FSAs are so: trial version n application of graph theory. I the subject, you —attp://Wwww.stdutility.com — {cv Informally, Tay Saya ae TT Is a sysvent ora res-aaite set of states with rules for making transitions between states. When input signals are applied to the machine, it changes state in response to them. (Some authors describe the machine as reading the inputs from a tape. This is unnecessarily restrictive for our purposes, and we will simply envision the machine as somehow receiving the inputs. The lexical scanner reads its inputs from the buffer.) The operation of the automaton is as follows: It starts in an initial state. As it receives inputs, it changes state; the new state depends only on the current state and the input it receives. The state of the machine reflects the recent history of the machine, specifically what kind of inputs it has received. Its response to a new input thus depends on what inputs it has received in the recent past. ver; if you know 2.3.1 State Diagrams and State Tables To take an example far removed from compilers, consider a vending machine. As you walk up to the machine, it is in its starting state. I said that the state reflects the recent history of the machine; the starting state indicates that nothing has happened yet. As you put coins in the slot, the machine changes state. To keep things simple, suppose that a candy bar costs only 25 cents. The states of the 2.3. Finite-State Automata 21 machine will indicate the total amount of money deposited; thus if you insert a nickel, the machine will change to a state indicating that the total amount received so far is 5 cents. If you then insert another nickel, it will change to a state indicating that the total amount received is now 10 cents. If you then press a button to select a candy bar, the machine will ignore this input, since the proper amount has not yet been paid. When the total amount reaches at least 25 cents, however, the machine will now respond if you select a candy bar. It will deliver the candy bar and return to its starting state. We can summarize this operation with the state diagram shown here: (5¢) 8 : n Converted with ¢) STDU Converter trial version (overpaymer—hittp://www.stdutility.com ——_ |}°) n,d,q d n,d,q (25) Figure 2.1 You will recognize this as a directed graph; the nodes in the graph are the states, and the edges indicate transitions between states. Each edge is labeled with the input that enables the transition; here n stands for nickel, d for dime, q for quarter, and s for the select button. State 0 is identified as the starting state by a little arrow coming in out of nowhere. Each of the other states is labeled with the corresponding amount of money received. We may also represent the machine with the following state table: 22 2. Lexical Analysis Current | Inputs state | Nickel Dime Quarter Select 0 1 2 5 0 1 2 3 6 1 2 3 4 6 2 3 4 5 6 3 Next state 4 5 6 6 4 5 6 6 6 o* 6 6 6 6 o* States 1 through 5 record the amount of money received so far; the * indicates that on this transition, the candy bar is delivered and any change due the user is sent to the coin return. State 6 indicates an overpayment; we assume that any coins deposited in these cases are moved to an overflow area for return as part of the change. This example, besides presenting a FSA in familiar terms, should also demon- strate that FSAs are useful for many other purposes in addition to lexical analysis. (Much of the co . a FSA.) In the case of lexical ar Converted with particular char- acter in the sou Hence the state of the scanner ; TOU Converter As an exam . . would detect an identifier. In Pa trial version dentifier consists of an initial lette . aa s. (Many imple- mentations also . http:/ t www. stdutility.com ial characters as well; we will ignore these.) We can represent this by the following state diagram: 1d 1 iF © Figure 2.2 © State 0, the starting state, indicates that nothing has been received at that point. Like the vending machine, the lexical analyzer is waiting for something to happen. If a letter (indicated by 1) is found in the source string, the scanner moves to State 1. State 1 indicates that the scanner is now in the midst of an identifier. Further letters and digits (indicated by d) leave the scanner in State 1, since the end of the identifier has not yet been detected. When some character other than a letter or a digit is encountered (indicated by #), the scanner then knows that it has passed the end of the identifier and moves to State 2. State 2 indicates that an identifier has been found and that the current-character pointer is one character past the end. We say that State 2 is an accepting state. 2.3. Finite-State Automata 23 2.3.2 Formal Definition We are now ready for a formal definition. A FSA M, as used for pattern recognition, consists of A finite set of input symbols ©, the input alphabet A finite set of states Q A starting state qo € Q A set of accepting states F C Q A state-transition function N : (Q x XL) — Q We can write** = ae a are known as n- tuples; this parti Converted with set of characters that the automa In defines how M will change state STDU Converter lut. It is usually written as a tab xample, but the functional form trial version hitp://www.stdutility.com is also convenient for certain purposes. This form says that when M is in State qi, an x input takes it to State q;. The accepting states are also called final states (hence the symbol F’), but this is misleading, since M does not necessarily stay in them; I will continue to call them accepting states. When we represent N(q, x) with a table, we have a row for every state and a column for every character in ©. These, together with the table entries, tell us most of what we need to know about the machine. If we adopt a couple of simple conventions, the table will tell us everything. First, we will always put the starting state in the first row; second, we will underline the accepting states. The directed-graph representation is as follows: There is a node for every state, and every node is labeled with the number of its corresponding state. Nodes cor- responding to accepting states are normally identified with a double circle, and the starting state is marked by an edge coming in from nowhere, as it was in the vending-machine example. If N(q:, x) = qj, then there is an edge from Node q; to Node qj labeled with zx. For example, consider the automaton Mj, shown here: 24 2. Lexical Analysis Figure 2.3 This machine is defined by the following table: Current | Inputs state abe 1 9 1 2 Converted with STDU Converter trial version ene We will try http://Wwww.stdutility.com modify this exa: — = tant to us later on. Let Mj] be as shown here: The starting st states are State: but first, let us t will be impor- Figure 2.4 The state table is as follows: 2.3. Finite-State Automata 25 Current Inputs state Ot [oo to arw pb nls Pwo PD RIO On NK wa As this example illustrates, it is entirely possible to have unreachable states in an automaton. These states may or may not correspond to something physical in the implementation of the automaton, but it should be clear that unreachable states may be removed from the automaton without changing its behavior in any way. It is true that there is a transition from State 5 to State 4 on an input b, but since the automaton can never enter State 5 in the first place, this transition is irrelevant. If we remove State 5, we’ll never miss it; hence this machine is equivalent to M1. 2.3.3 Accept We use FSAs fo: Converted with must recognize every character es. A character string over & is STDU Converter e last character has been read, . . passes through an accepting st: trial version accepting state, the string is no a sentation of the a then http:/ I WWW.Stdutility.com nce: A character string w is accepted by M iff! there is a path through the graph, from the starting state to any accepting state, whose edges spell out w. If we apply a string w; = abaabc to My, it will pass through states 1, 2, 2, 4, 1, 2, and 3, as indicated by the following history: a b a a b c O—O—- O_O 0— O—-® Figure 2.5(a) This history displays the path from State 1 to State 3 whose edges spell out wi. Since Mj, ended up in State 3, which is an accepting state, the machine accepts the string wy. If we apply the string we = cccaaa, then M, will pass through states 1, 3, 2, 1, 2, 4, and 1. Again we can draw this as a history: lif and only if 26 2. Lexical Analysis O—©—_-®—_-0—-® 9 —-O Figure 2.5(b) This time M, ended up in State 1; since this is not an accepting state, we is not accepted by My. Here is a second example. The machine Mg has the following state diagram and table: Converted with STDU Converter trial version http://Wwww.stdutility.com pd |A D The input string 011100 takes M2 through the following states: @-@+9+-@+-@ E49 Figure 2.7(a) Since State D is an accepting state, this string is accepted. The input 110011 takes Mz through the following states: 1 1 0 0 1 1 @©—-®—©—-D)—- © ® © Figure 2.7(b) Since State C’ is not an accepting state, this string is not accepted. A language is any set of strings; a language over an alphabet © is any set of strings made up only of characters from ©. The language accepted by M is the set 2.4. Nondeterministic Finite-State Automata 27 of all strings over © that are accepted by M. We write this as L(M). To see how this relates to compilers, let us say that we have one FSA for every token. (That’s not quite right, but it will do for now.) Then the language accepted by the FSA is the set of all lexemes embraced by the token. Two automata M; and M; are equivalent iff L(M;) = L(M;), that is, if and only if they both accept the same language. A FSA can easily be programmed if the state table is stored in memory as a two-dimensional array. That is, given table: array [1..nstates, 1..ninputs] of byte; then a FSA can be implemented in software with a loop along the following lines, state := 1; for i := 1 to length(w) do begin col := char_to_col (wli]); state := table[state, col] end; where w is the jj Converted with LTo_Col is some function associat mn in the state table STDU Converter 2.4 NOND trial version UTOMATA So far, the beha http://www.stdutility.com there is another type of FSA in which the state transitions are not predictable. In these machines, the state-transition function is of the form N: Qx (ZU {e}) > P(Q) where P(Q) is the power set of the states. The lowercase Greek epsilon € is used to represent the null string. (Some authors use a Greek lambda, either A or A.) This means two things: (a) the automaton can change state without reading an input character (that is the significance of the € in the domain), and (b) the transition may. be to more than one state (that is the significance of the power set in the codomain). Since this makes the behavior of the automaton unpredictable, we call it a nondeterministic FSA. In order to keep these two types of machine distinct, I will henceforth refer to DFAs (deterministic FSAs) and NFAs (nondeter- ministic FSAs). In applying NFAs to lexical analysis, we will be mainly interested in machines having e-transitions, but the subject is more easily approached if we consider both types. (Some authors reserve the term nondeterministic for automata in which transi- tions can be to more than one state and consider automata with e-transitions as a special case. For our purposes, it will be simpler to consider both types as NFAs.) 28 2. Lexical Analysis As an example of a NFA, consider the machine M3 with the following state table: Current Inputs state a b {1,2} {1} 2 {2} {1, 3} 3 {2, 3} {1, 3} Note that the table entries are now sets of states, as the form of the next-state function says they will be. The table entry {1, 3} in the second row means that from State 2, on an input a, M3 can change either to State 1 or to State 3 (it can’t do both). The state diagram for this machine is a,b Converted with I said previd trial version ath in the state diagram, from t| . an ges spell out w. Applying this a ttD://WWW.STAUTHITV.COM is tncans that if the machine can end up in an accepting state, by being lucky enough to make the correct transitions each time, then the string is accepted. If it is also possible for the machine to end in a nonaccepting state, that is irrelevant. If there is any path to an accepting state, the fact that there may also be a path to a nonaccepting state does not matter. All this will become clearer if we consider the history of our sample machine when given the string w; = baabab. One possible history is shown here: O49 9 tp to Figure 2.9(a) In this case M3 ends up in State 1 and it appears that the string will not be recognized. Here is another possible history: b a a b a b O—O0—- 0) Figure 2.9(b) 2.4. Nondeterministic Finite-State Automata 29 Now Msz ends up in State 3 and again the string is not recognized. But to determine whether w; can ever be recognized, we must consider all eventualities. To do this, we must draw a history that represents all possible transitions. Such a history is shown in the following figure: Biourea 9 1 Converted with The arrow unde STDU Converter pplies to all the transitions show state to two or more new state h follow up both alternatives. Bu trial version since there are a finite number of http: / /www.stdutility.com ir representation does not get out The first stage of the diagram reflects the fact that on the initial b input, M3 can only stay in State 1. The second stage shows that from State 1, an a input can either leave M3 in State 1 or take it to State 2. The third stage of the history follows up both of these possibilities: from State 1, an a can take M3 either to State 1 or State 2, and from State 2 an a input can only leave M3 in State 2. Since we have two possible ways of entering State 2, the two paths converge on State 2 at this point. As the diagram continues, we find that at the end of the input string, M3 will always be in either State 1 or State 3. Neither of these is an accepting state, so now we know for sure that w , will never be accepted. The next figure shows a similar history for the input w2 = ababa. One possible path is 1, 1, 1, 1, 1, 1; this lands M3 in State 1, and if this were all we had to go on, it would appear that we was not recognized. But another possible path is 1, 2, 3, 3, 3, 2; here Mg ends in State 2, and we find that we is accepted after all. Note that by drawing a complete history like this and observing the set of states in which M3 can end up, we can see at a glance whether the string will be accepted. Indeed, the figure shows all possible paths from the starting state to an accepting state whose edges spell out the string; in this example, there are 240 of them. 30 2. Lexical Analysis Figure 2.11 The key thing to be observed from these examples, however, is that in a NFA, transitions can be thought of as occurring between subsets of the states. Fur- thermore, while transitions between individual states are unpredictable, transitions between subsets are cast in concrete. If M3 is in the subset {1, 2}, then an a in- put will always leave it in the same subset, and a 6 input will always take it to the subset {1, 3}. Our definition of acceptance then means that if M3 ends in a subset including| . has an important consequence for Converted with 2.4.1 e-Trang STDU Converter As an example trial version http://www.stdutility.com state a 6 € ey 2 {2} {3} {} 3 {2} {1} {2} The heading on this table may be misleading: strictly speaking, € is not an input; it is the absence of an input. The e-column lists the transitions that Mj is capable of making spontaneously, without any enabling input symbol. There are no e-transitions out of State 2, and we indicate this with the empty-set symbol { }. Here is the state diagram for this machine: a,b Figure 2.12 2.4, Nondeterministic Finite-State Automata 7 In drawing a history for such a machine, we will represent ¢-transitions by ver- tical arrows; this reflects the fact that these state transitions occur without using up any input symbols. Here is such a history for the input w; = aaba: Figure 2.13 In the starting state, before any inputs have been received, Mj, is capable of making a spontaneous t ition (thet tumosnaitiow lf Chota —t State 2, so we show this by a vi Converted with din State 1, the first a input lea transition, that input takes it to STDU Converter n €-transition to State 2; this is The rest of the history is constr] trial version will be in either State 1 or State does not accept wr. http://www.stdutility.com Here is the history for Mq given the input wo = abab. We see that M, can end up in any of States 1, 2, or 3; since State 3 is an accepting state, it follows that we is accepted by Mg. Figure 2.14 The important thing to note about these histories is that, again, transitions between states are unpredictable (since we don’t know whether M4 will make an e-transition or not), but that transitions between subsets are still deterministic. 32 2. Lexical Analysis 2.4.2 Equivalence A consequence of this is that for any nondeterministic machine M, we can construct an equivalent deterministic machine M'. M’ simulates the operation of M by having one state for every subset of M-states. If M makes a transition from a subset A to a subset B on some given input, then M’ makes a transition from the state corresponding to subset A to the state corresponding to subset B. A consequence of this fact, in turn, is that NFAs cannot accept any wider variety of languages than DFAs can. In that case, what is the good of them? This question acquires additional force since any computer in proper working order is essentially deterministic. There are two answers to the question. First, in the general theory of automata, which embraces a wider variety of machines than we have yet seen, nondeterminism can make a profound difference in the power of some types of machine. It must have been natural, under these circumstances, for researchers to consider what the effect of making a FSA nondeterministic would be, and the negative finding, that it makes no difference at i i icali Second, our definitions of to and economical the regular expre no good way of Converted with tate tables from very convenient STDU Converter "= ter. But there is trial version A. On the other hand, it is relat: alent NFA, and from this NFA http://www.stdutility.com In be used in our lexical analyzer. Ine language accepted by the DFA will comprise all the lexemes belonging to the token. Thus the process of generating state tables for the scanner consists of the following steps: Tokens |) Resular |_| nea [| pra expressl1ons Because I wanted to start by explaining how a lexical analyzer is implemented, we started at the end of this chain; we are now in the process of working backward through the chain. We have illustrated DFAs in some detail, and also NFAs. We will next describe how to construct an equivalent DFA for a given NFA. In Section 2.6 we will describe regular expressions, and later we will complete the chain by showing how to construct an NFA from a given regular expression. The overall process is laborious, but it doesn’t require much in the way of brains. This means that we can delegate the design task to a computer, since computers are good at carrying out laborious tasks, even though (as you must know by now) they are not very bright. 2.5. The Subset Construction 33 2.5 THE SUBSET CONSTRUCTION Constructing an equivalent DFA from a given NFA hinges on the fact that tran- sitions between state sets are deterministic even though transitions between states may not be. I will show the basic idea first. This will make clear the rationale behind the construction, but it will not be practical, for reasons I will describe. We will then consider an alternative procedure, which is based on the basic idea and is practical, although more laborious. The basic idea is this: For the time being, assume that there are no e-transitions. We will be able to fit €-transitions into the general scheme once we are familiar with it. Let M be the given NFA and let M’ be the required DFA. For every subset of states in M, we will have a single state in M’. M’ will simulate the behavior of M by making transitions between its states the way M makes transitions between its subsets. This leads to the following general method: Let M = (Q, qo, F, N) be the given NFA and let M’ = (Q’, q@, F’, N’) be the required equivalent DFA. Then States of M’ ill label these tates b . 7,9, 17] i the state Converted with ee confusio: hes in square brackets. STDU Converter led by letters; a subset - . BE J]. wae trial version 1) The initi Acceptin, —ttp://WWW.Stdutility.com = —_{ subsets con- taining at Teast one accepting state m I. If N({p, a, r},2) = {s, t, u} in M, then in M’, N"([pqr], 2) = [stul. It remains to explain how we find N({p, q, r}, x) when the state table for M gives N(-) only for single states. The rule is: The next-state function of a subset A is the union of all the next-state functions of its elements. For example, in our machine M3, the next-state function for the subset {1,2} on an input 6 is N(1, 6) U N(2, b) = {1} U {1, 3} = {1, 3}. The next-state function for the subset {1, 2, 3} on an input a is {1, 2} U{2}U {2, 3} = {1, 2, 3}. The next-state function for the empty set on any input whatever is always the empty set: nothing will come of nothing. In the new machine, accepting states will be those states named after subsets that include accepting states of M. As an example of this, we will return to our first nondeterministic machine M3. We base the states of the equivalent deterministic machine Mj on the power set of the states in M3. For each state, the next-state function is obtained from the next-state function of the corresponding subset in M3. The state table for the given machine is repeated on the left for convenience, and that for the equivalent DFA is given on the right: 34 2. Lexical Analysis Current state Inputs a b il {1, 2} {1} 2 {2} {1, 3} 3 {2,3} {1, 3} This is how the corresponding state diagram looks: Converted with STDU Converter trial version http://Wwww.stdutility.com el I Figure 2.15 For example, in constructing the row for [1, 3], we proceed as follows: [1, 3] is named after the subset {1, 3}. On an input of a, the next-state function for State 1 was {1, 2} and the next-state function for State 3 was {2, 3}; the union of these subsets is {1, 2, 3}, and so the entry for Mj is the state [1, 2, 3]. On an input of 6, the next-state function for State 1 was {1} and the next-state function for State 3 was {1, 3}; the union of these subsets is {1, 3} and so the entry for M4 is the state (1, 3}. There are two reasons why this construction is impractical. First, as the number of states in M increases, the number of states in M’ increases drastically. You will recall that for any set Q, the cardinality of P(Q) is 2'@!. This means that as the 2.5. The Subset Construction 35 number of states in M increases linearly, the number of states in M’ increases exponentially. If we have a NFA with 20 states, |P(Q)| is something over a million. For the second reason why this basic method is impractical, look at the state diagram again. You will see that four of these states are unreachable: there is no path from the starting state to States [], [2], [3], or [2, 3]. But I said earlier that unreachable states can safely be omitted; in that case, there is no reason to include them in M3. This leads to the practical method of constructing the deterministic machine: the trick is to create new states only as we need them. The basis of a program for computing the equivalent DFA is as follows. (I’m still ignoring e-transitions.) Let the NFA be M and the DFA M’, as usual: 1. Initially, the list of states in M’ is empty. Create the starting state [go], named after {go}. 2. While (there is an uncompleted row in the table for M’) do: a. Let x = [81, $2, $3, ..., 8%] be the state for this row. b. For each input symbol a do: i. - ii, Converted with iii. I TDU C t I to the list. | 8 onverter A . . 1 iv. trial version ules for M". 3. Identify We will use http:/ f WWw.stdutility.com lhe starting state and create a row in the state table for that state in exactly the same way as before. From this row, we discover that we will need a state [1, 2]. We will create that state (i.e., add it to the current-state column) and fill in its row: Current Inputs state a b 2) 72) fl [1,2] | (1,2) [1, 3) We now need a new state [1, 3]. Again we add this state to M$ and fill in its row: Current Inputs state a b 1,2) [J (1,2) ) (1,2) [1,3] (1,3) | (1,2,3) [2,3] 36 2. Lexical Analysis We must now add the state [1, 2, 3]: Current Inputs state a b (1,2) {J [2 {1, 2] (1, 3] 3] | 1.23) [13] (1, 2, 3] | (1, 2,3] [1, 3] Now we are done; filling in this last row created no new states. It remains only to identify the accepting states in M3, as before; these are underlined in the final state table. The corresponding state diagram is: Converted with STDU Converter trial version http://Wwww.stdutility.com Notice that we TU Only Ivur staves isteaw or tie ergnt-we had to provide when we did it the naive way. To generate a DFA from an NFA with e-transitions, we need make only one modification of this procedure. For the starting state in M’, and for each new state, we must include all states that can be reached from the current state by means of ¢-transitions. This should be apparent from the histories we saw before. Such a state set is known as the e-closure of the state. (If you think of e-transitions as a relation p on Q, such that q p q; iff there is an ¢-transition from q; to q;, then the e-closure can be thought of as the reflexive, transitive closure of the relation; this explains the peculiar name.) Thus we need only two modifications to our rules for creating the equivalent DFA: Rule 1 becomes Create the starting state named after the e-closure of M’s starting state. And Rule 2.b.i becomes Find the e-closure of N({s1, 52, ..., 8x}, a) = some set we'll call T. We will use M4 as an example; we repeat its state table here for convenience: 2.5. The Subset Construction 37 Current Inputs state a b € 1 {1} {1} {2} 2 {2} {3} {} 3 {2} {1} {2} Here the e-closure of State 1 is the set {1, 2} and the e-closure of State 3 is the set {2, 3}. There is no e-transition for State 2, so the e-closure of State 2 is merely {2}. In constructing the equivalent DFA, we must take the starting state from the e-closure of State 1 in M4. Hence our state table begins, Current state Note that there is no e-column in this table. (If there were, then Mj wouldn’t be deterministic!) Now we fill in the row. The next-state function for 1 in Ms on an input a is {1}, and the next-stat nn because we must include the e-cl Converted with F {1, 2}. Hence the table entry b, the next-state Ge! a ST D U C 0 nye rte i d the e-closures; trial version http://www. stdutility.com eee where we have added the new state [1, 2, 3] to the table. For the next row, the state transition from {1, 2, 3} on an input a is {1, 2} U {2} U {1, 2} = {1,2}. (The first subset is the e-closure of State 1’s next-state function, the second subset is that of State 2’s next-state function, and the third is that of State 3’s.) Doing the same for input b gives us the entry [1, 2,3]. There are now no unfinished rows in the table, and we are done. We now have the final state table for Mj, with the accepting state identified: Current Inputs state a b Notice that we had three states in the original machine. The power set is 23 = 8 subsets. But we never had to create all eight of them in Mj; we needed only two. This is typical. For a second example, let Ms; be defined by the state table, 38 2. Lexical Analysis Current Inputs state a b € A {A} {C} {B} B {Cc} {B} {} Cc {A} {C} {B} In M, the starting state corresponds to e-closure of {A} = {A, B}. Hence the starting state is [AB]. For this state: L. N({A, B}, a) = N(A, a) UN(B, a) = {A, B}U{B, C} =A, B,C} a. y = [ABC]. b. Add y to the list of new states. c. N'([AB], a) = [ABC]. 2. N({A, B}, 6) = N(A, b) UN(B, b) = {C, B} U{B} = {B, Ch. N'({AB], b) = [BC]; add [BC] to the list. This gives the fi Converted with STDU Converter trial version Here the second, ‘Tt: //WWWW.Stdutility.com 1. (1) N({A, B, C}, a) = {A, B} U{B, C} U {A, B} = {A, B,C}. N'([ABC], a) = [ABC]; [ABC] is already in the list. 2. (2) N({A, B,C}, 6) = {B,C} U {B} U{B, C} = {B,C}. N'({ABC], 6) = [BC]; this is also already in the list. ond row, The second row of our table is now added: Current Inputs state a b [AB] | [ABC] [BC] [ABC] | [ABC] [BC] For the third row: 1. N({B,C},a) = {A, B} U {B,C} = {A,B,C}. N'({[BC], a) = [ABC], already in the list. 2. N({B, C}, b) = {B} U{B,C} = {B,C}. 2.6. Regular Expressions 39 Our table is now complete: Current Inputs state a b [AB] [ABC] [BC] [ABC] | [ABC] [BC] [BC] [ABC] [BC] There are no more rows to be done; the accepting states in Mé are [ABC], and [BC]. 2.6 REGULAR EXPRESSIONS Lexical analysis and syntactic analysis normally run off tables that control the operation of these phases of the compiler. It is generally easier and faster to use tables than to write customized code, and if changes must be made, tables can be modified more easily than code can. It should be clear, now, that the state table for a practical ler . nable minimum and the actual ni Converted with ge are laborious to compute, and er. We will now discuss how this ST DU C onverter epresent a token for the table-gen is representation to the correspon trial version Tokens are d _ ormal definition of a regular ex http://www.stdutility.com that a regular expression over an alphabet » 1s a combination of characters trom © and certain operators indicating concatenation, selection, or repetition. I will show a few ex- amples of regular expressions and then will present the formal definition. We usually represent concatenation by simply concatenating the expressions. Occasionally, for greater clarity, we use the raised dot normally associated with multiplication. Thus the concatenation of a and 6 is written ab or a:b. Repetition of a character is represented by a superscript. A sequence of five b’s is written 1 eer b° = bbbbb and a sequence of n 0’s is written b”. In most cases, however, the number of repetitions must be left open. To indicate an indefinite number of repetitions, we use an asterisk; thus b* means zero or more b’s strung together. This is known as the Kleene closure (named for Stephen C. Kleene and pronounced “kleenie”), and the asterisk is called the 40 2. Lexical Analysis Kleene star. The Kleene closure calls for two comments. First, note that it allows for zero repetitions of the character. Zero repetitions of b means no b’s at all; this is the null string «. Second, where b° specifies exactly one character string, b* specifies an infinite set of strings: b* = {e, b, bb, bbb, bbbb, bbbbb, .. .}. The positive closure, indicated by a superscript plus +, represents one or more repetitions: b+ = {b, bb, bbb, bbbb, bbbbb, ...}. We represent a choice between two strings with a vertical bar |. (Some authors use + for this operation, but doing so invites confusion with the positive-closure sign +.) For example, if either 123 or 456 is acceptable, we write 123 | 456. This notation allows us to specify sets of strings in terms that define their struc- define languages Converted with f expresions are keeecosee SE DU Converter istecsetc:- Pree are now trial version bn (RE) over an siphaber S's co _ttp://AMww.stdutility.com 1. Any character in D is a RE. 2. €isa RE. 3. If R and S are REs, then a. RS (or R- S) isa RE, b. R|S is a RE, c. R* is a RE (as is S*), d. Rt is a RE (as is ST). 4. Only expressions formed by these rules are regular. You will recognize this as an inductive definition. It is useful not only in proving that a given expression is regular, but also in analyzing a particular RE, and it will be the key to specifying the FSA corresponding to a given RE. Occasionally, it is useful to allow variables to be REs. For example, we would like to be able to use 1 to refer to any letter and d to refer to any digit, since this permits writing many token definitions as REs more economically. We will see an example of this shortly. In evaluating regular expressions, * and * have highest precedence and | has lowest, precedence; parentheses are used to override precedence in the usual way. 2.6. Regular Expressions 41 I must state exactly what the forms RS, R|S, and R* (or R+) mean. The regular expression R defines a language L(R), and similarly S defines a language L(S). In that case, RS (or R-S) defines a language each of whose strings is the concatenation of any string from L(R) with any string from L(S). We may write this as L(RS) = {ww |v € L(R), w € L(S)}. This set is sometimes called the concatenation of the languages L(R) and L(S). R|S defines the union of the two languages. Any string from either L(R) or L(S) will qualify. Here we may write simply L(R| $) = L(R) UL(S). To see the meaning of R*, the Kleene closure of R, start with the definition of R?. This is just R- R. Every string in R- R is the concatenation of any string from L(R) with any string (maybe the same one, maybe another) from L(R). In that case, L(R|mbor) is all strings formed from an indefinite number of concatenations of strings from [4?*—3— ttt (RY) Converted with », 2)}, although this fo STDU Converter he definition of L(R*) is similar ‘ ; Using this de trial version t, the individual letters a, and d 4 tt://Www.stdutility.com == | By Rule 3c, b* is a RE, and fr UHIS TU Uno WS Chat 0 Ss a oe case, the entire contents of the parentheses is a RE by Rule 3b. Finally, the Kleene closure of the parenthesized expression is also a RE. This may seem to belabor the obvious, but we will find a practical use for this kind of analysis when we get back to the construction of lexical analyzers. Two REs R and S are equivalent if L(R) = L(S), that is, if they define the same language. Thus to prove the equivalence of R and S, we must proceed in the usual way for sets and show that L(R) C L(S) and that L(S) C L(R). Sometimes this can be done informally, arguing, “Anything R can give me I can get from S,” and vice versa. There are a few equivalences that are sometimes helpful in analyzing REs. If R, S, and T are any REs whatever, then R(ST) = (RS)T. (Associativity) R|(S|T)=(R|S)|T. (Associativity) R|S=S|R. (Commutative law) R(S | T)= RS | RT. (Distributive law) Rt = RR*. Rt |e=R*. Re=eR=R. (Identity element) 42 2. Lexical Analysis But note that concatenation is not commutative: in general, RS # SR. REs can be used to describe only a limited variety of languages, but they are powerful enough to be used to define tokens. A token can be thought of as a language that comprises its lexemes; by writing the token as a RE, we define exactly what lexemes will be recognized as elements of that language. I should mention one limitation right off, however: many languages put length limitations on lexemes. For example, an identifier may be restricted to a maximum length of 31 characters. REs provide no means of enforcing such limitations; they must be handled by other means. To familiarize you with REs, we will look at a number of examples. Some of these have immediate applications in compiler writing; others are more abstract. 1. a(b|c) represents ab or ac. Here we have the concatenation of a with a choice of 6 or c. 2. a*b represents b or ab or aab or aaab or aaaab or -:: . 3. The RE (alb)* represents any combination of a’s and b’s. The way to think of the form (---)* is to imagine it as a sort of bucket into which we can reach . se. Whatever we pull Converted with lore. (This is what the bans.) In this case, eve STDU Converter rab. Wecan obtain th . . out an a the first time trial version Clearly, since we are at a Ito take either an aor q___ftp://www.stdutility.com (31 of as ana b’s. We can obtain the null string € by choosing not to reach in at all. 4, =* must then mean any combination of characters drawn from the set 3 here & represents the bucket from which we may pull out characters at will. Note that any RE over © represents a subset of &*. 5. The RE 1(1|d)*, where 1 is any letter and d is any digit, defines a Pascal identifier. The initial 1 forces the string to begin with a letter, and the (1]d)* represents any combination of letters and digits. 6. The RE (ab)* represents the language {e, ab, abab, ababab, ...}. Every time we reach inside the parentheses, we must pull out an ab sequence. (Since there is no | inside the parentheses this time, we have no choice of what to pull out.) 7. (al|b)(cld) represents either a or b concatenated with either c or d. This gives us the strings, ac, ad, bc, bd. Why doesn’t it also give us ab? Because we must choose either a or b, and, having chosen one, we must concatenate it with either c or d. 8. The RE (a|b)(a|b) represents the set of all strings over {a, b} containing exactly two characters. We may pick either a or b from the first pair of Regular Expressions and Finite-State Machines 43 parentheses and either a or 6 from the second pair. Note that because there is no * on either of these parenthesized expressions, we must pick something from each one exactly once: we may not refuse to pick, and we may not pick more than once. 9. The RE ((a|b)(a|b))* represents the set of all strings over {a, b} whose length is an even number. Every time we reach inside the outer pair of parentheses, we will pick up one of the strings defined in Example 8. 10. The RE 0*10* represents the set of all strings over {0, 1} containing exactly one 1. 11. The RE (e| + |—)(0|d*).(0|d*), where d is any digit, represents the set of all legal floating-point constants in Pascal (except those using the E format). The dot is the decimal point, the first parenthesized expression allows an optional + or —, the second provides the integer part of the number and ensures that there will be at least a 0 left of the decimal point, and the last provides the fractional part and ensures that there will be at least a 0 after the decimal point. (Note that this definition allows n . re silly, but they are Converted with 2.7 STDU Converter crars trial version Our general goa. a a les for a lexical analyzer. We ou httn://www.stdutility.com ==om REs to NFAs to DFAs. We are ready for the one remaining link in the chain: constructing a NFA from a given RE. This is done by Thompson’s construction (after Ken Thompson [1968]). Thompson’s construction goes back to our original inductive definition of a RE, providing a machine or an interconnection of machines corresponding to every step of the definition. The construction further ensures that the resulting machine will have distinct starting and accepting states and will have only one accepting state. We know from the inductive definition of a RE that any RE is built up from simpler REs by the operations of union, concatenation, and Kleene closure. So all we need to do is show 1. How to build NFAs that accept € and individual symbols in ©; 2. How to build an NFA that recognizes union, concatenation, and Kleene closure of REs for which we already have NFAs. 44 2. Lexical Analysis We do these as follows. 1. This machine recognizes e: —O>—-O Figure 2.17(a) (We could do as well with a machine consisting of a single state, but we require that each machine have distinct starting and accepting states.) 2. This machine will recognize a character a in U: —+O+4—-O Figure 2.17(b) For the rest of Thompson’s construction, we will assume that we have machines Mr and Msg for recognizing L(R) and L(S), respectively, and we will show ways of interconnecting them in order to recognize the new REs specified by the definition. 3. To recognize R-S, connect the machines as shown here: Converted with Here we have b trial version we have broken into Mg and pul http: / /www.stdutility.com these two states. It should be cle IR) and L(S), for any string in L(R) will then take Mp to the consolidated state, and any string in L(S) will take Ms from the consolidated state to its accepting state. 4. To recognize R|S, connect the machines this way: Figure 2.17(d) We provide a new starting state that has e-transitions to the starting states of Mr and Ms. Clearly a string in L(R) will be accepted since if there was a path from Mp’s original starting state to its accepting state, then there will be a path from the new starting state as well. Similarly, any string in L(S) will be accepted. Since we wish always to have a single accepting state, we complete the construction by downgrading the accepting states of Mr and Ms to ordinary states and connecting them by way of e-transitions to a new accepting state. Regular Expressions and Finite-State Machines 45 5. In constructing a machine to recognize R*, it is easier to start with the positive closure. This machine is constructed from Mp as shown here: ere) Figure 2.17(e) We need only provide a feedback path from the accepting state to the starting state of Mr. Clearly we can loop around this path any number of times; hence the modified machine can accept any number of concatenated strings from L(R). The only difference for R* is that we must allow for zero repetitions. We do this by providing another e-transition that bypasses Mp altogether and goes straight to the accepting state. To make this work properly in all cases, we must provide separate starting and accepting states, as shown here: € Mn (\ Converted with To illustrate STDU Converter jor any given RE, we will take the trial version jn the procedure parallel the steps — . We started out by noting that a http:/ I Www.stdutility.com s for recognizing a and b, shown below. To recognize ab, we connect these together by consolidating Mp’s accepting state and Mg’s starting state: Figure 2.18(a) To recognize b*, we start with a machine to recognize b alone and apply the feedback and bypass paths we saw before. If we cascade this with two machines for a, we have a structure that will recognize ab*a: ~>4 ee (4 ; Figure 2.18(b) 46 2. Lexical Analysis We now connect the machines for ab and ab*a. This gives us the ability to recognize ablab*a: Figure 2.18(c) Finally, to obtain the machine for the entire expression, we again apply feedback and bypass paths; this gives us the final machine shown here: a b € 7 € c < € € Converted with € trial version From this NFA, http://www.stdutility.com so recognize this expression, and thal DFA Can tien be mcorporated i a lexical analyzer (assuming that so unlikely a token should appear in the programming language). This completes our chain; we now know how to start with a RE and derive the equivalent DFA, using the NFA as an intermediate stage. The program Lex, included in the UNIX system, generates a state table for a lexical scanner in es- sentially this way. The user specifies tokens by means of regular expressions; Lex finds the equivalent NFA and from that generates the state table for the DFA. (For information on using Lex, see Appendix B.) 2.8 THE PUMPING LEMMA I said earlier that there were limitations on the variety of languages representable by REs. The equivalence of REs and FSAs, together with the directed-graph rep- resentation of a FSA, is the key to an important result which we can use to show that certain languages are not regular. Suppose a FSA has n states, and suppose that it will recognize a string w of length n. This means that there is a path from the starting state to an accepting state whose length is also n. This path cannot be simple, because a path of length n must pass through n + 1 states, and the machine 2.8. The Pumping Lemma 47 has only n states. By the pigeonhole principle, it must pass through at least one state at least twice. Let this repeated state be q;. (If it passes through several states twice, let q; be any one of them.) Then we can divide w into three substrings, such that w = ryz, where x is the substring taking M from qp to q;. y is the substring taking M from q; back to q;. z is the substring taking M from q; to the final state. (Note that x or z may be null strings, but for g; to be repeated, y must have length > 0.) In that case, the machine must also be able to accept ryyz, since the repeated y simply takes M around the loop from q; to q; twice, and similarly it must be able to accept xyyyz. We can “pump” the substring y in this way to inflate the strings accepted by M to any length whatever. Indeed, M must be able to accept the language xy*z. For example, consider a machine with seven states that accepts the string abde- fca, as shown here: Converted with STDU Converter trial version http://Wwww.stdutility.com Figure 2.19 The heavy lines show the path M followed in accepting the string; the light lines indicate transitions that don’t concern us here. As the figure shows, M passes through State 3 twice; the substring y = def takes it around the loop 3-4-5. But if it can go around that loop once, it can go around it any number of times—including zero—and still accept the strings that take it over that route. Thus it will accept abdefdefca, abdefdefdefca, abdefdefdefdefca, ..., and so on, and even abca. Clearly the same argument can be used for a string of length less than n if it happens to take M through the same state twice. But for any FSA there will always be a critical length k, such that any string in L(M) of length > k can be pumped, and this critical length will never be greater than the number of states in the machine. What we have proved is the Pumping Lemma for FSAs: If a finite-state machine M has n states, then there is a constant k 5, it e character c? If the symmetry of ly, then pumping of 5 states that te clear that this ng languages do not require tokey http:/ f Www.stdutility.com this sort; indeed, the structure of tokens is such that we need not worry about nonregular languages. As a more realistic application of pumping, we note that the language Lp con- sisting of all legal sets of parentheses is not regular. (A set of parentheses is legal if it begins with an open parenthesis and if every open parenthesis is matched some- where down the string by a close parenthesis. Detection of illegal parentheses must clearly be a function of any compiler.) Suppose it were regular; then let Mp be the machine that accepts Lp, and let this machine have n states. Consider the string w = (*)*, where k > n. This is the string containing k open parentheses followed by k close parentheses. It is clearly legal; hence this string must be in Lp. Consider how Mp will accept (FF. In going through the open parentheses, Mp has to respond to k inputs, and in so doing it must repeat some state. Hence we can pump the open parentheses. Similarly, in going through the close parentheses, it will also repeat some state, so we can also pump the close parentheses independently of the open parentheses. Hence if it can accept (*)*, it will also accept strings of the form uv*wa*y, where v consists of at least one open parenthesis and z consists of at least one close parenthesis. But no matter how u,v, w,2z, and y are selected from w, it will be possible to pump v and z in such a way as to generate an illegal set. 2.9. Application to Lexical Analysis 49 We conclude that Lp is not regular. Parenthesis checking is a function of syn- tactic analysis, not lexical analysis. In Chapter 3 we will find ways of defining legal sets of parentheses, and the syntactic analyzers we study will always be capable of checking the nesting of parentheses. 2.9 APPLICATION TO LEXICAL ANALYSIS We are now ready to put all of the foregoing material together. First, for a simple example that exercises everything we have seen, imagine a language in which the input alphabet © is {a, b, c} and there are exactly two tokens, X = aa*(blc) and Y = (blc)c*. Then the lexical scanner will be a DFA that accepts the language X|Y. We start by constructing the equivalent NFAs for X and Y. Thompson’s construction gives us the machine shown here: € a Bp £ a € Converted with STDU Converter trial version http://Wwww.stdutility.com wo Figure 2.20 The NFA for X is on top, that for Y is underneath, and the lines to and from T and U form the overall machine for recognizing X|Y. We now find the equivalent DFA, working right from the diagram. We will call the NFA M and the equivalent DFA M’, as usual. We will write the next-state function for the NFA as N(-) and that for the DFA as N’(-). The starting state is T and the e-closure of T is {A, T, K, L, N}, so the starting state of M’ is [AKLNT] (we will list the component states in alphabetical order, without commas). We add that to the list of states in M’. To find transitions out of [AK LNT], we note transitions out of {A, T, K, L, N}. There is only one edge labeled a coming out of this set, the one from A to B. Hence N({A, T, K, L, N}, a) = B. But the e-closure of B is {B, C, E, F, H}; hence in M’ the next state is [BCE FH] and N’([AKLNT], a) = [BCEFH]. This is a new state and must be added to the table for processing later. There is also only one b-transition out of the set, the edge from L to M. The e- closure of M is {M, P, Q, S, U} so M' has the state [MPQSU] and N'([AKLNT], 50 2. Lexical Analysis b) = [MPQSU]. We must also add [MPQSU] to the table. Similarly, N({A, T, K, L, N}, c) = O and the e-closure is {O, P, Q, S, U}, therefore N'([AKLNT], c) = [OPQSU], another new state. So far, the state table for M’ looks like this: Current Inputs state a b c [AKLNT] | [BCEFH] |[MPQSU] [OPQSU] [BCEFH| [MPQSU] Still to be filled in [(OPQSU] We must now do the second row of the table. To find the transitions out of [BCEFH]: 1. N({B, C, E, F, H}, a) = D and e-closure of D = {B, C, D, E, F, H}, so N'([BCEFH], a) = [CDEFH], another new state. 2. N({B, C, E, Ni((BCEFF Converted with 3. N({B, C, E, vigcee, STDU Converter Now M’ looks li trial version http://Wwww.stdutility.com state | a 6 Cc [AKLNT] | [BCEFH] [MPQSU] [OPQSU| BCEFH]|(CDEFH]| [GJU] [JU] MPQSU] [OPQSU] CDEFH| [GJU] [rJU] The third row gives the transitions out of [MPQSU]: 1. N({M, P, Q, S, U}, a) = {}. The state corresponding to { } will be named [ ], and it must be added to the table; N’([MPQSU], a) = []. 2. N({M, P, Q, S, U}, b) = {}, so N'([MPQSU], b) =[]. This state has already been added to the table. 3. N({M, P, Q, S, U}, c) = R and e-closure of R = {Q, R, S, U}, so N'([MPQSU}), c) = [QRSU], a new state. 2.9. Application to Lexical Analysis 51 The state table for M’ now looks like this: Current Inputs state a b c [AKLNT] | [BCEFH] [MPQSU] [OPQSU] [BCEFH] | |CDEFH| [GJU] [JU] [MPQsu] | [] (] [QRSU] [(OPQSU] [(CDEFH| [GJU] {1JU] (] [QRSU] The fourth row gives the transitions out of [OPQSU]: Converted with 1. N({O, P,Q, state. STDU Converter 2. N({O, P, Q, trial version lw state. hitp://www.stdutility.com 3. N({O, P, Q, S, U}, c) = R and e-closure of R = {Q, R, S, U}, so N'([(OPQSU], c) = [QRSU], not a new state. The state table for M’ is now this: Current Inputs state a b c [BCEFH] [MPQSU] [OPQSU] [(CDEFH] {GJU] [IJU] [] [] [QRSU] [] ] [QRSU] 52 2. Lexical Analysis Continuing this way, we end up with the following table (accepting states are underlined, as usual): Current Inputs state a b 7 [AKLNT] | [BCEFH] [MPQSU] [OPQSU] [BCEFH] | [CDEFH] [GJU] {IJU] [MPQsu]| [1 [] (QRSU] [OPQSU] [] [] (QRSU] [CDEFH] | [CDEFH] [GJU] [IJU] [GJU] (] [] [] JU] ] [] (] (] (] [] (] [QRSU] (] [] [QRSU] This machine has some unneeded states. There are techniques for minimizing the number of s : ix A. If we follow those technique: Converted with STDU Converter trial version http://Wwww.stdutility.com 5 Js 5 5 Figure 2.21 2.9.1 Recognizing Tokens In applying this concept to the recognition of tokens, we must make a number of modifications to the basic FSA. First, the scanner must ignore white space, except when white space marks the end of a token. We do this by providing a state 2.9. Application to Lexical Analysis 53 transition from the starting state back to the starting state when the input is a white space character (i.e., a blank, a tab, a carriage return, or a line feed). Second, we cannot defer a decision until we come to the end of the string; the string, for all practical purposes, is the entire source program. Hence we must modify our definition of acceptance: any time the scanner enters an accepting state, it must announce a token, and, further, it must not enter an accepting state before an entire token has been read. We assume the scanner will always accept something from the input buffer, since failure to accept anything means failure to find a legal token and in such cases the scanner normally will enter an error state. Third, we are looking for a multiplicity of possible tokens. We would like to have exactly one accepting state for each token; in this way, the accepting state tells us concisely which token was found. Fourth, some character strings are identifiers, while others are keywords. We can either have a separate accepting state for each keyword, or we can tentatively accept the keyword as an identifier and make a final decision based on whether the string is found in a list of keywords. Sometimes the symbol table is preloaded with a list of keywor e scanner if the table entry for t Converted with ever, the lexical analyzer must s keyword is not enough. STDU Converter . - added complica- trial version P . until it has run off the end of th . ai owing character, which may be w. http://www.stdutility.com n. Other tokens, like ; or (in Pascal) +, can be recognized immediately. So on any type that requires running off the end, we must arrange to back up the character pointer, so that it will reread the next character when searching for the next token. Otherwise, in a statement like distance := rate*time; the scanner would read the asterisk to find the end of rate; then, in searching for the next token, it would read the t of time and miss the multiplier. Comments and character-string constants introduce their own problems. Com- ments are handled by recognizing the comment-start token and ignoring all further input until the comment-end token has been found. Many languages use multiple- character delimiters for comments. Pascal recognizes both {---} and (* --- *) as comment delimiters; PL/I, C, and Prolog use /* --- */. In many languages, character-string constants are delimited by single quotes. All material between the quotes must be treated as part of the string, case-conversion 54 2. Lexical Analysis must be disabled, and the delimiting quotes must be removed from the internal rep- resentation of the string. The fact that a single quote inside a string is customarily doubled, as in *Say it isn’’t so.’ means that when a single quote is found, the scanner must reserve its decision about whether the quote ends the string until it has read the following character. If the following character is another single quote, then the scanner must decide that it is still in the string and, in addition, must remove the repeated quote from the internal representation of the string. Thus the string given above is converted to Say it isn’t so. 2.9.2 A Stripped-Down Lexical Analyzer for Pascal As an example oO} Converted with hs used in Pascal. Here is a list of t ‘or Turbo Pascal: -vewice, 9 FDU Converter 2. The key trial version ». __http://www.stdutility.com *?const’, ’div’, ’do’, ’downto’, ’else’, ’end’, ?external’, ’file’, ’for’, ’forward’, ’function’, »goto’, ’if’, ’in’, ’inline’, ’label’, ’mod’, *nil’, ’not’, ’of’, ’or’, ’overlay’, ’packed’, »procedure’, ’program’, ’record’, ’repeat’, ’set’, *shl’, ’shr’, ’string’, ’then’, ’to’, ’type’, until’, ’var’, ’while’, ’with’, ’xor’ 3. The simple operators : ; . + - / * = i ¢ ) < > Deere tee 4. The compound operators 27, ore >= <= <> (ek ok) (dD Identifying keywords here is the least of our worries. Consider these problems: 2.9. Application to Lexical Analysis 55 1. Whenever you find one of you must reserve judgment about its meaning until you’ve read the fol- lowing character. (It could be a simple operator, or it might be the first character of a compound operator.) You reserve judgment by putting the analyzer into a special state for each of these characters, since the machine’s current state determines how the incoming character is inter- preted. 2. As we saw before, Pascal uses both { } and (* *) to mark comments. (The alternatives were intended to accommodate keypunches, many of which didn’t have { or }; the compound characters (. and .) were similarly intended as alternatives to [ and ].) But in Turbo Pascal you are permitted to nest ( * *) inside { } and vice versa; this is useful if you want tor ata nn it out. Hence you need| Converted with } comment, and anot. (so you know “west ST DU Converter 3. On top trial version irective (i.e., a messa; ncluding the contents} ttp://Www.stdutility.com = |rense check- ing). A simple scanner for this language, covering all these special cases, takes 34 states. This is entirely too many for us to discuss here. Let us choose a subset of Turbo to include 1. Identifiers and integer constants 2. The keywords, as above 3. All the simple operators listed above except ’ 4. The compound operators 1= >= <= <> (x *) Our state table looks like this: 56 2. Lexical Analysis Curr. Input Back- State| 1 d { } (¢€ * ) : = < > sp p up 1 LY ey 1) aye) a ae ap 2 Qe eo eo oo es oe 3 1 1 1 1 1 1 1 1 1 1 1 1 1 y 4 5 4 5 5 5 5 5 5 5 5 5 5 5 5 1 1 1 1 1 1 1 1 1 1 1 1 1 y 6 G2 210 67) 7) 0) 6. 0276) 676). 16 7.0. 0 tl il 1 1 1 1 1 1 1 1 1 1 1 1 n 8 207 20) 2 202 20). 208 2 29 | 20) 208 220) 207 207 20) «20 9 9 9 9 9 9 10 9 9 9 9 9 9 9 10 9 9 9 9 9 9 UW § § 9 Y 9 9g ll 1 1 1 1 1 1 1 1 1 1 1 1 1 n 12 20 20 20 20 20 20 20 20 13 20 20 20 = 20 13 1 1 1 1 1 1 1 1 1 1 1 1 1 n 14 20 20 20 20 20 20 20 20 15 20 16 20 20 15 1 1 1 1 1 1 1 1 1 1 1 1 1 n 16 | 1 5 i on 17 | 20 2 Converted with 0 20 18 1 1 1 n » |i $TDUConverter |: : 20 | 1 ; ; 1 1sy trial version The meanin; httn://www.stdutility.com VVAIUITIg Stave DMT Or Tr 7 2 In identifier 2 Found 3. End of identifier 13. Token := 4 In number 14 Found < 5 End of number 15 Token: <= 6 In {} comment 16 Token: <> 7 Endof{}comment 17 Found > 8 Found ( 18 Token: >= 9 In (* *) comment 19 General punctuation 10 Found * in (* *) 20 General punctuation Among the inputs, “I” stands for letter, “d” stands for digit, “sp” stands for space, and “p” stands for punctuation mark—that is, any character not specifically listed. (This avoids a lot of individual tests.) Notice the large number of identical entries in this table; in a really large table economical representations may be used. this wastes a lot of space and more The analyzer starts in State 1 and scans the input string, character by character, entering various states as it goes. It gets characters from the input string with the aid of a character pointer, doing something like 2.9. Application to Lexical Analysis 57 ch := textline[cp] ; where cp is the character pointer. When it hits a state that is an accepting state for some token, it takes the appropriate action for that token (typically, passing the token to the parser) and goes back to State 1. If the token is the kind that can only be recognized by running off the end, it must also back up the character pointer. This is indicated in the back-up column of the state table: accepting states that don’t require backing off the character pointer are marked with “n”; those that do, with “y.” Let’s try an example. The input string xX := ab; takes the analyzer through these states: 3 sp O—-O—® Figure 2.22(a) (The symbol “sy j nd at this point the identifier x Converted with acks up cp (the character pointe STDU Converter trial version hitp://www.stdutility.com (Note that keeping the scanner in State 1 on a blank input is a handy way to ignore embedded blanks.) State 12 can’t be an accepting state, because the scanner doesn’t know what kind of colon this is. The equals sign takes the scanner to State 13, which is the accepting state for :=. The scanner did not overshoot the end of the token this time, so it doesn’t need to back off cp. (State 13 has “n” in the back-up column.) It returns to State 1 and resumes: O=-O--O--©+-© Figure 2.22(c) and here the identifier ab is recognized. The scanner ran past the end of ab in recognizing it; the back-up column says “y.” So the scanner decrements cp, returns to State 1, and continues: Figure 2.22(d) 58 2. Lexical Analysis State 19 is a catchall state for punctuation; if we need to determine which punctuation mark this is (e.g., so we can assign a suitable code to it), we can do an operation like code = pos(token, ’()[]<>??+-*/#$7;’); (The function pos is peculiar to Turbo Pascal; it returns the location of token in the string that follows. Many other dialects of Pascal offer a similar function.) Note that an illegal punctuation mark, like !, would get a code of 0; this can be a convenient way to indicate errors. Notice that we have two catchall punctuation states: one of them (20) is for cases like (ab where we must move on to a before knowing we have ( rather than (*; and the other (19) for cases like ; where we do not run on to the next character. State 20 requires back-up; 19 doesn’t. Identifying keywords. The list of keywords is set up as an array of strings; whenever the scanner finds what looks like an identifier, it makes a quick binary search of the keyword array. If the search succeeds, we have a keyword; if not, the token defau}*:+o-4donti@on\ This _ic_muoh_cimnlor thon_writing a regular expression for e Converted with anent entries in the symbol tabl dled uniformly— the scanner loo STDU Converter 2.9.3 Coding trial version A finite-state sc| http: fe www stdutility. com ch it roughly as follows: Repeat Pick the next input character. Find the new state-table entry. If this is a final state for some token, isolate the token, pass it to the parser, and maybe decrement cp. Until input used up. Handling accepting states can be done by a huge case statement with branches for all the final states. Here is a part of a working lexical scanner, written in Turbo Pascal: procedure LEX_SCAN (textline: strtype); const state: array[1..34, 1..16] of byte = { State table } { (In the original program the contents of the state table } 2.9. Application to Lexical Analysis 59 { are inserted at this point.) } colstring: string[13] = ’.’’{}(*)$:=<> °; { For determining state table column } alph: string[27] = ’_ABCDEFGHI JKLMNOPQRSTUVWXYZ’ ; { For identifying alphabetics } digs: string[11] = ’#0123456789° ; { For identifying digits } var sp, { Token-start pointer } cp, { Pointer to current character } class, { Token class } col: integer; { Column of state table } cc: char; { Current character } token: strtype; {I.e., a character string } { ====> Note: lex_state, the current state, is a global or static integer variable. The main program sets it to 1 before the first call to - } begin Converted with cy haracter } = SOlDU Converter feccnin. wh - - trial version g loop } aan, haracter } http://www.stdutility.com — b.. t1 cor 3 if col < 3 then if pos(upcase(cc), alph) > 0 then col := 1 { Special handling for} else { alphabetics, for } if pos(cc, digs) > 0 then col := 2 { numerics, and for } else col := 12; { punctuation marks } { Find new state from table } lex_state := state[lex_state, col]; case lex_state of { States 1 and 2 require no action } 3: begin { Keyword, identifier, or label } { Copy token from text line } token := copy(textline, sp, cp - sp); { Check for possible keyword } class := kwsearch (token); if class < 1 then class := 1; 60 2. Lexical Analysis { Put into symbol table } enter (token, class); lex_state := 1; { Return to starting state } cp := cp - 1; { Back up cp so start of new } { token won’t be missed } end; (* ... and so on, through all states which are final for some token or other, or which otherwise require some special action ... *) end; { case } CDE = cD aa. { Update character } if lex_state=1 then sp := cp; { and start pointers } end; { while } end; { Lex_Scan } Note that each call to Lex_Scan handles a separate input Tine. Since Pascal will not let you split ——~——"> er the operation of the program, Converted with rrent state, be a global variable ricting the input STDU Converter of Lex_Scan to uffering scheme described above ming language. trial version http://www.stdutility.com wes Some languages lguities in tokens. This is particularly true in languages like FORTRAN which do not require blanks. FORTRAN lexical analyzers routinely squeeze out blanks, so a statement like 2.9.4 The L if (x + y .le. 64) x =0 becomes if (xty.le.64)x=0 (In FORTRAN, .le. means <=.) The classical ambiguity problem in FORTRAN is a programmer’s blunder. FoR- TRAN, has do loops instead of for loops: do 100 x = 1, 10 means that the body of the loop is all statements down through the one labeled 100. Now, it takes only one slip of the finger to write do 100 x = 1. 10 in which case FORTRAN’s lexical analyzer makes this 2.10. Summary 61 do100x=1.10 which looks like an assignment statement. So when the scanner sees do100x, it doesn’t know whether this is the beginning of a do loop or whether the user has an identifier doi00x. This is true even if the programmer wrote the statement correctly; you can’t just assume that nobody is going to use an identifier like that. (We don’t write compilers just for smart programmers; we must write them for all kinds of programmers.) The problem is aggravated by the fact that in FORTRAN the programmer doesn’t have to declare numerical variables; hence when the compiler fails to find do100x in the symbol table, that tells it nothing. As a result, FORTRAN must place a temporary marker $ after the do: do$100x=1.10 or do$100x=1,10 It then continues looking until it has found a comma, a period, or the end of the statement. If it “+ tA +t * the keyword do to the parser, a Converted with r and passes the identifier do100: STDU c t 2.10 SUM : onver er trial version It may be a goo lexical scanner. We start by wri http:/ I Www.stdutility.com rom these REs, Thompson’s construction will give us the NFA that recognizes those tokens. (I have not described the program that will carry out Thompson’s construction; it requires the ability to parse REs, and we haven’t looked at parsing yet.) From the NFA, the subset construction will give us the equivalent DFA, which can then be embedded as a table in the lexical analyzer. The principal tools we have seen in this chapter are regular expressions and finite- state automata. I have concentrated on using FSAs for lexical analysis, naturally, but they have many other applications as well. The 1977 edition of Aho and Ullman has an illuminating section, “The Scanner Generator as Swiss Army Knife,” which lists some of these. Most of these applications are pattern matchers of one sort or another. One of the fastest methods for pattern matching in character strings is based on FSAs; this is the Knuth-Morris-Pratt algorithm [Knuth, 1977]. The path from regular expressions to NFAs to DFAs is the basis of Lex, an au- tomatic program for generating lexical analyzers. You prepare an input file defining the tokens of your language as regular expressions; Lex analyzes this file and outputs the code for a lexical analyzer. For a brief introduction to Lex, see Appendix B. More detailed information may be found in Lesk and Schmidt [1975] and especially in Mason and Brown [1990]. 62 2. Lexical Analysis Even if you aren’t contemplating writing a Lex workalike, it’s important to understand the process of constructing an equivalent DFA from a given NFA. We will see in Chapter 4 that it lies at the heart of the procedure for constructing tables for the LR parser. Again we will start with an NFA, this time one that reflects the logic of the parsing procedure, and again we will have to construct a DFA from it in order to make the procedure practical. PROBLEMS 2.1. For each of the following finite-state machines, draw its state diagram, trace its operation on each of the strings shown, and indicate which of them are accepted. ae 0 (a) ) ; ; Strings: aaaaa, bbbbb, ababa, babab, abbbb 3/1 3 a L (b) , ; Converted with 3] 2 STDU Converter a 1/2 trial version (c) 2] 1 ae ra http://www.stdutility.com 413 3 2 2.2. For each of the following NFAs, draw the state diagram, trace its operation on each of the strings shown, and indicate which of them are accepted. (a) , von oo Strings: aaab, baaab, ababab, baabb, aba 3] {1,2} {1} a b € 1 {3} {4} nes: (b) 2} 42} (3 G ae abaab, baabb, bbaba, aabb, bababa, 3} {1} {4} {2} oie eo ee a b c € 1] {1,2} {2} {1,3} {} + os: (c) 2 {3} (2, 3} (1 {4} ad finde, bacabb, cbaba, 3} {3,4} {2} {2,4} {1} , 4} {1} {2} {3} } Problems 63 2.3. For each of the NFAs in Problem 2.2, construct the equivalent DFA. 2.4. Show that (R|S)* = (R*S*)* 2.5. Show that RS # SR by giving a counterexample. 2.6. Show that (R|S)* #4 (R*|S*) by giving a counterexample. 2.7. Given the RE R = (ab|b)*c, which of the following strings is (are) in L(R)? ababbc c babe abab 2.8. Given the RE ab*c(a|b)c, which of the following strings is (are) in L(R)? acac acbbbe abcac abcc 2.9. If © = {a, b}, write a regular expression whose language is all strings beginning and ending with b. 2.10. If © = {a, b}, write a regular expression whose language is all strings whose length is a multiple of 3. 2.11. If © = {0, 1}, write a regular expression whose language is all strings containing exactly three 1s. 2.12. Ifo= . is all strings in which every 1 is Converted with 2.13. Consty STDU C 1 d the equivalent Dra’ onverter 2.14. Constr trial version d the equivalent DFA. —_ 2.15. Use theft: / WWW STHUTHNTY.COM | nine ns 0} is not regular. 2.16. Test the language Lo1¢g = {(ab", n > 0} with the Pumping Lemma and show that it 7s regular. 2.17. Using the Pascal state table, trace the lexical scan of the following strings: (Gah po) var x: integer; if (a>b) or (a<=c) then c:=b; CHAPTER 3 SYNTACTIC ANALYSIS I The syntactic analyzer, or parser, is the heart of the front end of the compiler. The parser’s main task is analyzing the structure of the program and its component statements and checking these for errors. It frequently controls the lexical analyzer, which provides t/~ * *— ay also supervise the operation of Converted with Our principa languages. This can be defined rq ST D U C 0 nye rte i languages. This theory was deve le., languages like English, French, trial version at computer sci- entists discovere —_ ‘ory by providing a grammar for t] http:/ I Www. stdutility.com mar to construct the parser. To be quite accurate, I should point out that most practical programming lan- guages cannot be completely described by the kind of grammar we will use. In a programming language there are frequently restrictions that cannot be enforced by these grammars. For example, strongly typed languages require that every variable be declared before it is used. The context-free grammars, which are the type we will study, are not capable of enforcing this requirement. The grammars that can enforce requirements like this are too complex and unwieldy for use in compiler construction. (We’ll come back to this later.) In any case, exceptions like these are rare and can be handled simply by other means, and context-free grammars are so convenient to use and lend themselves so readily to efficient methods of parser construction that they are universally used in compiler construction. 3.1 GRAMMARS I will provide a formal definition of a grammar presently, but for the time being, we will say that a grammar is a finite set of rules for generating an infinite set 64 3.1. Grammars 65 of sentences. (We can also write grammars that generate finite languages, and we will use many such examples, but they are of little practical interest.) In natural languages, the sentences are made up of words; in programming languages, they are made up of tokens. The rules impose some desired structure on the sentences. Any sentence that can be generated by a given grammar is by definition grammatical (in the terms of that grammar, anyway); any sentence that cannot be so generated is ungrammatical. The grammars one traditionally learns in studying foreign languages are different from the ones we will consider. (I have to say “traditionally” here, because the theory of formal languages has had some influence on modern foreign language teaching.) We use a type of grammar called a generative grammar; this type of grammar builds a sentence in a series of steps, starting with the abstract concept of a sentence and refining that concept until an actual sentence emerges as the result. Analyzing, or parsing, the sentence consists of reconstructing the way in which the sentence was formed. Much of the theory of generative grammars is based on the work of Noam Chomsky [1956, 1957, 1959], and we will draw on that work here. Some examples will be useful in laying the groundwork for the formal definition. Consider the ser hce consists of a noun phrase, “th Converted with (he verb phrase, in turn, can be 4 her noun phrase, “the bone.” Fin| STDU Converter »” followed by a noun, in one cas . - trial version http://www.stdutility.com (noun phrase) (verb phrase) (article) (noun) (verb) (noun phrase) (article) “ the dog gnawed _ the bone Figure 3.1 This analysis can be represented by the tree shown here. Such a structure is called a parse tree. The root of the tree is our starting point, a sentence. A sentence can take many forms; in this case, it consists of a noun phrase and a verb phrase. These are the children of the node sentence, as shown in this partial tree: 66 3. Syntactic Analysis I (sentence) (noun phrase) (verb phrase) Figure 3.2(a) When a node has two children, as here, they are read off from left to right: noun phrase, verb phrase. In similar fashion, the specific form of the verb phrase is indicated by its children, the nodes verb and noun phrase, as here: (sentence) (noun phrase) (verb phrase) Converted with The rest of the STDU Converter s of the tree are the final sentenc . . The tree lay: trial version e. Sentences are not just random . ale A grammatical sentence is made_, http:/ / Www.stdutility.com of smaller parts, and so on. The tree shows these parts and how they are related so we can examine them at leisure. If you start at the root of the tree and work downward, you are retracing the steps by which the sentence was generated. It is almost as if someone were thinking, “T guess I’ll speak a sentence. Let’s see—how will I do it this time? I guess I'll fall back on the old noun-phrase-verb-phrase structure. Now, what should I use for the noun phrase?...” I don’t really believe people go through such a process when they talk, even unconsciously. The point is that you can reconstruct the derivation of the sentence and lay bare its structure as if the speaker had thought of it that way. If you start at the leaves of the tree and work upward, you are retracing the process by which a listener might understand what was being said or what was written. “Hmmm. ..it says, ‘the.’ That’s usually the beginning of a noun phrase. Now we have, ‘dog’; yep, that’s a noun phrase, all right. Will they give me a verb next?...” and so forth. Again, whether people really analyze sentences that way is doubtful. (Part of the problem is that we learned to speak and understand as infants and no longer recall how we did it.) This is a usable model of comprehension, however, or at least of analysis, and, more to the point, we can write parsers for programming languages that proceed almost exactly along these lines. 3.1. Grammars 67 A generative grammar is a formalization of this process. Returning to our model of speaking the sentence, we note that the starting point was the idea of a sentence. We can paraphrase our hypothetical talker’s thoughts as follows: A sentence can consist of a noun phrase and a verb phrase. A noun phrase can consist of an article and a noun. A verb phrase can consist of a verb and a noun phrase. nn & Possible nouns are “do, cat,” “bone,”..., etc. ’ , ) Possible articles are “the,” “a,”..., etc. Possible verb| Converted with These are, in fac ete set, but they aeenoun oo} © ST DU Converter For clarity, i - . sentence, noun phrase, and verb trial version lay we did on the tree. Thus we wi . ai so on. This lets us omit the quot! http:/ i Wwww.stdutility.com ce; it also avoids confusion when analyzing sentences like “The sentence contains a verb.” Later, for brevity, we will abbreviate these structural elements with capital letters. We will also abbreviate “can consist of” (or “can be replaced by”) with the symbol —. Using these conventions, we can say (sentence) — (noun phrase) (verb phrase) (noun phrase) — (article) (noun) (verb phrase) — (verb) (noun phrase) (noun) — dog, cat, bone, sentence, verb, ... (article) — the, a, an (verb) — contains, gnawed, saw, walks, ... We can parallel this entire discussion in the domain of programming languages. Consider the expression a * b+ c. This expression consists of a smaller expression axb and a second expression c, joined by a +. The expression a *b, in turn, consists of the expression a and the expression b, joined by a +. We can construct a parse tree for this expression in the same way that we did for the English sentence: 68 3. Syntactic Analysis I (expression) (expression) + (expression) (expression) * (expression) c a b Figure 3.3 We can also produce a partial set of rules for forming expressions, as follows: (expression) — (expression) * (expression) (expression) —> (expression) + (expression) (expression) Converted with These rules sentence) can be rewritten as (no STDU Converter Ire the heart of a generative gram: terminal symbols, or just terminal trial version le things in angle brackets are call _ http://www.stdutility.com 3.1.1 Syntax and Semantics Before we go on, we must draw a couple of distinctions. We must differentiate between syntax and semantics, and it is well to contrast the theoretical powers of grammars with the practical limitations found in languages. Syntax deals with the way a sentence is put together; semantics deals with what the sentence means. Grammars define proper sentence structure, but they have no bearing on meaning. It is possible to write grammatically correct sentences that are nonsense; for example, The blotter subtracted the sympathy. It is also possible to write ungrammatical sentences that make sense, for example, the famous reason for climbing Mount Everest: Because it was there. This sentence lacks a main verb. In fact, it isn’t really a sentence at all, only a dependent clause; but it makes sense to the listener in context because the listener mentally supplies the missing part: 3.1. Grammars 69 [We climbed it] because it was there. We can supply the missing part, but the grammar won’t; hence by our definition this utterance is ungrammatical even though sensible. It is much easier to determine whether a sentence, or a program, is grammatical than whether it makes sense. The compiler, during syntactic analysis, will auto- matically check for grammatical errors; it is generally left to the programmer to see, during debugging, whether the program makes sense. (Later on, we will consider semantic analysis in the compiler, but that will turn out to cover much less ground than debugging does.) The practical limitations come about as follows: there are grammatical sentences that nobody can utter and grammatical sentences that are almost impossible to understand. There is a story about a child who was told to write a composition 100 words long and who submitted the following: Perspective is what you see when you look down the railroad tracks and they get smaller and smaller and smaller and smaller and smaller and smaller and smaller and . and so on, out to Converted with b unquestionably grammatical, an |tence that takes rood yeas tos ~— SW DU Converter That gramm . . erstand must be clear to anyone trial version hing different in mind. Robinson aT hitp://www.stdutility.com The house the cheese the rat the cat the dog chased caught ate lay in was built by Jack. This grammatical sentence is virtually incomprehensible until you recognize the nursery jingle, “This is the house that Jack built...” and recall how it goes. The problems here are the large number of nested clauses and the fact that the subjects and verbs in all but the lowest level of nesting are separated by further levels of nesting. We could easily follow The house the cheese lay in was built by Jack. and we might even be able to handle The house the cheese the rat ate lay in was built by Jack. if we read it carefully. The problem is that the mental stack we use for handling nested clauses doesn’t work very well when the depth of nesting exceeds approximately two. Practical limitations inside the computer, or in the compiler itself, impose similar limits. There may be limits on the number of dimensions an array may have, for example, 70 3. Syntactic Analysis I or limits on the number of levels of nesting in a record or structure. We saw in Chapter 2 that regular expressions could not in themselves enforce a limit on the length of an identifier. Here, too, just as English grammatical rules permit the monstrosities shown above, programming-language grammars do not cover the consequences of practical limitations in their languages. There are also grammatical sentences in natural languages that do not lend themselves well to analysis with the simple grammars we will consider here; I will have more to say about this later. 3.1.2. Grammars: Formal Definition We have now seen enough background to permit a formal definition of a grammar. A grammar G is a quadruple (T, N, S, R), where T is a finite set of terminal symbols, N is a finite set of nonterminal symbols (note that N and T are disjoint sets), S is a unl . : Converted with Ris a fir a and @ are inesof STDU Converter Nonterminals ar . . n), and the like; terminals are th trial version ur dog-bone ex- ample, the start . ai example, it was (expression). http://www.stdutility.com A production means that any occurrence of the string on the left-hand side can be replaced by the string on the right-hand side. At the moment, a will be a single nonterminal, although it does not have to be; we will come back to this later. When a is a single nonterminal, the grammar is context-free. These are the grammars of particular interest to us. Nonterminals are also called syntactic categories or syntactic variables. S is sometimes called the starting symbol and sometimes the sentence symbol. Some writers require that S be distinct from N; this requirement makes it slightly easier to construct one kind of parser, but otherwise it makes no practical difference, and we will not impose this requirement. Sentences in the language are generated by starting with S and applying pro- ductions until we are left with nothing but terminals. (You aren’t done as long as there are any nonterminals left.) The set of all sentences that can be generated in this way from a given grammar G are the language generated by G, written L(G). Before we go on, there are some notational details to cover. First, in talking about grammars generally, I will follow Chomsky’s convention whereby nontermi- nals are represented by capital letters, terminals by lowercase letters early in the alphabet, strings of terminals by lowercase letters late in the alphabet, and mixtures of nonterminals and terminals by lowercase Greek letters. Thus for example we will 3.1. Grammars 71 speak of the nonterminals A, B, and C, the terminals a, b, and c, the strings w and x, and the sentential forms a, 3, and ¥. (I will define sentential forms shortly.) When dealing with specific programming-language examples, however, I will write the terminals in boldface characters. In particular, I will use a boldface i for an identifier. Second, the list of productions can be used to define the entire grammar, in much the same way that the state table was used to define an entire finite-state machine, if we agree to list the productions from the sentence symbol first. Then the nonterminals can be identified by scanning the left-hand sides of the productions and the terminals can be identified by a careful examination of the right-hand sides. Finally, we must mention BNF notation. BNF stands for Backus-Naur Form, named for J. W. Backus (who played a leading réle in the development of the first FORTRAN compiler) and P. Naur (who played a leading réle in the design of Algol-60). In this notation, nonterminals are placed in angle brackets, as men- tioned previously; the symbol ::= is used in place of the arrow; and when several productions share the same left-hand side, as for example, (expression) t Sowd_1_t : (expression) Converted with coresion) — STDU Converter we write . . trial version (expression) _ i where the vertic http:/ t Www.stdutility.com e free use of the vertical line but will otherwise follow Chomsky’s notation and use the arrow. You should be aware that some writers use the term “BNF grammars” as a synonym for “context-free grammars.” Extended BNF includes two constructs useful in defining practical programming- language grammars concisely. An indefinite number of repetitions is indicated by enclosing a sentential form in curly braces. For example, in the syntax for Turbo Pascal [Borland, 1985], the syntax for the var statement goes, in part (and slightly paraphrased), (variable-declaration-part) ::= var (variable-declaration) {; (variable-declaration) } ; The curly braces function like the Kleene closure, in that they mean that their contents can be repeated zero or more times. The second construct uses square brackets to indicate an optional element. For example, an integer constant might be defined as (integer-constant) ::= [+ | — ] (digit) {(digit)} Here the brackets indicate that the sign before the number is optional. We will not use BNF in our discussion of parsing, but it is well to be familiar with the notation, because it is widely used in descriptions of programming languages.

You might also like