**Johan Jeuring, Doaitse Swierstra
**

November 7, 2010

Copyright c ( 2001–2009 by Johan Jeuring and Doaitse Swierstra.

Contents

Preface for the 2009 version v

Voorwoord vii

1. Goals 1

1.1. History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2. Grammar analysis of context-free grammars . . . . . . . . . . . . . . . 3

1.3. Compositionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4. Abstraction mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2. Context-Free Grammars 7

2.1. Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2. Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1. Notational conventions . . . . . . . . . . . . . . . . . . . . . . . 14

2.3. The language of a grammar . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.1. Examples of basic languages . . . . . . . . . . . . . . . . . . . . 17

2.4. Parse trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5. Grammar transformations . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.5.1. Removing duplicate productions . . . . . . . . . . . . . . . . . 23

2.5.2. Substituting right hand sides for nonterminals . . . . . . . . . 23

2.5.3. Removing unreachable productions . . . . . . . . . . . . . . . . 23

2.5.4. Left factoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.5.5. Removing left recursion . . . . . . . . . . . . . . . . . . . . . . 24

2.5.6. Associative separator . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5.7. Introduction of priorities . . . . . . . . . . . . . . . . . . . . . . 26

2.5.8. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.6. Concrete and abstract syntax . . . . . . . . . . . . . . . . . . . . . . . 28

2.7. Constructions on grammars . . . . . . . . . . . . . . . . . . . . . . . . 31

2.7.1. SL: an example . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.8. Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.9. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3. Parser combinators 41

3.1. The type of parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.2. Elementary parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3. Parser combinators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

i

Contents

3.3.1. Matching parentheses: an example . . . . . . . . . . . . . . . . 53

3.4. More parser combinators . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.4.1. Parser combinators for EBNF . . . . . . . . . . . . . . . . . . . 55

3.4.2. Separators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.5. Arithmetical expressions . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.6. Generalised expressions . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.7. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4. Grammar and Parser design 67

4.1. Step 1: Example sentences for the language . . . . . . . . . . . . . . . 68

4.2. Step 2: A grammar for the language . . . . . . . . . . . . . . . . . . . 68

4.3. Step 3: Testing the grammar . . . . . . . . . . . . . . . . . . . . . . . 69

4.4. Step 4: Analysing the grammar . . . . . . . . . . . . . . . . . . . . . . 69

4.5. Step 5: Transforming the grammar . . . . . . . . . . . . . . . . . . . . 69

4.6. Step 6: Deciding on the types . . . . . . . . . . . . . . . . . . . . . . . 70

4.7. Step 7: Constructing the basic parser . . . . . . . . . . . . . . . . . . . 71

4.7.1. Basic parsers from strings . . . . . . . . . . . . . . . . . . . . . 72

4.7.2. A basic parser from tokens . . . . . . . . . . . . . . . . . . . . 72

4.8. Step 8: Adding semantic functions . . . . . . . . . . . . . . . . . . . . 74

4.9. Step 9: Did you get what you expected . . . . . . . . . . . . . . . . . 75

4.10. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5. Compositionality 77

5.1. Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.1.1. Built-in lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.1.2. User-deﬁned lists . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.1.3. Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.2. Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.2.1. Binary trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.2.2. Trees for matching parentheses . . . . . . . . . . . . . . . . . . 83

5.2.3. Expression trees . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.2.4. General trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.2.5. Eﬃciency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.3. Algebraic semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.4. Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.4.1. Evaluating expressions . . . . . . . . . . . . . . . . . . . . . . . 90

5.4.2. Adding variables . . . . . . . . . . . . . . . . . . . . . . . . . . 91

5.4.3. Adding deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . . . 93

5.4.4. Compiling to a stack machine . . . . . . . . . . . . . . . . . . . 95

5.5. Block structured languages . . . . . . . . . . . . . . . . . . . . . . . . 97

5.5.1. Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5.5.2. Generating code . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.6. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

ii

Contents

6. Computing with parsers 105

6.1. Insert a semantic function in the parser . . . . . . . . . . . . . . . . . 105

6.2. Apply a fold to the abstract syntax . . . . . . . . . . . . . . . . . . . . 105

6.3. Deforestation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.4. Using a class instead of abstract syntax . . . . . . . . . . . . . . . . . 107

6.5. Passing an algebra to the parser . . . . . . . . . . . . . . . . . . . . . 109

7. Programming with higher-order folds 111

7.1. The rep min problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.1.1. A straightforward solution . . . . . . . . . . . . . . . . . . . . . 113

7.1.2. Lambda lifting . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.1.3. Tupling computations . . . . . . . . . . . . . . . . . . . . . . . 114

7.1.4. Merging tupled functions . . . . . . . . . . . . . . . . . . . . . 115

7.2. A small compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.2.1. The language . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.2.2. A stack machine . . . . . . . . . . . . . . . . . . . . . . . . . . 117

7.2.3. Compiling to the stackmachine . . . . . . . . . . . . . . . . . . 118

7.3. Attribute grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

8. Regular Languages 123

8.1. Finite-state automata . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

8.1.1. Deterministic ﬁnite-state automata . . . . . . . . . . . . . . . . 124

8.1.2. Nondeterministic ﬁnite-state automata . . . . . . . . . . . . . . 126

8.1.3. Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

8.1.4. Constructing a DFA from an NFA . . . . . . . . . . . . . . . . 132

8.1.5. Partial evaluation of NFA’s . . . . . . . . . . . . . . . . . . . . 135

8.2. Regular grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

8.2.1. Equivalence of Regular grammars and Finite automata . . . . 139

8.3. Regular expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

8.4. Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

8.4.1. Proof of Theorem 8.7 . . . . . . . . . . . . . . . . . . . . . . . 146

8.4.2. Proof of Theorem 8.16 . . . . . . . . . . . . . . . . . . . . . . . 147

8.4.3. Proof of Theorem 8.17 . . . . . . . . . . . . . . . . . . . . . . . 148

8.5. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

9. Pumping Lemmas: the expressive power of languages 153

9.1. The Chomsky hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 154

9.1.1. Type-0 grammars . . . . . . . . . . . . . . . . . . . . . . . . . . 154

9.1.2. Type-1 grammars . . . . . . . . . . . . . . . . . . . . . . . . . . 154

9.1.3. Type-2 grammars . . . . . . . . . . . . . . . . . . . . . . . . . . 154

9.1.4. Type-3 grammars . . . . . . . . . . . . . . . . . . . . . . . . . . 155

9.2. The pumping lemma for regular languages . . . . . . . . . . . . . . . . 155

9.3. The pumping lemma for context-free languages . . . . . . . . . . . . . 157

9.4. Proofs of pumping lemmas . . . . . . . . . . . . . . . . . . . . . . . . . 161

iii

Contents

9.4.1. Proof of the Regular Pumping Lemma, Theorem 9.1 . . . . . . 161

9.4.2. Proof of the Context-free Pumping Lemma, Theorem 9.2 . . . 162

9.5. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

10.LL Parsing 167

10.1. LL Parsing: Background . . . . . . . . . . . . . . . . . . . . . . . . . . 167

10.1.1. A stack machine for parsing . . . . . . . . . . . . . . . . . . . . 168

10.1.2. Some example derivations . . . . . . . . . . . . . . . . . . . . . 169

10.1.3. LL(1) grammars . . . . . . . . . . . . . . . . . . . . . . . . . . 173

10.2. LL Parsing: Implementation . . . . . . . . . . . . . . . . . . . . . . . . 176

10.2.1. Context-free grammars in Haskell . . . . . . . . . . . . . . . . . 176

10.2.2. Parse trees in Haskell . . . . . . . . . . . . . . . . . . . . . . . 178

10.2.3. LL(1) parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

10.2.4. Implementation of isLL(1) . . . . . . . . . . . . . . . . . . . . . 181

10.2.5. Implementation of lookahead . . . . . . . . . . . . . . . . . . . 181

10.2.6. Implementation of empty . . . . . . . . . . . . . . . . . . . . . 186

10.2.7. Implementation of ﬁrst and last . . . . . . . . . . . . . . . . . . 187

10.2.8. Implementation of follow . . . . . . . . . . . . . . . . . . . . . 190

11.LL versus LR parsing 195

11.1. LL(1) parser example . . . . . . . . . . . . . . . . . . . . . . . . . . . 196

11.1.1. An LL(1) checker . . . . . . . . . . . . . . . . . . . . . . . . . . 196

11.1.2. An LL(1) grammar . . . . . . . . . . . . . . . . . . . . . . . . . 197

11.1.3. Using the LL(1) parser . . . . . . . . . . . . . . . . . . . . . . . 198

11.2. LR parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

11.2.1. A stack machine for SLR parsing . . . . . . . . . . . . . . . . . 200

11.3. LR parse example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

11.3.1. An LR checker . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

11.3.2. LR action selection . . . . . . . . . . . . . . . . . . . . . . . . . 202

11.3.3. LR optimizations and generalizations . . . . . . . . . . . . . . . 206

A. The Stack module 211

B. Answers to exercises 213

iv

Preface for the 2009 version

This is a work in progress. These lecture notes are in large parts identical with

the old lecture notes for “Grammars and Parsing” by Johan Jeuring and Doaitse

Swierstra.

The lecture notes have not only been used at Utrecht University, but also at the Open

University. Some modiﬁcations made by Manuela Witsiers have been reintegrated.

I have also started to make some modiﬁcations to style and content – mainly trying to

adapt the Haskell code contained in the lecture notes to currently established coding

guidelines. I also rearranged some of the material because I think they connect better

in the new order.

However, this work is not quite ﬁnished, and this means that some of the later

chapters look a bit diﬀerent from the earlier chapters. I apologize in advance for any

inconsistencies or mistakes I may have introduced.

Please feel free to point out any mistakes you ﬁnd in these lecture notes to me – any

other feedback is also welcome.

I hope to be able to update the online version of the lecture notes regularly.

Andres L¨oh

October 2009

v

Preface for the 2009 version

vi

Voorwoord

Het hiervolgende dictaat is gebaseerd op teksten uit vorige jaren, die onder andere

geschreven zijn in het kader van het project Kwaliteit en Studeerbaarheid.

Het dictaat is de afgelopen jaren verbeterd, maar we houden ons van harte aanbevolen

voor suggesties voor verdere verbetering, met name daar waar het het aangeven van

verbanden met andere vakken betreft.

Veel mensen hebben een bijgedrage geleverd aan de totstandkoming van dit dictaat

door een gedeelte te schrijven, of (een gedeelte van) het dictaat te becommentari¨eren.

Speciale vermelding verdienen Jeroen Fokker, Rik van Geldrop, en Luc Duponcheel,

die mee hebben geholpen door het schrijven van (een) hoofdstuk(ken) van het dic-

taat. Commentaar is onder andere geleverd door: Arthur Baars, Arnoud Berendsen,

Gijsbert Bol, Breght Boschker, Martin Bravenboer, Pieter Eendebak, Rijk-Jan van

Haaften, Graham Hutton, Daan Leijen, Andres L¨oh, Erik Meijer, en Vincent Oost-

indi¨e.

Tenslotte willen we van de gelegenheid gebruik maken enige studeeraanwijzingen te

geven:

• Het is onze eigen ervaring dat het uitleggen van de stof aan iemand anders vaak

pas duidelijk maakt welke onderdelen je zelf nog niet goed beheerst. Als je dus

van mening bent dat je een hoofdstuk goed begrijpt, probeer dan eens in eigen

woorden uiteen te zetten.

• Oefening baart kunst. Naarmate er meer aandacht wordt besteed aan de pre-

sentatie van de stof, en naarmate er meer voorbeelden gegeven worden, is het

verleidelijker om, na lezing van een hoofdstuk, de conclusie te trekken dat je

een en ander daadwerkelijk beheerst. “Begrijpen is echter niet hetzelfde als

“kennen”, “kennen” is iets anders dan “beheersen” en “beheersen” is weer iets

anders dan “er iets mee kunnen”. Maak dus de opgaven die in het dictaat

opgenomen zijn zelf, en doe dat niet door te kijken of je de oplossingen die

anderen gevonden hebben, begrijpt. Probeer voor jezelf bij te houden welk

stadium je bereikt hebt met betrekking tot alle genoemde leerdoelen. In het

ideale geval zou je in staat moeten zijn een mooi tentamen in elkaar te zetten

voor je mede-studenten!

• Zorg dat je up-to-date bent. In tegenstelling tot sommige andere vakken is het

bij dit vak gemakkelijk de vaste grond onder je voeten kwijt te raken. Het is

niet “elke week nieuwe kansen”. We hebben geprobeerd door de indeling van

vii

Voorwoord

de stof hier wel iets aan te doen, maar de totale opbouw laat hier niet heel

veel vrijheid toe. Als je een week gemist hebt is het vrijwel onmogelijk de

nieuwe stof van de week daarop te begrijpen. De tijd die je dan op college en

werkcollege doorbrengt is dan weinig eﬀectief, met als gevolg dat je vaak voor

het tentamen heel veel tijd (die er dan niet is) kwijt bent om in je uppie alles

te bestuderen.

• We maken gebruik van de taal Haskell om veel concepten en algoritmen te

presenteren. Als je nog moeilijkheden hebt met de taal Haskell aarzel dan niet

direct hier wat aan te doen, en zonodig hulp te vragen. Anders maak je jezelf

het leven heel moeilijk. Goed gereedschap is het halve werk, en Haskell is hier

ons gereedschap.

Veel sterkte, en hopelijk ook veel plezier,

Johan Jeuring en Doaitse Swierstra

viii

1. Goals

Introduction

Courses on Grammars, Parsing and Compilation of programming languages have

always been some of the core components of a computer science curriculum. The

reason for this is that from the very beginning of these curricula it has been one

of the few areas where the development of formal methods and the application of

formal techniques in actual program construction come together. For a long time the

construction of compilers has been one of the few areas where we had a methodology

available, where we had tools for generating parts of compilers out of formal descrip-

tions of the tasks to be performed, and where such program generators were indeed

generating programs which would have been impossible to create by hand. For many

practicing computer scientists the course on compiler construction still is one of the

highlights of their education.

One of the things which were not so clear however is where exactly this joy originated

from: the techniques taught deﬁnitely had a certain elegance, we could construct

programs someone else could not – thus giving us the feeling we had “the right

stuﬀ” –, and when completing the practical exercises, which invariably consisted of

constructing a compiler for some toy language, we had the usual satisﬁed feeling.

This feeling was augmented by the fact that we would not have had the foggiest idea

how to complete such a product a few months before, and now we knew “how to do

it”.

This situation has remained so for years, and it is only in the last years that we

have started to discover and make explicit the reasons why this area attracted so

much interest. Many of the techniques which were taught on a “this is how you solve

this kind of problems” basis, have been provided with a theoretical underpinning

which explains why the techniques work. As a beneﬁcial side-eﬀect we also gradually

learned to see where the discovered concept further played a rˆole, thus linking the

area with many other areas of computer science; and not only that, but also giving

us a means to explain such links, stress their importance, show correspondences and

transfer insights from one area of interest to the other.

Goals

The goals of these lecture notes can be split into primary goals, which are associated

with the speciﬁc subject studied, and secondary – but not less important – goals which

1

1. Goals

have to do with developing skills which one would expect every educated computer

scientist to have. The primary, somewhat more traditional, goals are:

• to understand how to describe structures (i.e., “formulas”) using grammars;

• to know how to parse, i.e., how to recognise (build) such structures in (from) a

sequence of symbols;

• to know how to analyse grammars to see whether or not speciﬁc properties

hold;

• to understand the concept of compositionality;

• to be able to apply these techniques in the construction of all kinds of programs;

• to familiarise oneself with the concept of computability.

The secondary, more far reaching, goals are:

• to develop the capability to abstract;

• to understand the concepts of abstract interpretation and partial evaluation;

• to understand the concept of domain speciﬁc languages;

• to show how proper formalisations can be used as a starting point for the

construction of useful tools;

• to improve the general programming skills;

• to show a wide variety of useful programming techniques;

• to show how to develop programs in a calculational style.

1.1. History

When at the end of the ﬁfties the use of computers became more and more wide-

spread, and their reliability had increased enough to justify applying them to a wide

range of problems, it was no longer the actual hardware which posed most of the

problems. Writing larger and larger programs by more and more people sparked the

development of the ﬁrst more or less machine-independent programming language

FORTRAN (FORmula TRANslator), which was soon to be followed by ALGOL-60

and COBOL.

For the developers of the FORTRAN language, of which John Backus was the prime

architect, the problem of how to describe the language was not a hot issue: much

more important problems were to be solved, such as, what should be in the language

and what not, how to construct a compiler for the language that would ﬁt into the

small memories which were available at that time (kilobytes instead of megabytes),

and how to generate machine code that would not be ridiculed by programmers who

had thus far written such code by hand. As a result the language was very much

implicitly deﬁned by what was accepted by the compiler and what not.

Soon after the development of FORTRAN an international working group started to

work on the design of a machine independent high-level programming language, to

2

1.2. Grammar analysis of context-free grammars

become known under the name ALGOL-60. As a remarkable side-eﬀect of this under-

taking, and probably caused by the need to exchange proposals in writing, not only

a language standard was produced, but also a notation for describing programming

languages was proposed by Naur and used to describe the language in the famous

Algol-60 report. Ever since it was introduced, this notation, which soon became to

be known as the Backus-Naur formalism (BNF), has been used as the primary tool

for describing the basic structure of programming languages.

It was not for long that computer scientists, and especially people writing compilers,

discovered that the formalism was not only useful to express what language should be

accepted by their compilers, but could also be used as a guideline for structuring their

compilers. Once this relationship between a piece of BNF and a compiler became

well understood, programs emerged which take such a piece of language description

as input, and produce a skeleton of the desired compiler. Such programs are now

known under the name parser generators.

Besides these very mundane goals, i.e., the construction of compilers, the BNF-

formalism also became soon a subject of study for the more theoretically oriented. It

appeared that the BNF-formalism actually was a member of a hierarchy of grammar

classes which had been formulated a number of years before by the linguist Noam

Chomsky in an attempt to capture the concept of a “language”. Questions arose

about the expressibility of BNF, i.e., which classes of languages can be expressed

by means of BNF and which not, and consequently how to express restrictions and

properties of languages for which the BNF-formalism is not powerful enough. In the

lectures we will see many examples of this.

1.2. Grammar analysis of context-free grammars

Nowadays the use of the word Backus-Naur is gradually diminishing, and, inspired

by the Chomsky hierarchy, we most often speak of context-free grammars. For the

construction of everyday compilers for everyday languages it appears that this class

is still a bit too large. If we use the full power of the context-free languages we

get compilers which in general are ineﬃcient, and probably not so good in handling

erroneous input. This latter fact may not be so important from a theoretical point of

view, but it is from a pragmatical point of view. Most invocations of compilers still

have as their primary goal to discover mistakes made when typing the program, and

not so much generating actual code. This aspect is even stronger present in strongly

typed languages, such as Java and Haskell, where the type checking performed by

the compilers is one of the main contributions to the increase in eﬃciency in the

programming process.

When constructing a recogniser for a language described by a context-free grammar

one often wants to check whether or not the grammar has speciﬁc desirable properties.

Unfortunately, for a human being it is not always easy, and quite often practically

3

1. Goals

impossible, to see whether or not a particular property holds. Furthermore, it may

be very expensive to check whether or not such a property holds. This has led to a

whole hierarchy of context-free grammars classes, some of which are more powerful,

some are easy to check by machine, and some are easily checked by a simple human

inspection. In this course we will see many examples of such classes. The general

observation is that the more precise the answer to a speciﬁc question one wants to

have, the more computational eﬀort is needed and the sooner this question cannot

be answered by a human being anymore.

1.3. Compositionality

As we will see the structure of many compilers follows directly from the grammar

that describes the language to be compiled. Once this phenomenon was recognised it

went under the name syntax directed compilation. Under closer scrutiny, and under

the inﬂuence of the more functional oriented style of programming, it was recognised

that actually compilers are a special form of homomorphisms, a concept thus far only

familiar to mathematicians and more theoretically oriented computer scientist that

study the description of the meaning of a programming language.

This should not come as a surprise since this recognition is a direct consequence of the

tendency that ever greater parts of compilers are more or less automatically generated

from a formal description of some aspect of a programming language; e.g. by making

use of a description of their outer appearance or by making use of a description of the

semantics (meaning) of a language. We will see many examples of such mappings.

As a side eﬀect you will acquire a special form of writing functional programs, which

makes it often surprisingly simple to solve at ﬁrst sight rather complicated program-

ming assignments. We will see that the concept of lazy evaluation plays an important

rˆole in making these eﬃcient and straightforward implementations possible.

1.4. Abstraction mechanisms

One of the main reasons for that what used to be an endeavour for a large team in

the past can now easily be done by a couple of ﬁrst year’s students in a matter of

days or weeks, is that over the last thirty years we have discovered the right kind of

abstractions to be used, and an eﬃcient way of partitioning a problem into smaller

components. Unfortunately there is no simple way to teach the techniques which

have led us thus far. The only way we see is to take a historians view and to compare

the old and the new situations.

Fortunately however there have also been some developments in programming lan-

guage design, of which we want to mention the developments in the area of functional

programming in particular. We claim that the combination of a modern, albeit quite

4

1.4. Abstraction mechanisms

elaborate, type system, combined with the concept of lazy evaluation, provides an

ideal platform to develop and practice ones abstraction skills. There does not exist

another readily executable formalism which may serve as an equally powerful tool.

We hope that by presenting many algorithms, and fragments thereof, in a modern

functional language, we can show the real power of abstraction, and even ﬁnd some

inspiration for further developments in language design: i.e., ﬁnd clues about how to

extend such languages to enable us to make common patterns, which thus far have

only been demonstrated by giving examples, explicit.

5

1. Goals

6

2. Context-Free Grammars

Introduction

We often want to recognise a particular structure hidden in a sequence of symbols.

For example, when reading this sentence, you automatically structure it by means of

your understanding of the English language. Of course, not any sequence of symbols

is an English sentence. So how do we characterise English sentences? This is an old

question, which was posed long before computers were widely used; in the area of

natural language research the question has often been posed what actually constitutes

a “language”. The simplest deﬁnition one can come up with is to say that the English

language equals the set of all grammatically correct English sentences, and that a

sentence consists of a sequence of English words. This terminology has been carried

over to computer science: the programming language Java can be seen as the set of

all correct Java programs, whereas a Java program can be seen as a sequence of Java

symbols, such as identiﬁers, reserved words, speciﬁc operators etc.

This chapter introduces the most important notions of this course: the concept of

a language and a grammar. A language is a, possibly inﬁnite, set of sentences and

sentences are sequences of symbols taken from a ﬁnite set (e. g., sequences of char-

acters, which are referred to as strings). Just as we say that the fact whether or not

a sentence belongs to the English language is determined by the English grammar

(remember that before we have used the phrase “grammatically correct”), we have a

grammatical formalism for describing artiﬁcial languages.

A diﬀerence with the grammars for natural languages is that this grammatical for-

malism is a completely formal one. This property may enable us to mathematically

prove that a sentence belongs to some language, and often such proofs can be con-

structed automatically by a computer in a process called parsing. Notice that this

is quite diﬀerent from the grammars for natural languages, where one may easily

disagree about whether something is correct English or not. This completely formal

approach however also comes with a disadvantage; the expressiveness of the class of

grammars we are going to describe in this chapter is rather limited, and there are

many languages one might want to describe but which cannot be described, given

the limitations of the formalism.

7

2. Context-Free Grammars

Goals

The main goal of this chapter is to introduce and show the relation between the

main concepts for describing the parsing problem: languages and sentences, and

grammars.

In particular, after you have studied this chapter you will:

• know the concepts of language and sentence;

• know how to describe languages by means of context-free grammars;

• know the diﬀerence between a terminal symbol and a nonterminal symbol;

• be able to read and interpret the BNF notation;

• understand the derivation process used in describing languages;

• understand the rˆole of parse trees;

• understand the relation between context-free grammars and datatypes;

• understand the EBNF formalism;

• understand the concepts of concrete and abstract syntax;

• be able to convert a grammar from EBNF-notation into BNF-notation by hand;

• be able to construct a simple context-free grammar in EBNF notation;

• be able to verify whether or not a simple grammar is ambiguous;

• be able to transform a grammar, for example for removing left recursion.

2.1. Languages

The goal of this section is to introduce the concepts of language and sentence.

In conventional texts about mathematics it is not uncommon to encounter a deﬁnition

of sequences that looks as follows:

Deﬁnition 2.1 (Sequence). Let X be a set. The set of sequences over X, called X

∗

,

sequence

is deﬁned as follows:

• ε is a sequence, called the empty sequence, and

• if z is a sequence and a is an element of X, then az is also a sequence.

The above deﬁnition is an instance of a very common deﬁnition pattern: it is a

deﬁnition by induction, i. e., the deﬁnition of the concept refers to the concept itself.

induction

It is implicitly understood that nothing that cannot be formed by repeated, but ﬁnite

application of one of the two given rules is a sequence over X.

Furthermore, the deﬁnition corresponds almost exactly to the deﬁnition of the type

[a] of lists with elements of type a in Haskell. The one diﬀerence is that Haskell lists

can be inﬁnite, whereas sequences are always ﬁnite.

In the following, we will introduce several concepts based on sequences. They can be

implemented easily in Haskell using lists.

8

2.1. Languages

Functions that operate on an inductively deﬁned structure such as sequences are

typically structurally recursive, i. e., such deﬁnitions often follow a recursion pattern

which is similar to the deﬁnition of the structure itself. (Recall the function foldr

from the course on Functional Programming.)

Note that the Haskell notation for lists is generally more precise than the mathemat-

ical notation for sequences. When talking about languages and grammars, we often

leave the distinction between single symbols and sequences implicit.

We use letters from the beginning of the alphabet to represent single symbols, and

letters from the end of the alphabet to represent sequences. We write a to denote

both the single symbol a or the sequence aε, depending on context. We typically use

ε only when we want to explicitly emphasize that we are talking about the empty

sequence.

Furthermore, we denote concatenation of sequences and symbols in the same way,

i. e., az should be understood as the symbol a followed by the sequence z , whereas

xy is the concatenation of sequences x and y.

In Haskell, all these distinctions are explicit. Elements are distinguished from lists

by their type; there is a clear diﬀerence between a and [a]. Concatenation of lists

is handled by the operator (++), whereas a single element can be added to the front

of a list using (:). Also, Haskell identiﬁers often have longer names, so ab in Haskell

is to be understood as a single identiﬁer with name ab, not as a combination of two

symbols a and b.

Now we move from individual sequences to ﬁnite or inﬁnite sets of sequences. We

start with some terminology:

Deﬁnition 2.2 (Alphabet, Language, Sentence).

• An alphabet is a ﬁnite set of symbols.

alphabet

• A language is a subset of T

∗

, for some alphabet T.

language

• A sentence (often also called word) is an element of a language.

sentence

Note that ‘word’ and ‘sentence’ in formal languages are used as synonyms.

Some examples of alphabets are:

• the conventional Roman alphabet: ¦a, b, c, . . . , z¦;

• the binary alphabet ¦0, 1¦;

• sets of reserved words ¦if, then, else¦;

• a set of characters l =¦a, b, c, d, e, i, k, l, m, n, o, p, r, s, t, u, w, x¦;

• a set of English words ¦course, practical, exercise, exam¦.

Examples of languages are:

• T

∗

, ∅ (the empty set), ¦ε¦ and T are languages over alphabet T;

9

2. Context-Free Grammars

• the set ¦course, practical, exercise, exam¦ is a language over the alphabet

l of characters and exam is a sentence in it.

The question that now arises is how to specify a language. Since a language is a set

we immediately see three diﬀerent approaches:

• enumerate all the elements of the set explicitly;

• characterise the elements of the set by means of a predicate;

• deﬁne which elements belong to the set by means of induction.

We have just seen some examples of the ﬁrst (the Roman alphabet) and third (the set

of sequences over an alphabet) approach. Examples of the second approach are:

• the even natural numbers ¦n [ n ∈ ¦0, 1, . . . , 9¦

∗

, n mod 2 = 0¦;

• the language PAL of palindromes, sequences which read the same forward as

backward, over the alphabet ¦a, b, c¦: ¦s [ s ∈ ¦a, b, c¦

∗

, s = s

R

¦, where s

R

denotes the reverse of sequence s.

One of the fundamental diﬀerences between the predicative and the inductive ap-

proach to deﬁning a language is that the latter approach is constructive, i. e., it

provides us with a way to enumerate all elements of a language. If we deﬁne a lan-

guage by means of a predicate we only have a means to decide whether or not an

element belongs to a language. A famous example of a language which is easily de-

ﬁned in a predicative way, but for which the membership test is very hard, is the set

of prime numbers.

Quite often we want to prove that a language L, which is deﬁned by means of an

inductive deﬁnition, has a speciﬁc property P. If this property is of the form P(L) =

∀x ∈ L . P(x), then we want to prove that L ⊆ P.

Since languages are sets the usual set operators such as union, intersection and dif-

ference can be used to construct new languages from existing ones. The complement

of a language L over alphabet T is deﬁned by L =¦x [ x ∈ T

∗

, x / ∈ L¦.

In addition to these set operators, there are more speciﬁc operators, which apply only

to sets of sequences. We will use these operators mainly in the chapter on regular

languages, Chapter 8. Note that ∪ denotes set union, so ¦1, 2¦∪¦1, 3¦=¦1, 2, 3¦.

Deﬁnition 2.3 (Language operations). Let L and M be languages over the same

alphabet T, then

L = T

∗

−L complement of L

L

R

=¦s

R

[ s ∈ L¦ reverse of L

LM =¦st [ s ∈ L, t ∈ M¦ concatenation of L and M

L

0

=¦ε¦ 0

th

power of L

L

n+1

= LL

n

n + 1

st

power of L

L

∗

=

i∈N

L

i

= L

0

∪ L

1

∪ L

2

∪ . . . star-closure of L

L

+

=

i∈N,i>0

L

i

= L

1

∪ L

2

∪ . . . positive closure of L

10

2.2. Grammars

The following equations follow immediately from the above deﬁnitions.

L

∗

=¦ε¦ ∪ LL

∗

L

+

= LL

∗

Exercise 2.1. Let L=¦ab, aa, baa¦, where a and b are the terminals. Which of the following

strings are in L

∗

: abaabaaabaa, aaaabaaaa, baaaaabaaaab, baaaaabaa?

Exercise 2.2. What are the elements of ∅

∗

?

Exercise 2.3. For any language, prove

1. ∅L = L∅ =∅

2. ¦ε¦L = L¦ε¦ = L

Exercise 2.4. In this section we deﬁned two “star” operators: one for arbitrary sets (Deﬁni-

tion 2.1) and one for languages (Deﬁnition 2.3). Is there a diﬀerence between these operators?

2.2. Grammars

The goal of this section is to introduce the concept of context-free grammars.

Working with sets might be fun, but it often is complicated to manipulate sets, and

to prove properties of sets. For these purposes we introduce syntactical deﬁnitions,

called grammars, of sets. This section will only discuss so-called context-free gram-

mars, a kind of grammars that are convenient for automatic processing, and that can

describe a large class of languages. But the class of languages that can be described

by context-free grammars is limited.

In the previous section we deﬁned PAL, the language of palindromes, by means of

a predicate. Although this deﬁnition deﬁnes the language we want, it is hard to

use in proofs and programs. An important observation is the fact that the set of

palindromes can be deﬁned inductively as follows.

Deﬁnition 2.4 (Palindromes by induction).

• The empty string, ε, is a palindrome;

• the strings consisting of just one character, a, b, and c, are palindromes;

• if P is a palindrome, then the strings obtained by prepending and appending

the same character, a, b, and c, to it are also palindromes, that is, the strings

aPa

bPb

cPc

are palindromes.

11

2. Context-Free Grammars

The ﬁrst two parts of the deﬁnition cover the basic cases. The last part of the

deﬁnition covers the inductive cases. All strings which belong to the language PAL

inductively deﬁned using the above deﬁnition read the same forwards and backwards.

Therefore this deﬁnition is said to be sound (every string in PAL is a palindrome).

sound

Conversely, if a string consisting of a’s, b’s, and c’s reads the same forwards and

backwards then it belongs to the language PAL. Therefore this deﬁnition is said to

be complete (every palindrome is in PAL).

complete

Finding an inductive deﬁnition for a language which is described by a predicate (like

the one for palindromes) is often a nontrivial task. Very often it is relatively easy

to ﬁnd a deﬁnition that is sound, but you also have to convince yourself that the

deﬁnition is complete. A typical method for proving soundness and completeness of

an inductive deﬁnition is mathematical induction.

Now that we have an inductive deﬁnition for palindromes, we can proceed by giving

a formal representation of this inductive deﬁnition.

Inductive deﬁnitions like the one above can be represented formally by making use

of deduction rules which look like:

a

1

, a

2

, . . . , a

n

¬ a or ¬ a

The ﬁrst kind of deduction rule has to be read as follows:

if a

1

, a

2

, . . . and a

n

are true,

then a is true.

The second kind of deduction rule, called an axiom, has to be read as follows:

a is true.

Using these deduction rules we can now write down the inductive deﬁnition for PAL

as follows:

¬ ε ∈ PAL

¬ a ∈ PAL

¬ b ∈ PAL

¬ c ∈ PAL

P ∈ PAL

¬ aPa ∈ PAL

P ∈ PAL

¬ bPb ∈ PAL

P ∈ PAL

¬ cPc ∈ PAL

Although the deﬁnition of PAL

**is completely formal, it is still laborious to write.
**

Since in computer science we use many deﬁnitions which follow such a pattern, we

introduce a shorthand for it, called a grammar. A grammar consists of production

grammar

rules. We can give a grammar of PAL

**by translating the deduction rules given above
**

into production rules. The rule with which the empty string is constructed is:

12

2.2. Grammars

P → ε

This rule corresponds to the axiom that states that the empty string ε is a palindrome.

A rule of the form s → α, where s is symbol and λα is a sequence of symbols, is called

a production rule, or production for short. A production rule can be considered as

production rule

a possible way to rewrite the symbol s. The symbol P to the left of the arrow is a

symbol which denotes palindromes. Such a symbol is an example of a nonterminal

symbol , or nonterminal for short. Nonterminal symbols are also called auxiliary

nonterminal

symbols: their only purpose is to denote structure, they are not part of the alphabet

of the language. Three other basic production rules are the rules for constructing

palindromes consisting of just one character. Each of the one element strings a, b,

and c is a palindrome, and gives rise to a production:

P → a

P → b

P → c

These production rules correspond to the axioms that state that the one element

strings a, b, and c are palindromes. If a string α is a palindrome, then we obtain a

new palindrome by prepending and appending an a, b, or c to it, that is, aαa, bαb,

and cαc are also palindromes. To obtain these palindromes we use the following

recursive productions:

P → aPa

P → bPb

P → cPc

These production rules correspond to the deduction rules that state that, if P is a

palindrome, then one can deduce that aPa, bPb and cPc are also palindromes. The

grammar we have presented so far consists of three components:

• the set of terminals ¦a, b, c¦;

terminal

• the set of nonterminals ¦P¦;

• and the set of productions (the seven productions that we have introduced so

far).

Note that the intersection of the set of terminals and the set of nonterminals is

empty. We complete the description of the grammar by adding a fourth component:

the nonterminal start symbol P. In this case we have only one choice for a start

start symbol

symbol, but a grammar may have many nonterminal symbols, and we always have

to select one to start with.

To summarize, we obtain the following grammar for PAL:

P → ε

P → a

P → b

13

2. Context-Free Grammars

P → c

P → aPa

P → bPb

P → cPc

The deﬁnition of the set of terminals, ¦a, b, c¦, and the set of nonterminals, ¦P¦, is

often implicit. Also the start-symbol is implicitly deﬁned here since there is only one

nonterminal.

We conclude this example with the formal deﬁnition of a context-free grammar.

Deﬁnition 2.5 (Context-Free Grammar). A context-free grammar G is a four-tuple

context-free

grammar

(T, N, R, S) where

• T is a ﬁnite set of terminal symbols;

• N is a ﬁnite set of nonterminal symbols (T and N are disjunct);

• R is a ﬁnite set of production rules. Each production has the form A → α,

where A is a nonterminal and α is a sequence of terminals and nonterminals;

• S is the start symbol, S ∈ N.

The adjective “context-free” in the above deﬁnition comes from the speciﬁc produc-

tion rules that are considered: exactly one nonterminal on the left hand side. Not

every language can be described via a context-free grammar. The standard example

here is ¦a

n

b

n

c

n

[ n ∈ N¦. We will encounter this example again later in these lecture

notes.

2.2.1. Notational conventions

In the deﬁnition of the grammar for PAL we have written every production on a

single line. Since this takes up a lot of space, and since the production rules form the

heart of every grammar, we introduce the following shorthand. Instead of writing

S → α

S → β

we combine the two productions for S in one line as using the symbol [:

S → α [ β

We may rewrite any number of rewrite rules for one nonterminal in this fashion, so

the grammar for PAL may also be written as follows:

P → ε [ a [ b [ c [ aPa [ bPb [ cPc

The notation we use for grammars is known as BNF – Backus Naur Form – after

BNF

Backus and Naur, who ﬁrst used this notation for deﬁning grammars.

14

2.3. The language of a grammar

Another notational convention concerns names of productions. Sometimes we want to

give names to production rules. The names will be written in front of the production.

So, for example,

Alpha rule: S → α

Beta rule: S → β

Finally, if we give a context-free grammar just by means of its productions, the start-

symbol is usually the nonterminal in the left hand side of the ﬁrst production, and

the start-symbol is usually called S.

Exercise 2.5. Give a context free grammar for the set of sentences over alphabet X where

1. X =¦a¦,

2. X =¦a, b¦.

Exercise 2.6. Give a context free grammar for the language

L =¦a

n

b

n

[ n ∈ N¦

Exercise 2.7. Give a grammar for palindromes over the alphabet ¦a, b¦.

Exercise 2.8. Give a grammar for the language

L =¦s s

R

[ s ∈ ¦a, b¦

∗

¦

This language is known as the language of mirror palindromes.

Exercise 2.9. A parity sequence is a sequence consisting of 0’s and 1’s that has an even

number of ones. Give a grammar for parity sequences.

Exercise 2.10. Give a grammar for the language

L =¦w [ w ∈ ¦a, b¦

∗

∧ #(a, w) = #(b, w)¦

where #(c, w) is the number of c-occurrences in w.

2.3. The language of a grammar

The goal of this section is to describe the relation between grammars and languages:

to show how to derive sentences of a language, given its grammar.

In the previous section, we have demonstrated in several examples how to construct

a grammar for a particular language. Now we consider the reverse question: how to

obtain a language from a given grammar? Before we can answer this question we

ﬁrst have to say what we can do with a grammar. The answer is simple: we can

derive sequences with it.

15

2. Context-Free Grammars

How do we construct a palindrome? A palindrome is a sequence of terminals, in our

case the characters a, b and c, that can be derived in zero or more direct derivation

steps from the start symbol P using the productions of the grammar for palindromes

given before.

For example, the sequence bacab can be derived using the grammar for palindromes

as follows:

P

⇒

bPb

⇒

baPab

⇒

bacab

Such a construction is called a derivation. In the ﬁrst step of this derivation pro-

derivation

duction P → bPb is used to rewrite P into bPb. In the second step production

P → aPa is used to rewrite bPb into baPab. Finally, in the last step production

P → c is used to rewrite baPab into bacab. Constructing a derivation can be seen

as a constructive proof that the string bacab is a palindrome.

We will now describe derivation steps more formally.

Deﬁnition 2.6 (Derivation). Suppose X → β is a production of a grammar, where

X is a nonterminal symbol and β is a sequence of (nonterminal or terminal) symbols.

Let αXγ be a sequence of (nonterminal or terminal) symbols. We say that αXγ

directly derives the sequence αβγ, which is obtained by replacing the left hand side

direct derivation

X of the production by the corresponding right hand side β. We write αXγ ⇒ αβγ

and we also say that αXγ rewrites to αβγ in one step. A sequence ϕ

n

is derived

derivation

from a sequence ϕ

0

, written ϕ

0

⇒

∗

ϕ

n

, if there exist sequences ϕ

0

, . . . , ϕ

n

such that

∀i , 0 i <n : ϕ

i

⇒ ϕ

i+1

If n = 0, this statement is trivially true, and it follows that we can derive each

sentence ϕ from itself in zero steps:

ϕ ⇒

∗

ϕ

A partial derivation is a derivation of a sequence that still contains nonterminals.

partial derivation

Finding a derivation ϕ

0

⇒

∗

ϕ

n

is, in general, a nontrivial task. A derivation is only

one branch of a whole search tree which contains many more branches. Each branch

represents a (successful or unsuccessful) direction in which a possible derivation may

proceed. Another important challenge is to arrange things in such a way that ﬁnding

a derivation can be done in an eﬃcient way.

16

2.3. The language of a grammar

From the example derivation above it follows that

P ⇒

∗

bacab

Because this derivation begins with the start symbol of the grammar and results in a

sequence consisting of terminals only (a terminal string), we say that the string bacab

belongs to the language generated by the grammar for palindromes. In general, we

deﬁne

language of a

grammar

Deﬁnition 2.7 (Language of a grammar). The language of a grammar G=(T, N, R, S),

usually denoted by L(G), is deﬁned as

L(G) =¦s [ S ⇒

∗

s, s ∈ T

∗

¦

The language L(G) is also called the language generated by the grammar G. We

sometimes also talk about the language of a nonterminal A, which is deﬁned by

L(A) =¦s [ A ⇒

∗

s, s ∈ T

∗

¦

Note that diﬀerent grammars may have the same language. For example, if we extend

the grammar for PAL with the production P → bacab, we obtain a grammar with

exactly the same language as PAL. Two grammars that generate the same language

are called equivalent. So for a particular grammar there exists a unique language,

equivalent

but the reverse is not true: given a language we can construct many grammars that

generate the language. To phrase it more mathematically: the mapping between a

grammar and its language is not a bijection.

Deﬁnition 2.8 (Context-free language). A context-free language is a language that

context-free

language

is generated by a context-free grammar.

All palindromes can be derived from the start symbol P. Thus, the language of

our grammar for palindromes is PAL, the set of all palindromes over the alphabet

¦a, b, c¦, and PAL is context-free.

2.3.1. Examples of basic languages

Digits occur in a several programming languages and other languages, and so do

letters. In this subsection we will deﬁne some grammars that specify some basic

languages such as digits and letters. These grammars will be used frequently in later

sections.

• The language of single digits is speciﬁed by a grammar with ten production

rules for the nonterminal Dig.

Dig → 0 [ 1 [ 2 [ 3 [ 4 [ 5 [ 6 [ 7 [ 8 [ 9

17

2. Context-Free Grammars

• We obtain sequences of digits by means of the following grammar:

Digs → ε [ Dig Digs

• Natural numbers are sequences of digits that start with a non-zero digit. So

in order to specify natural numbers, we ﬁrst deﬁne the language of non-zero

digits.

Dig-0 → 1 [ 2 [ 3 [ 4 [ 5 [ 6 [ 7 [ 8 [ 9

Now we can deﬁne the language of natural numbers as follows.

Nat → 0 [ Dig-0 Digs

• Integers are natural numbers preceded by a sign. If a natural number is not

preceded by a sign, it is supposed to be a positive number.

Sign → + [ -

Int → Sign Nat [ Nat

• The languages of small letters and capital letters are each speciﬁed by a gram-

mar with 26 productions:

SLetter → a [ b [ . . . [ z

CLetter → A [ B [ . . . [ Z

In the real deﬁnitions of these grammars we have to write each of the 26 letters,

of course. A letter is now either a small or a capital letter.

Letter → SLetter [ CLetter

• Variable names, function names, datatypes, etc., are all represented by identi-

ﬁers in programming languages. The following grammar for identiﬁers might

be used in a programming language:

Identiﬁer → Letter AlphaNums

AlphaNums → ε [ Letter AlphaNums [ Dig AlphaNums

An identiﬁer starts with a letter, and is followed by a sequence of alphanumeric

characters, i. e., letters and digits. We might want to allow more symbols, such

as for example underscores and dollar symbols, but then we have to adjust the

grammar, of course.

• Dutch zip codes consist of four digits, of which the ﬁrst digit is non-zero, fol-

lowed by two capital letters. So

ZipCode → Dig-0 Dig Dig Dig CLetter CLetter

18

2.4. Parse trees

Exercise 2.11. A terminal string that belongs to the language of a grammar is always

derived in one or more steps from the start symbol of the grammar. Why?

Exercise 2.12. What language is generated by the grammar with the single production rule

S → ε

Exercise 2.13. What language does the grammar with the following productions generate?

S → Aa

A → B

B → Aa

Exercise 2.14. Give a simple description of the language generated by the grammar with

productions

S → aA

A → bS

S → ε

Exercise 2.15. Is the language L deﬁned in Exercise 2.1 context free ?

2.4. Parse trees

The goal of this section is to introduce parse trees, and to show how parse trees relate

to derivations. Furthermore, this section deﬁnes (non)ambiguous grammars.

For any partial derivation, i. e., a derivation that contains nonterminals in its right

hand side, there may be several productions of the grammar that can be used to pro-

ceed the partial derivation with. As a consequence, there may be diﬀerent derivations

for the same sentence.

However, if only the order in which the derivation steps are chosen diﬀers between

two derivations, then the derivations are considered to be equivalent. If, however,

diﬀerent derivation steps have been chosen in two derivations, then these derivations

are considered to be diﬀerent.

Here is a simple example. Consider the grammar SequenceOfS with productions:

S → SS

S → s

Using this grammar, we can derive the sentence sss in at least the following two

ways (the nonterminal that is rewritten in each step appears underlined):

S ⇒ SS ⇒ SSS ⇒ SsS ⇒ ssS ⇒ sss

19

2. Context-Free Grammars

S ⇒ SS ⇒ sS ⇒ sSS ⇒ sSs ⇒ sss

These derivations are the same up to the order in which derivation steps are taken.

However, the following derivation does not use the same derivation steps:

S ⇒ SS ⇒ SSS ⇒ sSS ⇒ ssS ⇒ sss

In both derivation sequences above, the ﬁrst S was rewritten to s. In this derivation,

however, the ﬁrst S is rewritten to SS.

The set of all equivalent derivations can be represented by selecting a so-called canoni-

cal element. A good candidate for such a canonical element is the leftmost derivation.

leftmost derivation

In a leftmost derivation, the leftmost nonterminal is rewritten in each step. If there

exists a derivation of a sentence x using the productions of a grammar, then there

exists also a leftmost derivation of x. The last of the three derivation sequences for

the sentence sss given above is a leftmost derivation. The two equivalent derivation

sequences before, however, are both not leftmost. The leftmost derivation corre-

sponding to these two sequences above is

S ⇒ SS ⇒ sS ⇒ sSS ⇒ ssS ⇒ sss

There exists another convenient way for representing equivalent derivations: they all

correspond to the same parse tree (or derivation tree). A parse tree is a representation

parse tree

of a derivation which abstracts from the order in which derivation steps are chosen.

The internal nodes of a parse tree are labelled with a nonterminal N, and the children

of such a node are the parse trees for symbols of the right hand side of a production for

N. The parse tree of a terminal symbol is a leaf labelled with the terminal symbol.

The resulting parse tree of the ﬁrst two derivations of the sentence sss looks as

follows:

S

S

s

S

S

s

S

s

The third derivation of the sentence sss results in a diﬀerent parse tree:

S

S

S

s

S

s

S

s

20

2.4. Parse trees

As another example, all derivations of the string abba using the productions of the

grammar

P → ε

P → APA

P → BPB

A → a

B → b

are represented by the following derivation tree:

P

A

a

P

B

b

P

ε

B

b

A

a

A derivation tree can be seen as a structural interpretation of the derived sentence.

Note that there might be more than one structural interpretation of a sentence with

respect to a given grammar. Such grammars are called ambiguous.

Deﬁnition 2.9 (ambiguous grammar, unambiguous grammar). A grammar is un-

ambiguous if every sentence has a unique leftmost derivation, or, equivalently, if every

unambiguous

sentence has a unique derivation tree. Otherwise it is called ambiguous.

ambiguous

The grammar SequenceOfS for constructing sequences of s’s is an example of an

ambiguous grammar, since there exist two parse trees for the sentence sss.

It is in general undecidable whether or not an arbitrary context-free grammar is

ambiguous. This implies that is impossible to write a program that determines for

an arbitrary context-free grammar whether it is ambiguous or not.

It is usually rather diﬃcult to translate languages with ambiguous grammars. There-

fore, you will ﬁnd that most grammars of programming languages and other languages

that are used in processing information are unambiguous.

Grammars have proved very successful in the speciﬁcation of artiﬁcial languages (such

as programming languages). They have proved less successful in the speciﬁcation of

natural languages (such as English), partly because is extremely diﬃcult to construct

an unambiguous grammar that speciﬁes a nontrivial part of the language. Take for

example the sentence ‘They are ﬂying planes’. This sentence can be read in two

ways, with diﬀerent meanings: ‘They – are – ﬂying planes’, and ‘They – are ﬂying

– planes’. While ambiguity of natural languages may perhaps be considered as an

advantage for their users (e. g., politicians), it certainly is considered a disadvantage

for language translators, because it is usually impossible to maintain an ambiguous

meaning in a translation.

21

2. Context-Free Grammars

2.5. Grammar transformations

In this section, we will ﬁrst look at a number of properties of grammars. We then will

show how we can systematically transform grammars into grammars that describe

the same language and satisfy particular properties.

Here are some examples of properties that are worth considering for a particular

grammar:

• a grammar may be unambiguous, that is, every sentence of its language has a

unique parse tree;

• a grammar may have the property that only the start symbol can derive the

empty string; no other nonterminal can derive the empty string;

• a grammar may have the property that every production either has a single

terminal, or two nonterminals in its right hand side. Such a grammar is said

to be in Chomsky normal form.

So why are we interested in such properties? Some of these properties imply that it

is possible to build parse trees for sentences of the language of the grammar in only

one way. Some other properties imply that we can build these parse trees very fast.

Other properties are used to prove facts about grammars. Yet other properties are

used to eﬃciently compute certain information from parse trees of a grammar.

Properties are particularly interesting in combination with grammar transformations.

A grammar transformationis a procedure to obtain a grammar G

from a grammar

grammar

transformation

G such that L(G

) =L(G).

Suppose, for example, that we have a program that builds parse trees for sentences

of grammars in Chomsky normal form, and that we can prove that every grammar

can be transformed in a grammar in Chomsky normal form. Then we can use this

program for building parse trees for any grammar.

Since it is sometimes convenient to have a grammar that satisﬁes a particular property

for a language, we would like to be able to transform grammars into other grammars

that generate the same language, but that possibly satisfy diﬀerent properties. In

the following, we describe a number of grammar transformations:

• Removing duplicate productions.

• Substituting right hand sides for nonterminals.

• Removing unreachable productions.

• Left factoring.

• Removing left recursion.

• Associative separator.

• Introduction of priorities.

22

2.5. Grammar transformations

There are many more transformations than we describe here; we will only show a

small but useful set of grammar transformations. In the following, we will assume

that N is the set of nonterminals, T is the set of terminals, and that u, v, w, x, y and

z denote sequences of terminals and nonterminals, i. e., are elements of (N ∪ T)

∗

.

2.5.1. Removing duplicate productions

This grammar transformation is a transformation that can be applied to any grammar

of the correct form. If a grammar contains two occurrences of the same production

rule, one of these occurrences can be removed. For example,

A → u [ u [ v

can be transformed into

A → u [ v

2.5.2. Substituting right hand sides for nonterminals

If a nonterminal X occurs in a right hand side of a production, the production may

be replaced by just as many productions as there exist productions for X, in which

X has been replaced by its right hand sides. For example, we can substitute B in

the right hand side of the ﬁrst production in the following grammar:

A → uBv [ z

B → x [ w

The resulting grammar is:

A → uxv [ uwv [ z

B → x [ w

2.5.3. Removing unreachable productions

Consider the result of the transformation above:

A → uxv [ uwv [ z

B → x [ w

If A is the start symbol and B does not occur in any of the symbol sequences u, v,

w, x, z , then the second production can never occur in the derivation of a sentence

starting from A. In such a case, the unreachable production can be dropped:

A → uxv [ uwv [ z

23

2. Context-Free Grammars

2.5.4. Left factoring

Left factoring is a grammar transformation that is applicable when two productions

left factoring

for the same nonterminal start with the same sequence of (terminal and/or nonter-

minal) symbols. These two productions can then be replaced by a single production,

that ends with a new nonterminal, replacing the part of the sequence after the com-

mon start sequence. Two productions for the new nonterminal are added: one for

each of the two diﬀerent end sequences of the two productions. For example:

A → xy [ xz [ v

may be transformed into

A → xZ [ v

Z → y [ z

where Z is a new nonterminal. As we will see in Chapter 3, parsers can be constructed

systematically from a grammar, and left factoring the grammar before constructing

a parser can dramatically improve the performance of the resulting parser.

2.5.5. Removing left recursion

A production is called left-recursive if the right-hand side starts with the nonterminal

left recursion

of the left-hand side. For example, the production

A → Az

is left-recursive. A grammar is left-recursive if we can derive A => + Az for some

nonterminal A of the grammar (i.e., if we can derive Az from A in one or more

steps).

Left-recursive grammars are sometimes undesirable – we will, for instance, see in

Chapter 3 that a parser constructed systematically from a left-recursive grammar

may loop. Fortunately, left recursion can be removed by transforming the grammar.

The following transformation removes left-recursive productions.

To remove the left-recursive productions of a nonterminal A, we divide the produc-

tions for A in sets of left-recursive and non left-recursive productions. We can thus

factorize the productions for A as follows:

A → Ax

1

[ Ax

2

[ . . . [ Ax

n

A → y

1

[ y

2

[ . . . [ y

m

where none of the symbol sequences y

1

, . . . , y

m

starts with A. We now add a new

nonterminal Z, and replace the productions for A by:

24

2.5. Grammar transformations

A → y

1

[ y

1

Z [ . . . [ y

m

[ y

m

Z

Z → x

1

[ x

1

Z [ . . . [ x

n

[ x

n

Z

Note that this procedure only works for a grammar that is directly left-recursive, i. e.,

a grammar that contains a left-recursive production of the form A → Ax.

Grammars can also be indirectly left recursive. An example is

A → Bx

B → Ay

None of the two productions is left recursive, but we can still derive A ⇒

∗

Ayx.

Removing left recursion in an indirectly left-recursive grammar is also possible, but

a bit more complicated [1].

Here is an example of how to apply the procedure described above to a grammar that

is directly left recursive, namely the grammar SequenceOfS that we have introduced

in Section 2.4:

S → SS

S → s

The ﬁrst production is left-recursive. The second is not. We can thus directly apply

the procedure for left recursion removal, and obtain the following productions:

S → s [ sZ

Z → S [ SZ

2.5.6. Associative separator

The following grammar fragment generates a list of declarations, separated by a

semicolon ‘;’:

Decls → Decls ; Decls

Decls → Decl

The productions for Decl , which generates a single declaration, have been omitted.

This grammar is ambiguous, for the same reason as SequenceOfS is ambiguous. The

operator ; is an associative separator in the generated language, that is, it does not

matter how we group the declarations; given three declarations d

1

, d

2

, and d

3

, the

meaning of d

1

; (d

2

; d

3

) and (d

1

; d

2

) ; d

3

is the same. Therefore, we may use the

following unambiguous grammar for generating a language of declarations:

Decls → Decl ; Decls

Decls → Decl

25

2. Context-Free Grammars

Note that in this case, the transformed grammar is also no longer left-recursive.

An alternative unambiguous (but still left-recursive) grammar for the same language

is

Decls → Decls ; Decl

Decls → Decl

The grammar transformation just described should be handled with care: if the

separator is associative in the generated language, like the semicolon in this case,

applying the transformation is ﬁne. However, if the separator is not associative, then

removing the ambiguity in favour of a particular nesting is dangerous.

This grammar transformation is often useful for expressions which are separated by

associative operators, such as for example natural numbers and addition.

2.5.7. Introduction of priorities

Another form of ambiguity often arises in the part of a grammar for a programming

language which describes expressions. For example, the following grammar generates

arithmetic expressions:

E → E + E

E → E * E

E → ( E )

E → Digs

where Digs generates a list of digits as described in Section 2.3.1.

This grammar is ambiguous: for example, the sentence 2+4*6 has two parse trees: one

corresponding to (2+4)*6, and one corresponding to 2+(4*6). If we make the usual

assumption that * has higher priority than +, the latter expression is the intended

reading of the sentence 2+4*6. In order to obtain parse trees that respect these

priorities, we transform the grammar as follows:

E → T

E → E + T

T → F

T → T * F

F → ( E )

F → Digs

This grammar generates the same language as the previous grammar for expressions,

but it respects the priorities of the operators.

26

2.5. Grammar transformations

In practice, often more than two levels of priority are used. Then, instead of writing

a large number of nearly identically formed production rules, we can abbreviate the

grammar by using parameterised nonterminals. For 1 i <n, we get productions

E

i

→ E

i+1

E

i

→ E

i

Op

i

E

i+1

The nonterminal Op

i

is parameterised and generates operators of priority i . In

addition to the above productions, there should also be a production for expressions

of the highest priority, for example:

E

n

→ ( E

1

) [ Digs

2.5.8. Discussion

We have presented several examples of grammar transformations. A grammar trans-

formation transforms a grammar into another grammar that generates the same

language. For each of the above transformations we should therefore prove that the

generated language remains the same. Since the proofs are too complicated at this

point, they are omitted. Proofs can be found in any of the theoretical books on

language and parsing theory [11].

There exist many other grammar transformations, but the ones given in this section

suﬃce for now. Note that everywhere we use ‘left’ (left-recursion, left factoring), we

can replace it by ‘right’, and obtain a dual grammar transformation. We will discuss

a larger example of a grammar transformation after the following section.

Exercise 2.16. Consider the following ambiguous grammar with start symbol A:

A → AaA

A → b [ c

Transform the grammar by applying the rule for associative separators. Choose the trans-

formation such that the resulting grammar is also no longer left-recursive.

Exercise 2.17. The standard example of ambiguity in programming languages is the dan-

gling else. Let G be a grammar with terminal set ¦if, b, then, else, a¦ and the following

productions:

S → if b then S else S

S → if b then S

S → a

1. Give two parse trees for the sentence if b then if b then a else a.

2. Give an unambiguous grammar that generates the same language as G.

3. How does Java prevent this dangling else problem?

27

2. Context-Free Grammars

Exercise 2.18. A bit list is a nonempty list of bits separated by commas. A grammar for

bit lists is given by

L → B

L → L , L

B → 0 [ 1

Remove the left recursion from this grammar.

Exercise 2.19. Consider the follwing grammar with start symbol S:

S → AB

A → ε [ aaA

B → ε [ Bb

1. What language does this grammar generate?

2. Give an equivalent non left recursive grammar.

2.6. Concrete and abstract syntax

In this section, we establish a connection between context-free grammars and Haskell

datatypes. To this end, we introduce the notion of abstract syntax, and show how

to obtain an abstract syntax from a concrete syntax.

For each context-free grammar we can deﬁne a corresponding datatype in Haskell.

Values of these datatypes represent parse trees of the context-free grammar. As an

example we take the grammar SequenceOfS:

S → SS

S → s

First, we give each of the productions of this grammar a name:

Beside: S → SS

Single: S → s

Now we interpret the start symbol of the grammar S as a datatype, using the names

of the productions as constructors:

data S = Beside S S

[ Single

Note that the nonterminals on the right hand side of Beside reappear as arguments of

the constructor Beside. On the other hand, the terminal symbol s in the production

Single is omitted in the deﬁnition of the constructor Single.

One may be tempted to make the following deﬁnition instead:

28

2.6. Concrete and abstract syntax

data S

= Beside

S

S

[ Single

**Char — too general
**

However, this datatype is too general for the given grammar. An argument of type

Char can be instantiated to any single character, but we know that this character

always has to be s. Since there is no choice anyway, there is no extra value in storing

that s, and the ﬁrst datatype S serves the purpose of encoding the parse trees of the

grammar SequenceOfS just ﬁne.

For example, the parse tree that corresponds to the ﬁrst two derivations of the se-

quence sss is represented by the following value of the datatype S:

Beside Single (Beside Single Single)

The third derivation of the sentence sss produces the following parse tree:

Beside (Beside Single Single) Single

To emphasize that these representations contain suﬃcient information in order to re-

produce the original strings, we can write a function that performs this conversion:

sToString :: S → String

sToString (Beside l r) = sToString l ++ sToString r

sToString Single = "s"

Applying the function sToString to either Beside Single (Beside Single Single) or

Beside (Beside Single Single) Single yields the string "sss".

A concrete syntax of a language describes the appearance of the sentences of a lan-

concrete syntax

guage. So the concrete syntax of the language of nonterminal S is given by the

grammar SequenceOfS.

On the other hand, an abstract syntax of a language describes the shapes of parse trees

abstract syntax

of the language, without the need to refer to concrete terminal symbols. Parse trees

are therefore often also called abstract syntax trees. The datatype S is an example

of an abstract syntax for the language of SequenceOfS. The adjective ‘abstract’

indicates that values of the abstract syntax do not need to explicitly contain all

information about particular sentences, as long as that information is recoverable, as

for example by applying function sToString.

A function such as sToString is often called a semantic function. A semantic function

semantic function

is a function that is deﬁned on an abstract syntax of a language. Semantic functions

are used to give semantics (meaning) to values. In this example, the meaning of a

more abstract representation is expressed in terms of a concrete representation.

Using the removing left recursion grammar transformation, the grammar SequenceOfS

can be transformed into the grammar with the following productions:

29

2. Context-Free Grammars

S → sZ [ s

Z → SZ [ S

An abstract syntax of this grammar may be given by

data SA = ConsS Z [ SingleS

data Z = ConsZ SA Z [ SingleZ SA

For each nonterminal in the original grammar, we have introduced a corresponding

datatype. For each production for a particular nonterminal (expanding all the alter-

natives into separate productions), we have introduced a constructor and invented a

name for the constructor. The nonterminals on the right hand side of the production

rules appear as arguments of the constructors, but the terminals disappear, because

that information can be recovered by knowing which constructors have been used.

If we look at the two diﬀerent grammars for SequenceOfS, and the two abstract

syntaxes, we can conclude that the only important information about sequences of

s’s is how many occurrences of s there are. So the ultimate abstract syntax for

SequenceOfS is

data SS = Size Int

Using the abstract syntax SS, the sequence sss is represented by the parse tree

Size 3, and we can still recover the original string from a value SS by means of a

semantic function:

ssToString :: SS → String

ssToString (Size n) = replicate n ’s’

The SequenceOfS example shows that one may choose between many diﬀerent ab-

stract syntaxes for a given grammar. The choice of an abstract syntax over another

should therefore be determined by the demands of the application, i.e., by what we

ultimately want to compute.

Exercise 2.20. The following Haskell datatype represents a limited form of arithmetic ex-

pressions

data Expr = Add Expr Expr

[ Mul Expr Expr

[ Con Int

Give a grammar for a suitable concrete syntax corresponding to this datatype.

Exercise 2.21. Consider the grammar for palindromes that you have constructed in Exer-

cise 2.7. Give parse trees for the palindromes pal

1

="abaaba" and pal

2

="baaab". Deﬁne a

datatype Pal corresponding to the grammar and represent the parse trees for pal

1

and pal

2

as values of Pal .

Exercise 2.22. Consider your answers to Exercises 2.7 and 2.21 where we have given a

grammar for palindromes over the alphabet ¦a, b¦ and a Haskell datatype describing the

abstract syntax of such palindromes.

30

2.7. Constructions on grammars

1. Write a semantic function that transforms an abstract representation of a palindrome

into a concrete one. Test your function with the palindromes pal

1

and pal

2

from

Exercise 2.21.

2. Write a semantic function that counts the number of a’s occurring in a palindrome.

Test your function with the palindromes pal

1

and pal

2

from Exercise 2.21.

Exercise 2.23. Consider your answer to Exercise 2.8, which describes the concrete syntax

for mirror palindromes.

1. Deﬁne a datatype Mir that describes the abstract syntax corresponding to your gram-

mar. Give the two abstract mirror palindromes aMir

1

and aMir

2

that correspond to

the concrete mirror palindromes cMir

1

= "abaaba" and cMir

2

= "abbbba".

2. Write a semantic function that transforms an abstract representation of a mirror

palindrome into a concrete one. Test your function with the abstract mirror palin-

dromes aMir

1

and aMir

2

.

3. Write a function that transforms an abstract representation of a mirror palindrome

into the corresponding abstract representation of a palindrome (using the datatype

from Exercise 2.21). Test your function with the abstract mirror palindromes aMir

1

and aMir

2

.

Exercise 2.24. Consider your anwer to Exercise 2.9, which describes the concrete syntax

for parity sequences.

1. Deﬁne a datatype Parity describing the abstract syntax corresponding to your gram-

mar. Give the two abstract parity sequences aEven

1

and aEven

2

that correspond to

the concrete parity sequences cEven

1

= "00101" and cEven

2

= "01010".

2. Write a semantic function that transforms an abstract representation of a parity se-

quence into a concrete one. Test your function with the abstract parity sequences aEven

1

and aEven

2

.

Exercise 2.25. Consider your answer to Exercise 2.18, which describes the concrete syntax

for bit lists by means of a grammar that is not left-recursive.

1. Deﬁne a datatype BitList that describes the abstract syntax corresponding to your

grammar. Give the two abstract bit-lists aBitList

1

and aBitList

2

that correspond to

the concrete bit-lists cBitList

1

= "0,1,0" and cBitList

2

= "0,0,1".

2. Write a semantic function that transforms an abstract representation of a bit list into

a concrete one. Test your function with the abstract bit lists aBitList

1

and aBitList

2

.

3. Write a function that concatenates two abstract representations of a bit lists into a bit

list. Test your function with the abstract bit lists aBitList

1

and aBitList

2

.

2.7. Constructions on grammars

This section introduces some constructions on grammars that are useful when spec-

ifying larger grammars, for example for programming languages. Furthermore, it

gives an example of a larger grammar that is transformed in several steps.

The BNF notation, introduced in Section 2.2.1, was ﬁrst used in the early sixties

when the programming language ALGOL 60 was deﬁned and until now it is the

31

2. Context-Free Grammars

standard way of deﬁning the syntax of programming languages (see,for instance,

the Java Language Grammar). You may object that the Java grammar contains

more “syntactical sugar” than the grammars that we considered thus far (and to be

honest, this also holds for the ALGOL 60 grammar): one encounters nonterminals

with postﬁxes ‘?’, ‘+’ and ‘∗’.

This extended BNF notation, EBNF, is introduced in order to help abbreviate a

EBNF

number of standard constructions that usually occur quite often in the syntax of a

programming language:

• one or zero occurrences of nonterminal P, abbreviated P?,

• one or more occurrences of nonterminal P, abbreviated P

+

,

• and zero or more occurrences of nonterminal P, abbreviated P

∗

.

We could easily express these constructions by adding additional nonterminals, but

that decreases the readability of the grammar. The notation for the EBNF con-

structions is not entirely standardized. In some texts, you will for instance ﬁnd the

notation [P] instead of P?, and ¦P¦ for P

∗

. The same notation can be used for

languages, grammars, and sequences of terminal and nonterminal symbols instead of

just single nonterminals. In this section, we deﬁne the meaning of these constructs.

We introduced grammars as an alternative for the description of languages. Designing

a grammar for a speciﬁc language may not be a trivial task. One approach is to

decompose the language and to ﬁnd grammars for each of its constituent parts.

In Deﬁnition 2.3, we have deﬁned a number of operations on languages using op-

erations on sets. We now show that these operations can be expressed in terms of

operations on context-free grammars.

Theorem 2.10 (Language operations). Suppose we have grammars for the languages

L and M, say G

L

= (T, N

L

, R

L

, S

L

) and G

M

= (T, N

M

, R

M

, S

M

). We assume that

the nonterminal sets N

L

and N

M

are disjoint. Then

• the language L∪M is generated by the grammar (T, N, R, S) where S is a fresh

nonterminal, N = N

L

∪ N

M

∪ ¦S ¦ and R = R

L

∪ R

M

∪ ¦S → S

L

, S → S

M

¦;

• the language L M is generated by the grammar (T, N, R, S) where S is a fresh

nonterminal, N = N

L

∪ N

M

∪ ¦S ¦ and R = R

L

∪ R

M

∪ ¦S → S

L

S

M

¦;

• the language L

∗

is generated by the grammar (T, N, R, S) where S is a fresh

nonterminal, N = N

L

∪ ¦S ¦ and R = R

L

∪ ¦S → ε, S → S

L

S ¦;

• the language L

+

is generated by the grammar (T, N, R, S) where S is a fresh

nonterminal, N = N

L

∪ ¦S ¦ and R = R

L

∪ ¦S → S

L

, S → S

L

S ¦.

The theorem above establishes that the set-theoretic operations at the level of lan-

guages (i. e., sets of sentences) have a direct counterpart at the level of grammatical

descriptions. A straightforward question to ask is now: can we also deﬁne languages

32

2.7. Constructions on grammars

as the diﬀerence between two languages or as the intersection of two languages, and

translate these operations to operations on grammars? Unfortunately, the answer

is negative – there are no operations on grammars that correspond to the language

intersection and diﬀerence operators.

Two of the above constructions are important enough to actually deﬁne them as

grammar operations. Furthermore, we add a new grammar construction for an “op-

tional grammar”.

Deﬁnition 2.11 (Grammar operations). Let G = (T, N, R, S) be a context-free

grammar and let S

**be a fresh nonterminal. Then
**

G

∗

= (T, N ∪ ¦S

¦, R ∪ ¦S

→ ε, S

→ S S

¦, S

)

G

+

= (T, N ∪ ¦S

¦, R ∪ ¦S

→ S, S

→ S S

¦, S

)

G? = (T, N ∪ ¦S

¦, R ∪ ¦S

→ ε, S

→ S ¦, S

)

The deﬁnition of P?, P

+

, and P

∗

for a sequence of symbols P is very similar to the

deﬁnitions of the operations on grammars. For example, P

∗

denotes zero or more

concatenations of string P, so Dig

∗

denotes the language consisting of zero or more

digits.

Deﬁnition 2.12 (EBNF for sequences). Let P be a sequence of nonterminals and

terminals, then

L(P

∗

) =L(Z) with Z → ε [ PZ

L(P

+

) =L(Z) with Z → P [ PZ

L(P?) =L(Z) with Z → ε [ P

where Z is a new nonterminal in each deﬁnition.

Because the concatenation operator for sequences is associative, the operators

∗

and

+

can also be deﬁned symmetrically:

L(P

∗

) =L(Z) with Z → ε [ ZP

L(P

+

) =L(Z) with Z → P [ ZP

Many variations are possible on this theme:

L(P

∗

Q) =L(Z) with Z → Q [ PZ (2.1)

or also

L(P Q

∗

) =L(Z) with Z → P [ ZQ (2.2)

33

2. Context-Free Grammars

2.7.1. SL: an example

To illustrate EBNF and some of the grammar transformations given in the previous

section, we give a larger example. The following grammar generates expressions in a

very small programming language, called SL.

Expr → if Expr then Expr else Expr

Expr → Expr where Decls

Expr → AppExpr

AppExpr → AppExpr Atomic [ Atomic

Atomic → Var [ Number [ Bool [ ( Expr )

Decls → Decl

Decls → Decls ; Decls

Decl → Var = Expr

where the nonterminals Var, Number, and Bool generate variables, number expres-

sions, and boolean expressions, respectively. Note that the brackets around the Expr

in the production for Atomic, and the semicolon in between the Decls in the second

production for Decls are also terminal symbols. The following ‘program’ is a sentence

of this language:

if true then funny true else false where funny = 7

It is clear that this is not a very convenient language to write programs in.

The above grammar is ambiguous (why?), and we introduce priorities to resolve

some of the ambiguities. Application binds stronger than if, and both application

and if bind stronger then where. Using the “introduction of priorities” grammar

transformation, we obtain:

Expr → Expr

1

Expr → Expr

1

where Decls

Expr

1

→ Expr

2

Expr

1

→ if Expr

1

then Expr

1

else Expr

1

Expr

2

→ Atomic

Expr

2

→ Expr

2

Atomic

where Atomic and Decls have the same productions as before.

The nonterminal Expr

2

is left-recursive. Removing left recursion gives the following

productions for Expr

2

:

Expr

2

→ Atomic [ Atomic Expr

2

Expr

2

→ Atomic [ Atomic Expr

2

34

2.7. Constructions on grammars

Since the new nonterminal Expr

2

has exactly the same productions as Expr

2

, these

productions can be replaced by

Expr

2

→ Atomic [ Atomic Expr

2

So Expr

2

generates a nonempty sequence of atomics. Using the

+

-notation intro-

duced before, we can replace Expr

2

by Atomic

+

.

Another source of ambiguity are the productions for Decls. The nonterminal Decls

generates a nonempty list of declarations, and the separator ; is assumed to be asso-

ciative. Hence we can apply the “associative separator” transformation to obtain

Decls → Decl [ Decls ; Decl

or, according to (2.2),

Decls → Decl (; Decl )

∗

The last grammar transformation we apply is “left factoring”. This transformation

is applied to the productions for Expr, and yields

Expr → Expr

1

Expr

1

Expr

1

→ ε [ where Decls

Since nonterminal Expr

1

generates either nothing or a where clause, we can replace

Expr

1

by an optional where clause in the production for Expr:

Expr → Expr

1

(where Decls)?

After all these grammar transformations, we obtain the following grammar.

Expr → Expr

1

(where Decls)?

Expr

1

→ Atomic

+

Expr

1

→ if Expr

1

then Expr

1

else Expr

1

Atomic → Var [ Number [ Bool [ ( Expr )

Decls → Decl (; Decl )

∗

Exercise 2.26. Give the EBNF notation for each of the basic languages deﬁned in Sec-

tion 2.3.1.

Exercise 2.27. Let G be a grammar G. Give the language that is generated by G? (i.e.,

the ? operation applied to G).

Exercise 2.28. Let

L

1

=¦a

m

b

m

c

n

[ m, n ∈ N¦

L

2

=¦a

m

b

n

c

n

[ m, n ∈ N¦

1. Give grammars for L

1

and L

2

.

2. Is L

1

∩ L

2

context-free, i. e., can you give a context-free grammar for this language?

35

2. Context-Free Grammars

2.8. Parsing

This section formulates the parsing problem, and discusses some of the future topics

of the course.

Deﬁnition 2.13 (Parsing problem). Given the grammar G and a string s, the

parsing problem answers the question whether or not s ∈ L(G). If s ∈ L(G), the

parsing problem

answer to this question may be either a parse tree or a derivation.

This question may not be easy to answer given an arbitrary grammar. Until now we

have only seen simple grammars for which it is relatively easy to determine whether

or not a string is a sentence of the grammar. For more complicated grammars this

may be more diﬃcult. However, in the ﬁrst part of this course we will show how –

given a grammar with certain reasonable properties – we can easily construct parsers

by hand. At the same time we will show how the parsing process can quite often be

combined with the algorithm we actually want to perform on the recognized object

(the semantic function). The techniques we describe comprise a simple, although

surprisingly eﬃcient, introduction into the area of compiler construction.

A compiler for a programming language consists of several parts. Examples of such

parts are a scanner, a parser, a type checker, and a code generator. Usually, a parser

is preceded by a scanner (also called lexer), which splits an input sentence into a list

scanner

lexer

of so-called tokens. For example, given the sentence

if true then funny true else false where funny = 7

a scanner might return the following list of tokens:

["if", "true", "then", "funny", "true",

"else", "false", "where", "funny", "=", "7"]

So a token is a syntactical entity. A scanner usually performs the ﬁrst step towards

token

an abstract syntax: it throws away layout information such as spacing and newlines.

In this course we will concentrate on parsers, but some of the concepts of scanners

will sometimes be used.

In the second part of this course we will take a look at more complicated grammars,

which do not always conform to the restrictions just referred to. By analysing the

grammar we may nevertheless be able to generate parsers as well. Such generated

parsers will be in such a form that it will be clear that writing such parsers by hand

is far from attractive, and actually impossible for all practical cases.

One of the problems we have not referred to yet in this rather formal chapter is of

a more practical nature. Quite often the sentence presented to the parser will not

be a sentence of the language since mistakes were made when typing the sentence.

This raises another interesting question: What are the minimal changes that have

36

2.9. Exercises

to be made to the sentence in order to convert it into a sentence of the language?

It goes almost without saying that this is an important question to be answered in

practice; one would not be very happy with a compiler which, given an erroneous

input, would just reply that the “Input could not be recognised”. One of the most

important aspects here is to deﬁne metric for deciding about the minimality of a

change; humans usually make certain mistakes more often than others. A semicolon

can easily be forgotten, but the chance that an if symbol is missing is far from likely.

This is where grammar engineering starts to play a rˆole.

Summary

Starting from a simple example, the language of palindromes, we have introduced

the concept of a context-free grammar. Associated concepts, such as derivations and

parse trees were introduced.

2.9. Exercises

Exercise 2.29. Do there exist languages L such that (L

∗

) = (L)

∗

?

Exercise 2.30. Give a language L such that L = L

∗

.

Exercise 2.31. Under which circumstances is L

+

= L

∗

−¦ε¦?

Exercise 2.32. Let L be a language over alphabet ¦a, b, c¦ such that L=L

R

. Does L contain

only palindromes?

Exercise 2.33. Consider the grammar with productions

S → AA

A → AAA

A → a

A → bA

A → Ab

1. Which terminal strings can be produced by derivations of four or fewer steps?

2. Give at least two distinct derivations for the string babbab.

3. For any m, n, p 0, describe a derivation of the string b

m

ab

n

ab

p

.

Exercise 2.34. Consider the grammar with productions

S → aaB

A → bBb

A → ε

B → Aa

Show that the string aabbaabba cannot be derived from S.

37

2. Context-Free Grammars

Exercise 2.35. Give a grammar for the language

L =¦ωcω

R

[ ω ∈ ¦a, b¦

∗

¦

This language is known as the center-marked palindromes language. Give a derivation of the

sentence abcba.

Exercise 2.36. Describe the language generated by the grammar:

S → ε

S → A

A → aAb

A → ab

Can you ﬁnd another (preferably simpler) grammar for the same language?

Exercise 2.37. Describe the languages generated by the grammars.

S → ε

S → A

A → Aa

A → a

and

S → ε

S → A

A → AaA

A → a

Can you ﬁnd other (preferably simpler) grammars for the same languages?

Exercise 2.38. Show that the languages generated by the grammars G

1

, G

2

en G

3

are the

same.

G

1

: G

2

: G

3

:

S → ε S → ε S → ε

S → aS S → Sa S → a

S → SS

Exercise 2.39. Consider the following property of grammars:

1. the start symbol is the only nonterminal which may have an empty production (a

production of the form X → ε),

2. the start symbol does not occur in any alternative.

A grammar having this property is called non-contracting. The grammar A → aAb [ ε does

not have this property. Give a non-contracting grammar which describes the same language.

Exercise 2.40. Describe the language L of the grammar

A → AaA[ a

Give a grammar for L that has no left-recursive productions. Give a grammar for L that has

no right-recursive productions.

38

2.9. Exercises

Exercise 2.41. Describe the language L of the grammar

X → a [ Xb

Give a grammar for L that has no left-recursive productions. Give a grammar for L that has

no left-recursive productions and is non-contracting.

Exercise 2.42. Consider the language L of the grammar

S → T [ US

T → aSa [ Ua

U → S [ SUT

Give a grammar for L which uses only productions with two or less symbols on the right

hand side. Give a grammar for L which uses only two nonterminals.

Exercise 2.43. Give a grammar for the language of all sequences of 0’s and 1’s which start

with a 1 and contain exactly one 0.

Exercise 2.44. Give a grammar for the language consisting of all nonempty sequences of

brackets,

¦(, )¦

in which the brackets match. An example sentence of the language is ( ) ( ( ) ) ( ). Give a

derivation for this sentence.

Exercise 2.45. Give a grammar for the language consisting of all nonempty sequences of

two kinds of brackets,

¦(, ), [, ]¦

in which the brackets match. An example sentence in this language is [ ( ) ] ( ).

Exercise 2.46. This exercise shows an example (attributed to Noam Chomsky) of an am-

biguous English sentence. Consider the following grammar for a part of the English language:

Sentence → Subject Predicate .

Subject → they

Predicate → Verb NounPhrase

Predicate → AuxVerb Verb Noun

Verb → are

Verb → flying

AuxVerb → are

NounPhrase → Adjective Noun

Adjective → flying

Noun → planes

Give two diﬀerent leftmost derivations for the sentence

they are flying planes.

39

2. Context-Free Grammars

Exercise 2.47. Try to ﬁnd some ambiguous sentences in your own natural language. Here

are some ambiguous Dutch sentences seen in the newspapers:

Vliegen met hartafwijking niet gevaarlijk

Jessye Norman kan niet zingen

Alcohol is voor vrouwen schadelijker dan mannen

Exercise 2.48 (•). Is your grammar for Exercise 2.45 unambiguous? If not, ﬁnd one which

is unambiguous.

Exercise 2.49. This exercise deals with a grammar that uses unusual terminal and non-

terminal symbols. Assume that , ⊗, and ⊕ are nonterminals, and the other symbols are

terminals.

→ ´⊗

→ ⊗

⊗ → ⊗3⊕

⊗ → ⊕

⊕ → ♣

⊕ → ♠

Find a derivation for the sentence ♣3♣´♠.

Exercise 2.50 (•). Prove, using induction, that the grammar G for palindromes in Sec-

tion 2.2 does indeed generate the language of palindromes.

Exercise 2.51 (••, no answer provided). Prove that the language generated by the grammar

of Exercise 2.33 contains all strings over ¦a, b¦ where the number of a’s is even and greater

than zero.

Exercise 2.52 (no answer provided). Consider the natural numbers in unary notation where

only the symbol I is used; thus 4 is represented as IIII. Write an algorithm that, given a

string w of I’s, determines whether or not w is divisible by 7.

Exercise 2.53 (no answer provided). Consider the natural numbers in reverse binary no-

tation; thus 4 is represented as 001. Write an algorithm that, given a string w of zeros and

ones, determines whether or not w is divisible by 7.

Exercise 2.54 (no answer provided). Let w be a string consisting of a’s and b’s only. Write

an algorithm that determines whether or not the number of a’s in w equals the number of

b’s in w.

40

3. Parser combinators

Introduction

This chapter is an informal introduction to writing parsers in a lazy functional lan-

guage using ‘parser combinators’. Parsers can be written using a small set of basic

parsing functions, and a number of functions that combine parsers into more com-

plicated parsers. The functions that combine parsers are called parser combinators.

The basic parsing functions do not combine parsers, and are therefore not parser

combinators in this sense, but they are usually also called parser combinators.

Parser combinators are used to write parsers that are very similar to the grammar of

a language. Thus writing a parser amounts to translating a grammar to a functional

program, which is often a simple task.

Parser combinators are built by means of standard functional language constructs

like higher-order functions, lists, and datatypes. List comprehensions are used in a

few places, but they are not essential, and could easily be rephrased using the map,

ﬁlter and concat functions. Type classes are only used for overloading the equality

and arithmetic operators.

We will start by motivating the deﬁnition of the type of parser functions. Using that

type, we can build parsers for the language of (possibly ambiguous) grammars. Next,

we will introduce some elementary parsers that can be used for parsing the terminal

symbols of a language.

In Section 3.3 the ﬁrst parser combinators are introduced, which can be used for se-

quentially and alternatively combining parsers, and for calculating so-called semantic

functions during the parse. Semantic functions are used to give meaning to syntactic

structures. As an example, we construct a parser for strings of matching paren-

theses in Section 3.3.1. Diﬀerent semantic values are calculated for the matching

parentheses: a tree describing the structure, and an integer indicating the nesting

depth.

In Section 3.4 we introduce some new parser combinators. Not only do these make life

easier later, but their deﬁnitions are also nice examples of using parser combinators.

A real application is given in Section 3.5, where a parser for arithmetical expressions

is developed. Finally, the expression parser is generalised to expressions with an

arbitrary number of precedence levels. This is done without coding the priorities of

operators as integers, and we will avoid using indices and ellipses.

41

3. Parser combinators

It is not always possible to directly construct a parser from a context-free grammar

using parser combinators. If the grammar is left-recursive, it has to be transformed

into a non left-recursive grammar before we can construct a combinator parser. An-

other limitation of the parser combinator technique as described in this chapter is

that it is not trivial to write parsers for complex grammars that perform reasonably

eﬃcient. However, there do exist implementations of parser combinators that per-

form remarkably well, see [10, 12]. For example, there exist good parsers using parser

combinators for the Haskell language.

Most of the techniques introduced in this chapter have been described by Burge [4],

Wadler [13] and Hutton [7].

This chapter is a revised version of an article by Fokker [5].

Goals

This chapter introduces the ﬁrst programs for parsing in these lecture notes. Parsers

are composed from simple parsers by means of parser combinators. Hence, important

primary goals of this chapter are:

• to understand how to parse, i. e., how to recognise structure in a sequence of

symbols, by means of parser combinators;

• to be able to construct a parser when given a grammar;

• to understand the concept of semantic functions.

Two secondary goals of this chapter are:

• to develop the capability to abstract;

• to understand the concept of domain speciﬁc language.

Required prior knowledge

To understand this chapter, you should be able to formulate the parsing problem,

and you should understand the concept of context-free grammar. Furthermore, you

should be familiar with functional programming concepts such as type, class, and

higher-order functions.

42

3.1. The type of parsers

3.1. The type of parsers

The goals of this section are:

• develop a type Parser that is used to give the type of parsing functions;

• show how to obtain this type by means of several abstraction steps.

The parsing problem is (see Section 2.8): Given a grammar G and a string s, de-

termine whether or not s ∈ L(G). If s ∈ L(G), the answer to this question may

be either a parse tree or a derivation. For example, in Section 2.4 we have seen a

grammar for sequences of s’s:

S → SS [ s

A parse tree of an expression of this language is a value of the datatype S (or a value

of several variants of that type, see Section 2.6, depending on what you want to do

with the result of the parser), which is deﬁned by

data S = Beside S S [ Single

A parser for expressions could be implemented as a function of the following type:

type Parser = String → S — preliminary

For parsing substructures, a parser can call other parsers, or call itself recursively.

These calls do not only have to communicate their result, but also the part of the

input string that is left unprocessed. For example, when parsing the string sss, a

parser will ﬁrst build a parse tree Beside Single Single for ss, and only then build a

complete parse tree

Beside (Beside Single Single) Single

using the unprocessed part s of the input string. As this cannot be done using a global

variable, the unprocessed input string has to be part of the result of the parser. The

two results can be paired. A better deﬁnition for the type Parser is hence:

type Parser = String → (S, String) — still preliminary

Any parser of type Parser returns an S and a String. However, for diﬀerent grammars

we want to return diﬀerent parse trees: the type of tree that is returned depends on

the grammar for which we want to parse sentences. Therefore it is better to abstract

from the type S, and to turn the parser type into a polymorphic type. The type

Parser is parametrised with a type a, which represents the type of parse trees.

type Parser a = String → (a, String) — still preliminary

For example, a parser that returns a structure of type Oak (whatever that is) now

has type Parser Oak. A parser that parses sequences of s’s has type Parser S.

43

3. Parser combinators

We might also deﬁne a parser that does not return a value of type S, but instead

the number of s’s in the input sequence. This parser would have type Parser Int.

Another instance of a parser is a parse function that recognises a string of digits, and

returns the number represented by it as a parse ‘tree’. In this case the function is

also of type Parser Int. Finally, a recogniser that either accepts or rejects sentences

of a grammar returns a boolean value, and will have type Parser Bool .

Until now, we have assumed that every string can be parsed in exactly one way. In

general, this need not be the case: it may be that a single string can be parsed in

various ways, or that there is no way to parse a string. For example, the string "sss"

has the following two parse trees:

Beside (Beside Single Single) Single

Beside Single (Beside Single Single)

As another reﬁnement of the type Parser, instead of returning one parse tree (and

its associated rest string), we let a parser return a list of trees. Each element of the

result consists of a tree, paired with the rest string that was left unprocessed after

parsing. The type deﬁnition of Parser therefore becomes:

type Parser a = String → [(a, String)] — useful, but still suboptimal

If there is just one parse, the result of the parse function is a singleton list. If no

parse is possible, the result is an empty list. In case of an ambiguous grammar, the

result consists of all possible parses.

This method for parsing is called the list of successes method, described by Wadler [13].

list of successes

It can be used in situations where in other languages you would use so-called back-

tracking techniques. In the Bird and Wadler textbook it is used to solve combinatorial

problems like the eight queens problem [3]. If only one solution is required rather

than all possible solutions, you can take the head of the list of successes. Thanks

to lazy evaluation, not all elements of the list are computed if only the ﬁrst value is

lazy evaluation

needed, so there will be no loss of eﬃciency. Lazy evaluation provides a backtracking

approach to ﬁnding the ﬁrst solution.

Parsers with the type described so far operate on strings, that is lists of characters.

There is however no reason for not allowing parsing strings of elements other than

characters. You may imagine a situation in which a preprocessor prepares a list of

tokens (see Section 2.8), which is subsequently parsed. To cater for this situation we

reﬁne the parser type once more: we let the type of the elements of the input string

be an argument of the parser type. Calling the type of symbols s, and as before the

result type a, the type of parsers becomes

type Parser s a = [s] → [(a, [s])]

or, if you prefer meaningful identiﬁers over conciseness:

44

3.2. Elementary parsers

— The type of parsers

type Parser symbol result = [symbol ] → [(result, [symbol ])]

Listing 3.1: ParserType.hs

type Parser symbol result = [symbol ] → [(result, [symbol ])]

We will use this type deﬁnition in the rest of this chapter. This type is deﬁned in

Listing 3.1, the ﬁrst part of our parser library. The list of successes appears in result

type of a parser. Each element of this list is a possible parsing of (an initial part of)

the input. We will hardly use the full generality provided by the Parser type: the

type of the input s (or symbol ) will almost always be Char.

3.2. Elementary parsers

The goals of this section are:

• introduce some very simple parsers for parsing sentences of grammars with rules

of the form:

A → ε

A → a

A → x

where x is a sequence of terminals;

• show how one can construct useful functions from simple, trivially correct func-

tions by means of generalisation and partial parametrisation.

This section deﬁnes parsers that can only be used to parse ﬁxed sequences of terminal

symbols. For a grammar with a production that contains nonterminals in its right-

hand side we need techniques that will be introduced in the following section.

We will start with a very simple parse function that just recognises the terminal

symbol a. The type of the input string symbols is Char in this case, and as a parse

‘tree’ we also simply use a Char:

symbola :: Parser Char Char

symbola [] = []

symbola (x : xs) [ x = = ’a’ = [(’a’, xs)]

[ otherwise = []

45

3. Parser combinators

— Elementary parsers

symbol :: Eq s ⇒ s → Parser s s

symbol a [] = []

symbol a (x : xs) [ x = = a = [(x, xs)]

[ otherwise = []

satisfy :: (s → Bool ) → Parser s s

satisfy p [] = []

satisfy p (x : xs) [ p x = [(x, xs)]

[ otherwise = []

token :: Eq s ⇒ [s] → Parser s [s]

token k xs [ k = = take n xs = [(k, drop n xs)]

[ otherwise = []

where n = length k

failp :: Parser s a

failp xs = []

succeed :: a → Parser s a

succeed r xs = [(r, xs)]

— Applications of elementary parsers

digit :: Parser Char Char

digit = satisfy isDigit

Listing 3.2: ParserType.hs

The list of successes method immediately pays oﬀ, because now we can return an

empty list if no parsing is possible (because the input is empty, or does not start

with an a).

In the same fashion, we can write parsers that recognise other symbols. As always,

rather than deﬁning a lot of closely related functions, it is better to abstract from the

symbol to be recognised by making it an extra argument of the function. Further-

more, the function can operate on lists of characters, but also on lists of symbols of

other types, so that it can be used in other applications than character oriented ones.

The only prerequisite is that the symbols to be parsed can be tested for equality. In

Haskell, this is indicated by the Eq predicate in the type of the function.

Using these generalisations, we obtain the function symbol that is given in Listing 3.2.

The function symbol is a function that, given a symbol, returns a parser for that

symbol. A parser on its turn is a function too. This is why two arguments appear in

the deﬁnition of symbol .

46

3.2. Elementary parsers

We will now deﬁne some elementary parsers that can do the work traditionally taken

care of by lexical analysers (see Section 2.8). For example, a useful parser is one

that recognises a ﬁxed string of symbols, such as while or switch. We will call this

function token; it is deﬁned in Listing 3.2. As in the case of the symbol function we

have parametrised this function with the string to be recognised, eﬀectively making

it into a family of functions. Of course, this function is not conﬁned to strings of

characters. However, we do need an equality test on the type of values in the input

string; the type of token is:

token :: Eq s ⇒ [s] → Parser s [s]

The function token is a generalisation of the symbol function, in that it recognises

a list of symbols instead of a single symbol. Note that we cannot deﬁne symbol in

terms of token: the two functions have incompatible types.

Another generalisation of symbol is a function which may, depending on the in-

put, return diﬀerent parse results. Instead of specifying a speciﬁc symbol, we can

parametrise the function with a condition that the symbol should fulﬁll. Thus the

function satisfy has a function s → Bool as argument. Where symbol tests for equal-

ity to a speciﬁc value, the function satisfy tests for compliance with this predicate.

It is deﬁned in Listing 3.2. This generalised function is for example useful when we

want to parse digits (characters in between ’0’ and ’9’):

digit :: Parser Char Char

digit = satisfy isDigit

where the function isDigit is the standard predicate that tests whether or not a

character is a digit:

isDigit :: Char → Bool

isDigit x = ’0’ x ∧ x ’9’

In books on grammar theory an empty string is often called ‘ε’. In this tradition,

we will deﬁne a function epsilon that ‘parses’ the empty string. It does not consume

any input, and hence always returns an empty parse tree and unmodiﬁed input. A

zero-tuple can be used as a result value: () is the only value of the type () – both the

type and the value are pronounced unit.

unit type

epsilon :: Parser s ()

epsilon xs = [((), xs)]

A more useful variant is the function succeed, which also doesn’t consume input,

but always returns a given, ﬁxed value (or ‘parse tree’, if you can call the result of

processing zero symbols a parse tree). It is deﬁned in Listing 3.2.

Dual to the function succeed is the function failp, which fails to recognise any symbol

on the input string. As the result list of a parser is a ‘list of successes’, and in the

47

3. Parser combinators

case of failure there are no successes, the result list should be empty. Therefore the

function failp always returns the empty list of successes. It is deﬁned in Listing 3.2.

Note the diﬀerence with epsilon, which does have one element in its list of successes

(albeit an empty one).

Do not confuse failp with epsilon: there is an important diﬀerence between returning

one solution (which contains the unchanged input as ‘rest’ string) and not returning

a solution at all!

Exercise 3.1. Deﬁne a function capital :: Parser Char Char that parses capital letters.

Exercise 3.2. Since satisfy is a generalisation of symbol , the function symbol can be deﬁned

as an instance of satisfy. How can this be done?

Exercise 3.3. Deﬁne the function epsilon using succeed.

3.3. Parser combinators

Using the elementary parsers from the previous section, parsers can be constructed

for terminal symbols from a grammar. More interesting are parsers for nonterminal

symbols. It is convenient to construct these parsers by partially parametrising higher-

order functions.

The goals of this section are:

• show how parsers can be constructed directly from the productions of a gram-

mar. The kind of productions for which parsers will be constructed are

A → x [ y

A → x y

where x and y are sequences of nonterminal or terminal symbols;

• show how we can construct a small, powerful combinator language (a domain

speciﬁc language) for the purpose of parsing;

• understand and use the concept of semantic functions in parsers.

Let us have a look at the grammar for expressions again, see also Section 2.5:

E → T + E [ T

T → F * T [ F

F → Digs [ ( E )

where Digs is a nonterminal that generates the language of sequences of digits, see

Section 2.3.1. An expression can be parsed according to any of the two rules for E.

This implies that we want to have a way to say that a parser consists of several alter-

native parsers. Furthermore, the ﬁrst rule says that in order to parse an expression,

48

3.3. Parser combinators

we should ﬁrst parse a term, then a terminal symbol +, and then an expression. This

implies that we want to have a way to say that a parser consists of several parsers

that are applied sequentially.

So important operations on parsers are sequential and alternative composition: a

more complex construct can consist of a simple construct followed by another con-

struct (sequential composition), or by a choice between two constructs (alternative

composition). These operations correspond directly to their grammatical counter-

parts. We will develop two functions for this, which for notational convenience are

deﬁned as operators: <∗> for sequential composition, and <[> for alternative com-

position. The names of these operators are chosen so that they can be easily remem-

bered: <∗> ‘multiplies’ two constructs together, and <[> can be pronounced as ‘or’.

Be careful, though, not to confuse the <[>-operator with Haskell’s built-in construct

[, which is used to distinguish cases in a function deﬁnition or as a separator for the

constructors of a datatype.

Priorities of these operators are deﬁned so as to minimise parentheses in practical

situations:

inﬁxl 6 <∗>

inﬁxr 4 <[>

So <∗> has a higher priority – i. e., it binds stronger – than <[>.

Both operators take two parsers as argument, and return a parser as result. By

again combining the result with other parsers, you may construct even more involved

parsers.

In the deﬁnitions in Listing 3.3, the functions operate on parsers p and q. Apart

from the arguments p and q, the function operates on a string, which can be thought

of as the string that is parsed by the parser that is the result of combining p and

q.

We start with the deﬁnition of operator <∗>. For sequential composition, p must be

applied to the input ﬁrst. After that, q is applied to the rest string of the result. The

ﬁrst parser, p, returns a list of successes, each of which contains a value and a rest

string. The second parser, q, should be applied to the rest string, returning a second

value. Therefore we use a list comprehension, in which the second parser is applied

in all possible ways to the rest string of the ﬁrst parser:

(p <∗>q) xs = [ (combine r

1

r

2

, zs)

[ (r

1

, ys) ← p xs

, (r

2

, zs) ← q ys

]

The rest string of the parser for the sequential composition of p and q is whatever

the second parser q leaves behind as rest string.

49

3. Parser combinators

— Parser combinators

(<[>) :: Parser s a → Parser s a → Parser s a

(p <[>q) xs = p xs ++ q xs

(<∗>) :: Parser s (b → a) → Parser s b → Parser s a

(p <∗>q) xs = [ (f x, zs )

[ (f , ys) ← p xs

, ( x, zs ) ← q ys

]

(<$>) :: (a → b) → Parser s a → Parser s b

(f <$>p) xs = [ (f y, ys)

[ ( y, ys) ← p xs

]

— Applications of parser combinators

newdigit :: Parser Char Int

newdigit = f <$>digit

where f c = ord c −ord ’0’

Listing 3.3: ParserCombinators.hs

Now, how should the results of the two parsings be combined? We could, of course,

parametrise the whole thing with an operator that describes how to combine the

parts (as is done in the zipWith function). However, we have chosen for a diﬀerent

approach, which nicely exploits the ability of functional languages to manipulate

functions. The function combine should combine the results of the two parse trees

recognised by p and q. In the past, we have interpreted the word ‘tree’ liberally:

simple values, like characters, may also be used as a parse ‘tree’. We will now also

accept functions as parse trees. That is, the result type of a parser may be a function

type.

If the ﬁrst parser that is combined by <∗> would return a function of type b → a,

and the second parser a value of type b, a straightforward choice for the combine

function would be function application. That is exactly the approach taken in the

deﬁnition of <∗> in Listing 3.3. The ﬁrst parser returns a function, the second parser

a value, and the combined parser returns the value that is obtained by applying the

function to the value.

Apart from ‘sequential composition’ we need a parser combinator for representing

‘choice’. For this, we have the parser combinator operator <[>. Thanks to the list

of successes method, both p

1

and p

2

return lists of possible parsings. To obtain all

possible parsings when applying p

1

or p

2

, we only need to concatenate these two

lists.

50

3.3. Parser combinators

By combining parsers with parser combinators we can construct new parsers. The

most important parser combinators are <∗> and <[>. The parser combinator <,>

in exercise 3.12 is just a variation of <∗>.

Sometimes we are not quite satisﬁed with the result value of a parser. The parser

might work well in that it consumes symbols from the input adequately (leaving the

unused symbols as rest-string in the tuples in the list of successes), but the result

value might need some postprocessing. For example, a parser that recognises one digit

is deﬁned using the function satisfy: digit = satisfy isDigit. In some applications,

we may need a parser that recognises one digit character, but returns the result

as an integer, instead of a character. In a case like this, we can use a new parser

combinator: <$>. It takes a function and a parser as argument; the result is a parser

that recognises the same string as the original parser, but ‘postprocesses’ the result

using the function. We use the $ sign in the name of the combinator, because the

combinator resembles the operator that is used for normal function application in

Haskell: f $ x = f x. The deﬁnition of <$> is given in Listing 3.3. It is an inﬁx

operator:

inﬁxl 7 <$>

Using this postprocessing parser combinator, we can modify the parser digit that was

deﬁned above:

newdigit :: Parser Char Int

newdigit = f <$>digit

where f c = ord c −ord ’0’

The auxiliary function f determines the ordinal number of a digit character; using

the parser combinator <$> it is applied to the result part of the digit parser.

In practice, the <$> operator is used to build a certain value during parsing (in the

case of parsing a computer program this value may be the generated code, or a list

of all variables with their types, etc.). Put more generally: using <$> we can add

semantic functions to parsers.

A parser for the SequenceOfS grammar that returns the abstract syntax tree of the

input, i.e., a value of type S, see Section 2.6, is deﬁned as follows:

sequenceOfS :: Parser Char S

sequenceOfS = Beside <$>sequenceOfS <∗>sequenceOfS

<[>const Single <$>symbol ’s’

But if you try to run this function, you will get a stack overﬂow! If you apply

sequenceOfS to a string, the ﬁrst thing it does is to apply itself to the same string,

which loops. The problem stems from the fact that the underlying grammar is left-

recursive. For any left-recursive grammar, a systematically constructed parser using

parser combinators will exhibit the problem that it loops. However, in Section 2.5

51

3. Parser combinators

we have shown how to remove the left recursion in the SequenceOfS grammar. The

resulting grammar is used to obtain the following parser:

sequenceOfS

:: Parser Char SA

sequenceOfS

=

const ConsS <$>symbol ’s’ <∗>parseZ

<[>const SingleS <$>symbol ’s’

where parseZ = ConsZ <$>sequenceOfS

<∗>parseZ

<[>SingleZ <$>sequenceOfS

**This example is a direct translation of the grammar obtained by using the remov-
**

ing left recursion grammar transformation. There exists a much simpler parser for

parsing sequences of s’s.

Exercise 3.4. Prove for all f :: a → b that

f <$>succeed a = succeed (f a)

In the sequel we will often use this rule for constant functions f , i. e., f = λ → c for some

term c.

Exercise 3.5. Consider the parser (:) <$> symbol ’a’. Give its type and show its results

on inputs [] and x : xs.

Exercise 3.6. Consider the parser (:) <$> symbol ’a’ <∗> p. Give its type and show its

results on inputs [] and x : xs.

Exercise 3.7. Deﬁne a parser for Booleans.

Exercise 3.8 (no answer provivded). Deﬁne parsers for each of the basic languages deﬁned

in Section 2.3.1.

Exercise 3.9. Consider the grammar for palindromes that you have constructed in Exer-

cise 2.7.

1. Give the datatype Pal

2

that corresponds to this grammar.

2. Deﬁne a parser palin

2

that returns parse trees for palindromes. Test your function

with the palindromes cPal

1

= "abaaba" and cPal

2

= "baaab". Compare the results

with your answer to Exercise 2.21.

3. Deﬁne a parser palina that counts the number of a’s occurring in a palindrome.

Exercise 3.10. Consider the grammar for a part of the English language that is given in

Exercise 2.46.

1. Give the datatype English that corresponds to this grammar.

2. Deﬁne a parser english that returns parse trees for the English language. Test your

function with the sentence they are flying planes. Compare the result to your

answer of Exercise 2.46.

52

3.3. Parser combinators

Exercise 3.11. When deﬁning the priority of the <[> operator with the inﬁxr keyword,

we also speciﬁed that the operator associates to the right. Why is this a better choice than

association to the left?

Exercise 3.12. Deﬁne a parser combinator <,> that combines two parsers. The value

returned by the combined parser is a tuple containing the results of the two component

parsers. What is the type of this parser combinator?

Exercise 3.13. The term ‘parser combinator’ is in fact not an adequate description for <$>.

Can you think of a better word?

Exercise 3.14. Compare the type of <$> with the type of the standard function map. Can

you describe your observations in an easy-to-remember, catchy phrase?

Exercise 3.15. Deﬁne <∗> in terms of <,> and <$>. Deﬁne <,> in terms of <∗> and <$>.

Exercise 3.16. If you examine the deﬁnitions of <∗> and <$> in Listing 3.3, you can observe

that <$> is in a sense a special case of <∗>. Can you deﬁne <$> in terms of <∗>?

3.3.1. Matching parentheses: an example

Using parser combinators, it is often fairly straightforward to construct a parser for

a language for which you have a grammar. Consider, for example, the grammar that

you wrote in Exercise 2.44:

S → ( S ) S [ ε

This grammar can be directly translated to a parser, using the parser combinators

<∗> and <[>. We use <∗> when symbols are written next to each other, and <[>

when [ appears in a production (or when there is more than one production for a

nonterminal).

parens :: Parser Char ??? — ill-typed

parens = symbol ’(’ <∗>parens <∗>symbol ’)’ <∗>parens

<[>epsilon

However, this function is not correctly typed: the parsers in the ﬁrst alternative

cannot be composed using <∗>, as for example symbol ’(’ is not a parser returning

a function.

But we can postprocess the parser symbol ’(’ so that, instead of a character, this

parser does return a function. So, what function should we use? This depends on the

kind of value that we want as a result of the parser. A nice result would be a tree-

like description of the parentheses that are parsed. For this purpose we introduce

an abstract syntax, see Section 2.6, for the parentheses grammar. We obtain the

following Haskell datatype:

data Parentheses = Match Parentheses Parentheses

[ Empty

53

3. Parser combinators

data Parentheses = Match Parentheses Parentheses

[ Empty

deriving Show

open = symbol ’(’

close = symbol ’)’

parens :: Parser Char Parentheses

parens = (λ x y → Match x y)

<$>open <∗>parens <∗>close <∗>parens

<[>succeed Empty

nesting :: Parser Char Int

nesting = (λ x y → max (1 + x) y)

<$>open <∗>nesting <∗>close <∗>nesting

<[>succeed 0

Listing 3.4: ParseParentheses.hs

For example, the sentence ()() is represented by

Match Empty (Match Empty Empty)

Suppose we want to calculate the number of parentheses in a sentence. The number

of parentheses is calculated by the function nrofpars, which is deﬁned by induction

on the datatype Parentheses.

nrofpars :: Parentheses → Int

nrofpars (Match pl pr) = 2 + nrofpars pl + nrofpars pr

nrofpars Empty = 0

Using the datatype Parentheses, we can add ‘semantic functions’ to the parser. We

then obtain the deﬁnition of parens in Listing 3.4.

By varying the function used as a ﬁrst argument of <$> (the ‘semantic function’), we

can return other things than parse trees. As an example we construct a parser that

calculates the nesting depth of nested parentheses, see the function nesting deﬁned

in Listing 3.4.

A session in which nesting is used may look like this:

? nesting "()(())()"

[(2,[]), (2,"()"), (1,"(())()"), (0,"()(())()")]

? nesting "())"

[(1,")"), (0,"())")]

54

3.4. More parser combinators

As you can see, when there is a syntax error in the argument, there are no solutions

with empty rest string. It is fairly simple to test whether a given string belongs to

the language that is parsed by a given parser.

Exercise 3.17. What is the type of the function f =λ x y → Match x y which appears in

function parens in Listing 3.4? What is the type of the parser open? Using the type of <$>,

what is the type of f <$>open? Can f <$>open be used as a left hand side of <∗>parens?

What is the type of the result?

Exercise 3.18. What is a convenient way for <∗> to associate? Does it?

Exercise 3.19. Write a function test that determines whether or not a given string belongs

to the language parsed by a given parser.

3.4. More parser combinators

In principle you can build parsers for any context-free language using the combina-

tors <∗> and <[>, but in practice it is easier to have some more parser combinators

available. In traditional grammar formalisms, additional symbols are used to describe

for example optional or repeated constructions. Consider for example the BNF for-

malism, in which originally only sequential and alternative composition can be used

(denoted by juxtaposition and vertical bars, respectively), but which was later ex-

tended to EBNF to also allow for repetition, denoted by a star. The goal of this

section is to show how the set of parser combinators can be extended.

3.4.1. Parser combinators for EBNF

It is very easy to make new parser combinators for EBNF. As a ﬁrst example we

consider repetition. Given a parser p for a construction, many p constructs a parser

for zero or more occurrences of that construction:

many :: Parser s a → Parser s [a]

many p = (:) <$>p <∗>many p

<[>succeed []

So the EBNF expression P

∗

is implemented by many P. The function (:) is just the

cons-operator for lists: it takes a head element and a tail list and combines them.

The order in which the alternatives are given only inﬂuences the order in which

solutions are placed in the list of successes.

For example, the many combinator can be used in parsing a natural number:

natural :: Parser Char Int

natural = foldl f 0 <$>many newdigit

where f a b = a ∗ 10 + b

55

3. Parser combinators

— EBNF parser combinators

option :: Parser s a → a → Parser s a

option p d = p <[>succeed d

many :: Parser s a → Parser s [a]

many p = (:) <$>p <∗>many p <[>succeed []

many

1

:: Parser s a → Parser s [a]

many

1

p = (:) <$>p <∗>many p

pack :: Parser s a → Parser s b → Parser s c → Parser s b

pack p r q = (λ x → x) <$>p <∗>r <∗>q

listOf :: Parser s a → Parser s b → Parser s [a]

listOf p s = (:) <$>p <∗>many ((λ x → x) <$>s <∗>p)

— Auxiliary functions

ﬁrst :: Parser s b → Parser s b

ﬁrst p xs [ null r = []

[ otherwise = [head r]

where r = p xs

greedy, greedy

1

:: Parser s b → Parser s [b]

greedy = ﬁrst . many

greedy

1

= ﬁrst . many

1

Listing 3.5: EBNF.hs

56

3.4. More parser combinators

Deﬁned in this way, the natural parser also accepts empty input as a number. If

this is not desired, we had better use the many

1

parser combinator, which accepts

one or more occurrences of a construction, and corresponds to the EBNF expression

P

+

, see Section 2.7. It is deﬁned in Listing 3.5. Another combinator from EBNF

is the option combinator P?. It takes a parser as argument, and returns a parser

that recognises the same construct, but which also succeeds if that construct is not

present in the input string. The deﬁnition is given in Listing 3.5. It has an additional

argument: the value that should be used as result in case the construct is not present.

It is a kind of ‘default’ value.

By the use of the option and many functions, a large amount of backtracking possi-

bilities are introduced. This is not always advantageous. For example, if we deﬁne a

parser for identiﬁers by

identiﬁer = many

1

(satisfy isAlpha)

a single word may also be parsed as two identiﬁers. Caused by the order of the

alternatives in the deﬁnition of many (succeed [] appears as the second alternative),

the ‘greedy’ parsing, which accumulates as many letters as possible in the identiﬁer

is tried ﬁrst, but if parsing fails elsewhere in the sentence, also less greedy parsings

of the identiﬁer are tried – in vain. You will give a better deﬁnition of identiﬁer in

Exercise 3.27.

In situations where from the way the grammar is built we can predict that it is

hopeless to try non-greedy results of many, we can deﬁne a parser transformer ﬁrst,

that transforms a parser into a parser that only returns the ﬁrst possible parsing. It

does so by taking the ﬁrst element of the list of successes.

ﬁrst :: Parser a b → Parser a b

ﬁrst p xs [ null r = []

[ otherwise = [head r]

where r = p xs

Using this function, we can create a special ‘take all or nothing’ version of many:

greedy = ﬁrst . many

greedy

1

= ﬁrst . many

1

If we compose the ﬁrst function with the option parser combinator:

obligatory p d = ﬁrst (option p d)

we get a parser which must accept a construction if it is present, but which does not

fail if it is not present.

57

3. Parser combinators

3.4.2. Separators

The combinators many, many

1

and option are classical in compiler constructions

– there are notations for it in EBNF (

∗

,

+

and ?, respectively) –, but there is no

need to leave it at that. For example, in many languages constructions are frequently

enclosed between two meaningless symbols, most often some sort of parentheses. For

this case we design a parser combinator pack. Given a parser for an opening token,

a body, and a closing token, it constructs a parser for the enclosed body, as deﬁned

in Listing 3.5. Special cases of this combinator are:

parenthesised p = pack (symbol ’(’) p (symbol ’)’)

bracketed p = pack (symbol ’[’) p (symbol ’]’)

compound p = pack (token "begin") p (token "end")

Another frequently occurring construction is repetition of a certain construction,

where the elements are separated by some symbol. You may think of lists of ar-

guments (expressions separated by commas), or compound statements (statements

separated by semicolons). For the parse trees, the separators are of no importance.

The function listOf below generates a parser for a non-empty list, given a parser for

the items and a parser for the separators:

listOf :: Parser s a → Parser s b → Parser s [a]

listOf p s = (:) <$>p <∗>many ((λ x → x) <$>s <∗>p)

Useful instantiations are:

commaList, semicList :: Parser Char a → Parser Char [a]

commaList p = listOf p (symbol ’,’)

semicList p = listOf p (symbol ’;’)

A somewhat more complicated variant of the function listOf is the case where the

separators carry a meaning themselves. For example, in arithmetical expressions,

where the operators that separate the subexpressions have to be part of the parse

tree. For this case we will develop the functions chainr and chainl . These functions

expect that the parser for the separators returns a function (!); that function is used

by chain to combine parse trees for the items. In the case of chainr the operator is

applied right-to-left, in the case of chainl it is applied left-to-right. The functions

chainr and chainl are deﬁned in Listing 3.6 (remember that $ is function application:

f $ x = f x).

The deﬁnitions look quite complicated, but when you look at the underlying grammar

they are quite straightforward. Suppose we apply operator ⊕ (⊕ is an operator

variable, it denotes an arbitrary right-associative operator) from right to left, so

e

1

⊕e

2

⊕e

3

⊕e

4

=

58

3.4. More parser combinators

— Chain expression combinators

chainr :: Parser s a → Parser s (a → a → a) → Parser s a

chainr pe po = h <$>many (j <$>pe <∗>po) <∗>pe

where j x op = (x‘op‘)

h fs x = foldr ($) x fs

chainl :: Parser s a → Parser s (a → a → a) → Parser s a

chainl pe po = h <$>pe <∗>many (j <$>po <∗>pe)

where j op x = (‘op‘x)

h x fs = foldl (ﬂip ($)) x fs

Listing 3.6: Chains.hs

e

1

⊕(e

2

⊕(e

3

⊕e

4

))

=

((e

1

⊕) (e

2

⊕) (e

3

⊕)) e

4

It follows that we can parse such expressions by parsing many pairs of expressions

and operators, turning them into functions, and applying all those functions to the

last expression. This is done by function chainr, see Listing 3.6.

If operator ⊕ is applied from left to right, then

e

1

⊕e

2

⊕e

3

⊕e

4

=

((e

1

⊕e

2

) ⊕e

3

) ⊕e

4

=

((⊕e

4

) (⊕e

3

) (⊕e

2

)) e

1

So such an expression can be parsed by ﬁrst parsing a single expression (e

1

), and then

parsing many pairs of operators and expressions, turning them into functions, and

applying all those functions to the ﬁrst expression. This is done by function chainl ,

see Listing 3.6.

Functions chainl and chainr can be made more eﬃcient by avoiding the construction

of the intermediate list of functions. The resulting deﬁnitions can be found in the

article by Fokker [5].

Note that functions chainl and chainr are very similar, the only diﬀerence is that

everything is ‘turned around’: function j of chainr takes a value and an operator,

and returns the function obtained by ‘left’ applying the operator; function j of chainl

takes an operator and a value, and returns the function obtained by ‘right’ applying

the operator to the value. Such functions are sometimes called dual .

dual

59

3. Parser combinators

Exercise 3.20.

1. Deﬁne a parser that analyses a string and recognises a list of digits separated by a

space character. The result is a list of integers.

2. Deﬁne a parser sumParser that recognises digits separated by the character ’+’ and

returns the sum of these integers.

3. Both parsers return a list of solutions. What should be changed in order to get only

one solution?

Exercise 3.21. What is the value of

many (symbol ’a’) xs

for xs ∈ ¦[], [’a’], [’b’], [’a’, ’b’], [’a’, ’a’, ’b’] ¦?

Exercise 3.22. Consider the application of the parser many (symbol ’a’) to the string aaa.

In what order do the four possible parsings appear in the list of successes?

Exercise 3.23 (no answer provided). Using the parser combinators option, many and many

1

deﬁne parsers for each of the basic languages deﬁned in Section 2.3.1.

Exercise 3.24. As another variation on the theme ‘repetition’, deﬁne a parser combinator

psequence that transforms a list of parsers for some type into a parser returning a list of

elements of that type. What is the type of psequence? Also deﬁne a combinator choice that

iterates the operator <[>.

Exercise 3.25. As an application of psequence, deﬁne the function token that was discussed

in Section 3.2.

Exercise 3.26 (no answer provided). Carefully analyse the semantic functions in the deﬁ-

nition of chainl in Listing 3.6.

Exercise 3.27. In real programming languages, identiﬁers follow rather ﬂexible rules: the

ﬁrst symbol must be a letter, but the symbols that follow (if any) may be a letter, digit, or

underscore symbol. Deﬁne a more realistic parser identiﬁer.

3.5. Arithmetical expressions

The goal of this section is to use parser combinators in a concrete application. We

will develop a parser for arithmetical expressions, which have the following concrete

syntax:

E → E + E

[ E - E

[ E / E

[ ( E )

[ Digs

60

3.5. Arithmetical expressions

Besides these productions, we also have productions for identiﬁers and applications

of functions:

E → Identiﬁer

[ Identiﬁer ( Args )

Args → ε [ E (, E)

∗

The parse trees for this grammar are of type Expr:

data Expr = Con Int

[ Var String

[ Fun String [Expr]

[ Expr :+: Expr

[ Expr :−: Expr

[ Expr :∗: Expr

[ Expr :/: Expr

You can almost recognise the structure of the parser in this type deﬁnition. But in

order to account for the priorities of the operators, we will use a grammar with three

non-terminals ‘expression’, ‘term’ and ‘factor’: an expression is composed of terms

separated by + or −; a term is composed of factors separated by ∗ or /, and a factor

is a constant, variable, function call, or expression between parentheses.

This grammar appears as a parser in the functions in Listing 3.7.

The ﬁrst parser, fact, parses factors.

fact :: Parser Char Expr

fact = Con <$>integer

<[>Var <$>identiﬁer

<[>Fun <$>identiﬁer <∗>parenthesised (commaList expr)

<[>parenthesised expr

The ﬁrst alternative is an integer parser which is postprocessed by the ‘semantic

function’ Con. The second and third alternative are a variable or function call,

depending on the presence of an argument list. In absence of the latter, the function

Var is applied, in presence the function Fun. For the fourth alternative there is no

semantic function, because the meaning of an expression between parentheses is the

meaning of the expression.

For the deﬁnition of a term as a list of factors separated by multiplicative operators

we use the function chainl . Recall that chainl repeatedly recognises its ﬁrst argument

(fact), separated by its second argument (a ∗ or /). The parse trees for the individual

factors are joined by the constructor functions that appear before <$>. We use chainl

and not chainr because the operator ’/’ is considered to be left-associative.

The function expr is analogous to term, only with additive operators instead of

multiplicative operators, and with terms instead of factors.

61

3. Parser combinators

— Type deﬁnition for parse tree

data Expr = Con Int

[ Var String

[ Fun String [Expr]

[ Expr :+: Expr

[ Expr :−: Expr

[ Expr :∗: Expr

[ Expr :/: Expr

— Parser for expressions with two priorities

fact :: Parser Char Expr

fact = Con <$>integer

<[> Var <$>identiﬁer

<[> Fun <$>identiﬁer <∗>parenthesised (commaList expr)

<[> parenthesised expr

integer :: Parser Char Int

integer = (const negate <$>(symbol ’-’)) ‘option‘ id <∗>natural

term :: Parser Char Expr

term = chainl fact

( const (:∗:) <$>symbol ’*’

<[>const (:/:) <$>symbol ’/’

)

expr :: Parser Char Expr

expr = chainl term

( const (:+:) <$>symbol ’+’

<[>const (:−:) <$>symbol ’-’

)

Listing 3.7: ExpressionParser.hs

62

3.6. Generalised expressions

This example clearly shows the strength of parsing with parser combinators. There is

no need for a separate formalism for grammars; the production rules of the grammar

are combined with higher-order functions. Also, there is no need for a separate

parser generator (like ‘yacc’); the functions can be viewed both as description of the

grammar and as an executable parser.

Exercise 3.28. 1. Give the parse tree for the expressions "abc", "(abc)", "a*b+1",

"a*(b+1)", "-1-a", and "a(1,b)"

2. Why is the parse tree for the expression "a(1,b)" not the ﬁrst solution of the parser?

Modify the functions in Listing 3.7 in such way that it will be.

Exercise 3.29. A function with no arguments such as "f()" is not accepted by the parser.

Explain why and modify the parser in such way that it will be.

Exercise 3.30. Modify the functions in Listing 3.7, in such a way that + is parsed as a

right-associative operator, and - is parsed as a left-associative operator.

3.6. Generalised expressions

This section generalises the parser in the previous section with respect to priorities.

Arithmetical expressions in which operators have more than two levels of priority can

be parsed by writing more auxiliary functions between term and expr. The function

chainl is used in each deﬁnition, with as ﬁrst argument the function of one priority

level lower.

If there are nine levels of priority, we obtain nine copies of almost the same text. This

is not as it should be. Functions that resemble each other are an indication that we

should write a generalised function, where the diﬀerences are described using extra

arguments. Therefore, let us inspect the diﬀerences in the deﬁnitions of term and

expr again. These are:

• The operators and associated tree constructors that are used in the second

argument of chainl

• The parser that is used as ﬁrst argument of chainl

The generalised function will take these two diﬀerences as extra arguments: the ﬁrst

in the form of a list of pairs, the second in the form of a parse function:

type Op a = (Char, a → a → a)

gen :: [Op a] → Parser Char a → Parser Char a

gen ops p = chainl p (choice (map f ops))

where f (s, c) = const c <$>symbol s

If furthermore we deﬁne as shorthand:

63

3. Parser combinators

multis = [(’*’, (:∗:)), (’/’, (:/:))]

addis = [(’+’, (:+:)), (’-’, (:−:))]

then expr and term can be deﬁned as partial parametrisations of gen:

expr = gen addis term

term = gen multis fact

By expanding the deﬁnition of term in that of expr we obtain:

expr = addis ‘gen‘ (multis ‘gen‘ fact)

which an experienced functional programmer immediately recognises as an applica-

tion of foldr:

expr = foldr gen fact [addis, multis]

From this deﬁnition a generalisation to more levels of priority is simply a matter of

extending the list of operator-lists.

The very compact formulation of the parser for expressions with an arbitrary number

of priority levels is possible because the parser combinators can be used together with

the existing mechanisms for generalisation and partial parametrisation in Haskell.

Contrary to conventional approaches, the levels of priority need not be coded expli-

citly with integers. The only thing that matters is the relative position of an operator

in the list of ‘list with operators of the same priority’. Also, the insertion of new

priority levels is very easy. The deﬁnitions are summarised in Listing 3.8.

Summary

This chapter shows how to construct parsers from simple combinators. It shows

how a small parser combinator library can be a powerful tool in the construction of

parsers. Furthermore, this chapter gives a rather basic implementation of the parser

combinator library. More advanced implementations are discussed elsewhere.

3.7. Exercises

Exercise 3.31. How should the parser of Section 3.6 be adapted to also allow raising an

expression to the power of an expression?

Exercise 3.32. Prove the following laws

h <$>(f <$>p) = (h . f ) <$>p (3.1)

h <$>(p <[>q) = (h <$>p) <[>(h <$>q) (3.2)

h <$>(p <∗>q) = ((h.) <$>p) <∗>q (3.3)

64

3.7. Exercises

— Parser for expressions with aribitrary many priorities

type Op a = (Char, a → a → a)

fact

**:: Parser Char Expr
**

fact

= Con <$>integer

<[>Var <$>identiﬁer

<[>Fun <$>identiﬁer <∗>parenthesised (commaList expr

)

<[>parenthesised expr

**gen :: [Op a] → Parser Char a → Parser Char a
**

gen ops p = chainl p (choice (map f ops))

where f (s, c) = const c <$>symbol s

expr

**:: Parser Char Expr
**

expr

= foldr gen fact

[addis, multis]

multis = [(’*’, (:∗:)), (’/’, (:/:))]

addis = [(’+’, (:+:)), (’-’, (:−:))]

Listing 3.8: GExpressionParser.hs

Exercise 3.33. Consider your answer to Exercise 2.23. Deﬁne a combinator parser pMir

that transforms a concrete representation of a mirror-palindrome into an abstract one. Test

your function with the concrete mirror-palindromes cMir

1

and cMir

2

.

Exercise 3.34. Consider your answer to Exercise 2.25. Assuming the comma is an associa-

tive operator, we can give the following abstract syntax for bit-lists:

data BitList = SingleB Bit [ ConsB Bit BitList

Deﬁne a combinator parser pBitList that transforms a concrete representation of a bit-list

into an abstract one. Test your function with the concrete bit-lists cBitList

1

and cBitList

2

.

Exercise 3.35. Deﬁne a parser for ﬁxed-point numbers, that is numbers like 12.34 and

-123.456. Also integers are acceptable. Notice that the part following the decimal point

looks like an integer, but has a diﬀerent semantics!

Exercise 3.36. Deﬁne a parser for ﬂoating point numbers, which are ﬁxed point numbers

followed by an optional E and an (positive or negative, integer) exponent.

Exercise 3.37. Deﬁne a parser for Java assignments that consist of a variable, an = sign,

an expression and a semicolon.

Exercise 3.38 (no answer provided). Deﬁne a parser for (simpliﬁed) Java statements.

Exercise 3.39 (no answer provided). Outline the construction of a parser for Java programs.

65

3. Parser combinators

66

4. Grammar and Parser design

The previous chapters have introduced many concepts related to grammars and

parsers. The goal of this chapter is to review these concepts, and to show how

they are used in the design of grammars and parsers.

The design of a grammar and parser for a language consists of several steps: you

have to

1. give example sentences of the language for which you want to design a grammar

and a parser;

2. give a grammar for the language for which you want to have a parser;

3. test that the grammar can indeed describe the example sentences;

4. analyse this grammar to ﬁnd out whether or not it has some desirable proper-

ties;

5. possibly transform the grammar to obtain some of these desirable properties;

6. decide on the type of the parser: Parser a b, that is, decide on both the input

type a of the parser (which may be the result type of a scanner), and the result

type b of the parser.

7. construct a basic parser;

8. add semantic functions;

9. test that the parser can parse the example sentences you have given in the ﬁrst

step, and that the parser returns what you expect.

We will describe and exemplify each of these steps in detail in the rest of this sec-

tion.

As a running example we will construct a grammar and parser for travelling schemes

for day trips, of the following form:

Groningen 8:37 9:44 Zwolle 9:49 10:15 Utrecht 10:21 11:05 Den Haag

We might want to do several things with such a schema, for example:

1. compute the net travel time, i. e., the travel time minus the waiting time (2

hours and 17 minutes in the above example);

67

4. Grammar and Parser design

2. compute the total time one has to wait on the intermediate stations (11 min-

utes).

This chapter deﬁnes functions to perform these computations.

4.1. Step 1: Example sentences for the language

We have already given an example sentence above:

Groningen 8:37 9:44 Zwolle 9:49 10:15 Utrecht 10:21 11:05 Den Haag

Other example sentences are:

Utrecht Centraal 10:25 10:58 Amsterdam Centraal

Assen

4.2. Step 2: A grammar for the language

The starting point for designing a parser for your language is to deﬁne a grammar that

describes the language as precisely as possible. It is important to convince yourself

from the fact that the grammar you give really generates the desired language, since

the grammar will be the basis for grammar transformations, which might turn the

grammar into a set of incomprehensible productions.

For the language of travelling schemes, we can give several grammars. The following

grammar focuses on the fact that a trip consists of zero or more departures and

arrivals.

TS → TS Departure Arrival TS [ Station

Station → Identiﬁer

+

Departure → Time

Arrival → Time

Time → Nat : Nat

where Identiﬁer and Nat have been deﬁned in Section 2.3.1. So a travelling scheme is

a sequence of departure and arrival times, separated by stations. Note that a single

station is also a travelling scheme with this grammar.

Another grammar focuses on changing at a station:

TS → Station Departure (Arrival Station Departure)

∗

Arrival Station

[ Station

So each travelling scheme starts and ends at a station, and in between there is a list

of intermediate stations.

68

4.3. Step 3: Testing the grammar

4.3. Step 3: Testing the grammar

Both grammars we have given in step 2 describe the example sentences given in

step 1. The derivation of these sentences using these grammars is easy.

4.4. Step 4: Analysing the grammar

To parse sentences of a language eﬃciently, we want to have a unambiguous grammar

that is left-factored and not left recursive. Depending on the parser we want to obtain,

we might desire other properties of our grammar. So a ﬁrst step in designing a parser

is analysing the grammar, and determining which properties are (not) satisﬁed. We

have not yet developed tools for grammar analysis (we will do so in the chapter on

LL(1) parsing) but for some grammars it is easy to detect some properties.

The ﬁrst example grammar is left and right recursive: the ﬁrst production for TS

starts and ends with TS. Furthermore, the sequence Departure Arrival is an asso-

ciative separator in the generated language.

These properties may be used for transforming the grammar. Since we don’t mind

about right recursion, we will not make use of the fact that the grammar is right

recursive. The other properties will be used in grammar transformations in the

following subsection.

4.5. Step 5: Transforming the grammar

Since the sequence Departure Arrival is an associative separator in the generated

language, the productions for TS may be transformed into:

TS → Station [ Station Departure Arrival TS (4.1)

Thus we have removed the left recursion in the grammar. Both productions for

TS start with the nonterminal Station, so TS can be left factored. The resulting

productions are:

TS → Station Z

Z → ε [ Departure Arrival TS

We can also apply equivalence (2.1) to the two productions for TS from (4.1), and

obtain the following single production:

TS → (Station Departure Arrival )

∗

Station (4.2)

So which productions do we take for TS? This depends on what we want to do with

the parsed sentences. We will show several choices in the next section.

69

4. Grammar and Parser design

4.6. Step 6: Deciding on the types

We want to write a parser for travel schemes, that is, we want to write a function ts

of type

ts :: Parser ? ?

The question marks should be replaced by the input type and the result type, respec-

tively. For the input type we can choose between at least two possibilities: characters,

Char or tokens Token. The type of tokens can be chosen as follows:

data Token = Station Token Station [ Time Token Time

type Station = String

type Time = (Int, Int)

We will construct a parser for both input types in the next subsection. So ts has one

of the following two types.

ts :: Parser Char ?

ts :: Parser Token ?

For the result type we have many choices. If we just want to compute the total

travelling time, Int suﬃces for the result type. If we want to compute the total

travelling time, the total waiting time, and a nicely printed version of the travelling

scheme, we may do several things:

• deﬁne three parsers, with Int (total travelling time), Int (total waiting time),

and String (nicely printed version) as result type, respectively;

• deﬁne a single parser with the triple (Int, Int, String) as result type;

• deﬁne an abstract syntax for travelling schemes, say a datatype TS, and deﬁne

three functions on TS that compute the desired results.

The ﬁrst alternative parses the input three times, and is rather ineﬃcient compared

with the other alternatives. The second alternative is hard to extend if we want

to compute something extra, but in some cases it might be more eﬃcient than the

third alternative. The third alternative needs an abstract syntax. There are several

ways to deﬁne an abstract syntax for travelling schemes. The ﬁrst abstract syntax

corresponds to deﬁnition (4.1) of grammar TS.

data TS

1

= Single

1

Station

[ Cons

1

Station Time Time TS

1

where Station and Time are deﬁned above. A second abstract syntax corresponds

to the grammar for travelling schemes deﬁned in (4.2).

type TS

2

= ([(Station, Time, Time)], Station)

70

4.7. Step 7: Constructing the basic parser

So a travelling scheme is a tuple, the ﬁrst component of which is a list of triples

consisting of a departure station, a departure time, and an arrival time, and the

second component of which is the ﬁnal arrival station. A third abstract syntax

corresponds to the second grammar deﬁned in Section 4.2:

data TS

3

= Single

3

Station

[ Cons

3

(Station, Time, [(Time, Station, Time)], Time, Station)

Which abstract syntax should we take? Again, this depends on what we want to do

with the abstract syntax. Since TS

2

and TS

1

combine departure and arrival times

in a tuple, they are convenient to use when computing travelling times. TS

3

is useful

when we want to compute waiting times since it combines arrival and departure times

in one constructor. Often we want to exactly mimic the productions of the grammar

in the abstract syntax, so if we use grammar (4.1) for travelling schemes, we use TS

1

for the abstract syntax. Note that TS

1

is a datatype, whereas TS

2

is a type. TS

1

cannot be deﬁned as a type because of the two alternative productions for TS. TS

2

can be deﬁned as a datatype by adding a constructor. Types and datatypes each

have their advantages and disadvantages; the application determines which to use.

The result type of the parsing function ts may be one of types mentioned earlier (Int,

etc.), or one of TS

1

, TS

2

, TS

3

.

4.7. Step 7: Constructing the basic parser

Converting a grammar to a parser is a mechanical process that consists of a set

of simple replacement rules. Functional programming languages oﬀer some extra

ﬂexibility that we sometimes use, but usually writing a parser is a simple translation.

We use the following replacement rules.

grammar construct Haskell/parser construct

→ =

[ <[>

(space) <∗>

+

many

1

∗

many

? option

terminal x symbol x

begin of sequence of symbols undeﬁned<$>

Note that we start each sequence of symbols by undeﬁned<$>. The undeﬁned has

to be replaced by an appropriate semantic function in Step 6, but putting undeﬁned

here ensures type correctness of the parser. Of course, running the parser will result

in an error.

71

4. Grammar and Parser design

We construct a basic parser for each of the input types Char and Token.

4.7.1. Basic parsers from strings

Applying these rules to the grammar (4.2) for travelling schemes, we obtain the

following basic parser.

station :: Parser Char Station

station = undeﬁned <$>many

1

identiﬁer

time :: Parser Char Time

time = undeﬁned <$>natural <∗>symbol ’:’ <∗>natural

departure, arrival :: Parser Char Time

departure = undeﬁned <$>time

arrival = undeﬁned <$>time

tsstring :: Parser Char ?

tsstring = undeﬁned

<$>many ( undeﬁned

<$>spaces

<∗> station

<∗> spaces

<∗> departure

<∗> spaces

<∗> arrival

)

<∗> spaces

<∗> station

spaces :: Parser Char String

spaces = undeﬁned <$>many (symbol ’ ’)

The only thing left to do is to add the semantic glue to the functions. The semantic

glue also determines the type of the function tsstring, which is denoted by ? for the

moment. For the other basic parsers we have chosen some reasonable return types.

The semantic functions are deﬁned in the next and ﬁnal step.

4.7.2. A basic parser from tokens

To obtain a basic parser from tokens, we ﬁrst write a scanner that produces a list of

tokens from a string.

scanner :: String → [Token]

scanner = mkTokens . combine . words

combine :: [String] → [String]

72

4.7. Step 7: Constructing the basic parser

combine [] = []

combine [x] = [x]

combine (x : y : xs) =if isAlpha (head x) ∧ isAlpha (head y)

then combine ((x ++ " " ++ y) : xs)

else x : combine (y : xs)

mkToken :: String → Token

mkToken xs =if isDigit (head xs)

then Time Token (mkTime xs)

else Station Token xs

parse result :: [(a, b)] → a

parse result xs

[ null xs = error "parse_result: could not parse the input"

[ otherwise = fst (head xs)

mkTime :: String → Time

mkTime = parse result . time

This is a basic scanner with very basic error messages, but it suﬃces for now. The

composition of the scanner with the function tstoken

1

deﬁned below gives the ﬁnal

parser.

tstoken

1

:: Parser Token ?

tstoken

1

= undeﬁned

<$>many ( undeﬁned

<$>tstation

<∗> tdeparture

<∗> tarrival

)

<∗> tstation

tstation :: Parser Token Station

tstation (Station Token s : xs) = [(s, xs)]

tstation = []

tdeparture, tarrival :: Parser Token Time

tdeparture (Time Token (h, m) : xs) = [((h, m), xs)]

tdeparture = []

tarrival (Time Token (h, m) : xs) = [((h, m), xs)]

tarrival = []

where again the semantic functions remain to be deﬁned. Note that functions

tdeparture and tarrival are the same functions. Their presence reﬂects their pres-

ence in the grammar.

Another basic parser from tokens is based on the second grammar of Section 4.2.

tstoken

2

:: Parser Token ?

tstoken

2

= undeﬁned

73

4. Grammar and Parser design

<$>tstation

<∗> tdeparture

<∗> many ( undeﬁned

<$>tarrival

<∗> tstation

<∗> tdeparture

)

<∗> tarrival

<∗> tstation

<[>undeﬁned <$>tstation

4.8. Step 8: Adding semantic functions

Once we have the basic parsing functions, we need to add the semantic glue: the

functions that take the results of the elements in the right hand side of a production,

and convert them into the result of the left hand side. The basic rule is: Let the

types do the work!

First we add semantic functions to the basic parsing functions station, time, departure,

arrival , and spaces. Since function many

1

identiﬁer returns a list of strings, and we

want to obtain the concatenation of these strings for the station name, we can take

the concatenation function concat for undeﬁned in function station. To obtain a

value of type Time from an integer, a character, and an integer, we have to combine

the two integers in a tuple. So we take the following function

λx y → (x, y)

for undeﬁned in time. Now, since function time returns a value of type Time, we can

take the identity function for undeﬁned in departure and arrival , and then we replace

id <$>time by just time. Finally, the result of many is a string, so for undeﬁned in

spaces we can take the identity function too.

The ﬁrst semantic function for the basic parser tsstring deﬁned in Section 4.7.1

returns an abstract syntax tree of type TS

2

. So the ﬁrst undeﬁned in tsstring should

return a tuple of a list of things of the correct type (the ﬁrst component of the type

TS

2

) and a Station. Since many returns a list of things, we can construct such a

tuple by means of the function

λx y → (x, y)

provided many returns a value of the desired type: [(Station, Time, Time)]. Note

that this semantic function basically only throws away the value returned by the

spaces parser: we are not interested in the spaces between the components of our

74

4.9. Step 9: Did you get what you expected

travelling scheme. The many parser returns a value of the correct type if we replace

the second occurrence of undeﬁned in tsstring by the function

λ x y z → (x, y, z )

Again, the results of spaces are thrown away. This completes a parser for travelling

schemes. The next semantic functions we deﬁne compute the net travel time. To

compute the net travel time, we have to compute the travel time of each trip from

a station to a station, and to add the travel times of all of these trips. We obtain

the travel time of a single trip if we replace the second occurrence of undeﬁned in

tsstring by:

λ (xh, xm) (zh, zm) → (zh −xh) ∗ 60 + zm −xm

and Haskell’s prelude function sum sums these times, so for the ﬁrst occurrence of

undeﬁned we take:

λx → sum x

The ﬁnal set of semantic functions we deﬁne are used for computing the total waiting

time. Since the second grammar of Section 4.2 combines arrival times and departure

times, we use a parser based on this grammar: the basic parser tstoken

2

. We have

to give deﬁnitions of the three undeﬁned semantic functions. If a trip consists of a

single station, there is now waiting time, so the last occurrence of undeﬁned is the

function const 0. The second occurrence of function undeﬁned computes the waiting

time for one intermediate station:

λ(uh, um) (wh, wm) → (wh −uh) ∗ 60 + wm −um

Finally, the ﬁrst occurrence of undeﬁned sums the list of waiting time obtained by

means of the function that replaces the second occurrence of undeﬁned:

λ x → sum x

4.9. Step 9: Did you get what you expected

In the last step you test your parser(s) to see whether or not you have obtained what

you expected, and whether or not you have made errors in the above process.

Summary

This chapter describes the diﬀerent steps that have to be considered in the design of

a grammar and a language.

75

4. Grammar and Parser design

4.10. Exercises

Exercise 4.1. Write a parser ﬂoatLiteral for Java ﬂoat-literals. The EBNF grammar for

ﬂoat-literals is given by:

FloatLiteral → IntPart . FractPart? ExponentPart? FloatSuﬃx?

[ . FractPart ExponentPart? FloatSuﬃx?

[ IntPart ExponentPart FloatSuﬃx?

[ IntPart ExponentPart? FloatSuﬃx

IntPart → SignedInteger

FractPart → Digits

ExponentPart → ExponentIndicator SignedInteger

SignedInteger → Sign? Digits

Digits → Digits Digit [ Digit

ExponentIndicator → e [ E

Sign → + [ -

FloatSuﬃx → f [ F [ d [ D

To keep your parser simple, assume that all nonterminals, except for the nonterminal FloatLiteral ,

are represented by a String in the abstract syntax.

Exercise 4.2. Write an evaluator signedFloat for Java ﬂoat-literals (the ﬂoat-suﬃx may be

ignored).

Exercise 4.3. Up to the deﬁnition of the semantic functions, parsers constructed on a (ﬁxed)

abstract syntax have the same shape. Give this parsing scheme for Java ﬂoat literals.

76

5. Compositionality

Introduction

Many recursive functions follow a common pattern of recursion. These common

patterns of recursion can conveniently be captured by higher order functions. For

example: many recursive functions deﬁned on lists are instances of the higher order

function foldr. It is possible to deﬁne a function such as foldr for a whole range of

datatypes other than lists. Such functions are called compositional. Compositional

functions on datatypes are deﬁned in terms of algebras of semantic actions that

correspond to the constructors of the datatype. Compositional functions can typically

be used to deﬁne the semantics of programming languages constructs. Such semantics

is referred to as algebraic semantics. Algebraic semantics often uses algebras that

contain functions from tuples to tuples. Such functions can be seen as computations

that read values from a component of the domain and write values to a component of

the codomain. The former values are called inherited attributes and the latter values

are called synthesised attributes. Attributes can be both inherited and synthesised.

As explained in Section 2.6, there is an important relationship between grammars

and compositionality: with every grammar, which describes the concrete syntax of a

language, one can associate a (possibly mutually recursive) datatype, which describes

the abstract syntax of the language. Compositional functions on these datatypes are

called syntax driven.

Goals

After studying this chapter and making the exercises you will

• know how to generalise constructors of a datatype to an algebra;

• know how to write compositional functions, also known as folds, on (possibly

mutually recursive) datatypes;

• understand the advantages of using folds, and have seen that many problems

can be solved with a fold;

• know that a fold applied to the constructors algebra is the identity function;

• have seen the notions of fusion and deforestation;

• know how to write syntax driven code;

• understand the connection between datatypes, abstract syntax and concrete

syntax;

• understand the notions of synthesised and inherited attributes;

77

5. Compositionality

• be able to associate inherited and synthesised attributes with the diﬀerent alter-

natives of (possibly mutually recursive) datatypes (or the diﬀerent nonterminals

of grammars);

• be able to deﬁne algebraic semantics in terms of compositional (or syntax

driven) code that is deﬁned using algebras of computations which make use

of inherited and synthesised attributes.

Organisation

The chapter is organised as follows. Section 5.1 shows how to deﬁne compositional

recursive functions on built-in lists using a function which is similar to foldr and shows

how to do the same thing with user-deﬁned lists and streams. Section 5.2 shows how

to deﬁne compositional recursive functions on several kinds of trees. Section 5.3

deﬁnes algebraic semantics. Section 5.4 shows the usefulness of algebraic semantics

by presenting an expression evaluator, an expression interpreter which makes use of

a stack and expression compiler to a stack machine. They only diﬀer in the way

they handle basic expressions (variables and local deﬁnitions are handled in the same

way). All three examples use an algebra of computations which can read values from

and write values to an environment which binds names to values. In a second version

of the expression evaluator the use of inherited and synthesised attributes is made

more explicit by using tuples. Section 5.5 presents a relatively complex example of

the use of tuples in combination with compositionality. It deals with the problem of

variable scope when compiling block structured languages.

5.1. Lists

This section introduces compositional functions on the well known datatype of lists.

Compositional functions are deﬁned on the built-in datatype [a] for lists (Section

5.1.1), on a user-deﬁned datatype List a for lists (Section 5.1.2), and on streams or

inﬁnite lists (Section 5.1.3). We also show how to construct an algebra that directly

corresponds to a datatype.

5.1.1. Built-in lists

The datatype of lists is perhaps the most important example of a datatype. A list is

either the empty list [] or a nonempty list (x : xs) consisting of an element x at the

head of the list, followed by a tail xs which itself is again a list. Thus, the type [x]

is recursive. In Haskell the type [x] for lists is built-in. The informal deﬁnition of

above corresponds to the following (pseudo) datatype.

data [x] = x : [x] [ []

78

5.1. Lists

Many recursive functions on lists look very similar. Computing the sum of the el-

ements of a list of integers (sumL, where the L denotes that it is a sum function

deﬁned on lists) is very similar to computing the product of the elements of a list of

integers (prodL).

sumL, prodL :: [Int] → Int

sumL (x : xs) = x + sumL xs

sumL [] = 0

prodL (x : xs) = x ∗ prodL xs

prodL [] = 1

The function sumL replaces the list constructor (:) by (+) and the list constructor

[] by 0 and the function prodL replaces the list constructor (:) by (∗) and the list

constructor [] by 1. Note that we have replaced the constructor (:) (a constructor

with two arguments) by binary operators (+) and (∗) (i.e. functions with two argu-

ments) and the constructor [] (a constructor with zero variables) by constants 0 and

1 (i.e. ‘functions’ with zero variables). The similarity between the deﬁnitions of the

functions sumL and prodL can be captured by the following higher order recursive

function foldL, which is nothing else but an uncurried version of the well known func-

tion foldr. Don’t confuse foldL with Haskell’s prelude function foldl , which works

the other way around.

foldL :: (x → l → l , l ) → [x] → l

foldL (op, c) = fold

where

fold (x : xs) = op x (fold xs)

fold [] = c

The function foldL recursively replaces the constructor (:) by an operator op and

the constructor [] by a constant c. We can now use foldL to compute the sum and

product of the elements of a list of integers as follows.

? foldL ((+),0) [1,2,3,4]

10

? foldL ((*),1) [1,2,3,4]

24

The pair (op, c) is often referred to as a list-algebra. More precisely, a list-algebra

consists of a type l (the carrier of the algebra), a binary operator op of type x → l → l

and a constant c of type l . Note that a type (like Int) can be the carrier of a list-

algebra in more than one way (for example using ((+), 0) and ((∗), 1)). Here is

another example of how to turn Int into a list-algebra.

? foldL (\_ n -> n+1,0) [1,2,3,4]

4

79

5. Compositionality

This list-algebra ignores the value at the head of a list, and increments the result

obtained thus far with one. It corresponds to the function sizeL deﬁned by:

sizeL :: [x] → Int

sizeL ( : xs) = 1 + sizeL xs

sizeL [] = 0

Note that the type of sizeL is more general than the types of sumL and prodL. The

type of the elements of the list does not play a role.

5.1.2. User-deﬁned lists

In this subsection we present an example of a fold function deﬁned on another

datatype than built-in lists. To keep things simple we redo the list example for

user-deﬁned lists.

data List x = Cons x (List x) [ Nil

User-deﬁned lists are deﬁned in the same way as built-in ones. The constructors

(:) and [] are replaced by constructors Cons and Nil . Here are the types of the

constructors Cons and Nil .

? :t Cons

Cons :: a -> List a -> List a

? :t Nil

Nil :: List a

A algebra type ListAlgebra corresponding to the datatype List directly follows the

structure of that datatype.

type ListAlgebra x l = (x → l → l , l )

The left hand side of the type deﬁnition is obtained from the left hand side of the

datatype as follows: a postﬁx Algebra is added at the end of the name List and a type

variable l is added at the end of the whole left hand side of the type deﬁnition. The

right hand side of the type deﬁnition is obtained from the right hand side of the data

deﬁnition as follows: all List x valued constructors are replaced by l valued functions

which have the same number of arguments (if any) as the corresponding constructors.

The types of recursive constructor arguments (i.e. arguments of type List x) are

replaced by recursive function arguments (i.e. arguments of type l ). The types of the

other arguments are simply left unchanged. In a similar way, the deﬁnition of a fold

function can be generated automatically from the data deﬁnition.

foldList :: ListAlgebra x l → List x → l

foldList (cons, nil ) = fold

80

5.1. Lists

where

fold (Cons x xs) = cons x (fold xs)

fold Nil = nil

The constructors Cons and Nil in the left hand sides of the deﬁnition of the local

function fold are replaced by functions cons and nil in the right hand sides. The

function fold is applied recursively to all recursive constructor arguments of type

List x to return a value of type l as required by the functions of the algebra (in this

case Cons and cons have one such recursive argument). The other arguments are

left unchanged. Recursive functions on user-deﬁned lists which are deﬁned by means

of foldList are called compositional. Every algebra deﬁnes a unique compositional

function. Here are three examples of compositional functions. They correspond to

the examples of Section 5.1.1.

sumList, prodList :: List Int → Int

sumList = foldList ((+), 0)

prodList = foldList ((∗), 1)

sizeList :: List x → Int

sizeList = foldList (const (1+), 0)

It is worth mentioning one particular ListAlgebra: the trivial ListAlgebra that re-

places Cons by Cons and Nil by Nil . This algebra deﬁnes the identity function on

user-deﬁned lists.

idListAlgebra :: ListAlgebra x (List x)

idListAlgebra = (Cons, Nil )

idList :: List x → List x

idList = foldList idListAlgebra

? idList (Cons 1 (Cons 2 Nil))

Cons 1 (Cons 2 Nil)

5.1.3. Streams

In this section we consider streams (or inﬁnite lists).

streams

data Stream x = And x (Stream x)

Here is a standard example of a stream: the inﬁnite list of ﬁbonacci numbers.

ﬁbStream :: Stream Int

ﬁbStream = And 0 (And 1 (restOf ﬁbStream))

where

restOf (And x stream@(And y )) = And (x + y) (restOf stream)

81

5. Compositionality

The algebra type StreamAlgebra, and the fold function foldStream can be generated

automatically from the datatype Stream.

type StreamAlgebra x s = x → s → s

foldStream :: StreamAlgebra x s → Stream x → s

foldStream and = fold

where

fold (And x xs) = and x (fold xs)

Note that the algebra has only one component because Stream has only one construc-

tor. For the same reason the fold function is deﬁned using only one equation. Here is

an example of using a compositional function on user deﬁned streams. It computes

the ﬁrst element of a monotone stream that is greater or equal than a given value.

ﬁrstGreaterThan :: Ord x ⇒ x → Stream x → x

ﬁrstGreaterThan n = foldStream (λx y → if x n then x else y)

5.2. Trees

Now that we have seen how to generalise the foldr function on built-in lists to compo-

sitional functions on user-deﬁned lists and streams we proceed by explaining another

common class of datatypes: trees. We will treat four diﬀerent kinds of trees in the

subsections below:

• binary trees;

• trees for matching parentheses;

• expression trees;

• general trees.

Furthermore, we will brieﬂy mention the concepts of fusion and deforestation.

5.2.1. Binary trees

A binary tree is either a node where the tree splits into two subtrees or a leaf which

holds a value.

data BinTree x = Bin (BinTree x) (BinTree x) [ Leaf x

One can generate the corresponding algebra type BinTreeAlgebra and fold function

foldBinTree from the datatype automatically. Note that Bin has two recursive argu-

ments and that Leaf has one non-recursive argument.

type BinTreeAlgebra x t = (t → t → t, x → t)

82

5.2. Trees

foldBinTree :: BinTreeAlgebra x t → BinTree x → t

foldBinTree (bin, leaf ) = fold

where

fold (Bin l r) = bin (fold l ) (fold r)

fold (Leaf x) = leaf x

In the BinTreeAlgebra type, the bin part of the algebra has two arguments of type

t and the leaf part of the algebra has one argument of type x. Similarly, in the

foldBinTree function, the local fold function is applied recursively to both arguments

of bin and is not called on the argument of leaf . We can now deﬁne compositional

functions on binary trees much in the same way as we deﬁned them on lists. Here is

an example: the function sizeBinTree computes the size of a binary tree.

sizeBinTree :: BinTree x → Int

sizeBinTree = foldBinTree ((+), const 1)

? sizeBinTree (Bin (Bin (Leaf 3) (Leaf 7)) (Leaf 11))

3

If a tree consists of a leaf, then sizeBinTree ignores the value at the leaf and returns

1 as the size of the tree. If a tree consists of two subtrees, then sizeBinTree returns

the sum of the sizes of those subtrees as the size of the tree. Functions for computing

the sum and the product of the integers at the leafs of a binary tree can be deﬁned

in a similar way. It suﬃces to deﬁne appropriate semantic actions bin and leaf on a

type t (in this case Int) that correspond to the syntactic constructs Bin and Leaf of

the datatype BinTree.

5.2.2. Trees for matching parentheses

Section 3.3.1 deﬁnes the datatype Parentheses for matching parentheses.

data Parentheses = Match Parentheses Parentheses

[ Empty

For example, the sentence ()() of the concrete syntax for matching parentheses

is represented by the value Match Empty (Match Empty Empty) in the abstract

syntax Parentheses. Remember that the abstract syntax ignores the terminal bracket

symbols of the concrete syntax.

We can now deﬁne, in the same way as we did for lists and binary trees, an algebra

type ParenthesesAlgebra and a fold function foldParentheses, which can be used to

compute the depth (depthParentheses) and the width (widthParentheses) of match-

ing parenthesis in a compositional way. The depth of a string of matching parentheses

83

5. Compositionality

s is the largest number of unmatched parentheses that occurs in a substring of s. For

example, the depth of the string ((()))() is 3. The width of a string of matching

parentheses s is the the number of substrings that are matching parentheses them-

selves, which are not a substring of a surrounding string of matching parentheses.

For example, the width of the string ((()))() is 2. Compositional functions on

datatypes that describe the abstract syntax of a language are called syntax driven.

type ParenthesesAlgebra m = (m → m → m, m)

foldParentheses :: ParenthesesAlgebra m → Parentheses → m

foldParentheses (match, empty) = fold

where

fold (Match l r) = match (fold l ) (fold r)

fold Empty = empty

depthParenthesesAlgebra :: ParenthesesAlgebra Int

depthParenthesesAlgebra = (λx y → max (1 + x) y, 0)

widthParenthesesAlgebra :: ParenthesesAlgebra Int

widthParenthesesAlgebra = (λ y → 1 + y , 0)

depthParentheses, widthParentheses :: Parentheses → Int

depthParentheses = foldParentheses depthParenthesesAlgebra

widthParentheses = foldParentheses widthParenthesesAlgebra

parenthesesExample = Match (Match (Match Empty Empty) Empty)

(Match Empty

(Match (Match Empty Empty)

Empty

) )

? depthParentheses parenthesesExample

3

? widthParentheses parenthesesExample

3

Our example reveals that abstract syntax is not very well suited for interpretation

by human beings. What is the concrete representation of the matching parenthesis

example represented by parenthesesExample? It happens to be ((()))()(()). For-

tunately, we can easily write a program that computes the concrete representation

from the abstract one. We know exactly which terminals we have deleted when go-

ing from the concrete syntax to the abstract one. The algebra used by the function

a2cParentheses simply reinserts those terminals that we have deleted. Note that

a2cParentheses does not deal with layout such as blanks, indentation and newlines.

For a simple example layout does not really matter. For large examples layout is

very important: it can be used to let concrete representations look pretty.

84

5.2. Trees

a2cParenthesesAlgebra :: ParenthesesAlgebra String

a2cParenthesesAlgebra = (λxs ys → "(" ++ xs ++ ")" ++ ys, "")

a2cParentheses :: Parentheses → String

a2cParentheses = foldParentheses a2cParenthesesAlgebra

? a2cParentheses parenthesesExample

((()))()(())

This example illustrates that a computer can easily interpret abstract syntax (some-

thing human beings have diﬃculties with). Strangely enough, human beings can

easily interpret concrete syntax (something computers have diﬃculties with). What

we would really like is that computers can interpret concrete syntax as well. This

is the place where parsing enters the picture: computing an abstract representation

from a concrete one is precisely what parsers are used for.

Consider the functions parens and nesting of Section 3.3.1 again.

open = symbol ’(’

close = symbol ’)’

parens :: Parser Char Parentheses

parens = (λ x y → Match x y)

<$>open <∗>parens <∗>close <∗>parens

<[>succeed Empty

nesting :: Parser Char Int

nesting = (λ x y → max (1 + x) y)

<$>open <∗>nesting <∗>close <∗>nesting

<[>succeed 0

Function nesting could have been deﬁned by means of function parens and a fold:

nesting

**:: Parser Char Int
**

nesting

= depthParentheses <$>parens

(Remember that depthParentheses has been deﬁned as a fold.) Functions nesting

and nesting

**compute exactly the same result. The function nesting is the fusion of
**

the fold function with the parser parens from the function nesting

**. Using laws for
**

parsers and folds (not shown here) we can prove that the two functions are equal.

Note that function nesting

**ﬁrst builds a tree by means of function parens, and then
**

ﬂattens it by means of the fold. Function nesting never builds a tree, and is thus

preferable for reasons of eﬃciency. On the other hand: in function nesting

we reuse

the parser parens and the function depthParentheses, in function nesting we have

to write our own parser, and convince ourselves that it is correct. So for reasons of

‘programming eﬃciency’ function nesting

**is preferable. To obtain the best of both
**

85

5. Compositionality

worlds, we would like to write function nesting

**and have our compiler ﬁgure out that
**

it is better to use function nesting in computations. The automatic transformation

of function nesting

**into function nesting is called deforestation (trees are removed).
**

deforestation

Some (very few) compilers are clever enough to perform this transformation auto-

matically.

5.2.3. Expression trees

The matching parentheses grammar has only one nonterminal. Therefore its ab-

stract syntax is described by a single datatype. In this section we look again at the

expression grammar of Section 3.5:

E → T

E → E + T

T → F

T → T * F

F → ( E )

F → Digs

This grammar has three nonterminals, E, T, and F. Using the approach from

Section 2.6 we transform the nonterminals to datatypes:

data E = E

1

T [ E

2

E T

data T = T

1

F [ T

2

T F

data F = F

1

E [ F

2

Int

where we have translated Digs by the type Int. Note that this is a rather inconvenient

and clumsy abstract syntax for expressions; the following abstract syntax is more

convenient.

data Expr = Con Int [ Add Expr Expr [ Mul Expr Expr

However, to illustrate the concept of mutual recursive datatypes, we will study the

datatypes E, T, and F deﬁned above. Since E uses T, T uses F, and F uses E,

these three types are mutually recursive. The main datatype of the three datatypes

is the one corresponding to the start symbol E. Since the datatypes are mutually

recursive, the algebra type EAlgebra consists of three tuples of functions and three

carriers (the main carrier is, as always, the one corresponding to the main datatype

and is therefore the one corresponding to the start-symbol).

type EAlgebra e t f = ((t → e, e → t → e)

, (f → t, t → f → t)

, (e → f , Int → f )

)

86

5.2. Trees

The fold function foldE for E also folds over T and F, so it uses three mutually

recursive local functions.

foldE :: EAlgebra e t f → E → e

foldE ((e

1

, e

2

), (t

1

, t

2

), (f

1

, f

2

)) = fold

where

fold (E

1

t) = e

1

(foldT t)

fold (E

2

e t) = e

2

(fold e) (foldT t)

foldT (T

1

f ) = t

1

(foldF f )

foldT (T

2

t f ) = t

2

(foldT t) (foldF f )

foldF (F

1

e) = f

1

(fold e)

foldF (F

2

n) = f

2

n

We can now use foldE to write a syntax driven expression evaluator evalE. In the

algebra that is used in the foldE, all type variables e, f , and t are instantiated with

Int.

evalE :: E → Int

evalE = foldE ((id, (+)), (id, (∗)), (id, id))

exE = E

2

(E

1

(T

2

(T

1

(F

2

2)) (F

2

3))) (T

1

(F

2

1))

? evalE exE

7

Once again our example shows that abstract syntax cannot easily be interpreted by

human beings. Here is a function a2cE which does this job for us.

a2cE :: E → String

a2cE = foldE ((e

1

, e

2

), (t

1

, t

2

), (f

1

, f

2

))

where e

1

=λt → t

e

2

=λe t → e ++ "+" ++ t

t

1

=λf → f

t

2

=λt f → t ++ "*" ++ f

f

1

=λe → "(" ++ e ++ ")"

f

2

=λn → show n

? a2cE exE

"2*3+1"

87

5. Compositionality

5.2.4. General trees

A general tree consist of a node, holding a value, where the tree splits into a list of

subtrees. Notice that this list may be empty (in which case, of course, only the value

at the node is of interest). As usual, the type TreeAlgebra and the function foldTree

can be generated automatically from the data deﬁnition.

data Tree x = Node x [Tree x]

type TreeAlgebra x a = x → [a] → a

foldTree :: TreeAlgebra x a → Tree x → a

foldTree node = fold

where

fold (Node x gts) = node x (map fold gts)

Notice that the constructor Node has a list of recursive arguments. Therefore the

node function of the algebra has a corresponding list of recursive arguments. The local

fold function is recursively called on all elements of a list using the map function.

One can compute the sum of the values at the nodes of a general tree as follows:

sumTree :: Tree Int → Int

sumTree = foldTree (λx xs → x + sum xs)

Computing the product of the values at the nodes of a general tree and computing

the size of a general tree can be done in a similar way.

5.2.5. Eﬃciency

A fold takes a value of a datatype, and replaces its constructors by functions. If

the evaluation of each of these functions on their arguments takes constant time,

evaluation of the fold takes time linear in the number of constructors in its argument.

However, some functions require more than constant evaluation time. For example,

list concatenation is linear in its left argument, and it follows that if we deﬁne the

function reverse by

reverse :: [a] → [a]

reverse = foldL (λx xs → xs ++ [x], [])

then function reverse takes time quadratic in the length of its argument list. So,

folds are often eﬃcient functions, but if the functions in the algebra are not constant,

the fold is usually not linear. Often such a nonlinear fold can be transformed into

a more eﬃcient function. A technique that often can be used in such a case is the

accumulating parameter technique. For example, for the reverse function we have

reverse x = reverse

x []

88

5.3. Algebraic semantics

reverse

:: [a] → [a] → [a]

reverse

[] ys = ys

reverse

(x : xs) ys = reverse

xs (x : ys)

The evaluation of reverse

**xs takes time linear in the length of xs.
**

Exercise 5.1. Deﬁne an algebra type and a fold function for the following datatype.

data LNTree a b = Leaf a

[ Node (LNTree a b) b (LNTree a b)

Exercise 5.2. Deﬁne the following functions as folds on the datatype BinTree, see Sec-

tion 5.2.1.

1. height, which returns the height of a tree.

2. ﬂatten, which returns the list of leaf values in left-to-right order.

3. maxBinTree, which returns the maximal value at the leaves.

4. sp, which returns the length of a shortest path.

5. mapBinTree, which maps a function over the elements at the leaves.

Exercise 5.3. A path through a binary tree describes the route from the root of the tree to

some leaf. We choose to represent paths by sequences of Direction’s:

data Direction = L [ R

in such a way that taking the left subtree in an internal node will be encoded by L and

taking the right subtree will be encoded by R. Deﬁne a compositional function allPaths

which produces all paths of a given tree. Deﬁne this function ﬁrst using explicit recursion,

and then using a fold.

Exercise 5.4. This exercise deals with resistors. There are some basic resistors with a ﬁxed

(ﬂoating point) resistance and, given two resistors, they can be put in parallel (:[:) or in

sequence (:∗:).

1. Deﬁne the datatype Resist to represent resistors. Also, deﬁne the type ResistAlgebra

and the corresponding function foldResist.

2. Deﬁne a compositonal function result which determines the resistance of a resistors.

(Recall the rules

1

r

=

1

r

1

+

1

r

2

and r = r

1

+r

2

.)

5.3. Algebraic semantics

This section summarizes what we have seen before, and establishes terminology.

In the previous section, we have seen how to associate an algebra type and a fold

function with every datatype, or family of mutually recursive datatypes. The algebra

is a tuple (one component per datatype) of tuples (one component per constructor

89

5. Compositionality

of the datatype) of semantic actions. The algebra is parameterized over the result

type of the computation, also called the carrier of the algebra. If we work with a

carrier

family of mutually recursive datatypes, we need one carrier per datatype, and the

algebra is parameterized over all of them. We call the carrier corresponding to the

main datatype of the family the main carrier, and the others auxiliary carriers.

The fold function traverses a value and recursively replaces syntactic constructors

of the datatypes by the corresponding semantic actions of the algebra. Functions

which are deﬁned in terms of a fold function and an algebra are called compositional

compositional

functions.

There is one special algebra: the one whose components are the constructor functions

of the datatypes we operate on. This algebra, when applied via a fold, deﬁnes the

identity function, and is called the initial algebra.

initial algebra

The function resulting from applying an algebra to a fold function is sometimes also

called an algebra homomorphism, and the compositional function is said to deﬁne

algebra

homomorphism

algebraic semantics.

algebraic semantics

5.4. Expressions

The ﬁrst part of this section presents a basic expression evaluator. The evaluator

is extended with variables in the second part and with local deﬁnitions in the third

part.

5.4.1. Evaluating expressions

In this subsection we start with a more involved example: an expression evaluator.

We will use another datatype for expressions than the one introduced in Section 5.2.3:

here we will use a single, and hence non mutual recursive datatype for expressions.

We restrict ourselves to ﬂoat valued expressions on which addition, subtraction, mul-

tiplication and division are deﬁned. The datatype and the corresponding algebra

type and fold function are as follows:

inﬁxl 7 ‘Mul ‘

inﬁx 7 ‘Dvd‘

inﬁxl 6 ‘Add‘, ‘Min‘

data Expr = Expr ‘Add‘ Expr

[ Expr ‘Min‘ Expr

[ Expr ‘Mul ‘ Expr

[ Expr ‘Dvd‘ Expr

[ Num Float

type ExprAlgebra a = (a → a → a — add

90

5.4. Expressions

, a → a → a — min

, a → a → a — mul

, a → a → a — dvd

, Float → a) — num

foldExpr :: ExprAlgebra a → Expr → a

foldExpr (add, min, mul , dvd, num) = fold

where

fold (expr

1

‘Add‘ expr

2

) = fold expr

1

‘add‘ fold expr

2

fold (expr

1

‘Min‘ expr

2

) = fold expr

1

‘min‘ fold expr

2

fold (expr

1

‘Mul ‘ expr

2

) = fold expr

1

‘mul ‘ fold expr

2

fold (expr

1

‘Dvd‘ expr

2

) = fold expr

1

‘dvd‘ fold expr

2

fold (Num n) = num n

There is nothing special to notice about these deﬁnitions except, perhaps, the fact

that Expr does not have an extra parameter x like the list and tree examples. Com-

puting the result of an expression now simply consists of replacing the constructors

by appropriate functions.

resultExpr :: Expr → Float

resultExpr = foldExpr ((+), (−), (∗), (/), id)

5.4.2. Adding variables

Our next goal is to extend the evaluator of the previous subsection such that it

can handle variables as well. The values of variables are typically looked up in an

environment which binds the names of the variables to values. We implement an

environment as a list of name-value pairs. For our purposes names are strings and

values are ﬂoats. In the following programs we will use the following functions and

types:

type Env name value = [(name, value)]

(?) :: Eq name ⇒ Env name value → name → value

env ? x = head [v [ (y, v) ← env, x = = y]

type Name = String

type Value = Float

The datatype and the corresponding algebra type and fold function are now as follows.

Note that we use the same name (Expr) for the datatype, although it diﬀers from

the previous Expr datatype.

data Expr = Expr ‘Add‘ Expr

[ Expr ‘Min‘ Expr

[ Expr ‘Mul ‘ Expr

91

5. Compositionality

[ Expr ‘Dvd‘ Expr

[ Num Value

[ Var Name

type ExprAlgebra a = (a → a → a — add

, a → a → a — min

, a → a → a — mul

, a → a → a — dvd

, Value → a — num

, Name → a) — var

foldExpr :: ExprAlgebra a → Expr → a

foldExpr (add, min, mul , dvd, num, var) = fold

where

fold (expr

1

‘Add‘ expr

2

) = fold expr

1

‘add‘ fold expr

2

fold (expr

1

‘Min‘ expr

2

) = fold expr

1

‘min‘ fold expr

2

fold (expr

1

‘Mul ‘ expr

2

) = fold expr

1

‘mul ‘ fold expr

2

fold (expr

1

‘Dvd‘ expr

2

) = fold expr

1

‘dvd‘ fold expr

2

fold (Num n) = num n

fold (Var x) = var x

The datatype Expr now has an extra constructor: the unary constructor Var. Sim-

ilarly, the argument of foldExpr now has an extra component: the unary function

var which corresponds to the unary constructor Var. Computing the result of an

expression somehow needs to use an environment. Here is a ﬁrst, bad way of doing

this: one can use it as an argument of a function that computes an algebra (we will

explain why this is a bad choice in the next subsection; the basic idea is that we use

the environment as a global variable here).

resultExprBad :: Env Name Value → Expr → Value

resultExprBad env = foldExpr ((+), (−), (∗), (/), id, (env?))

? resultExprBad [("x",3)] (Var "x" ‘Mul‘ Num 2)

6

The good way of using an environment is the following: instead of working with a

computation which, given an environment, yields an algebra of values it is better

to turn the computation itself into an algebra. Thus we turn the environment in a

‘local’ variable.

([+[), ([−[), ([∗[), ([/[) :: (Env Name Value → Value) →

(Env Name Value → Value) →

(Env Name Value → Value)

f [+[ g =λenv → f env + g env

f [−[ g =λenv → f env −g env

92

5.4. Expressions

f [∗[ g =λenv → f env ∗ g env

f [/[ g =λenv → f env / g env

resultExprGood :: Expr → (Env Name Value → Value)

resultExprGood = foldExpr (([+[), ([−[), ([∗[), ([/[), const, ﬂip (?))

? resultExprGood (Var "x" ‘Mul‘ Num 2) [("x",3)]

6

The actions (+), (−), (∗) and (/) on values are now replaced by corresponding actions

([+[), ([−[), ([∗[) and ([/[) on computations. Computing the result of the sum of two

subexpressions within a given environment consists of computing the result of the

subexpressions within this environment and adding both results to yield a ﬁnal result.

Computing the result of a constant does not need the environment at all. Computing

the result of a variable consists of looking it up in the environment. Thus, the

algebraic semantics of an expression is a computation which yields a value.

In this case the computation is of the form env → val . The value type is an example

of a synthesised attribute. The value of an expression is synthesised from values of its

synthesised attribute

subexpressions. The environment type is an example of an inherited attribute. The

inherited attribute

environment which is used by the computation of a subexpression of an expression

is inherited from the computation of the expression. Since we are working with

abstract syntax we say that the synthesised and inherited attributes are attributes

of the datatype Expr. If Expr is one of the mutually recursive datatypes which are

generated from the nonterminals of a grammar, then we say that the synthesised and

inherited attributes are attributes of the nonterminal.

5.4.3. Adding deﬁnitions

Our next goal is to extend the evaluator of the previous subsection such that it can

handle deﬁnitions as well. A (local) deﬁnition is an expression of the form

Def name expr

1

expr

2

which should be interpreted as: let the value of name be equal to expr

1

in expres-

sion expr

2

. Variables are typically deﬁned by updating the environment with an

appropriate name-value pair.

The datatype (called Expr again) and the corresponding algebra type and eval func-

tion are now as follows:

data Expr = Expr ‘Add‘ Expr

[ Expr ‘Min‘ Expr

[ Expr ‘Mul ‘ Expr

93

5. Compositionality

[ Expr ‘Dvd‘ Expr

[ Num Value

[ Var Name

[ Def Name Expr Expr

type ExprAlgebra a = (a → a → a — add

, a → a → a — min

, a → a → a — mul

, a → a → a — dvd

, Value → a — num

, Name → a — var

, Name → a → a → a) — def

foldExpr :: ExprAlgebra a → Expr → a

foldExpr (add, min, mul , dvd, num, var, def ) = fold

where

fold (expr

1

‘Add‘ expr

2

) = fold expr

1

‘add‘ fold expr

2

fold (expr

1

‘Min‘ expr

2

) = fold expr

1

‘min‘ fold expr

2

fold (expr

1

‘Mul ‘ expr

2

) = fold expr

1

‘mul ‘ fold expr

2

fold (expr

1

‘Dvd‘ expr

2

) = fold expr

1

‘dvd‘ fold expr

2

fold (Num n) = num n

fold (Var x) = var x

fold (Def x value body) = def x (fold value) (fold body)

Expr now has an extra constructor: the ternary constructor Def , which can be used

to introduce a local variable. For example, the following expression can be used to

compute the number of seconds per year.

seconds = Def "days_per_year" (Num 365) (

Def "hours_per_day" (Num 24) (

Def "minutes_per_hour" (Num 60) (

Def "seconds_per_minute" (Num 60) (

Var "days_per_year" ‘Mul ‘

Var "hours_per_day" ‘Mul ‘

Var "minutes_per_hour" ‘Mul ‘

Var "seconds_per_minute" ))))

Similarly, the parameter of foldExpr now has an extra component: the ternary func-

tion def which corresponds to the ternary constructor Def . Notice that the last two

arguments are recursive ones. We can now explain why the ﬁrst use of environments

is inferior to the second one. Trying to extend the ﬁrst deﬁnition gives something

like:

resultExprBad :: Env Name Value → Expr → Value

resultExprBad env =

foldExpr ((+), (−), (∗), (/), id, (env?), error "def")

94

5.4. Expressions

The last component causes a problem: a body that contains a local deﬁnition has to

be evaluated in an updated environment. We cannot update the environment in this

setting: we can read the environment but afterwards it is not accessible any more

in the algebra (which consists of values). Extending the second deﬁnition causes

no problems: the environment is now accessible in the algebra (which consists of

computations). We can easily add a new action which updates the environment.

The computation corresponding to the body of an expression with a local deﬁnition

can now be evaluated in the updated environment.

f [+[ g =λenv → f env + g env

f [−[ g =λenv → f env −g env

f [∗[ g =λenv → f env ∗ g env

f [/[ g =λenv → f env / g env

x [=[ f =λg env → g ((x, f env) : env)

resultExprGood :: Expr → (Env Name Value → Value)

resultExprGood =

foldExpr (([+[), ([−[), ([∗[), ([/[), const, ﬂip (?), ([=[))

? resultExprGood seconds []

31536000

Note that by prepending a pair (x, y) to an environment (in the deﬁnition of the

operator [=[), we add the pair to the environment. By deﬁnition of (?), the binding

for x hides possible other bindings for x.

5.4.4. Compiling to a stack machine

In this section we compile expressions to instructions on a stack machine. We can

then use this stack machine to evaluate compiled expressions. This section is inspired

by an example in the textbook by Bird and Wadler [3].

Imagine a simple computer for evaluating arithmetic expressions. This computer has

a ‘stack’ and can execute ‘instructions’ which change the value of the stack. The

class of possible instructions is deﬁned by the following datatype.

data MachInstr v = Push v [ Apply (v → v → v)

type MachProgr v = [MachInstr v]

An instruction either pushes a value of type v on the stack, or it executes an operator

that takes the two top values of the stack, applies the operator, and pushes the result

back on the stack. A stack (a value of type Stack v for some value type v) is a list of

values, from which you can pop values, on which you can push values, and from which

95

5. Compositionality

you can take the top value. A module Stack for stacks can be found in Appendix A.

The eﬀect of executing an instruction of type MachInstr is deﬁned by

execute :: MachInstr v → Stack v → Stack v

execute (Push x) s = push x s

execute (Apply op) s =let a = top s

t = pop s

b = top t

u = pop t

in push (op a b) u

A sequence of instructions is executed by the function run deﬁned by

run :: MachProgr v → Stack v → Stack v

run [] s = s

run (x : xs) s = run xs (execute x s)

It follows that run can be deﬁned as a foldl .

An expression can be translated (or compiled) into a list of instructions by the func-

tion compile, deﬁned by:

compileExpr :: Expr →

Env Name (MachProgr Value) →

MachProgr Value

compileExpr = foldExpr (add, min, mul , dvd, num, var, def )

where

f ‘add‘ g =λenv → f env ++ g env ++ [Apply (+)]

f ‘min‘ g =λenv → f env ++ g env ++ [Apply (−)]

f ‘mul ‘ g =λenv → f env ++ g env ++ [Apply (∗)]

f ‘dvd‘ g =λenv → f env ++ g env ++ [Apply (/)]

num v =λenv → [Push v]

var x =λenv → env ? x

def x fd fb =λenv → fb ((x, fd env) : env)

Exercise 5.5. Deﬁne the following functions as folds on the datatype Expr that contains

deﬁnitions.

1. isSum, which determines whether or not an expression is a sum.

2. vars, which returns the list of variables that occur in the expression.

Exercise 5.6. This exercise deals with expressions without deﬁnitions. The function der is

deﬁned by

der (e

1

‘Add‘ e

2

) dx = der e

1

dx ‘Add‘ der e

2

dx

der (e

1

‘Min‘ e

2

) dx = der e

1

dx ‘Min‘ der e

2

dx

der (e

1

‘Mul ‘ e

2

) dx = (e

1

‘Mul ‘ der e

2

dx) ‘Add‘ (der e

1

dx ‘Mul ‘ e

2

)

96

5.5. Block structured languages

der (e

1

‘Dvd‘ e

2

) dx = ((e

2

‘Mul ‘ der e

1

dx) ‘Min‘ (e

1

‘Mul ‘ der e

2

dx))

‘Dvd‘ (e

2

‘Mul ‘ e

2

)

der (Num f ) dx = Num 0

der (Var s) dx =if s = = dx then Num 1 else Num 0

1. Give an informal description of the function der.

2. Why is the function der not compositional ?

3. Deﬁne a datatype Exp to represent expressions consisting of (ﬂoating point) constants,

variables, addition and substraction. Also, deﬁne the type ExpAlgebra and the corre-

sponding foldExp.

4. Deﬁne the function der on Exp and show that this function is compositional.

Exercise 5.7. Deﬁne the function replace, which given a binary tree and an element m

replaces the elements at the leaves by m as a fold on the datatype BinTree, see Section 5.2.1.

It is easy to write a function with the required functionality if you swap the arguments, but

then it is impossible to write replace as a fold. Note that the fold returns a function, which

when given m replaces all the leaves by m.

Exercise 5.8. Consider the datatype of paths introduced in Exercise 5.3. A path in a tree

leads to a unique leaf. Deﬁne a compositonal function path2Value which, given a tree and a

path in the tree, yields the element at the unique leaf.

5.5. Block structured languages

This section presents a more complex example of the use of tuples in combination

with compositionality. The example deals with the scope of variables in a block

structured language. A variable from a global scope is visible in a local scope only if

it is not hidden by a variable with the same name in the local scope.

5.5.1. Blocks

A block is a list of statements. A statement is a variable declaration, a variable

usage or a nested block. The concrete representation of an example block of our

block structured language looks as follows (dcl stands for declaration, use stands for

usage and x, y and z are variables).

use x ; dcl x ;

(use z ; use y ; dcl x ; dcl z ; use x) ;

dcl y ; use y

97

5. Compositionality

Statements are separated by semicolons. Nested blocks are surrounded by paren-

theses. The usage of z refers to the local declaration (the only declaration of z).

The usage of y refers to the global declaration (the only declaration of y). The lo-

cal usage of x refers to the local declaration and the global usage of x refers to the

global declaration. Note that it is allowed to use variables before they are declared.

Here are some mutually recursive (data)types, which describe the abstract syntax of

blocks, corresponding to the grammar that describes the concrete syntax of blocks

which is used above. We use meaningful names for data constructors and we use

built-in lists instead of user-deﬁned lists for the block algebra. As usual, the algebra

type BlockAlgebra, which consists of two tuples of functions, and the fold function

foldBlock, which uses two mutually recursive local functions, can be generated from

the two mutually recursive (data)types.

type Block = [Statement]

data Statement = Dcl Idf [ Use Idf [ Blk Block

type Idf = String

type BlockAlgebra b s = ((s → b → b, b)

, (Idf → s, Idf → s, b → s)

)

foldBlock :: BlockAlgebra b s → Block → b

foldBlock ((cons, empty), (dcl , use, blk)) = fold

where

fold (s : b) = cons (foldS s) (fold b)

fold [] = empty

foldS (Dcl x) = dcl x

foldS (Use x) = use x

foldS (Blk b) = blk (fold b)

5.5.2. Generating code

The goal of this section is to generate code from a block. The code consists of a

sequence of instructions. There are three types of instructions.

• Enter (l , c): enters the l ’th nested block in which c local variables are declared.

• Leave (l , c): leaves the l ’th nested block in which c local variables were declared.

• Access (l , c): accesses the c’th variable of the l ’th nested block.

The code generated for the above example looks as follows.

[ Enter (0, 2), Access (0, 0)

, Enter (1, 2), Access (1, 1), Access (0, 1), Access (1, 0), Leave (1, 2)

98

5.5. Block structured languages

, Access (0, 1), Leave (0, 2)

]

Note that we start numbering levels (l ) and counts (c) (which are sometimes called

displacements) from 0. The abstract syntax of the code to be generated is described

by the following datatype.

type Count = Int

type Level = Int

type Variable = (Level , Count)

type BlockInfo = (Level , Count)

data Instruction = Enter BlockInfo

[ Leave BlockInfo

[ Access Variable

type Code = [Instruction]

The function ab2ac, which generates abstract code (a value of type Code) from an

abstract block (a value of type Block), uses a compositional function block2Code.

For all syntactic constructs of Statement and Block we deﬁne appropriate semantic

actions on an algebra of computations. Here is a, somewhat simpliﬁed, description

of these semantic actions.

• Dcl : Every time we declare a local variable x we have to update the local

environment le of the block we are in by associating with x the current pair of

level and local-count (l , lc). Moreover we have to increment the local variable

count lc to lc + 1. Note that we do not generate any code for a declaration

statement. Instead we perform some computations which make it possible to

generate appropriate code for other statements.

dcl x (le, l , lc) = (le

, lc

)

where

le

= le ‘update‘ (x, (l , lc))

lc

= lc + 1

where function update is deﬁned in the AssociationList module.

• Use: Every time we use a local variable x we have to generate code cd

for it.

This code is of the form [Access (l , lc)]. The level–local-count pair (l , lc) of the

variable is looked up in the global environment e.

use x e = cd

where

cd

= [Access (l , c)]

(l , c) = e ? x

99

5. Compositionality

• Blk: Every time we enter a nested block we increment the global level l to l +1,

start with a fresh local variable count 0 and set the local environment of the

nested block we enter to the current global environment e. The computation for

the nested block results in a local variable count lcB and a local environment

leB. Furthermore we need to make sure that the global environment (the one

in which we look up variables) which is used by the computation for the nested

block is equal to leB. The code which is generated for the block is surrounded

by an appropriate [Enter lcB]-[Leave lcB] pair.

blk fB (e, l ) = cd

where

l

= l + 1

(leB, lcB, cdB) = fB (leB, l

, e, 0)

cd

= [Enter (l

, lcB)] ++ cdB ++ [Leave (l

, lcB)]

• []: No action need to be performed for an empty block.

• (:): For every nonempty block we perform the computation of the ﬁrst state-

ment of the block which, given a local environment le and local variable count

lc, results in a local environment leS and local variable count lcS. This

environment-count pair is then used by the computation of the rest of the

block to result in a local environment le

and local variable count lc

. The code

cd

**which is generated is the concatenation cdS ++cdB of the code cdS which is
**

generated for the ﬁrst statement and the code cdB which is generated for the

rest of the block.

cons fS fB (le, lc) = (le

, lc

, cd

)

where

(leS, lcS, cdS) = fS (le, lc)

(le

, lc

**, cdB) = fB (leS, lcS)
**

cd

= cdS ++ cdB

What does our actual computation type look like? For dcl we need three inherited

attributes: a global level, a local block environment and a local variable count. Two

of them: the local block environment and the local variable count are also synthesised

attributes. For use we need one inherited attribute: a global block environment, and

we compute one synthesised attribute: the generated code. For blk we need two

inherited attributes: a global block environment and a global level, and we com-

pute two synthesised attributes: the local variable count and the generated code.

Moreover there is one extra attribute: a local block environment which is both inher-

ited and synthesised. When processing the statements of a nested block we already

make use of the global block environment which we are synthesising (when looking

up variables). For cons we compute three synthesised attributes: the local block

environment, the local variable count and the generated code. Two of them, the

100

5.5. Block structured languages

local block environment and the local variable count are also needed as inherited

attributes. It is clear from the considerations above that the following types fulﬁll

our needs.

type BlockEnv = [(Idf , Variable)]

type GlobalEnv = (BlockEnv, Level )

type LocalEnv = (BlockEnv, Count)

The implementation of block2Code is now a straightforward translation of the actions

described above. Attributes which are not mentioned in those actions are added as

extra components which do not contribute to the functionality.

block2Code :: Block → GlobalEnv → LocalEnv → (LocalEnv, Code)

block2Code = foldBlock ((cons, empty), (dcl , use, blk))

where

cons fS fB (e, l ) (le, lc) = ((le

, lc

), cd

)

where

((leS, lcS), cdS) = fS (e, l ) (le, lc)

((le

, lc

**), cdB) = fB (e, l ) (leS, lcS)
**

cd

= cdS ++ cdB

empty (e, l ) (le, lc) = ((le, lc), [])

dcl x (e, l ) (le, lc) = ((le

, lc

), [])

where

le

= (x, (l , lc)) : le

lc

= lc + 1

use x (e, l ) (le, lc) = ((le, lc), cd

)

where

cd

= [Access (l , c)]

(l , c) = e ? x

blk fB (e, l ) (le, lc) = ((le, lc), cd

)

where

((leB, lcB), cdB) = fB (leB, l

) (e, 0)

l

= l + 1

cd

= [Enter (l

, lcB)] ++ cdB ++ [Leave (l

, lcB)]

The code generator starts with an empty local environment, a fresh level and a fresh

local variable count. The code is a synthesised attribute. The global environment

is an attribute which is both inherited and synthesised. When processing a block

we already use the global environment which we are synthesising (when looking up

variables).

ab2ac :: Block → Code

ab2ac b = [Enter (0, c)] ++ cd ++ [Leave (0, c)]

where

101

5. Compositionality

((e, c), cd) = block2Code b (e, 0) ([], 0)

aBlock

= [ Use "x", Dcl "x"

, Blk [Use "z", Use "y", Dcl "x", Dcl "z", Use "x"]

, Dcl "y", Use "y"]

? ab2ac aBlock

[Enter (0,2),Access (0,0)

,Enter (1,2),Access (1,1),Access (0,1),Access (1,0),Leave (1,2)

,Access (0,1),Leave (0,2)]

5.6. Exercises

Exercise 5.9. Consider your answer to Exercise 2.22, which gives an abstract syntax for

palindromes.

1. Deﬁne a type PalAlgebra that describes the type of the semantic actions that corre-

spond to the syntactic constructs of Pal .

2. Deﬁne the function foldPal , which describes how the semantics actions that correspond

to the syntactic constructs of Pal should be applied.

3. Deﬁne the functions a2cPal and aCountPal as foldPal ’s.

4. Deﬁne the parser pfoldPal which interprets its input in an arbitrary semantic PalAlgebra

without building the intermediate abstract syntax tree.

5. Describe the parsers pfoldPal m

1

and pfoldPal m

2

where m

1

and m

2

correspond to the

algebras of a2cPal and aCountPal respectively.

Exercise 5.10. Consider your answer to Exercise 2.23, which gives an abstract syntax for

mirror-palindromes.

1. Deﬁne the type MirAlgebra that describes the semantic actions that correspond to the

syntactic constructs of Mir.

2. Deﬁne the function foldMir, which describes how semantic actions that correspond to

the syntactic constructs of Mir should be applied.

3. Deﬁne the functions a2cMir and m2pMir as foldMir’s.

4. Deﬁne the parser pfoldMir, which interprets its input in an arbitrary semantic MirAlgebra

without building the intermediate abstract syntax tree.

5. Describe the parsers pfoldMir m

1

and pfoldMir m

2

where m

1

and m

2

correspond to

the algebras of a2cMir and m2pMir, respectively.

Exercise 5.11. Consider your answer to exercise 2.24, which gives an abstract syntax for

parity-sequences.

1. Deﬁne the type ParityAlgebra that describes the semantic actions that correspond to

the syntactic constructs of Parity.

2. Deﬁne the function foldParity, which describes how the semantic actions that corre-

spond to the syntactic constructs of Parity should be applied.

3. Deﬁne the function a2cParity as foldParity.

102

5.6. Exercises

Exercise 5.12. Consider your answer to Exercise 2.25, which gives an abstract syntax for

bit-lists.

1. Deﬁne the type BitListAlgebra that describes the semantic actions that correspond to

the syntactic constructs of BitList.

2. Deﬁne the function foldBitList, which describes how the semantic actions that corre-

spond to the syntactic constructs of BitList should be applied.

3. Deﬁne the function a2cBitList as a foldBitList.

4. Deﬁne the parser pfoldBitList, which interprets its input in an arbitrary semantic

BitListAlgebra without building the intermediate abstract syntax tree.

Exercise 5.13. The following grammar describes the concrete syntax of a simple block-

structured programming language

B → S R (block)

R → ; S R [ ε (rest)

S → D [ U [ N (statement)

D → x [ y (declaration)

U → X [ Y (usage)

N → ( B ) (nested block)

1. Deﬁne a datatype Block that describes the abstract syntax that corresponds to the

grammar. What is the abstract representation of x;(y;Y);X?

2. Deﬁne the type BlockAlgebra that describes the semantic actions that correspond to

the syntactic constructs of Block.

3. Deﬁne the function foldBlock, which describes how the semantic actions corresponding

to the syntactic constructs of Block should be applied.

4. Deﬁne the function a2cBlock, which converts an abstract block into a concrete one.

Write a2cBlock as a foldBlock

5. The function checkBlock tests whether or not each variable of a given abstract block

is declared before use (declared in the same or in a surrounding block).

103

5. Compositionality

104

6. Computing with parsers

Parsers produce results. For example, the parsers for travelling schemes given in

Chapter 4 return an abstract syntax, or an integer that represents the net travelling

time in minutes. The net travelling time is computed directly by inserting the cor-

rect semantic functions. Another way to compute the net travelling time is by ﬁrst

computing the abstract syntax, and then applying a function to the abstract syntax

that computes the net travelling time. This section shows several ways to compute

results using parsers:

• insert a semantic function in the parser;

• apply a fold to the abstract syntax;

• use a class instead of abstract syntax;

• pass an algebra to the parser.

6.1. Insert a semantic function in the parser

In Chapter 4 we have deﬁned two parsers: a parser that computes the abstract syntax

for a travelling schema, and a parser that computes the net travelling time. These

functions are obtained by inserting diﬀerent functions in the basic parser. If we want

to compute the total travelling time, we have to insert diﬀerent functions in the basic

parser. This approach works ﬁne for a small parser, but it has some disadvantages

when building a larger parser:

• semantics is intertwined with the parsing process;

• it is diﬃcult to locate all positions where semantic functions have to be inserted

in the parser.

6.2. Apply a fold to the abstract syntax

Instead of inserting operations in a basic parser, we can write a parser that parses

the input to an abstract syntax, and computes the desired result by applying a fold

to the abstract syntax.

An example of such an approach has been given in Section 5.2.2, where we deﬁned

two functions with the same functionality: nesting and nesting’; both compute

the maximum nesting depth in a string of parentheses. Function nesting is deﬁned

105

6. Computing with parsers

by inserting functions in the basic parser. Function nesting’ is deﬁned by applying

a fold to the abstract syntax. Each of these deﬁnitions has its own merits; we repeat

the main arguments below.

parens :: Parser Char Parentheses

parens = (\_ b _ d -> Match b d) <$>

open <*> parens <*> close <*> parens

<|> succeed Empty

nesting :: Parser Char Int

nesting = (\_ b _ d -> max (1+b) d) <$>

open <*> nesting <*> close <*> nesting

<|> succeed 0

nesting’ :: Parser Char Int

nesting’ = depthParentheses <$> parens

The ﬁrst deﬁnition (nesting) is more eﬃcient, because it does not build an inter-

mediate abstract syntax tree. On the other hand, it might be more diﬃcult to write

because we have to insert functions in the correct places in the basic parser. The ad-

vantage of the second deﬁnition (nesting’) is that we reuse both the parser parens,

which returns an abstract syntax tree, and the function depthParentheses (or the

function foldParentheses, which is used in the deﬁnition of depthParentheses),

which does recursion over an abstract syntax tree. The only thing we have to write

ourselves in the second deﬁnition is the depthParenthesesAlgebra. The disadvan-

tage of the second deﬁnition is that it builds an intermediate abstract syntax tree,

which is ‘ﬂattened’ by the fold. We want to avoid building the abstract syntax

tree altogether. To obtain the best of both worlds, we would like to write function

nesting’ and have our compiler ﬁgure out that it is better to use function nesting

in computations. The automatic transformation of function nesting’ into function

nesting is called deforestation (trees are removed). Some (very few) compilers are

clever enough to perform this transformation automatically.

6.3. Deforestation

Deforestation removes intermediate trees in computations. The previous section gives

an example of deforestation on the datatype Parentheses. This section sketches the

general idea.

Suppose we have a datatype AbstractTree

data AbstractTree = ...

106

6.4. Using a class instead of abstract syntax

From this datatype we construct an algebra and a fold, see Chapter 5.

type AbstractTreeAlgebra a = ...

foldAbstractTree :: AbstractTreeAlgebra a -> AbstractTree -> a

A parser for the datatype AbstractTree (which returns a value of AbstractTree)

has the following type:

parseAbstractTree :: Parser Symbol AbstractTree

where Symbol is some type of input symbols (for example Char). Suppose now that

we deﬁne a function p that parses an AbstractTree, and then computes some value

by folding with an algebra f over this tree:

p = foldAbstractTree f . parseAbstractTree

Then deforestation says that p is equal to the function parseAbstractTree in which

occurrences of the constructors of the datatype AbstractTree have been replaced

by the corresponding components of the algebra f. The following two sections each

describe a way to implement such a deforestated function.

6.4. Using a class instead of abstract syntax

Classes can be used to implement the deforestated or fused computation of a fold

with a parser. This gives a solution of the desired eﬃciency.

For example, for the language of parentheses, we deﬁne the following class:

class Parens a where

match :: a -> a -> a

empty :: a

Note that types of the functions in the class Parens correspond exactly to the two

types that occur in the type ParenthesesAlgebra. This class is used in a parser for

parentheses:

parens :: Parens a => Parser Char a

parens = (\_ b _ d -> match b d) <$>

open <*> parens <*> close <*> parens

<|> succeed empty

107

6. Computing with parsers

The algebra is implicit in this function: the only thing we know is that there exist

functions empty and match of the correct type; we know nothing about their imple-

mentation. To obtain a function parens that returns a value of type Parentheses

we create the following instance of the class Parens.

instance Parens Parentheses where

match = Match

empty = Empty

Now we can write:

?(parens :: Parser Char Parentheses) "()()"

[(Match Empty (Match Empty Empty), "")

,(Match Empty Empty, "()")

,(Empty, "()()")

]

Note that we have to supply the type of parens in this expression, otherwise Haskell

doesn’t know which instance of Parens to use. This is how we turn the implicit

‘class’ algebra into an explicit ‘instance’ algebra. Another instance of Parens can be

used to compute the nesting depth of parentheses:

instance Parens Int where

match b d = max (1+b) d

empty = 0

And now we can write:

?(parens :: Parser Char Int) "()()"

[(1, ""), (1, "()"), (0, "()()")]

So the answer depends on the type we want our function parens to have. This

also immediately shows a problem of this, otherwise elegant, approach: it does not

work if we want to compute two diﬀerent results of the same type, because Haskell

doesn’t allow you to deﬁne two (or more) instances with the same type. So once we

have deﬁned the instance Parens Int as above, we cannot use function parens to

compute, for example, the width (also an Int) of a string of parentheses.

108

6.5. Passing an algebra to the parser

6.5. Passing an algebra to the parser

The previous section shows how to implement a parser with an implicit algebra. Since

this approach fails when we want to deﬁne diﬀerent parsers with the same result type,

we make the algebras explicit. Thus we obtain the following deﬁnition of parens:

parens :: ParenthesesAlgebra a -> Parser Char a

parens (match,empty) = par where

par = (\_ b _ d -> match b d) <$>

open <*> par <*> close <*> par

<|> succeed empty

Note that it is now easy to deﬁne diﬀerent parsers with the same result type:

nesting, breadth :: Parser Char Int

nesting = parens (\b d -> max (1+b) d,0)

breadth = parens (\b d -> d+1,0)

109

6. Computing with parsers

110

7. Programming with

higher-order folds

Introduction

In the previous chapters we have seen that algebras play an important role when

describing the meaning of a recognised structure (a parse tree). For each recursive

datatype T we have a function foldT, and for each constructor of the datatype we

have a corresponding function as a component in the algebra. Chapter 5 introduces

a language in which local declarations are permitted. Evaluating expressions in this

language can be done by choosing an appropriate algebra. The domain of that algebra

is a higher order (data)type (a (data)type that contains functions). Unfortunately,

the resulting code comes as a surprise to many. In this chapter we will illustrate a

related formalism, which will make it easier to construct such involved algebras. This

related formalism is the attribute grammar formalism. We will not formally deﬁne

attribute grammars, but instead illustrate the formalism with some examples, and

give an informal deﬁnition.

We start with developing a somewhat unconventional way of looking at functional

programs, and especially those programs that use functions that recursively descend

over datatypes a lot. In our case one may think about these datatypes as abstract

syntax trees. When computing a property of such a recursive object (for example, a

program) we deﬁne two sets of functions: one set that describes how to recursively

visit the nodes of the tree, and one set of functions (an algebra) that describes what

to compute at each node when visited.

One of the most important steps in this process is deciding what the carrier type

of the algebras is going to be. Once this step has been taken, these types are a

guideline for further design steps. We will see that such carrier types may be functions

themselves, and that deciding on the type of such functions may not always be simple.

In this chapter we will present a view on recursive computations that will enable us

to “design” the carrier type in an incremental way. We will do so by constructing

algebras out of other algebras. In this way we deﬁne the meaning of a language in a

semantically compositional way.

We will start with the rep min example, which looks a bit artiﬁcial, and deals with

a non-interesting, highly speciﬁc problem. However, it has been chosen for its sim-

plicity, and to not distract our attention to speciﬁc, programming language related,

111

7. Programming with higher-order folds

data Tree = Leaf Int

| Bin Tree Tree deriving Show

type TreeAlgebra a = (Int -> a, a -> a -> a)

foldTree :: TreeAlgebra a -> Tree -> a

foldTree alg@(leaf, _ ) (Leaf i) = leaf i

foldTree alg@(_ , bin) (Bin l r) = bin (foldTree alg l)

(foldTree alg r)

Listing 7.1: rm.start.hs

semantic issues. The second example of this chapter demonstrates the techniques on

a larger example: a small compiler for part of a programming language.

Goals

In this chapter you will learn:

• how to write ‘circular’ functional programs, or ‘higher-order folds’;

• how to combine algebras;

• (informally) the concept of an attribute grammar.

7.1. The rep min problem

One of the famous examples in which the power of lazy evaluation is demonstrated is

the so-called rep min problem [2]. Many have wondered how this program achieves

its goal, since at ﬁrst sight it seems that it is impossible to compute anything with

this program. We will use this problem, and a sequence of diﬀerent solutions, to

build up an understanding of a whole class of such programs.

In Listing 7.1 we present the datatype Tree, together with its associated algebra.

The carrier type of an algebra is the type that describes the objects of the algebra.

We represent it by a type parameter of the algebra type:

type TreeAlgebra a = (Int -> a, a -> a -> a)

The associated evaluation function foldTree systematically replaces the constructors

Leaf and Bin by corresponding operations from the algebra alg that is passed as an

argument.

112

7.1. The rep min problem

minAlg :: TreeAlgebra Int

minAlg = (id, min :: Int->Int->Int)

rep_min :: Tree -> Tree

rep_min t = foldTree repAlg t

where m = foldTree minAlg t

repAlg = (const (Leaf m), Bin)

Listing 7.2: rm.sol1.hs

We now want to construct a function rep_min :: Tree -> Tree that returns a Tree

with the same “shape” as its argument Tree, but with the values in its leaves replaced

by the minimal value occurring in the original tree. For example,

?rep_min (Bin (Bin (Leaf 1) (Leaf 7)) (Leaf 11))

Bin (Bin (Leaf 1) (Leaf 1)) (Leaf 1)

7.1.1. A straightforward solution

A straightforward solution to the rep min problem consists of a function in which

foldTree is used twice: once for computing the minimal value of the leaf values,

and once for constructing the resulting Tree. The function rep_min that solves the

problem in this way is given in Listing 7.2. Notice that the variable m is a global

variable of the repAlg-algebra, that is used in the tree constructing call of foldTree.

One of the disadvantages of this solution is that in the course of the computation the

pattern matching associated with the inspection of the tree nodes is performed twice

for each node in the tree.

Although this solution as such is no problem, we will try to construct a solution that

calls foldTree only once.

7.1.2. Lambda lifting

We want to obtain a program for the rep min problem in which pattern matching is

used only once. Program Listing 7.3 is an intermediate step towards this goal. In

this program the global variable m has been removed and the second call of foldTree

does not construct a Tree anymore, but instead a function constructing a tree of type

Int -> Tree, which takes the computed minimal value as an argument. Notice how

we have emphasized the fact that a function is returned through some superﬂuous

notation: the ﬁrst lambda in the function deﬁnitions constituting the algebra repAlg

is required by the signature of the algebra, the second lambda, which could have been

113

7. Programming with higher-order folds

repAlg = ( \_ -> \m -> Leaf m

,\lfun rfun -> \m -> let lt = lfun m

rt = rfun m

in Bin lt rt

)

rep_min’ t = (foldTree repAlg t) (foldTree minAlg t)

Listing 7.3: rm.sol2.hs

infix 9 ‘tuple‘

tuple :: TreeAlgebra a -> TreeAlgebra b -> TreeAlgebra (a,b)

(leaf1, bin1) ‘tuple‘ (leaf2, bin2) = (\i -> (leaf1 i, leaf2 i)

,\l r -> (bin1 (fst l) (fst r)

,bin2 (snd l) (snd r)

)

)

min_repAlg :: TreeAlgebra (Int, Int -> Tree)

min_repAlg = (minAlg ‘tuple‘ repAlg)

rep_min’’ t = r m

where (m, r) = foldTree min_repAlg t

Listing 7.4: rm.sol3.hs

omitted, is there because the carrier set of the algebra contains functions of type Int

-> Tree. This process is done routinely by functional compilers and is known as

lambda-lifting.

7.1.3. Tupling computations

We are now ready to formulate a solution in which foldTree is called only once. Note

that in the last solution the two calls of foldTree don’t interfere with each other.

As a consequence we may perform both the computation of the tree constructing

function and the minimal value in one go, by tupling the results of the computations.

The solution is given in Listing 7.4. First a function tuple is deﬁned. This function

takes two TreeAlgebras as arguments and constructs a third one, which has as its

carrier tuples of the carriers of the original algebras.

114

7.1. The rep min problem

7.1.4. Merging tupled functions

In the next step we transform the type of the carrier set in the previous example,

(Int, Int->Tree), into an equivalent type Int -> (Int, Tree). This transforma-

tion is not essential here, but we use it to demonstrate that if we compute a cartesian

product of functions, we may transform that type into a new type in which we com-

pute only one function, which takes as its arguments the cartesian product of all

the arguments of the functions in the tuple, and returns as its result the cartesian

product of the result types. In our example the computation of the minimal value

may be seen as a function of type ()->Int. As a consequence the argument of the

new type is ((), Int), which is isomorphic to just Int, and the result type becomes

(Int, Tree).

We want to mention here too that the reverse is in general not true; given a function

of type (a, b) -> (c, d), it is in general not possible to split this function into two

functions of type a -> c and b -> d, which together achieve the same eﬀect. The

new version is given in Listing 7.5.

Notice how we have, in an attempt to make the diﬀerent rˆoles of the parameters

explicit, again introduced extra lambdas in the deﬁnition of the functions of the

algebra. The parameters after the second lambda are there because we construct

values in a higher order carrier set. The parameters after the ﬁrst lambda are there

because we deal with a TreeAlgebra. A curious step taken here is that part of

the result, in our case the value m, is passed back as an argument to the result of

(foldTree mergedAlg t). Lazy evaluation makes this work.

That such programs were possible came originally as a great surprise to many func-

tional programmers, especially to those who used to program in LISP or ML, lan-

guages that require arguments to be evaluated completely before a call is evaluated

(so-called strict evaluation in contrast to lazy evaluation). Because of this surprising

behaviour this class of programs became known as circular programs. Notice however

that there is nothing circular in this program. Each value is deﬁned in terms of other

values, and no value is deﬁned in terms of itself (as in ones=1:ones).

Finally, Listing 7.6 shows the version of this program in which the function foldTree

has been unfolded. Thus we obtain the original solution as given in Bird [2].

Concluding, we have systematically transformed a program that inspects each node

twice into an equivalent program that inspects each node only once. The resulting

solution passes back part of the result of a call as an argument to that same call.

Lazy evaluation makes this possible.

Exercise 7.1. The deepest front problem is the problem of ﬁnding the so-called front of a

tree. The front of a tree is the list of all nodes that are at the deepest level. As in the

rep min problem, the trees involved are elements of the datatype Tree, see Listing 7.1. A

straightforward solution is to compute the height of the tree and passing the result of this

function to a function frontAtLevel :: Tree -> Int -> [Int].

115

7. Programming with higher-order folds

mergedAlg :: TreeAlgebra (Int -> (Int,Tree))

mergedAlg = (\i -> \m -> (i, Leaf m)

,\lfun rfun -> \m -> let (lm,lt) = lfun m

(rm,rt) = rfun m

in (lm ‘min‘ rm

, Bin lt rt

)

)

rep_min’’’ t = r

where (m, r) = (foldTree mergedAlg t) m

Listing 7.5: rm.sol4.hs

rep_min’’’’ t = r

where (m, r) = tree t m

tree (Leaf i) = \m -> (i, Leaf m)

tree (Bin l r) = \m -> let (lm, lt) = tree l m

(rm, rt) = tree r m

in (lm ‘min‘ rm, Bin lt rt)

Listing 7.6: rm.sol5.hs

116

7.2. A small compiler

1. Deﬁne the functions height and frontAtLevel

2. Give the four diﬀerent solutions as deﬁned in the rep min problem.

Exercise 7.2. Redo the previous exercise for the highest front problem.

7.2. A small compiler

This section constructs a small compiler for (a part of) a small language. The com-

piler compiles this code into code for a hypothetical stack machine.

7.2.1. The language

The language we consider in this section has integers, booleans, function application,

and an if-then-else expression. A language with just these constructs is useless, and

you will extend the language in the exercises with some other constructs, which make

the language a bit more interesting. We take the following context-free grammar for

the concrete syntax of the language.

Expr0 → if Expr1 then Expr1 else Expr1 [ Expr1

Expr1 → Expr2 Expr2

∗

Expr2 → Int [ Bool

where Int generates integers, and Bool booleans. An abstract syntax for our language

is given in Listing 7.7. Note that we use a single datatype for the abstract syntax

instead of three datatypes (one for each nonterminal); this simpliﬁes the code a bit.

The Listing 7.7 also contains a deﬁnition of a fold and an algebra type for the abstract

syntax.

A parser for expressions is given in Listing 7.8.

7.2.2. A stack machine

In section 5.4.4 we have deﬁned a stack machine with which simple arithmetic ex-

pressions can be evaluated. Here we deﬁne a stack machine that has some more

instructions. The language of the previous section wil be compiled into code for this

stack machine in the following section.

The stack machine we will use has the following instructions:

• it can load an integer;

• it can load a boolean;

• given an argument and a function on the stack, it can call the function on the

argument;

117

7. Programming with higher-order folds

data ExprAS = If ExprAS ExprAS ExprAS

| Apply ExprAS ExprAS

| ConInt Int

| ConBool Bool deriving Show

type ExprASAlgebra a = (a -> a -> a -> a

,a -> a -> a

,Int -> a

,Bool -> a

)

foldExprAS :: ExprASAlgebra a -> ExprAS -> a

foldExprAS (iff,apply,conint,conbool) = fold

where fold (If ce te ee) = iff (fold ce) (fold te) (fold ee)

fold (Apply fe ae) = apply (fold fe) (fold ae)

fold (ConInt i) = conint i

fold (ConBool b) = conbool b

Listing 7.7: ExprAbstractSyntax.hs

• it can set a label in the code;

• given a boolean on the stack, it can jump to a label provided the boolean is

false;

• it can jump to a label (unconditionally).

The datatype for instructions is given in Listing 7.9.

7.2.3. Compiling to the stackmachine

How do we compile the diﬀerent expressions to stack machine code? We want to

deﬁne a function compile of type

compile :: ExprAS -> [InstructionSM]

• A ConInt i is compiled to a LoadInt i.

compile (ConInt i) = [LoadInt i]

• A ConBool b is compiled to a LoadBool b.

compile (ConBool b) = [LoadBool b]

118

7.2. A small compiler

sptoken :: String -> Parser Char String

sptoken s = (\_ b _ -> b) <$>

many (symbol ’ ’) <*> token s <*> many1 (symbol ’ ’)

boolean = const True <$> token "True" <|> const False <$> token "False"

parseExpr :: Parser Char ExprAS

parseExpr = expr0

where expr0 = (\a b c d e f -> If b d f) <$>

sptoken "if"

<*> parseExpr

<*> sptoken "then"

<*> parseExpr

<*> sptoken "else"

<*> parseExpr

<|> expr1

expr1 = chainl expr2 (const Apply <$> many1 (symbol ’ ’))

<|> expr2

expr2 = ConBool <$> boolean

<|> ConInt <$> natural

Listing 7.8: ExprParser.hs

data InstructionSM = LoadInt Int

| LoadBool Bool

| Call

| SetLabel Label

| BrFalse Label

| BrAlways Label

type Label = Int

Listing 7.9: InstructionSM.hs

119

7. Programming with higher-order folds

• An application Apply f x is compiled by ﬁrst compiling the argument x, then

the ‘function’ f (at the moment it is impossible to deﬁne functions in our

language, hence the quotes around ‘function’), and ﬁnally putting a Call on

top of the stack.

compile (Apply f x) = compile x ++ compile f ++ [Call]

• An if-then-else expression If ce te ee is compiled by ﬁrst compiling the con-

ditional expression ce. Then we jump to a label (which will be set before the

code of the else expression ee later) if the resulting boolean is false. Then we

compile the then expression te. After the then expression we always jump to

the end of the code of the if-then-else expression, for which we need another la-

bel. Then we set the label for the else expression, we compile the else expression

ee, and, ﬁnally, we set the label for the end of the if-then-else expression.

compile (If ce te ee) = compile ce

++ [BrFalse ?lab1]

++ compile te

++ [BrAlways ?lab2]

++ [SetLabel ?lab1]

++ compile ee

++ [SetLabel ?lab2]

Note that we use labels here, but where do these labels come from?

From the above description we see that we also need labels when compiling an ex-

pression. We add a label argument (an integer, used for the ﬁrst label in the compiled

code) to function compile, and we want function compile to return the ﬁrst unused

label. We change the type of function compile as follows:

compile :: ExprAS -> Label -> ([InstructionSM],Label)

type Label = Int

The four cases in the deﬁnition of compile have to take care of the labels. We obtain

the following deﬁnition of compile:

compile (ConInt i) = \l -> ([LoadInt i],l)

compile (ConBool b) = \l -> ([LoadBool b],l)

compile (Apply f x) = \l -> let (xc,l’) = compile x l

(fc,l’’) = compile f l’

in (xc ++ fc ++ [Call],l’’)

compile (If ce te ee) = \l -> let (cc,l’) = compile ce (l+2)

(tc,l’’) = compile te l’

(ec,l’’’) = compile ee l’’

120

7.2. A small compiler

compile = foldExprAS compileAlgebra

compileAlgebra :: ExprASAlgebra (Label -> ([InstructionSM],Label))

compileAlgebra = (\cce cte cee -> \l ->

let (cc,l’) = cce (l+2)

(tc,l’’) = cte l’

(ec,l’’’) = cee l’’

in ( cc

++ [BrFalse l]

++ tc

++ [BrAlways (l+1)]

++ [SetLabel l]

++ ec

++ [SetLabel (l+1)]

,l’’’

)

,\cf cx -> \l -> let (xc,l’) = cx l

(fc,l’’) = cf l’

in (xc ++ fc ++ [Call],l’’)

,\i -> \l -> ([LoadInt i],l)

,\b -> \l -> ([LoadBool b],l)

)

Listing 7.10: CompileExpr.hs

in ( cc

++ [BrFalse l]

++ tc

++ [BrAlways (l+1)]

++ [SetLabel l]

++ ec

++ [SetLabel (l+1)]

,l’’’

)

Function compile is a fold, the carrier type of its algebra is a function of type Label

-> ([InstructionSM],Label). The deﬁnition of function compile as a fold is given

in Listing 7.10.

Exercise 7.3 (no answer provided). Extend the code generation example by adding variables

to the datatype Expr.

Exercise 7.4 (no answer provided). Extend the code generation example by adding deﬁni-

tions to the datatype Expr too.

121

7. Programming with higher-order folds

7.3. Attribute grammars

In Section 7.1 we have written a program that solves the rep min problem. This

program computes the minimum of a tree, and it computes the tree in which all the

leaf values are replaced by the minimum value. The minimum is computed bottom-

up: it is synthesized from its children. The minimum value is then passed on to the

functions that build the tree with the minimum value in its leaves. These functions

receive the minimum value from their parent tree node: they inherit the minimum

value from their parent.

We can see the rep min computation as a computation on a value of type Tree,

on which two attributes are deﬁned: the minimum and result tree attributes. The

minimum is computed bottom-up, and is then passed down to the result tree, and is

therefore a synthesized and inherited attribute. The result tree is computed bottom-

up, and is hence a synthesized attribute.

The formalism in which it is possible to specify such attributes and computations

on datatypes or grammars is called attribute grammars, and was originally proposed

by Donald Knuth in [9]. Attribute grammars provide a solution for the systematic

description of the phases of the compiler that come after scanning and parsing. Al-

though they look diﬀerent from what we have encountered thus far and are probably

a little easier to write, they can straightforwardly be mapped onto a functional pro-

gram. The programs you have seen in this chapter could also have been obtained

by means of such a mapping from an attribute grammar speciﬁcation. Traditionally

such attribute grammars are used as the input of a compiler generator. Just as we

have seen how by introducing a suitable set of parsing combinators one may avoid

the use of a special parser generator and even gain a lot of ﬂexibility in extending

the grammatical formalism by introducing more complicated combinators, we have

shown how one can do without a special purpose attribute grammar processing sys-

tem. But, just as the concept of a context free grammar was useful in understanding

the fundamentals of parser combinators, understanding attribute grammars will help

signiﬁcantly in describing the semantic part of the recognition and compilation pro-

cess. This chapter does not further introduce attribute grammars, but they will

appear again in the course in implementing programming languages.

122

8. Regular Languages

Introduction

The ﬁrst phase of a compiler takes an input program, and splits the input into a

list of terminal symbols: keywords, identiﬁers, numbers, punctuation, etc. Regular

expressions are used for the description of the terminal symbols. A regular grammar

is a particular kind of context-free grammar that can be used to describe regular

expressions. Finite-state automata can be used to recognise sentences of regular

grammars. This chapter discusses all of these concepts, and is organised as follows.

Section 8.1 introduces ﬁnite-state automata. Finite-state automata appear in two

versions, nondeterministic and deterministic ones. Section 8.1.4 shows that a non-

deterministic ﬁnite-state automaton can be transformed into a deterministic ﬁnite-

state automaton, so you don’t have to worry about whether or not your automaton is

deterministic. Section 8.2 introduces regular grammars (context-free grammars of a

particular form), and regular languages. Furthermore, it shows their equivalence with

ﬁnite-state automata. Section 8.3 introduces regular expressions as ﬁnite descriptions

of regular languages and shows that such expressions are another, equivalent, tool

for regular languages. Finally, Section 8.4 gives some of the proofs of the results of

the previous sections.

Goals

After you have studied this chapter you will know that

• regular languages are a subset of context-free languages;

• it is not always possible to give a regular grammar for a context-free grammar;

• regular grammars, ﬁnite-state automata and regular expressions are three dif-

ferent tools for regular languages;

• regular grammars, ﬁnite-state automata and regular expressions have the same

expressive power;

• ﬁnite-state automata appear in two, equally expressive, versions: deterministic

and nondeterministic.

8.1. Finite-state automata

The classical approach to recognising sentences from a regular language uses ﬁnite-

state automata. A ﬁnite-state automaton can be viewed as a simple form of digital

123

8. Regular Languages

computer with only a ﬁnite number of states, no temporary storage, an input ﬁle but

only the possibility to read it, and a control unit which records the state transitions.

A rather limited medium, but a useful tool in many practical subproblems. A ﬁnite-

state automaton can easily be implemented by a function that takes time linear in

the length of its input, and constant space. This implies that problems that can be

solved by means of a ﬁnite-state automaton, can be implemented by means of very

eﬃcient programs.

8.1.1. Deterministic ﬁnite-state automata

Finite-state automata come in two ﬂavours: deterministic and nondeterministic. We

start with a description of deterministic ﬁnite-state automata, the simplest form of

automata.

Deﬁnition 8.1 (Deterministic ﬁnite-state automaton, DFA). A deterministic ﬁnite-

state automaton (DFA) is a 5-tuple (X, Q, d, S, F) where

deterministic

ﬁnite-state

automaton

• X is the input alphabet,

• Q is a ﬁnite set of states,

• d :: Q → X → Q is the state transition function,

• S ∈ Q is the start state,

• F ⊆ Q is the set of accepting states.

As an example, consider the DFA M

0

= (X, Q, d, S, F) with

X = ¦a, b, c¦

Q = ¦S, A, B, C¦

F = ¦C¦

where state transition function d is deﬁned by

d S a = C

d S b = A

d S c = S

d A a = B

d B c = C

For human beings, a ﬁnite-state automaton is more comprehensible in a graphical

representation. The following representation is customary: states are depicted as

the nodes in a graph; accepting states get a double circle; start states are explicitly

mentioned or indicated otherwise. The transition function is represented by the

edges: whenever d Q

i

x is a state Q

j

, then there is an arrow labelled x from Q

i

to Q

j

.

The input alphabet is implicit in the labels. For automaton M

0

above, the pictorial

representation is:

124

8.1. Finite-state automata

S

?>=< 89:;

ED GF

c

@A

a

b

C

?>=< 89:; /.-, ()*+

A

?>=< 89:;

a

B

?>=< 89:;

c

**Note that d is a partial function: for example d B a is not deﬁned. We can make d
**

into a total function by introducing a new ‘sink’ state, the result state of all undeﬁned

transitions. For example, in the above automaton we can introduce a sink state D

with d D x = D for all terminals x, and d E x = D for all states E and terminals

x for which d E x is undeﬁned. The sink state and the transitions from/to it are

almost always omitted.

The action of a DFA on an input string is described as follows: given a sequence w

of input symbols, w can be ‘processed’ symbol by symbol (from left to right) and —

depending on the speciﬁc input symbol — the DFA (initially in the start state) moves

to the state as determined by its state transition function. If no move is possible, the

automaton blocks. When the complete input has been processed and the DFA is in

one of its accepting states, then we say that w is accepted by the automaton.

accept

To illustrate the action of an DFA, we will show that the sentence bac is accepted

by M

0

. We do so by recording the successive conﬁgurations, i.e. the pairs of current

state and remaining input values.

(S, bac)

→

(A, ac)

→

(B, c)

→

(C, )

Because of the deterministic behaviour of a DFA the deﬁnition of acceptance by

a DFA is relatively easy. Informally, a sequence w ∈ X

∗

is accepted by a DFA

(X, Q, d, S, F), if it is possible, when starting the DFA in S, to end in an accepting

state after processing w. This operational description of acceptance is formalised in

the predicate dfa accept. The predicate will be derived in a top-down fashion, i.e. we

formulate the predicate in terms of (“smaller”) subcomponents and afterwards we

give solutions to the subcomponents.

125

8. Regular Languages

Suppose dfa is a function that reﬂects the behaviour of the DFA, i.e. a function which

given a transition function, a start state and a string, returns the unique state that is

reached after processing the string starting from the start state. Then the predicate

dfa accept is deﬁned by:

dfa accept :: X

∗

→ (Q → X → Q, Q, ¦Q¦) → Bool

dfa accept w (d, S, F) = (dfa d S w) ∈ F

It remains to construct a deﬁnition of function dfa that takes a transition function,

a start state, and a list of input symbols, and reﬂects the behaviour of a DFA. The

deﬁnition of dfa is straightforward

dfa :: (Q → X → Q) → Q → X

∗

→ Q

dfa d q = q

dfa d q (ax) = dfa d (d q a) x

Note that both the type and the deﬁnition of dfa match the pattern of the function

foldl , and it follows that the function dfa is actually identical to foldl .

dfa d q = foldl d q.

Deﬁnition 8.2 (Acceptance by a DFA). The sequence w ∈ X

∗

is accepted by DFA

(X, Q, d, S, F) if

dfa accept w (d, S, F)

where

dfa accept w (d, qs, fs) = dfa d qs w ∈ fs

dfa d qs = foldl d qs

Using the predicate dfa accept, the language of a DFA is deﬁned as follows.

Deﬁnition 8.3 (Language of a DFA). For DFA M = (X, Q, d, S, F), the language

of M, Ldfa(M), is deﬁned by

Ldfa(M) = ¦w ∈ X

∗

[ dfa accept w (d, S, F)¦

8.1.2. Nondeterministic ﬁnite-state automata

This subsection introduces nondeterministic ﬁnite-state automata and deﬁnes their

semantics, i.e. the language of a nondeterministic ﬁnite-state automaton.

The transition function of a DFA returns a state, which implies that for all terminal

symbols x and for all states t there can only be one edge starting in t labelled with

x. Sometimes it is convenient to have two or more edges labelled with the same

terminal symbol from a state. In these cases one can use a nondeterministic ﬁnite-

state automaton. Nondeterministic ﬁnite state automata are deﬁned as follows.

126

8.1. Finite-state automata

Deﬁnition 8.4 (Nondeterministic ﬁnite-state automaton, NFA). A nondeterministic

ﬁnite-state automaton (NFA) is a 5-tuple (X, Q, d, Q

0

, F), where

nondeterministic

ﬁnite-state

automaton

• X is the input alphabet,

• Q is a ﬁnite set of states,

• d :: Q → X → ¦Q¦ is the state transition function,

• Q

0

⊆ Q is the set of start states,

• F ⊆ Q is the set of accepting states.

An NFA diﬀers from a DFA in that there may be more than one start state and that

there may be more than one possible move for each state and input symbol. Here is

an example of an NFA:

S

?>=< 89:;

ED GF

c

@A

a

b

C

?>=< 89:; /.-, ()*+

A

?>=< 89:;

a

BC

ED

a

B

?>=< 89:;

c

**Note that this NFA is very similar to the DFA in the previous section: the only
**

diﬀerence is that there are two outgoing arrows labelled with a from state A. Thus

the DFA becomes an NFA.

Formally, this NFA is deﬁned as M

1

= (X, Q, d, Q

0

, F) with

X = ¦a, b, c¦

Q = ¦S, A, B, C¦

Q

0

= ¦S¦

F = ¦C¦

where state transition function d is deﬁned by

d S a = ¦C¦

d S b = ¦A¦

d S c = ¦S¦

d A a = ¦S, B¦

d B c = ¦C¦

Again d is a partial function, which can be made total by adding d D x = ¦ ¦ for all

states D and all terminal symbols x for which d D x is undeﬁned.

127

8. Regular Languages

Since an NFA can make an arbitrary (nondeterministic) choice for one of its possible

moves, we have to be careful in deﬁning what it means that a sequence is accepted

by an NFA. Informally, sequence w ∈ X

∗

is accepted by NFA (X, Q, d, Q

0

, F), if it

is possible, when starting the NFA in a state from Q

0

, to end in an accepting state

after processing w. This operational description of acceptance is formalised in the

predicate nfa accept.

Assume that we have a function, say nfa, which reﬂects the behaviour of the NFA.

That is a function which given a transition function, a set of start states and a string,

returns all possible states that can be reached after processing the string starting in

some start state. Then the predicate nfa accept can be expressed as

nfa accept :: X

∗

→ (Q → X → ¦Q¦, ¦Q¦, ¦Q¦) → Bool

nfa accept w (d, Q

0

, F) = nfa d Q

0

w ∩ F = ∅

Now it remains to ﬁnd a function nfa d qs of type X

∗

→ ¦Q¦ that reﬂects the

behaviour of the NFA. For lists of length 1 such a function, called deltas, is deﬁned

by

deltas :: (Q → X → ¦Q¦) → ¦Q¦ → X → ¦Q¦

deltas d qs a = ¦r [ q ∈ qs, r ∈ d q a¦

The behaviour of the NFA on X-sequences of arbitrary length follows from this “one

step” behaviour:

nfa :: (Q → X → ¦Q¦) → ¦Q¦ → X

∗

→ ¦Q¦

nfa d qs = qs

nfa d qs (ax) = nfa d (deltas d qs a) x

Again, it follows that nfa can be written as a foldl.

nfa d qs = foldl (deltas d) qs

This concludes the deﬁnition of predicate nfa accept. In summary we have derived

Deﬁnition 8.5 (Acceptance by an NFA). The sequence w ∈ X

∗

is accepted by NFA

(X, Q, d, Q

0

, F) if

nfa accept w (d, Q

0

, F)

where

nfa accept w (d, qs, fs) = nfa d qs w ∩ fs = ∅

nfa d qs = foldl (deltas d) qs

deltas d qs a = ¦r [ q ∈ qs, r ∈ d q a¦

Using the nfa accept-predicate, the language of an NFA is deﬁned by

128

8.1. Finite-state automata

Deﬁnition 8.6 (Language of an NFA). For NFA M = (X, Q, d, Q

0

, F), the language

of M, Lnfa(M), is deﬁned by

Lnfa(M) = ¦w ∈ X

∗

[ nfa accept w (d, Q

0

, F)¦

Note that it is computationally expensive to determine whether or not a list is an

element of the language of a nondeterministic ﬁnite-state automaton. This is due to

the fact that all possible transitions have to be tried in order to determine whether or

not the automaton can end in an accepting state after reading the input. Determining

whether or not a list is an element of the language of a deterministic ﬁnite-state

automaton can be done in time linear in the length of the input list, so from a

computational view, deterministic ﬁnite-state automata are preferable. Fortunately,

for each nondeterministic ﬁnite-state automaton there exists a deterministic ﬁnite-

state automaton that accepts the same language. We will show how to construct a

DFA from an NFA in subsection 8.1.4.

8.1.3. Implementation

This section describes how to implement ﬁnite state machines. We start with imple-

menting DFA’s. Given a DFA M = (X, Q, d, S, F), we deﬁne two datatypes:

data StateM = ... deriving Eq

data SymbolM = ...

where the states of M (the elements of the set Q) are listed as constructors of StateM,

and the symbols of M (the elements of the set X) are listed as constructors of SymbolM.

Furthermore, we deﬁne three values (one of which a function):

start :: StateM

delta :: SymbolM -> StateM -> StateM

finals :: [StateM]

Note that the ﬁrst two arguments of delta have changed places: this has been done in

order to be able to apply ‘partial evaluation’ later. The extended transition function

dfa and the accept function dfaAccept are now deﬁned by:

dfa :: [SymbolM] -> StateM

dfa = foldl (flip delta) start

dfaAccept :: [SymbolM] -> Bool

dfaAccept xs = elem (dfa xs) finals

129

8. Regular Languages

Given a list of symbols [x1,x2,...,xn], the computation of dfa [x1,x2,...,xn]

uses the following intermediate states:

start, delta x1 start, delta x2 (delta x1 start),...

This list of states is determined uniquely by the input [x1,x2,...,xn] and the start

state.

Since we want to use the same function names for diﬀerent automata, we introduce

the following class:

class Eq a => DFA a b where

start :: a

delta :: b -> a -> a

finals :: [a]

dfa :: [b] -> a

dfa = foldl (flip delta) start

dfaAccept :: [b] -> Bool

dfaAccept xs = elem (dfa xs) finals

Note that the functions dfa and dfaAccept are deﬁned once and for all for all in-

stances of the class DFA.

As an example, we give the implementation of the example DFA (called MEX here)

given in the previous subsection.

data StateMEX = A | B | C | S deriving Eq

data SymbolMEX = SA | SB | SC

So the state A is represented by A, and the symbol a is represented by SA, and similar

for the other states and symbols. The automaton is made an instance of class DFA

as follows:

instance DFA StateMEX SymbolMEX where

start = S

delta x S = case x of SA -> C

SB -> A

SC -> S

delta SA A = B

delta SC B = C

finals = [C]

130

8.1. Finite-state automata

We can improve the performance of the automaton (function) dfa by means of partial

evaluation. The main idea of partial evaluation is to replace computations that

partial evaluation

are performed often at run-time by a single computation that is performed only

once at compile-time. A very simple example is the replacement of the expression

if True then f1 else f2 by the expression f1. Partial evaluation applies to ﬁnite

automatons in the following way.

dfa [x1,x2,...,xn]

=

foldl (flip delta) start [x1,x2,...,xn]

=

foldl (flip delta) (delta x1 start) [x2,...,xn]

=

case x1 of

SA -> foldl (flip delta) (delta SA start) [x2,...,xn]

SB -> foldl (flip delta) (delta SB start) [x2,...,xn]

SC -> foldl (flip delta) (delta SC start) [x2,...,xn]

All these equalities are simple transformation steps for functional programs. Note

that the ﬁrst argument of foldl is always flip delta, and the second argument is

one of the four states S, A, B, or C (the result of delta). Since there are only a ﬁnite

number of states (four, to be precise), we can deﬁne a transition function for each

state:

dfaS, dfaA, dfaB, dfaC :: [Symbol] -> State

Each of these functions is a case expression over the possible input symbols.

dfaS [] = S

dfaS (x:xs) = case x of SA -> dfaC xs

SB -> dfaA xs

SC -> dfaS xs

dfaA [] = A

dfaA (x:xs) = case x of SA -> dfaB xs

dfaB [] = B

dfaB (x:xs) = case x of SC -> dfaC xs

dfaC [] = C

With this deﬁnition of the ﬁnite automaton, the number of steps required for com-

puting the value of dfaS xs for some list of symbols xs is reduced considerably.

131

8. Regular Languages

The implementation of NFA’s is similar to the implementation of DFA’s. The only

diﬀerence is that the transition and accept functions have to take care of sets (lists)

of states now. We will use the following class, in which we use some names that

also appear in the class DFA. This is a problem if the two classes appear in the same

module.

class Eq a => NFA a b where

start :: [a]

delta :: b -> a -> [a]

finals :: [a]

nfa :: [b] -> [a]

nfa = foldl (flip deltas) start

deltas :: b -> [a] -> [a]

deltas a = union . map (delta a)

nfaAccept :: [b] -> Bool

nfaAccept xs = intersect (nfa xs) finals /= []

Here, functions union and intersect are implemented as follows:

union :: Eq a => [[a]] -> [a]

union = nub . concat

nub :: Eq a => [a] -> [a]

nub = foldr (\x xs -> x:filter (/=x) xs) []

intersect :: Eq a => [a] -> [a] -> [a]

intersect xs ys = intersect’ (nub xs)

where intersect’ =

foldr (\x xs -> if x ‘elem‘ ys then x:xs else xs) []

8.1.4. Constructing a DFA from an NFA

Is it possible to express more languages by means of nondeterministic ﬁnite-state

automata than by deterministic ﬁnite-state automata? For each nondeterministic

automaton it is possible to give a deterministic ﬁnite-state automaton such that

both automata accept the same language, so the answer to the above question is no.

Before we give the formal proof of this claim, we illustrate the construction of a DFA

for an NFA in an example.

132

8.1. Finite-state automata

Consider the nondeterministic ﬁnite-state automaton corresponding with the example

grammar of the previous subsection.

S

?>=< 89:;

ED GF

c

@A

a

b

C

?>=< 89:; /.-, ()*+

A

?>=< 89:;

a

BC

ED

a

B

?>=< 89:;

c

**The nondeterminicity of this automaton appears in state A: two outgoing arcs of A
**

are labelled with an a. Suppose we add a new state D, with an arc from A to D

labelled a, and we remove the arcs labelled a from A to S and from A to B. Since D

is a merge of the states S and B we have to merge the outgoing arcs from S and B

into outgoing arcs of D. We obtain the following automaton.

S

?>=< 89:;

ED GF

c

@A

a

b

C

?>=< 89:; /.-, ()*+

A

?>=< 89:;

a

D

?>=< 89:;

BC @A

b

c

c

c

c

c

c

c

c

c

c

c

c

c

c

c

c

c

a,c

B

?>=< 89:;

c

**We omit the proof that the language of the latter automaton is equal to the language
**

of the former one. Although there is just one outgoing arc from A labelled with a,

this automaton is still nondeterministic: there are two outgoing arcs labelled with

c from D. We apply the same procedure as above. Add a new state E and an arc

labelled c from D to E, and remove the two outgoing arcs labelled c from D. Since

E is a merge of the states C and S we have to merge the outgoing arcs from C and

S into outgoing arcs of E. We obtain the following automaton.

S

?>=< 89:;

ED GF

c

@A

a

b

C

?>=< 89:; /.-, ()*+

A

?>=< 89:;

a

D

?>=< 89:;

BC @A

b

a

{

{

{

{

{

{

{

{

{

{

{

{

{

{

{

{

{

{

c

E

?>=< 89:; /.-, ()*+

a

c

_u

u

u

u

u

u

u

u

u

u

u

u

u

u

u

u

u

u

u

u

u

u

u

ED BC

b

@A

B

?>=< 89:;

c

133

8. Regular Languages

Again, we do not prove that the language of this automaton is equal to the language

of the previous automaton, provided we add the state E to the set of accepting

states, which until now consisted just of state C. State E is added to the set of

accepting states because it is the merge of a set of states among which at least one

belongs to the set of accepting states. Note that in this automaton for each state, all

outgoing arcs are labelled diﬀerently, i.e. this automaton is deterministic. The DFA

constructed from the NFA above is the 5-tuple (X, Q, d, S, F) with

X = ¦a, b, c¦

Q = ¦S, A, B, C, D, E¦

F = ¦C, E¦

where transition function d is deﬁned by

d S a = C

d S b = A

d S c = S

d A a = D

d B c = C

d D a = C

d D b = A

d D c = E

d E a = C

d E b = A

d E c = S

This construction is called the ‘subset construction’. In general, the construction

works as follows. Suppose M = (X, Q, d, Q

0

, F) is a nondeterministic ﬁnite-state au-

tomaton. Then the ﬁnite-state automaton M

= (X

, Q

, d

, Q

0

, F

), the components

of which are deﬁned below, accepts the same language.

X

= X

Q

= subs Q

where subs returns all subsets of a set. subs Q is also called the powerset of Q. For

example,

subs :: ¦X¦ → ¦¦X¦¦

subs ¦A, B¦ = ¦¦ ¦, ¦A¦, ¦A, B¦, ¦B¦¦

For the other components of M

we deﬁne

d

q a = ¦t [ t ∈ d r a, r ∈ q¦

Q

0

= ¦Q

0

¦

F

= ¦p [ p ∩ F = ∅, p ∈ Q

¦

134

8.1. Finite-state automata

The proof of the following theorem is given in Section 8.4.

Theorem 8.7 (DFA for NFA). For every nondeterministic ﬁnite-state automaton M

there exists a ﬁnite-state automaton M

such that

Lnfa(M) = Ldfa(M

)

Theorem 8.7 enables us to freely switch between NFA’s and DFA’s. Equipped with

this knowledge we continue the exploration of regular languages in the following

section. But ﬁrst we show that the transformation from an NFA to a DFA is an

instance of partial evaluation.

8.1.5. Partial evaluation of NFA’s

Given a nondeterministic ﬁnite state automaton we can obtain a deterministic ﬁnite

state automaton not just by means of the above construction, but also by means of

partial evaluation.

Just as for function dfa, we can calculate as follows with function nfa.

nfa [x1,x2,...,xn]

=

foldl (flip deltas) start [x1,x2,...,xn]

=

foldl (flip deltas) (deltas x1 start) [x2,...,xn]

=

case x1 of

SA -> foldl (flip deltas) (deltas SA start) [x2,...,xn]

SB -> foldl (flip deltas) (deltas SB start) [x2,...,xn]

SC -> foldl (flip deltas) (deltas SC start) [x2,...,xn]

Note that the ﬁrst argument of foldl is always flip deltas, and the second argu-

ment is one of the six sets of states [S], [A], [B], [C], [B,S], [C,S] (the possible

results of deltas). Since there are only a ﬁnite number of results of deltas (six, to

be precise), we can deﬁne a transition function for each state:

nfaS, nfaA, nfaB, nfaC, nfaBS, nfaCS :: [Symbol] -> [State]

For example,

nfaA [] = A

nfaA (x:xs) = case x of

SA -> nfaBS xs

_ -> error "no transition possible"

135

8. Regular Languages

Each of these functions is a case expression over the possible input symbols. By

partially evaluating the function nfa we have obtained a function that is the imple-

mentation of the deterministic ﬁnite state automaton corresponding to the nondeter-

ministic ﬁnite state automaton.

8.2. Regular grammars

This section deﬁnes regular grammars, a special kind of context-free grammars. Sub-

section 8.2.1 gives the correspondence between nondeterministic ﬁnite-state automata

and regular grammars.

Deﬁnition 8.8 (Regular Grammar). A regular grammar G is a context free grammar

regular grammar

(T, N, R, S) in which all production rules in R are of one of the following two forms:

A → xB

A → x

with x ∈ T

∗

and A, B ∈ N. So in every rule there is at most one nonterminal, and

if there is a nonterminal present, it occurs at the end.

The regular grammars as deﬁned here are sometimes called right-regular grammars.

There is a symmetric deﬁnition for left-regular grammars.

Deﬁnition 8.9 (Regular Language). A regular language is a language that is gen-

regular language

erated by a regular grammar.

From the deﬁnition it is clear that each regular language is context-free. The question

is now: Is each context-free language regular? The answer is: No. There are context-

free languages that are not regular; an example of such a language is ¦a

n

b

n

[ n ∈ N¦.

To understand this, you have to know how to prove that a language is not regular.

Because of its subtlety, we postpone this kind of proofs until Chapter 9. Here it

suﬃces to know that regular languages form a proper subset of context-free languages

and that we will proﬁt from their speciality in the recognition process.

A ﬁrst similarity between regular languages and context-free languages is that both

are closed under union, concatenation and Kleene-star.

Theorem 8.10. Let L and M be regular languages, then

L ∪ M is regular

LM is regular

L

∗

is regular

136

8.2. Regular grammars

Proof. Let G

L

= (T, N

L

, R

L

, S

L

) and G

M

= (T, N

M

, R

M

, S

M

) be regular grammars

for L and M respectively, then

• for regular grammars, the well-known union construction for context-free gram-

mars is a regular grammar again;

• we obtain a regular grammar for LM if we replace, in G

L

, each production of

the form T → x and T → by T → xS

M

and T → S

M

, respectively;

• since L

∗

= ¦¦ ∪ LL

∗

, it follows from the above that there exists a regular

grammar for L

∗

.

In addition to these closure properties, regular languages are closed under intersection

and complement too. See the exercises. This is remarkable because context-free

languages are not closed under these operations. Recall the language L = L

1

∩ L

2

where L

1

= ¦a

n

b

n

c

m

[ n, m ∈ N¦ and L

2

= ¦a

n

b

m

c

m

[ n, m ∈ N¦.

As for context-free languages, there may exist more than one regular grammar for

a given regular language and these regular grammars may be transformed into each

other.

We conclude this section with a grammar transformation:

Theorem 8.11. For each regular grammar G there exists a regular grammar G

with

start-symbol S

such that

L(G) = L(G

)

and such that G

has no productions of the form U → V and W → , with W = S

.

In other words: every regular grammar can be transformed to a form where every

production has a nonempty terminal string in its right hand side (with a possible

exception for S → ).

The proof of this transformation is omitted, we only brieﬂy describe the construction

of such a regular grammar, and illustrate the construction with an example.

Given a regular grammar G, a regular grammar with the same language but without

productions of the form U → V and W → for all U, V , and all W = S is obtained

as follows. First, consider all pairs Y , Z of nonterminals of G such that Y

∗

⇒ Z.

Add productions Y → z to the grammar, with Z → z a production of the original

grammar, and z not a single nonterminal. Remove all productions U → V from

G. Finally, remove all productions of the form W → for W = S, and for each

production U → xW add the production U → x. The following example illustrates

this construction.

137

8. Regular Languages

Consider the following regular grammar G.

S → aA

S → bB

S → A

S → C

A → bB

A → S

A →

B → bB

B →

C → c

The grammar G

**of the desired form is constructed in 3 steps.
**

Step 1

Let G

equal G.

Step 2

Consider all pairs of nonterminals Y and Z. If Y

∗

⇒Z, add the productions Y → z to

G

**, with Z → z a production of the original grammar, and z not a single nonterminal.
**

Furthermore, remove all productions of the form U → V from G

. In the example

we remove the productions S → A, S → C, A → S, and we add the productions

S → bB and S → since S

∗

⇒A, and the production S → c since S

∗

⇒C, and the

productions A → aA and A → bB since A

∗

⇒ S, and the production A → c since

A

∗

⇒C. We obtain the grammar with the following productions.

S → aA

S → bB

S → c

S →

A → bB

A → aA

A → c

A →

B → bB

B →

C → c

This grammar generates the same language as G, and has no productions of the form

U → V . It remains to remove productions of the form W → for W = S.

138

8.2. Regular grammars

Step 3

Remove all productions of the form W → for W = S, and for each production

U → xW add the production U → x. Applying this transformation to the above

grammar gives the following grammar.

S → aA

S → bB

S → a

S → b

S → c

S →

A → bB

A → aA

A → a

A → b

A → c

B → bB

B → b

C → c

Each production in this grammar is of one of the desired forms: U → x or U → xV ,

and the language of the grammar G

**we thus obtain is equal to the language of
**

grammar G.

8.2.1. Equivalence of Regular grammars and Finite automata

In the previous section, we introduced ﬁnite-state automata. Here we show that

regular grammars and nondeterministic ﬁnite-state automata are two sides of one

coin.

We will prove the equivalence using theorem 8.7. The equivalence consists of two

parts, formulated in the theorems 8.12 and 8.13 below. The basis for both theorems

is the direct correspondence between a production A → xB and a transition

A

?>=< 89:;

x

B

?>=< 89:;

Theorem 8.12 (Regular grammar for NFA). For each NFA M there exists a regular

grammar G such that

Lnfa(M) = L(G)

139

8. Regular Languages

Proof. We will just sketch the construction, the formal proof can be found in the

literature. Let (X, Q, d, S, F) be a DFA for NFA M. Construct the grammar G =

(X, Q, R, S) where

• the terminals of the grammar are the input alphabet of the automaton;

• the nonterminals of the grammar are the states of the automaton;

• the start state of the grammar is the start state of the automaton;

• the productions of the grammar correspond to the automaton transitions:

a rule A → xB for each transition

A

?>=< 89:;

x

B

?>=< 89:;

a rule A → for each terminal state A.

In formulae:

R = ¦A → xB [ A, B ∈ Q, x ∈ X, d A x = B¦

∪ ¦A → [ A ∈ F¦

Theorem 8.13 (NFA for regular grammar). For each regular grammar G there exists

a nondeterministic ﬁnite-state automaton M such that

L(G) = Lnfa(M)

Proof. Again, we will just sketch the construction, the formal proof can be found in

the literature. The construction consists of two steps: ﬁrst we give a direct translation

of a regular grammar to an automaton and then we transform the automaton into a

suitable shape.

From a grammar to an automaton. Let G = (T, N, R, S) be a regular grammar

without productions of the form U → V and W → for W = S.

Construct NFA M = (X, Q, d, ¦S¦, F) where

• The input alphabet of the automaton are the nonempty terminal strings (!)

that occur in the rules of the grammar:

X = ¦x ∈ T

+

[ A, B ∈ N, A → xB ∈ R¦

∪ ¦x ∈ T

+

[ A ∈ N, A → x ∈ R¦

• The states of the automaton are the nonterminals of the grammar extended

with a new state N

f

.

Q = N ∪ ¦N

f

¦

140

8.2. Regular grammars

• The transitions of the automaton correspond to the grammar productions:

for each rule A → xB we get a transition

A

?>=< 89:;

x

B

?>=< 89:;

for each rule A → x with nonempty x, we get a transition

A

?>=< 89:;

x

N

f

GFED @ABC

In formulae: For all A ∈ N and x ∈ X:

d A x = ¦B [ B ∈ N, A → xB ∈ R¦

∪ ¦N

f

[ A → x ∈ R¦

• The ﬁnal states of the automaton are N

f

and possibly S, if S → is a grammar

production.

F = ¦N

f

¦ ∪ ¦S [ S → ∈ R¦

Lnfa(M) = L(G), because of the direct correspondence between derivation steps in

G and transitions in M.

Transforming the automaton to a suitable shape. There is a minor ﬂaw in the au-

tomaton given above: the grammar and the automaton have diﬀerent alphabets.

This shortcoming will be remedied by an automaton transformation which yields an

equivalent automaton with transitions labelled by elements of T (instead of T

∗

). The

transformation is relatively easy and is depicted in the diagram below. In order to

eliminate transition d q x = q

where x = x

1

x

2

. . . x

k+1

with k > 0 and x

i

∈ T for

all i, add new (nonﬁnal) states p

1

, . . . , p

k

to the existing ones and new transitions

d q x

1

= p

1

, d p

1

x

2

= p

2

, . . . , d p

k

x

k+1

= q

.

b

q

x

1

x

2

...x

k+1

b

q

b

q

x

1

b

p

1

x

2

b

p

2

b

p

k

x

k+1

b

q

- - - p p p p p p p p p p p p p p p p p p -

It is intuitively clear that the resulting automaton is equivalent to the original one.

Carry out this transformation for each M-transition d q x = q

with [x[ > 1 in order

to get an automaton for G with the same input alphabet T.

141

8. Regular Languages

8.3. Regular expressions

Regular expressions are a classical and convenient way to describe, for example, the

structure of terminal words. This section deﬁnes regular expressions, deﬁnes the

language of a regular expression, and shows that regular expressions and regular

grammars are equally expressive formalisms. We do not discuss implementations of

(datatypes and functions for matching) regular expressions; implementations can be

found in the literature [8, 6].

Deﬁnition 8.14 (RE

T

, regular expressions over alphabet T). The set RE

T

of regular

expressions over alphabet T is inductively deﬁned as follows: for regular expressions

regular expression

R, S

∅ ∈ RE

T

∈ RE

T

a ∈ RE

T

R +S ∈ RE

T

RS ∈ RE

T

R∗ ∈ RE

T

(R) ∈ RE

T

where a ∈ T. The operator + is associative, commutative, and idempotent; the

concatenation operator, written as juxtaposition (so x concatenated with y is denoted

by xy), is associative, and is the unit of it. In formulae this reads, for all regular

expressions R, S, and V ,

R + (S +U) = (R +S) +U

R +S = S +R

R +R = R

R(S U) = (RS) U

R = R (= R)

Furthermore, the star operator, ∗, binds stronger than concatenation, and concate-

nation binds stronger than +. Examples of regular expressions are:

(bc)∗ +∅

+ b(∗)

The language (i.e. the “semantics”) of a regular expression over T is a set of T-

sequences compositionally deﬁned on the structure of regular expressions. As fol-

lows.

142

8.3. Regular expressions

Deﬁnition 8.15 (Language of a regular expression). Function Lre :: RE

T

→ ¦T

∗

¦

returns the language of a regular expression. It is deﬁned inductively by:

Lre(∅) = ∅

Lre() = ¦¦

Lre(b) = ¦b¦

Lre(x +y) = Lre(x) ∪ Lre(y)

Lre(xy) = Lre(x) Lre(y)

Lre(x∗) = (Lre (x))

∗

Since ∪ is associative, commutative, and idempotent, set concatenation is associative

with ¦¦ as its unit, and function Lre is well deﬁned. Note that the language Lreb∗

is the set consisting of zero, one or more concatenations of b, i.e., Lre(b∗) = (¦b¦)

∗

.

As an example of a language of a regular expression, we compute the language of the

regular expression ( + bc)d.

Lre(( + bc)d)

=

(Lre( + bc)) (Lre(d))

=

(Lre() ∪ Lre(bc))¦d¦

=

(¦¦ ∪ (Lre(b))(Lre(c)))¦d¦

=

¦, bc¦ ¦d¦

=

¦d, bcd¦

Regular expressions are used to describe the tokens of a language. For example, the

list

if p then e1 else e2

contains six tokens, three of which are identiﬁers. An identiﬁer is an element in the

language of the regular expression

letter(letter + digit)∗

where

letter = a + b +. . . + z +

A + B +. . . + Z

digit = 0 + 1 +. . . + 9

143

8. Regular Languages

see subsection 2.3.1.

In the beginning of this section we claimed that regular expressions and regular

grammars are equivalent formalisms. We will prove this claim later, but ﬁrst we

illustrate the construction of a regular grammar out of a regular expressions in an

example. Consider the following regular expression.

R = a∗ + + (a + b)∗

We aim at a regular grammar G such that Lre(R) = L(G) and again we take a

top-down approach.

Suppose that nonterminal A generates the language Lre(a∗), nonterminal B generates

the language Lre(), and nonterminal C generates the language Lre((a + b)∗). Sup-

pose furthermore that the productions for A, B, and C satisfy the conditions imposed

upon regular grammars. Then we obtain a regular grammar G with L(G) = Lre(R)

by deﬁning

S → A

S → B

S → C

where S is the start-symbol of G. It remains to construct productions for nontermi-

nals A, B, and C.

• The nonterminal A with productions

A → aA

A →

generates the language Lre(a∗).

• Since Lre() = ¦¦, the nonterminal B with production

B →

generates the language ¦¦.

• Nonterminal C with productions

C → aC

C → bC

C →

generates the language Lre((a + b)∗).

For a speciﬁc example it is not diﬃcult to construct a regular grammar for a regular

expression. We now give the general result.

144

8.4. Proofs

Theorem 8.16 (Regular Grammar for Regular Expression). For each regular ex-

pression R there exists a regular grammar G such that

Lre(R) = L(G)

The proof of this theorem is given in Section 8.4.

To obtain a regular expression that generates the same language as a given regular

grammar we go via an automaton. Given a regular grammar G, we can use the

theorems from the previous sections to obtain a DFA D such that

L(G) = Ldfa(D)

So if we can obtain a regular expression for a DFA D, we have found a regular ex-

pression for a regular grammar. To obtain a regular expression for a DFA D, we

interpret each state of D as a regular expression deﬁned as the sum of the concate-

nation of outgoing terminal symbols with the resulting state. For our example DFA

we obtain:

S = aC +bA+cS

A = aB

B = cC

C =

It is easy to merge these four regular expressions into a single regular expression,

partially because this is a simple example. Merging the regular expressions obtained

from a DFA that may loop is more complicated, as we will brieﬂy explain in the proof

of the following theorem. In general, we have:

Theorem 8.17 (Regular Expression for Regular Grammar). For each regular gram-

mar G there exists a regular expression R such that

L(G) = Lre(R)

The proof of this theorem is given in Section 8.4.

8.4. Proofs

This section contains the proofs of some of the theorems given in this chapter.

145

8. Regular Languages

8.4.1. Proof of Theorem 8.7

Suppose M = (X, Q, d, Q

0

, F) is a nondeterministic ﬁnite-state automaton. Deﬁne

the ﬁnite-state automaton M

= (X

, Q

, d

, Q

0

, F

) as follows.

X

= X

Q

= subs Q

where subs returns the powerset of a set. For example,

subs ¦A, B¦ = ¦¦ ¦, ¦A¦, ¦A, B¦, ¦B¦¦

For the other components of M

we deﬁne

d

q a = ¦t [ t ∈ d r a, r ∈ q¦

Q

0

= Q

0

F

= ¦p [ p ∩ F = ∅, p ∈ Q

¦

We have

Lnfa(M)

= deﬁnition of Lnfa

¦w [ w ∈ X

∗

, nfa accept w (d, Q

0

, F)¦

= deﬁnition of nfa accept

¦w [ w ∈ X

∗

, (nfa d Q

0

w ∩ F) = ∅¦

= deﬁnition of F

¦w [ w ∈ X

∗

, nfa d Q

0

w ∈ F

¦

= assume nfa d Q

0

w = dfa d

Q

0

w

¦w [ w ∈ X

∗

, dfa d

Q

0

w ∈ F

¦

= deﬁnition of dfa accept

¦w [ w ∈ X

∗

, dfa accept w (d

, Q

0

, F

)¦

= deﬁnition of Ldfa

Ldfa(M

)

It follows that Lnfa(M) = Ldfa(M

) provided

nfa d Q

0

w = dfa d

Q

0

w (8.1)

We prove this equation as follows.

146

8.4. Proofs

nfa d Q

0

w

= deﬁnition of nfa

foldl (deltas d) Q

0

w

= Q

0

= Q

0

; assume deltas d = d

foldl d

Q

0

w

= deﬁnition of dfa

dfa d

Q

0

w

So if

deltas d = d

**equation (8.1) holds. This equality follows from the following calculation.
**

d

q a

= deﬁnition of d

¦t [ r ∈ q, t ∈ d r a¦

= deﬁnition of deltas

deltas d q a

8.4.2. Proof of Theorem 8.16

The proof of this theorem is by induction to the structure of regular expressions. For

the three base cases we argue as follows.

The regular grammar without productions generates the language Lre(∅). The regu-

lar grammar with the production S → generates the language Lre(). The regular

grammar with production S → b generates the language Lre(b).

For the other three cases the induction hypothesis is that there exist a regular gram-

mar with start-symbol S

1

that generates the language Lre(x), and a regular grammar

with start-symbol S

2

that generates the language Lre(y).

We obtain a regular grammar with start-symbol S that generates the language

Lre(x +y) by deﬁning

S → S

1

S → S

2

We obtain a regular grammar with start-symbol S that generates the language

Lre(xy) by replacing, in the regular grammar that generates the language Lre(x),

147

8. Regular Languages

each production of the form T → a and T → by T → aS

2

and T → S

2

, respec-

tively.

We obtain a regular grammar with start-symbol S that generates the language

Lre(x∗) by replacing, in the regular grammar that generates the language Lre(x),

each production of the form T → a and T → by T → aS and T → S, and by

adding the productions S → S

1

and S → , where S

1

is the start-symbol of the

regular grammar that generates the language Lre(x).

8.4.3. Proof of Theorem 8.17

In sections 8.1.4 and 8.2.1 we have shown that there exists a DFA D = (X, Q, d, S, F)

such that

L(G) = Ldfa(D)

So, if we can show that there exists a regular expression R such that

Ldfa(D) = Lre(R)

then the theorem follows.

Let D = (X, Q, d, S, F) be a DFA such that L(G) = Ldfa(D). We deﬁne a regular

expression R such that

Lre(R) = Ldfa(D)

For each state q ∈ Q we deﬁne a regular expression ¯ q, and we let R be

¯

S. We obtain

the deﬁnition of ¯ q by combining all pairs c and C such that d q c = C.

¯ q = if q / ∈ F

then foldl (+) ∅ [cC [ d q c = C]

else + foldl (+) ∅ [cC [ d q c = C]

This gives a set of possibly mutually recursive equations, which we have to solve. In

solving these equations we use the fact that concatenation distributes over the sum

operator:

z(x +y) = zx +zy

and that recursion can be removed by means of the star operator ∗:

A = xA+z (where A / ∈ z) ≡ A = x∗z

The algorithm for solving such a set of equations is omitted.

We prove Ldfa(D) = Lre(

¯

S).

148

8.4. Proofs

Ldfa(D)

= deﬁnition of Ldfa

¦w [ w ∈ X

∗

, dfa accept w (d, S, F)¦

= deﬁnition of dfa accept

¦w [ w ∈ X

∗

, (dfa d S w) ∈ F¦

= deﬁnition of dfa

¦w [ w ∈ X

∗

, (foldl d S w) ∈ F¦

= assumption

¦w [ w ∈ Lre(

¯

S)¦

= equality for set-comprehensions

Lre(

¯

S)

It remains to prove the assumption in the above calculation: for w ∈ X

∗

,

(foldl d S w) ∈ F ≡ w ∈ Lre(

¯

S)

We prove a generalisation of this equation, namely, for arbitrary q,

(foldl d q w) ∈ F ≡ w ∈ Lre(¯ q)

This equation is proved by induction to the length of w. For the base case w = we

calculate as follows.

(foldl d q ) ∈ F

≡ deﬁnition of foldl

q ∈ F

≡ deﬁnition of ¯ q, E abbreviates the fold expression

¯ q = +E

≡ deﬁnition of Lre, deﬁnition of ¯ q

∈ Lre(¯ q)

The induction hypothesis is that for all lists w with [w[ n we have (foldl d q w) ∈

F ≡ w ∈ Lre(¯ q). Suppose ax is a list of length n+1.

(foldl d q (ax)) ∈ F

≡ deﬁnition of foldl

(foldl d (d q a) x) ∈ F

≡ induction hypothesis

x ∈ Lre(

¯

d q a)

≡ deﬁnition of ¯ q, D is deterministic

ax ∈ Lre(¯ q)

149

8. Regular Languages

Summary

This chapter discusses methods for recognising sentences from regular languages, and

introduces several concepts related to describing and recognising regular languages.

Regular languages are used for describing simple languages like ‘the language of

identiﬁers’ and ‘the language of keywords’ and regular expressions are convenient for

the description of regular languages. The straightforward translation of a regular

expression into a recogniser for the language of that regular expression results in a

recogniser that is often very ineﬃcient. By means of (non)deterministic ﬁnite-state

automata we construct a recogniser that requires time linear in the length of the

input list for recognising an input list.

8.5. Exercises

Exercise 8.1. Given a regular grammar G for language L, construct a regular grammar for

L

∗

.

Exercise 8.2. Transform the grammar with the following productions to a grammar without

productions of the form U → V and W → with W = S.

S → aA

S → A

A → aS

A → B

B → C

B →

C → cC

C → a

Exercise 8.3. Suppose that the state transition function d in the deﬁnition of a nondeter-

ministic ﬁnite-state automaton has the following type

d :: ¦Q¦ → X → ¦Q¦

Function d takes a set of states V and an element a, and returns the set of states that are

reachable from V with an arc labelled a. Deﬁne a function ndfsa of type

(¦Q¦ → X → ¦Q¦) → ¦Q¦ → X

∗

→ ¦Q¦

which given a function d, a set of start states, and an input list, returns the set of states in

which the nondeterministic ﬁnite-state automaton can end after reading the input list.

Exercise 8.4. Prove the converse of Theorem 8.7: show that for every deterministic ﬁnite-

state automaton M there exists a nondeterministic ﬁnite-state automaton M

such that

Lda(M) = Lna(M

)

Exercise 8.5. Regular languages are closed under complementation. Prove this claim. Hint:

construct a ﬁnite automaton for L out of an automaton for regular language L.

150

8.5. Exercises

Exercise 8.6. Regular languages are closed under intersection.

1. Prove this claim using the result from the previous exercise.

2. A direct proof form this claim is the following:

Let M

1

= (X, Q

1

, d

1

, S

1

, F

1

) and M

2

= (X, Q

2

, d

2

, S

2

, F

2

) be DFA’s for the regular

languages L

1

and L

2

respectively. Deﬁne the (product) automaton M = (X, Q

1

Q

2

, d, (S

1

, S

2

), F

1

F

2

) by d (q

1

, q

2

) x = (d

1

q

1

x, d

2

q

2

x) Now prove that Ldfa(M) =

Ldfa(M

1

) ∩ Ldfa(M

2

)

Exercise 8.7. Deﬁne nondeterministic ﬁnite-state automata that accept languages equal to

the languages of the following regular grammars.

1.

S → (A

S →

S → )A

A → )

A → (

2.

S → 0A

S → 0B

S → 1A

A → 1

A → 0

B → 0

B →

Exercise 8.8. Describe the language of the following regular expressions.

1. + b(∗)

2. (bc)∗ +∅

3. a(b∗) + c∗

Exercise 8.9. Prove that for arbitrary regular expressions R, S, and T the following equiv-

alences hold.

Lre(R(S +T)) = Lre(RS +RT)

Lre((R +S)T) = Lre(RT +ST)

Exercise 8.10. Give regular expressions S and R such that

Lre(RS) = Lre(SR)

Lre(RS) = Lre(SR)

Exercise 8.11. Give regular expressions V and W, with Lre(V ) = Lre(W), such that for

all regular expressions R and S with S = ∅

Lre(R(S +V )) = Lre(R(S +W))

V and W may be expressed in terms of R and S.

Exercise 8.12. Give a regular expression for the language that consists of all lists of zeros

and ones such that the segment 01 occurs nowhere in a list. Examples of sentences of this

language are 1110, and 000.

151

8. Regular Languages

Exercise 8.13. Give regular grammars that generate the language of the following regular

expressions.

1. ((a + bb)∗ + c)∗

2. a∗ + b∗ + ab

Exercise 8.14. Give regular expressions of which the language equals the language of the

following regular grammars.

1.

S → bA

S → aC

S →

A → bA

A →

B → aC

B → bB

B →

C → bB

C → b

2.

S → 0S

S → 1T

S →

T → 0T

T → 1S

Exercise 8.15. Construct for each of the following regular expressions a nondeterministic

ﬁnite-state automaton that accepts the sentences of the language. Transform the nondeter-

ministic ﬁnite-state automata into deterministic ﬁnite-state automata.

1. a∗ + b∗ + (ab)

2. (1 + (12) + 0)∗(30)∗

Exercise 8.16. Deﬁne regular expressions for the languages of the following deterministic

ﬁnite-state automata.

1. Start state is S.

S

?>=< 89:; '&%$ !"#

a

ED GF

b

@A

A

?>=< 89:;

b

ED GF

a

@A

B

?>=< 89:; '&%$ !"#

ED GF

a,b

@A

2. Start state is S.

S

?>=< 89:; '&%$ !"#

a

ED GF

b

@A

A

?>=< 89:;

b

BC @A

b

B

?>=< 89:;

BC @A

b

152

9. Pumping Lemmas:

the expressive power of languages

Introduction

In these lecture notes we have presented several ways to show that a language is

regular or context-free, but until now we did not give any means to show the non-

regularity or noncontext-freeness of a language. In this chapter we ﬁll this gap by

introducing the so-called Pumping Lemmas. For example, the pumping lemma for

regular languages says

IF language L is regular,

THEN it has the following property P: each suﬃciently long sentence w ∈ L has

a substring that can be repeated any number of times, every time yielding

another word of L

In applications, pumping lemmas are used in the contrapositive way. In the regular

case this means that one may conclude that L is not regular, if P does not hold.

Although the ideas behind pumping lemmas are very simple, a precise formulation

is not. As a consequence, it takes some eﬀort to get familiar with applying pumping

lemmas. Regular grammars and context-free grammars are part of the Chomsky

hierarchy, which consists of four diﬀerent kinds of grammars and their corresponding

languages. Pumping lemmas are used to show that the expressive power of the

diﬀerent elements of the Chomsky hierarchy is diﬀerent.

Goals

After you have studied this chapter you will be able to

• prove that a language is not regular;

• prove that a language is not context-free;

• identify languages and grammars as regular, context-free or none of these;

• give examples of languages that are not regular, and/or not context-free;

• explain the Chomsky hierarchy.

153

9. Pumping Lemmas: the expressive power of languages

9.1. The Chomsky hierarchy

In the preceding chapters we have seen contex-free grammars and regular grammars.

You may now wonder: is it possible to express any language with these grammars?

And: is it possible to obtain any context-free language from a regular grammar? The

answer to these questions is no. The Chomsky hierarchy explains why the answer

is no. The Chomsky hierarchy consists of four elements, each of which is explained

below.

9.1.1. Type-0 grammars

The most powerful grammars are the type-0 grammars, in which a production has

the form φ → ψ, where φ ∈ V

+

, ψ ∈ V

∗

, where V is the set of symbols of the

grammar. So the left-hand side of a production may consist of a list of nonterminal

and terminal symbols, instead of a single nonterminal as in context-free grammars.

Type-0 grammars have the same expressive power as Turing machines, and the lan-

guages described by these grammars are the recursive enumerable languages. This

expressive power comes at a cost though: it is very diﬃcult to parse sentences from

type-0 grammars.

9.1.2. Type-1 grammars

We can slightly restrict the form of the productions to obtain type-1 grammars. In

a type-1, or context-sensitive grammar, each production has the form φAψ → φδψ,

where φ, ψ ∈ V

∗

, δ ∈ V

+

. So a production describes how to rewrite a nonterminal

A, in the context of lists of symbols φ and ψ. A language generated by a context-

sensitive grammar is called a context-sensitive language. Although context-sensitive

grammars are less expressive than type-0 grammars, parsing is still very diﬃcult for

context-sensitive grammars.

9.1.3. Type-2 grammars

The type-2 grammars are the context-free grammars which you have seen a lot in the

preceding chapters. As the name says, in a context-free grammar you can rewrite

nonterminals without looking at the context in which they appear. Actually, it is

impossible to look at the context when rewriting symbols. Context-free grammars

are less expressive than context-sensitive grammars. This statement can be proved

using the pumping lemma for context-free languages.

However, it is much easier to parse sentences from context-free languages. In fact, a

sentence of length n can be parsed in time at most O(n

3

) (or even a bit less than this)

for any sentence of a context-free language. And if we put some more restrictions

154

9.2. The pumping lemma for regular languages

on context-free grammars (for example LL(1)), we obtain linear-time algorithms for

parsing sentences of such grammars.

9.1.4. Type-3 grammars

The type-3 grammars in the Chomsky hierarchy are the regular grammars. Any sen-

tence from a regular language can be processed by means of a ﬁnite-state automaton,

which takes linear time and constant space in the size of its input. The set of regular

languages is strictly smaller than the set of context-free languages, a fact we will

prove below by means of the pumping lemma for regular languages.

9.2. The pumping lemma for regular languages

In this section we give the pumping lemma for regular languages. The lemma gives

a property that is satisﬁed by all regular languages. The property is a statement

of the form: in sentences longer than a certain length a substring can be identiﬁed

that can be duplicated while retaining a sentence. The idea behind this property

is simple: regular languages are accepted by ﬁnite automata. Given a DFA for a

regular language, a sentence of the language describes a path from the start state to

some ﬁnite state. When the length of such a sentence exceeds the number of states,

then at least one state is visited twice; consequently the path contains a cycle that

can be repeated as often as desired. The proof of the following lemma is given in

Section 9.4.

Theorem 9.1 (Regular Pumping Lemma). Let L be a regular language. Then

there exists n ∈ N :

for all x, y, z : xyz ∈ L and [y[ n :

there exist u, v, w : y = uvw and [v[ > 0 :

for all i ∈ N : xuv

i

wz ∈ L

Note that [y[ denotes the length of the string y. Also remember that ‘for all x ∈ X : ’

is true if X = ∅, and ‘there exists x ∈ X : ’ is false if X = ∅.

155

9. Pumping Lemmas: the expressive power of languages

For example, consider the following automaton.

S

?>=< 89:;

a

A

?>=< 89:;

b

B

c

?>=< 89:;

C

?>=< 89:;

a

c

c

c

c

c

c

c

c

c

c

c

c

c

c

c

c

d

D

?>=< 89:; /.-, ()*+

This automaton accepts: abcabcd, abcabcabcd, and, in general, a(bca)

∗

bcd. The

statement of the pumping lemma amounts to the following. Take for n the number

of states in the automaton (5). Let x, y, z be such that xyz ∈ L, and [y[ n.

Then we know that in order to accept y, the above automaton has to pass at least

twice through state A. The part that is accepted in between the two moments the

automaton passes through state A can be pumped up to create sentences that contain

an arbitrary number of copies of the string v = bca.

This pumping lemma is useful in showing that a language does not belong to the

family of regular languages. Its application is typical of pumping lemmas in general;

they are used negatively to show that a given language does not belong to some

family.

Theorem 9.1 enables us to prove that a language L is not regular by showing that

for all n ∈ N :

there exist x, y, z : xyz ∈ L and [y[ n :

for all u, v, w : y = uvw and [v[ > 0 :

there exists i ∈ N : xuv

i

wz ∈ L

In all applications of the pumping lemma in this chapter, this is the formulation we

will use.

Note that if n = 0, we can choose y = , and since there is no v with [v[ > 0 such

that y = uvw, the statement above holds for all such v (namely none!).

As an example, we will prove that language L = ¦a

m

b

m

[ m 0¦ is not regular.

Let n ∈ N.

Take s = a

n

b

n

with x = , y = a

n

, and z = b

n

.

Let u, v, w be such that y = uvw with v = , that is, u = a

p

, v = a

q

and w = a

r

with

p +q +r = n and q > 0.

Take i = 2, then

156

9.3. The pumping lemma for context-free languages

xuv

2

wz ∈ L

⇐ defn. x, u, v, w, z, calculus

a

p+2q+r

b

n

∈ L

⇐ p +q +r = n

n +q = n

⇐ arithmetic

q > 0

⇐ q > 0

true

Note that the language L = ¦a

m

b

m

[ m 0¦ is context-free, and together with the

fact that each regular grammar is also a context-free grammar it follows immedi-

ately that the set of regular languages is strictly smaller than the set of context-free

languages.

Note that here we use the pumping lemma (and not the proof of the pumping lemma)

to prove that a language is not regular. This kind of proof can be viewed as a kind of

game: ‘for all’ is about an arbitrary element which can be chosen by the opponent;

‘there exists’ is about a particular element which you may choose. Choosing the right

elements helps you ‘win’ the game, where winning means proving that a language is

not regular.

Exercise 9.1. Prove that the following language is not regular

¦a

k

2

[ k 0¦

Exercise 9.2. Show that the following language is not regular.

¦x [ x ∈ ¦a, b¦

∗

∧ nr a x < nr b x¦

where nr a x is the number of occurrences of a in x.

Exercise 9.3. Prove that the following language is not regular

¦a

k

b

m

[ k m 2k¦

Exercise 9.4. Show that the following language is not regular.

¦a

k

b

l

a

m

[ k > 5 ∧ l > 3 ∧ m l¦

9.3. The pumping lemma for context-free languages

The Pumping Lemma for context-free languages gives a property that is satisﬁed by

all context-free languages. This property is a statement of the form: in sentences ex-

ceeding a certain length, two sublists of bounded length can be identiﬁed that can be

157

9. Pumping Lemmas: the expressive power of languages

duplicated while retaining a sentence. The idea behind this property is the following.

Context-free languages are described by context-free grammars. For each sentence

in the language there exists a derivation tree. When sentences have a derivation tree

that is higher than the number of nonterminals, then at least one nonterminal will

occur twice in a node; consequently a subtree can be inserted as often as desired.

As an example of an application of the Pumping Lemma, consider the context-free

grammar with the following productions.

S → aAb

A → cBd

A → e

B → fAg

The following parse tree represents the derivation of the sentence acfegdb.

S

c

c

c

c

c

c

c

c

a

A

c

c

c

c

c

c

c

c

b

c

B

c

c

c

c

c

c

c

c

d

f A

g

e

If we replace the subtree rooted by the lower occurrence of nonterminal A by the

158

9.3. The pumping lemma for context-free languages

subtree rooted by the upper occurrence of A, we obtain the following parse tree.

S

c

c

c

c

c

c

c

c

a

A

c

c

c

c

c

c

c

c

b

c

B

c

c

c

c

c

c

c

c

d

f A

c

c

c

c

c

c

c

c

g

c

B

c

c

c

c

c

c

c

c

d

f A

g

e

This parse tree represents the derivation of the sentence acfcfegdgdb. Thus we ‘pump’

the derivation of sentence acfegdb to the derivation of sentence acfcfegdgdb. Repeat-

ing this step once more, we obtain a parse tree for the sentence

acfcfcfegdgdgdb

We can repeatedly apply this process to obtain derivation trees for all sentences of

the form

a(cf)

i

e(gd)

i

b

for i 0. The case i = 0 is obtained if we replace in the parse tree for the sentence

acfegdb the subtree rooted by the upper occurrence of nonterminal A by the subtree

rooted by the lower occurrence of A:

S

Ð

Ð

Ð

Ð

Ð

Ð

Ð

Ð

b

b

b

b

b

b

b

a

A b

e

This is a derivation tree for the sentence aeb. This step can be viewed as a negative

pumping step.

The proof of the following lemma is given in Section 9.4.

159

9. Pumping Lemmas: the expressive power of languages

Theorem 9.2 (Context-free Pumping Lemma). Let L be a context-free language.

Then

there exist c, d : c, d ∈ N :

for all z : z ∈ L and [z[ c :

there exist u, v, w, x, y : z = uvwxy and [vx[ > 0 and [vwx[ d :

for all i ∈ N : uv

i

wx

i

y ∈ L

The Pumping Lemma is a tool with which we prove that a given language is not

context-free. The proof obligation is to show that the property shared by all context-

free languages does not hold for the language under consideration.

Theorem 9.2 enables us to prove that a language L is not context-free by showing

that

for all c, d : c, d ∈ N :

there exists z : z ∈ L and [z[ c :

for all u, v, w, x, y : z = uvwxy and [vx[ > 0 and [vwx[ d :

there exists i ∈ N : uv

i

wx

i

y ∈ L

As an example, we will prove that the language T deﬁned by

T = ¦a

n

b

n

c

n

[ n > 0¦

is not context-free.

Proof. Let c, d ∈ N.

Take z = a

r

b

r

c

r

with r = max(c, d).

Let u, v, w, x, y be such that z = uvwxy, [vx[ > 0 and [vwx[ d

Note that our choice for r guarantees that substring vwx has one of the following

shapes:

• vwx consists of just a’s, or just b’s, or just c’s.

• vwx contains both a’s and b’s, or both b’s and c’s.

So vwx does not contain a’s, b’s, and c’s.

Take i = 0, then

• If vwx consists of just a’s, or just b’s, or just c’s, then it is impossible to write

the string uwy as a

s

b

s

c

s

for some s, since only the number of terminals of one

kind is decreased.

160

9.4. Proofs of pumping lemmas

• If vwx contains both a’s and b’s, or both b’s and c’s it lies somewhere on the

border between a’s and b’s, or on the border between b’s and c’s. Then the

string uwy can be written as

uwy = a

s

b

t

c

r

uwy = a

r

b

p

c

q

for some s, t, p, q, respectively. At least one of s and t or of p and q is less than

r. Again this list is not an element of T.

Exercise 9.5. Why does vwx not contain a’s, b’s, and c’s?

Exercise 9.6. Prove that the following language is not context-free

¦a

k

2

[ k 0¦

Exercise 9.7. Prove that the following language is not context-free

¦a

i

[ i is a prime number ¦

Exercise 9.8. Prove that the following language is not context-free

¦ww [ w ∈ ¦a, b¦

∗

¦

9.4. Proofs of pumping lemmas

This section gives the proof of the pumping lemmas.

9.4.1. Proof of the Regular Pumping Lemma, Theorem 9.1

Since L is a regular language, there exists a deterministic ﬁnite-state automaton D

such that L = Ldfa D.

Take for n the number of states of D.

Let s be an element of L with sublist y such that [y[ n, say s = xyz .

Consider the sequence of states D passes through while processing y. Since [y[ n,

this sequence has more than n entries, hence at least one state, say state A, occurs

twice.

Take u, v, w as follows

• u is the initial part of y processed until the ﬁrst occurrence of A,

• v is the (nonempty) part of y processed from the ﬁrst to the second occurrence

of A,

• w is the remaining part of y

161

9. Pumping Lemmas: the expressive power of languages

Note that D could have skipped processing v, and hence would have accepted xuwz .

Furthermore, D can repeat the processing in between the ﬁrst occurrence of A and

the second occurrence of A as often as desired, and hence it accepts xuv

i

wz for all

i ∈ N. Formally, a simple proof by induction shows that (∀i : i 0 : xuv

i

wz ∈ L).

9.4.2. Proof of the Context-free Pumping Lemma, Theorem 9.2

Let G = (T, N, R, S) be a context-free grammar such that L = L(G). Let m be the

length of the longest right-hand side of any production, and let k be the number of

nonterminals of G.

Take c = m

k

. In Lemma 9.3 below we prove that if z is a list with [z[ > c, then in

all derivation trees for z there exists a path of length at least k+1.

Let z ∈ L such that [z[ > c. Since grammar G has k nonterminals, there is at least

one nonterminal that occurs more than once in a path of length k+1 (which contains

k+2 symbols, of which at most one is a terminal, and all others are nonterminals)

of a derivation tree for z. Consider the nonterminal A that satisﬁes the following

requirements.

• A occurs at least twice in the path of length k+1 of a derivation tree for z.

Call the list corresponding to the derivation tree rooted at the lower A w, and call

the list corresponding to the derivation tree rooted at the upper A (which contains

the list w) vwx.

• A is chosen such that at most one of v and x equals .

• Finally, we suppose that below the upper occurrence of A no other nonterminal

that satisﬁes the same requirements occurs, that is, the two A’s are the lowest

pair of nonterminals satisfying the above requirements.

First we show that a nonterminal A satisfying the above requirements exists. We

prove this statement by contradiction. Suppose that for all nonterminals A that

occur twice on a path of length at least k+1 both v and x, the border lists of the list

vwx corresponding to the tree rooted at the upper occurrence of A, are both equal to

. Then we can replace the tree rooted at the upper A by the tree rooted at the lower

A without changing the list corresponding to the tree. Thus we can replace all paths

of length at least k+1 by a path of length at most k. But this contradicts Lemma

9.3 below, and we have a contradiction. It follows that a nonterminal A satisfying

the above requirements exists.

There exists a derivation tree for z in which the path from the upper A to the leaf

has length at most k+1, since either below A no nonterminal occurs twice, or there is

one or more nonterminal B that occurs twice, but the border lists v

and x

from the

list v

w

x

**corresponding to the tree rooted at the upper occurrence of nonterminal
**

B are empty. Since the lists v

and x

**are empty, we can replace the tree rooted at
**

162

9.4. Proofs of pumping lemmas

the upper occurrence of B by the tree rooted at the lower occurrence of B without

changing the list corresponding to the derivation tree. Since we can do this for all

nonterminals B that occur twice below the upper occurrence of A, there exists a

derivation tree for z in which the path from the upper A to the leaf has length at

most k+1. It follows from Lemma 9.3 below that the length of vwx is at most m

k+1

,

so we deﬁne d = m

k+1

.

Suppose z = uvwxy, that is, the list corresponding to the subtree to the left (right)

of the upper occurrence of A is u (y). This situation is depicted as follows.

S

r

r

r

r

r

r

r

r

r

r

r

r

r

Ò

Ò

Ò

Ò

Ò

Ò

Ò

Ò

`

`

`

`

`

`

`

`

v

v

v

v

v

v

v

v

v

v

v

v

v

u

A

r

r

r

r

r

r

r

r

r

r

r

r

r

Ò

Ò

Ò

Ò

Ò

Ò

Ò

Ò

`

`

`

`

`

`

`

`

v

v

v

v

v

v

v

v

v

v

v

v

v

y

v

A

c

c

c

c

c

c

c x

w

We prove by induction that (∀i : i 0 : uv

i

wx

i

y ∈ L). In this proof we apply the

tree substitution process described in the example before the lemma. For i = 0 we

have to show that the list uwy is a sentence in L. The list uwy is obtained if the

tree rooted at the upper A is replaced by the tree rooted at the lower A. Suppose

that for all i n we have uv

i

wx

i

y ∈ L. The list uv

i+1

wx

i+1

y is obtained if the tree

rooted at the lower A in the derivation tree for uv

i

wx

i

y ∈ L is replaced by the tree

rooted above it A. This proves the induction step.

The proof of the Pumping Lemma above frequently refers to the following lemma.

Theorem 9.3. Let G be a context-free grammar, and suppose that the longest right-

hand side of any production has length m. Let t be a derivation tree for a sentence

z ∈ L(G). If height t j, then [z[ m

j

.

Proof. We prove a slightly stronger result: if t is a derivation tree for a list z, but

the root of t is not necessarily the start-symbol, and height t j, then [z[ m

j

. We

prove this statement by induction on j.

For the base case, suppose j = 1. Then tree t corresponds to a single production in

G, and since the longest right-hand side of any production has length m, we have

that [z[ m = m

j

.

For the induction step, assume that for all derivation trees t of height at most j

we have that [z[ m

j

, where z is the list corresponding to t. Suppose we have

a tree t of height j+1. Let A be the root of t, and suppose the top of the tree

163

9. Pumping Lemmas: the expressive power of languages

corresponds to the production A → v in G. For all trees s rooted at the symbols of

v we have height s j, so the induction hypothesis applies to these trees, and the

lists corresponding to the trees rooted at the symbols of v all have length at most

m

j

. Since A → v is a production of G, and since the longest right-hand side of any

production has length m, the list corresponding to the tree rooted at A has length

at most mm

j

= m

j+1

, which proves the induction step.

Summary

This section introduces pumping lemmas. Pumping lemmas are used to prove that

languages are not regular or not context-free.

9.5. Exercises

Exercise 9.9. Show that the following language is not regular.

¦x [ x ∈ ¦0, 1¦

∗

∧ nr 1 x = nr 0 x¦

where function nr takes an element a and a list x, and returns the number of occurrences of

a in x.

Exercise 9.10. Consider the following language:

¦a

i

b

j

[ 0 i j¦

1. Is this language context-free? If it is, give a context-free grammar and prove that this

grammar generates the language. If it is not, why not?

2. Is this language regular? If it is, give a regular grammar and prove that this grammar

generates the language. If it is not, why not?

Exercise 9.11. Consider the following language:

¦wcw [ w ∈ ¦a, b¦

∗

¦

1. Is this language context-free? If it is, give a context-free grammar and prove that this

grammar generates the language. If it is not, why not?

2. Is this language regular? If it is, give a regular grammar and prove that this grammar

generates the language. If it is not, why not?

Exercise 9.12. Consider the grammar G with the following productions.

S → ¦A¦

S →

A → S

A → AA

A → ¦¦

1. Is this grammar

164

9.5. Exercises

• Context-free?

• Regular?

Why?

2. Give the language of G without referring to G itself. Prove that your description is

correct.

3. Is the language of G

• Context-free?

• Regular?

Why?

Exercise 9.13. Consider the grammar G with the following productions.

S →

S → 0

S → 1

S → S0

1. Is this grammar

• Context-free?

• Regular?

Why?

2. Give the language of G without referring to G itself. Prove that your description is

correct.

3. Is the language of G

• Context-free?

• Regular?

Why?

165

9. Pumping Lemmas: the expressive power of languages

166

10. LL Parsing

Introduction

This chapter introduces LL(1) parsing. LL(1) parsing is an eﬃcient (linear in the

length of the input string) method for parsing that can be used for all LL(1) gram-

mars. A grammar is LL(1) if at each step in a derivation the next symbol in the input

uniquely determines the production that should be applied. In order to determine

whether or not a grammar is LL(1), we introduce several kinds of grammar analyses,

such as determining whether or not a nonterminal can derive the empty string, and

determining the set of symbols that can appear as the ﬁrst symbol in a derivation

from a nonterminal.

Goals

After studying this chapter you will

• know the deﬁnition of LL(1) grammars;

• know how to parse a sentence from an LL(1) grammar;

• be able to apply diﬀerent kinds of grammar analyses in order to determine

whether or not a grammar is LL(1).

This chapter is organised as follows. Section 10.1 describes the background of LL(1)

parsing, and Section 10.2 describes an implementation in Haskell of LL(1) parsing

and the diﬀerent kinds of grammar analyses needed for checking whether or not a

grammar is LL(1).

10.1. LL Parsing: Background

In the previous chapters we have shown how to construct parsers for sentences of

context-free languages using combinator parsers. Since these parsers may backtrack,

the resulting parsers are sometimes a bit slow. There are several ways in which we

can put extra restrictions on context-free grammars such that we can parse sentences

of the corresponding languages eﬃciently. This chapter discusses one such restriction:

LL(1). Other restrictions, not discussed in these lecture notes are LR(1), LALR(1),

SLR(1), etc.

167

10. LL Parsing

10.1.1. A stack machine for parsing

This section presents a stack machine for parsing sentences of context-free grammars.

We will use this machine in the following subsections to illustrate why we need

grammar analysis.

The stack machine we use in this section diﬀers from the stack machines introduced

in Sections 5.4.4 and 7.2. A stack machine for a grammar G has a stack and an input,

and performs one of the following two actions.

1. Expand: If the top stack symbol is a nonterminal, it is popped from the stack

and a right-hand side from a production of G for the nonterminal is pushed onto

the stack. The production is chosen nondeterministically.

2. Match: If the top stack symbol is a terminal, then it is popped from the stack

and compared with the next symbol of the input sequence. If they are equal,

then this terminal symbol is ‘read’. If the stack symbol and the next input

symbol do not match, the machine signals an error, and the input sentence

cannot be accepted.

These actions are performed until the stack is empty. A stack machine for G accepts

an input if it can terminate with an empty input when starting with the start-symbol

from G on the stack.

For example, let grammar G be the grammar with the productions:

S → aS [ cS [ b

The stack machine for G accepts the input string aab because it can perform the

following actions (the ﬁrst component of the state (before the |in the picture below) is

the symbol stack and the second component of the state (after the |) is the unmatched

(remaining part of the) input string):

stack input

S aab

aS aab

S ab

aS ab

S b

b b

and end with an empty input. However, if the machine had chosen the production

S → cS in its ﬁrst step, it would have been stuck. So not all possible sequences

of actions from the state (S,aab) lead to an empty input. If there is at least one

sequence of actions that ends with an empty input on an input string, the input

168

10.1. LL Parsing: Background

string is accepted. In this sense, the stack machine is similar to a nondeterministic

ﬁnite-state automaton.

10.1.2. Some example derivations

This section gives three examples in which the stack machine for parsing is applied.

It turns out that, for all three examples, the nondeterministic stack machine can act

in a deterministic way by looking ahead one (or two) symbols of the sequence of input

symbols. Each of the examples exempliﬁes why diﬀerent kinds of grammar analyses

are useful in parsing.

The ﬁrst example

Our ﬁrst example is gramm1. The set of terminal symbols of gramm1 is ¦a, b, c¦, the

set of nonterminal symbols is ¦S, A, B, C¦, the start symbol is S; and the productions

are

S → cA [ b

A → cBC [ bSA [ a

B → cc [ Cb

C → aS [ ba

We want to know whether or not the string ccccba is a sentence of the language of

this grammar. The stack machine produces, amongst others, the following sequence,

corresponding with a leftmost derivation of ccccba.

stack input

S ccccba

cA ccccba

A cccba

cBC cccba

BC ccba

ccC ccba

cC cba

C ba

ba ba

a a

Starting with S the machine chooses between two productions:

S → cA [ b

169

10. LL Parsing

but, since the ﬁrst symbol of the string ccccba to be recognised is c, the only ap-

plicable production is the ﬁrst one. After expanding S a match-action removes the

leading c from ccccba and cA. So now we have to derive the string cccba from A.

The machine chooses between three productions:

A → cBC [ bSA [ a

and again, since the next symbol of the remaining string cccba to be recognised is

c, the only applicable production is the ﬁrst one. After expanding A a match-action

removes the leading c from cccba and cBC. So now we have to derive the string

ccba from BC. The top stack symbol of BC is B. The machine chooses between

two productions:

B → cc [ Cb

The next symbol of the remaining string ccba to be recognised is, once again, c. The

ﬁrst production is applicable, but the second production may be applicable as well.

To decide whether it also applies we have to determine the symbols that can appear

as the ﬁrst element of a string derived from B starting with the second production.

The ﬁrst symbol of the alternative Cb is the nonterminal C. From the productions

C → aS [ ba

it is immediately clear that a string derived from C starts with either an a or a b.

The set ¦a, b¦ is called the ﬁrst set of the nonterminal C. Since the next symbol

in the remaining string to be recognised is a c, the second production cannot be

applied. After expanding B and performing two match-actions it remains to derive

the string ba from C. The machine chooses between two productions C → aS and

C → ba. Clearly, only the second one applies, and, after two match-actions, leads to

success.

From the above derivation we conclude the following.

Deriving the sentence ccccba using gramm1 is a deterministic computa-

tion: at each step of the derivation there is only one applicable alternative

for the nonterminal on top of the stack.

Determinicity is obtained by looking at the set of ﬁrsts of the nonterminals.

The second example

A second example is the grammar gramm2 whose productions are

S → abA [ aa

A → bb [ bS

Now we want to know whether or not the string abbb is a sentence of the language of

this grammar. The stack machine produces, amongst others, the following sequence

170

10.1. LL Parsing: Background

stack input

S abbb

abA abbb

bA bbb

A bb

bb bb

b b

Starting with S the machine chooses between two productions:

S → abA [ aa

since both alternatives start with an a, it is not suﬃcient to look at the ﬁrst symbol a

of the string to be recognised. The problem is that the lookahead sets (the lookahead

set of a production N → α is the set of terminal symbols that can appear as the ﬁrst

symbol of a string that can be derived from N starting with the production N → α,

the deﬁnition is given in the following subsection) of the two productions for S both

contain a. However, if we look at the ﬁrst two symbols ab, then we ﬁnd that the

only applicable production is the ﬁrst one. After expanding and matching it remains

to derive the string bb from A. Again, looking ahead one symbol in the input string

does not give suﬃcient information for choosing one of the two productions

A → bb [ bS

for A. If we look at the ﬁrst two symbols bb of the input string, then we ﬁnd that the

ﬁrst production applies (and, after matching, leads to success). Each string derived

from A starting with the second production starts with a b and, since it is not possible

to derive a string starting with another b from S, the second production does not

apply.

From the above derivation we conclude the following.

Deriving the string abbb using gramm2 is a deterministic computation: at

each step of the derivation there is only one applicable alternative for the

nonterminal on the top of the stack.

Again, determinicity is obtained by analysing the set of ﬁrsts (of strings of length

2) of the nonterminals. Alternatively, we can left-factor the grammar to obtain a

grammar in which all productions for a nonterminal start with a diﬀerent terminal

symbol.

171

10. LL Parsing

The third example

A third example is grammar gramm3 with the following productions:

S → AaS [ B

A → cS [

B → b

Now we want to know whether or not the string acbab is an element of the language

of this grammar. The stack machine produces the following sequence

stack input

S acbab

AaS acbab

aS acbab

Starting with S the machine chooses between two productions:

S → AaS [ B

since each nonempty string derived from A starts with a c, and each nonempty string

derived from B starts with a b, there does not seem to be a candidate production

to start a leftmost derivation of acabb with. However, since A can also derive the

empty string, we can apply the ﬁrst production, and then apply the empty string for

A, producing aS which, as required, starts with an a. We do not explain the rest of

the leftmost derivation since it does not use any empty strings any more.

Nonterminal symbols that can derive the empty sequence will play a central role in

the grammar analysis problems which we will consider in Section 10.2.

From the above derivation we conclude the following.

Deriving the string acbab using gramm3 is a deterministic computation:

at each step of the derivation there is only one applicable alternative for

the nonterminal on the top of the stack.

Determinicity is obtained by analysing whether or not nonterminals can derive the

empty string, and which terminal symbols can follow upon a nonterminal in a deriva-

tion.

172

10.1. LL Parsing: Background

10.1.3. LL(1) grammars

The examples in the previous subsection show that the derivations of the example

sentences are deterministic, provided we can look ahead one or two symbols in the

input. An obvious question now is: for which grammars are all derivations determin-

istic? Of course, as the second example shows, the answer to this question depends

on the number of symbols we are allowed to look ahead. In the rest of this chapter we

assume that we may look 1 symbol ahead. A grammar for which all derivations are

deterministic with 1 symbol lookahead is called LL(1): Leftmost with a Lookahead

of 1. Since all derivations of sentences of LL(1) grammars are deterministic, LL(1)

is a desirable property of grammars.

To formalise this deﬁnition, we deﬁne lookAhead sets.

Deﬁnition 10.1 (lookAhead set). The lookahead set of a production N → α is

the set of terminal symbols that can appear as the ﬁrst symbol of a string that can

be derived from Nδ (where Nδ appears as a tail substring in a derivation from the

start-symbol) starting with the production N → α. So

lookAhead (N → α) = ¦x [ S

∗

⇒ γNδ ⇒ γαδ

∗

⇒ γxβ¦

For example, for the productions of gramm1 we have

lookAhead (S → cA) = ¦c¦

lookAhead (S → b) = ¦b¦

lookAhead (A → cBC) = ¦c¦

lookAhead (A → bSA) = ¦b¦

lookAhead (A → a) = ¦a¦

lookAhead (B → cc) = ¦c¦

lookAhead (B → Cb) = ¦a, b¦

lookAhead (C → aS) = ¦a¦

lookAhead (C → ba) = ¦b¦

We use lookAhead sets in the deﬁnition of LL(1) grammar.

Deﬁnition 10.2 (LL(1) grammar). A grammar G is LL(1) if all pairs of productions

of the same nonterminal have disjoint lookahead sets, that is: for all productions

N → α, N → β of G:

lookAhead (N → α) ∩ lookAhead (N → β) = ∅

Since all lookAhead sets for productions of the same nonterminal of gramm1 are

disjoint, gramm1 is an LL(1) grammar. For gramm2 we have:

lookAhead (S → abA) = ¦a¦

lookAhead (S → aa) = ¦a¦

lookAhead (A → bb) = ¦b¦

lookAhead (A → bS) = ¦b¦

173

10. LL Parsing

Here, the lookAhead sets for both nonterminals S and A are not disjoint, and it

follows that gramm2 is not LL(1). gramm2 is an LL(2) grammar, where an LL(k)

grammar for k 2 is deﬁned similarly to an LL(1) grammar: instead of one symbol

lookahead we have k symbols lookahead.

How do we determine whether or not a grammar is LL(1)? Clearly, to answer this

question we need to know the lookahead sets of the productions of the grammar. The

lookAhead set of a production N → α, where α starts with a terminal symbol x, is

simply x. But what if α starts with a nonterminal P, that is α = Pβ, for some β?

Then we have to determine the set of terminal symbols with which strings derived

from P can start. But if P can derive the empty string, we also have to determine

the set of terminal symbols with which a string derived from β can start. As you see,

in order to determine the lookAhead sets of productions, we are interested in

• whether or not a nonterminal can derive the empty string (empty);

• which terminal symbols can appear as the ﬁrst symbol in a string derived from

a nonterminal (firsts);

• and which terminal symbols can follow upon a nonterminal in a derivation

(follow).

In each of the following deﬁnitions we assume that a grammar G is given.

Deﬁnition 10.3 (Empty). Function empty takes a nonterminal N, and determines

whether or not the empty string can be derived from the nonterminal:

empty N = N

∗

⇒

For example, for gramm3 we have:

empty S = False

empty A = True

empty B = False

Deﬁnition 10.4 (First). The set of ﬁrsts of a nonterminal N is the set of terminal

symbols that can appear as the ﬁrst symbol of a string that can be derived from N:

firsts N = ¦x [ N

∗

⇒ xβ¦

For example, for gramm3 we have:

firsts S = ¦a, b, c¦

firsts A = ¦c¦

firsts B = ¦b¦

174

10.1. LL Parsing: Background

We could have given more restricted deﬁnitions of empty and firsts, by only looking

at derivations from the start-symbol, for example,

empty N = S

∗

⇒ αNβ

∗

⇒ αβ

but the simpler deﬁnition above suﬃces for our purposes.

Deﬁnition 10.5 (Follow). The follow set of a nonterminal N is the set of terminal

symbols that can follow on N in a derivation starting with the start-symbol S from

the grammar G:

follow N = ¦x [ S

∗

⇒ αNxβ¦

For example, for gramm3 we have:

follow S = ¦a¦

follow A = ¦a¦

follow B = ¦a¦

In the following section we will give programs with which lookahead, empty, firsts,

and follow are computed.

Exercise 10.1. Give the results of the function empty for the grammars gramm1 and gramm2.

Exercise 10.2. Give the results of the function firsts for the grammars gramm1 and gramm2.

Exercise 10.3. Give the results of the function follow for the grammars gramm1 and gramm2.

Exercise 10.4. Give the results of the function lookahead for grammar gramm3. Is gramm3

an LL(1) grammar ?

Exercise 10.5. Grammar gramm2 is not LL(1), but it can be transformed into an LL(1)

grammar by left factoring. Give this equivalent grammar gramm2’ and give the results of

the functions empty, first, follow and lookAhead on this grammar. Is gramm2’ an LL(1)

grammar?

Exercise 10.6. A non-leftrecursive grammar for Bit-Lists is given by the following gram-

mar (see your answer to Exercise 2.18):

L → BR

R → [ ,BR

B → 0 [ 1

Give the results of functions empty, firsts, follow and lookAhead on this grammar. Is this

grammar LL(1)?

175

10. LL Parsing

10.2. LL Parsing: Implementation

Until now we have written parsers with parser combinators. Parser combinators use

backtracking, and this is sometimes a cause of ineﬃciency. If a grammar is LL(1) we

do not need backtracking anymore: parsing is deterministic. We can use this fact by

either adjusting the parser combinators so that they don’t use backtracking anymore,

or by writing a special purpose LL(1) parsing program. We present the latter in this

section.

This section describes the implementation of a program that parses sentences of LL(1)

grammars. The program works for arbitrary context-free LL(1) grammars, so we ﬁrst

describe how to represent context-free grammars in Haskell. Another consequence of

the fact that the program parses sentences of arbitrary context-free LL(1) grammars is

that we need a generic representation of parse trees in Haskell. The second subsection

deﬁnes a datatype for parse trees in Haskell. The third subsection presents the

program that parses sentences of LL(1) grammars. This program assumes that the

input grammar is LL(1), so in the fourth subsection we give a function that determines

whether or not a grammar is LL(1). Both this and the LL(1) parsing function

use a function that determines the lookahead of a production. This function is

presented in the ﬁfth subsection. The last subsections of this section deﬁne functions

for determining the empty, ﬁrst, and follow symbols of a nonterminal.

10.2.1. Context-free grammars in Haskell

A context-free grammar may be represented by a pair: its start symbol, and its

productions. How do we represent terminal and nonterminal symbols? There are at

least two possibilities.

• The rigorous approach uses a datatype Symbol:

data Symbol a b = N a | T b

The advantage of this approach is that nonterminals and terminals are strictly

separated, the disadvantage is the notational overhead of constructors that

has to be carried around. However, a rigorous implementation of context-free

grammars should keep terminals and nonterminals apart, so this is the preferred

implementation. But in this section we will use the following implementation:

• class Eq s => Symbol s where

isT :: s -> Bool

isN :: s -> Bool

isT = not . isN

where isN and isT determine whether or not a symbol is a nonterminal or a

terminal, respectively. This notation is compact, but terminals and nontermi-

nals are no longer strictly separated, and symbols that are used as nonterminals

176

10.2. LL Parsing: Implementation

cannot be used as terminals anymore. For example, the type of characters is

made an instance of this class by deﬁning:

instance Symbol Char where

isN c = ’A’ <= c && c <= ’Z’

that is, capitals are nonterminals, and, by deﬁnition, all other characters are

terminals.

A context-free grammar is a value of the type CFG:

type CFG s = (s,[(s,[s])])

where the list in the second component associates nonterminals to right-hand sides.

So an element of this list is a production. For example, the grammar with produc-

tions

S → AaS [ B [ CB

A → SC [

B → A [ b

C → D

D → d

is represented as:

exGrammar :: CFG Char

exGrammar =

(’S’, [(’S’,"AaS"),(’S’,"B"),(’S’,"CB")

,(’A’,"SC"),(’A’,"")

,(’B’,"A"),(’B’,"b")

,(’C’,"D")

,(’D’,"d")

]

)

On this type we deﬁne some functions for extracting the productions, nonterminals,

terminals, etc. from a grammar.

start :: CFG s -> s

start = fst

prods :: CFG s -> [(s,[s])]

prods = snd

terminals :: (Symbol s, Ord s) => CFG s -> [s]

terminals = unions . map (filter isT . snd) . snd

177

10. LL Parsing

where unions :: Ord s => [[s]] -> [s] returns the union of the ‘sets’ of symbols

in the lists.

nonterminals :: (Symbol s, Ord s) => CFG s -> [s]

nonterminals = nub . map fst . snd

Here, nub :: Ord s => [s] -> [s] removes duplicates from a list.

symbols :: (Symbol s, Ord s) => CFG s -> [s]

symbols grammar =

union (terminals grammar) (nonterminals grammar)

nt2prods :: Eq s => CFG s -> s -> [(s,[s])]

nt2prods grammar s =

filter (\(nt,rhs) -> nt==s) (prods grammar)

where function union returns the set union of two lists (removing duplicates). For

example, we have

?start exGrammar

’S’

? terminals exGrammar

"abd"

10.2.2. Parse trees in Haskell

A parse tree is a tree, in which each internal node is labelled with a nonterminal,

and has a list of children (corresponding with a right-hand side of a production of

the nonterminal). It follows that parse trees can be represented as rose trees with

symbols, where the datatype of rose trees is deﬁned by:

data Rose a = Node a [Rose a] | Nil

The constructor Nil has been added to simplify error handling when parsing: when

a sentence cannot be parsed, the ‘parse tree’ Nil is returned. Strictly speaking it

should not occur in the datatype of rose trees.

178

10.2. LL Parsing: Implementation

10.2.3. LL(1) parsing

This section deﬁnes a function ll1 that takes a grammar and a terminal string as

input, and returns one tuple: a parse tree, and the rest of the inputstring that has

not been parsed. So ll1 takes a grammar, and returns a parser with Rose s as

its result type. As mentioned in the beginning of this section, our parser doesn’t

need backtracking anymore, since parsing with an LL(1) grammar is deterministic.

Therefore, the parser type is adjusted as follows:

type Parser b a = [b] -> (a,[b])

Using this parser type, the type of the function ll1 is:

ll1 :: (Symbol s, Ord s) => CFG s -> Parser s (Rose s)

Function ll1 is deﬁned in terms of two functions. Function isll1 :: CFG s ->

Bool is a function that checks whether or not a grammar is LL(1). And function

gll1 (for generalised LL(1)) produces a list of rose trees for a list of symbols. ll1 is

obtained from gll1 by giving gll1 the singleton list containing the start symbol of

the grammar as argument.

ll1 grammar input =

if isll1 grammar

then let ([rose], rest) = gll1 grammar [start grammar] input

in (rose, rest)

else error "ll1: grammar not LL(1)"

So now we have to implement functions isll1 and gll1. Function isll1 is imple-

mented in the following subsection. Function gll1 also uses two functions. Function

grammar2ll1table takes a grammar and returns the LL(1) table: the association list

that associates productions with their lookahead sets. And function choose takes a

terminal symbol, and chooses a production based on the LL(1) table.

gll1 :: (Symbol s, Ord s) => CFG s -> [s] -> Parser s [Rose s]

gll1 grammar =

let ll1table = grammar2ll1table grammar

-- The LL(1) table.

nt2prods nt = filter (\((n,l),r) -> n==nt) ll1table

-- nt2prods returns the productions for nonterminal

-- nt from the LL(1) table

selectprod nt t = choose t (nt2prods nt)

-- selectprod selects the production for nt from the

179

10. LL Parsing

-- LL(1) table that should be taken when t is the next

-- symbol in the input.

in \stack input ->

case stack of

[] -> ([], input)

(s:ss) ->

if isT s

then -- match

let (rts,rest) = gll1 grammar ss (tail input)

in if s == head input

then (Node s []: rts, rest)

-- The parse tree is a leaf (a node with

-- no children).

else ([Nil], input)

-- The input cannot be parsed

else -- expand

let t = head input

(rts,zs) = gll1 grammar (selectprod s t) input

-- Try to parse according to the production

-- obtained from the LL(1) table from s.

(rrs,vs) = gll1 grammar ss zs

-- Parse the rest of the symbols on the

-- stack.

in ((Node s rts): rrs, vs)

Functions grammar2ll1table and choose, which are used in the above function gll1,

are deﬁned as follows. These functions use function lookaheadp, which returns the

lookahead set of a production and is deﬁned in one of the following subsections.

grammar2ll1table :: (Symbol s, Ord s) => CFG s -> [((s,[s]),[s])]

grammar2ll1table grammar =

map (\x -> (x,lookaheadp grammar x)) (prods grammar)

choose :: Eq a => a -> [((b,c),[a])] -> c

choose t l =

let [((s,rhs), ys)] = filter (\((x,p),q) -> t ‘elem‘ q) l

in rhs

Note that function choose assumes that there is exactly one element in the association

list in which the element t occurs.

180

10.2. LL Parsing: Implementation

10.2.4. Implementation of isLL(1)

Function isll1 checks whether or not a context-free grammar is LL(1). It does this

by computing for each nonterminal of its argument grammar the set of lookahead

sets (one set for each production of the nonterminal), and checking that all of these

sets are disjoint. It is deﬁned in terms of a function lookaheadn, which computes the

lookahead sets of a nonterminal, and a function disjoint, which determines whether

or not all sets in a list of sets are disjoint. All sets in a list of sets are disjoint if

the length of the concatenation of these sets equals the length of the union of these

sets.

isll1 :: (Symbol s, Ord s) => CFG s -> Bool

isll1 grammar =

and (map (disjoint . lookaheadn grammar) (nonterminals grammar))

disjoint :: Ord s => [[s]] -> Bool

disjoint xss = length (concat xss) == length (unions xss)

Function lookaheadn computes the lookahead sets of a nonterminal by computing

all productions of a nonterminal, and computing the lookahead set of each of these

productions by means of function lookaheadp.

lookaheadn :: (Symbol s, Ord s) => CFG s -> s -> [[s]]

lookaheadn grammar =

map (lookaheadp grammar) . nt2prods grammar

10.2.5. Implementation of lookahead

Function lookaheadp takes a grammar and a production, and returns the lookahead

set of the production. It is deﬁned in terms of four functions. Each of the ﬁrst three

functions will be deﬁned in a separate subsection below, the fourth function is deﬁned

in this subsection.

• isEmpty :: (Ord s,Symbol s) => CFG s -> s -> Bool

Function isEmpty takes a grammar and a nonterminal and determines whether

or not the empty string can be derived from the nonterminal in the grammar.

(This function was called empty in Deﬁnition 10.3.)

• firsts :: (Ord s, Symbol s) => CFG s -> [(s,[s])]

Function firsts takes a grammar and computes the set of ﬁrsts of each symbol

(the set of ﬁrsts of a terminal is the terminal itself).

181

10. LL Parsing

• follow :: (Ord s, Symbol s) => CFG s -> [(s,[s])]

Function follow takes a grammar and computes the follow set of each nonter-

minal (so it associates a list of symbols with each nonterminal).

• lookSet :: Ord s =>

(s -> Bool) -> -- isEmpty

(s -> [s]) -> -- firsts?

(s -> [s]) -> -- follow?

(s, [s]) -> -- production

[s] -- lookahead set

Note that we use the operator ?, see Section 5.4.2, on the firsts and follow

association lists. Function lookSet takes a predicate, two functions that given

a nonterminal return the ﬁrst and follow set, respectively, and a production, and

returns the lookahead set of the production. Function lookSet is introduced

after the deﬁnition of function lookaheadp.

Now we deﬁne:

lookaheadp :: (Symbol s, Ord s) => CFG s -> (s,[s]) -> [s]

lookaheadp grammar =

lookSet (isEmpty grammar) ((firsts grammar)?) ((follow grammar)?)

We will exemplify the deﬁnition of function lookSet with the grammar exGrammar,

with the following productions:

S → AaS [ B [ CB

A → SC [

B → A [ b

C → D

D → d

Consider the production S → AaS. The lookahead set of the production contains

the set of symbols which can appear as the ﬁrst terminal symbol of a sequence of

symbols derived from A. But, since the nonterminal symbol A can derive the empty

string, the lookahead set also contains the symbol a.

Consider the production A → SC. The lookahead set of the production contains

the set of symbols which can appear as the ﬁrst terminal symbol of a sequence of

symbols derived from S. But, since the nonterminal symbol S can derive the empty

string, the lookahead set also contains the set of symbols which can appear as the

ﬁrst terminal symbol of a sequence of symbols derived from C.

Finally, consider the production B → A. The lookahead set of the production con-

tains the set of symbols which can appear as the ﬁrst terminal symbol of a sequence

182

10.2. LL Parsing: Implementation

of symbols derived from A. But, since the nonterminal symbol A can derive the

empty string, the lookahead set also contains the set of terminal symbols which can

follow the nonterminal symbol B in some derivation.

The examples show that it is useful to have functions firsts and follow in which,

for every nonterminal symbol n, we can look up the terminal symbols which can

appear as the ﬁrst terminal symbol of a sequence of symbols in some derivation from

n and the set of terminal symbols which can follow the nonterminal symbol n in a

sequence of symbols occurring in some derivation respectively. It turns out that the

deﬁnition of function follow also makes use of a function lasts which is similar to

the function firsts, but which deals with last nonterminal symbols rather than ﬁrst

terminal ones.

The examples also illustrate a control structure which will be used very often in

the following algorithms: we will fold over right-hand sides. While doing so we

compute sets of symbols for all the symbols of the right-hand side which we encounter

and collect them into a ﬁnal set of symbols. Whenever such a list for a symbol is

computed, there are always two possibilities:

• either we continue folding and return the result of taking the union of the set

obtained from the current element and the set obtained by recursively folding

over the rest of the right-hand side

• or we stop folding and immediately return the set obtained from the current

element.

We continue if the current symbol is a nonterminal which can derive the empty se-

quence and we stop if the current symbol is either a terminal symbol or a nonterminal

symbol which cannot derive the empty sequence. The following function makes this

statement more precise.

foldrRhs :: Ord s =>

(s -> Bool) ->

(s -> [s]) ->

[s] ->

[s] ->

[s]

foldrRhs p f start = foldr op start

where op x xs = f x ‘union‘ if p x then xs else []

The function foldrRhs is, of course, most naturally deﬁned in terms of the function

foldr. This function is somewhere in between a general purpose and an application

speciﬁc function (we could easily have made it more general though). In the exercises

we give an alternative characterisation of foldRhs. We will also need a function

scanrRhs which is like foldrRhs but accumulates intermediate results in a list. The

function scanrRhs is most naturally deﬁned in terms of the function scanr.

183

10. LL Parsing

scanrRhs :: Ord s =>

(s -> Bool) ->

(s -> [s]) ->

[s] ->

[s] ->

[[s]]

scanrRhs p f start = scanr op start

where op x xs = f x ‘union‘ if p x then xs else []

Finally, we will also need a function scanlRhs which does the same job as scanrRhs

but in the opposite direction. The easiest way to deﬁne scanlRhs is in terms of

scanrRhs and reverse.

scanlRhs p f start = reverse . scanrRhs p f start . reverse

We now return to the function lookSet.

lookSet :: Ord s =>

(s -> Bool) ->

(s -> [s]) ->

(s -> [s]) ->

(s,[s]) ->

[s]

lookSet p f g (nt,rhs) = foldrRhs p f (g nt) rhs

The function lookSet makes use of foldrRhs to fold over a right-hand side. As

stated above, the function foldrRhs continues processing a right-hand side only if

it encounters a nonterminal symbol for which p (so isEmpty in the lookSet in-

stance lookaheadp) holds. Thus, the set g nt (follow?nt in the lookSet instance

lookaheadp) is only important for those right-hand sides for nt that consist of non-

terminals that can all derive the empty sequence. We can now (assuming that the

deﬁnitions of the auxiliary functions are given) use the function lookaheadp instance

of lookSet to compute the lookahead sets of all productions.

look nt rhs = lookaheadp exGrammar (nt,rhs)

? look ’S’ "AaS"

dba

? look ’S’ "B"

dba

? look ’S’ "CB"

d

? look ’A’ "SC"

184

10.2. LL Parsing: Implementation

dba

? look ’A’ ""

ad

? look ’B’ "A"

dba

? look ’B’ "b"

b

? look ’C’ "D"

d

? look ’D’ "d"

d

It is clear from this result that exGrammar is not an LL(1)-grammar. Let us have

a closer look at how these lookahead sets are obtained. We will have to use the

functions firsts and follow and the predicate isEmpty for computing intermediate

results. The corresponding subsections explain how to compute these intermediate

results.

For the lookahead set of the production A → AaS we fold over the right-hand side

AaS. Folding stops at ’a’ and we obtain

firsts? ’A’ ‘union‘ firsts? ’a’

==

"dba" ‘union‘ "a"

==

"dba"

For the lookahead set of the production A → SC we fold over the right-hand side

SC. Folding stops at C since it cannot derive the empty sequence, and we obtain

firsts? ’S’ ‘union‘ firsts? ’C’

==

"dba" ‘union‘ "d"

==

"dba"

Finally, for the lookahead set of the production B → A we fold over the right-hand

side A In this case we fold over the complete (one element) list and and we obtain

firsts? ’A’ ‘union‘ follow? ’B’

==

"dba" ‘union‘ "d"

==

"dba"

185

10. LL Parsing

The other lookahead sets are computed in a similar way.

10.2.6. Implementation of empty

Many functions deﬁned in this chapter make use of a predicate isEmpty, which

tests whether or not the empty sequence can be derived from a nonterminal. This

subsection deﬁnes this function. Consider the grammar exGrammar. We are now only

interested in deriving sequences which contain only nonterminal symbols (since it is

impossible to derive the empty string if a terminal occurs). Therefore we only have

to consider the productions in which no terminal symbols appear in the right-hand

sides.

S → B [ CB

A → SC [

B → A

C → D

One can immediately see from those productions that the nonterminal A derives the

empty string in one step. To know whether there are any nonterminals which derive

the empty string in more than one step we eliminate the productions for A and we

eliminate all occurrences of A in the right hand sides of the remaining productions

S → B [ CB

B →

C → D

One can now conclude that the nonterminal B derives the empty string in two steps.

Doing the same with B as we did with A gives us the following productions

S → [ C

C → D

One can now conclude that the nonterminal S derives the empty string in three steps.

Doing the same with S as we did with A and B gives us the following productions

C → D

At this stage we can conclude that there are no more new nonterminals which derive

the empty string.

We now give the Haskell implementation of the algorithm described above. The al-

gorithm is iterative: it does the same steps over and over again until some desired

condition is met. For this purpose we use function fixedPoint, which takes a func-

tion and a set, and repeatedly applies the function to the set, until the set does not

change anymore.

186

10.2. LL Parsing: Implementation

fixedPoint :: Ord a => ([a] -> [a]) -> [a] -> [a]

fixedPoint f xs | xs == nexts = xs

| otherwise = fixedPoint f nexts

where nexts = f xs

fixedPoint f is sometimes called the ﬁxed-point of f. Function isEmpty determines

whether or not a nonterminal can derive the empty string. A nonterminal can derive

the empty string if it is a member of the emptySet of a grammar.

isEmpty :: (Symbol s, Ord s) => CFG s -> s -> Bool

isEmpty grammar = (‘elem‘ emptySet grammar)

The emptySet of a grammar is obtained by the iterative process described in the

example above. We start with the empty set of nonterminals, and at each step n of

the computation of the emptySet as a fixedPoint, we add the nonterminals that

can derive the empty string in n steps. Function emptyStepf adds a nonterminal if

there exists a production for the nonterminal of which all elements can derive the

empty string.

emptySet :: (Symbol s, Ord s) => CFG s -> [s]

emptySet grammar = fixedPoint (emptyStepf grammar) []

emptyStepf :: (Symbol s, Ord s) => CFG s -> [s] -> [s]

emptyStepf grammar set =

nub (map fst (filter (\(nt,rhs) -> all (‘elem‘ set) rhs)

(prods grammar)

) )

10.2.7. Implementation of ﬁrst and last

Function firsts takes a grammar, and returns for each symbol of the grammar (so

also the terminal symbols) the set of terminal symbols with which a sentence derived

from that symbol can start. The ﬁrst set of a terminal symbol is the terminal symbol

itself.

The set of ﬁrsts each symbol consists of that symbol itself, plus the (ﬁrst) symbols

that can be derived from that symbol in one or more steps. So the set of ﬁrsts can

be computed by an iterative process, just as the function isEmpty.

Consider the grammar exGrammar again. We start the iteration with

[(’S’,"S"),(’A’,"A"),(’B’,"B"),(’C’,"C"),(’D’,"D")

,(’a’,"a"),(’b’,"b"),(’d’,"d")

]

187

10. LL Parsing

Using the productions of the grammar we can derive in one step the following lists

of ﬁrst symbols.

[(’S’,"ABC"),(’A’,"S"),(’B’,"Ab"),(’C’,"D"),(’D’,"d")]

and the union of these two lists is

[(’S’,"SABC"),(’A’,"AS"),(’B’,"BAb"),(’C’,"CD"),(’D’,"Dd")

,(’a’,"a"),(’b’,"b"),(’d’,"d")]

In two steps we can derive

[(’S’,"SAbD"),(’A’,"ABC"),(’B’,"S"),(’C’,"d"),(’D’,"")]

and again we have to take the union of this list with the previous result. We repeat

this process until the list doesn’t change anymore. For exGrammar this happens

when:

[(’S’,"SABCDabd")

,(’A’,"SABCDabd")

,(’B’,"SABCDabd")

,(’C’,"CDd")

,(’D’,"Dd")

,(’a’,"a")

,(’b’,"b")

,(’d’,"d")

]

Function firsts is deﬁned as the fixedPoint of a step function that iterates the

ﬁrst computation one more step. The fixedPoint starts with the list that contains

all symbols paired with themselves.

firsts :: (Symbol s, Ord s) => CFG s -> [(s,[s])]

firsts grammar =

fixedPoint (firstStepf grammar) (startSingle grammar)

startSingle :: (Ord s, Symbol s) => CFG s -> [(s,[s])]

startSingle grammar = map (\x -> (x,[x])) (symbols grammar)

The step function takes the old approximation and performs one more iteration step.

At each of these iteration steps we have to add the start list with which the iteration

started again.

188

10.2. LL Parsing: Implementation

firstStepf :: (Ord s, Symbol s) =>

CFG s -> [(s,[s])] -> [(s,[s])]

firstStepf grammar approx = (startSingle grammar)

‘combine‘ (compose (first1 grammar) approx)

combine :: Ord s => [(s,[s])] -> [(s,[s])] -> [(s,[s])]

combine xs = foldr insert xs

where insert (a,bs) [] = [(a,bs)]

insert (a,bs) ((c,ds):rest)

| a == c = (a, union bs ds) : rest

| otherwise = (c,ds) : (insert (a,bs) rest)

compose :: Ord a => [(a,[a])] -> [(a,[a])] -> [(a,[a])]

compose r1 r2 = [(a, unions (map (r2?) bs)) | (a,bs) <- r1]

Finally, function first1 computes the direct ﬁrst symbols of all productions, taking

into account that some nonterminals can derive the empty string, and combines the

results for the diﬀerent nonterminals.

first1 :: (Symbol s, Ord s) => CFG s -> [(s,[s])]

first1 grammar =

map (\(nt,fs) -> (nt,unions fs))

(group (map (\(nt,rhs) -> (nt,foldrRhs (isEmpty grammar)

single

[]

rhs

) )

(prods grammar)

) )

where group groups elements with the same ﬁrst element together

group :: Eq a => [(a,b)] -> [(a,[b])]

group = foldr insertPair []

insertPair :: Eq a => (a,b) -> [(a,[b])] -> [(a,[b])]

insertPair (a,b) [] = [(a,[b])]

insertPair (a,b) ((c,ds):rest) =

if a==c then (c,(b:ds)):rest else (c,ds):(insertPair (a,b) rest)

function single takes an element and returns the set with the element, and unions

returns the union of a set of sets.

189

10. LL Parsing

Function lasts is deﬁned using function firsts. Suppose we reverse the right-hand

sides of all productions of a grammar. Then the set of ﬁrsts of this reversed grammar

is the set of lasts of the original grammar. This idea is implemented in the following

functions.

reverseGrammar :: Symbol s => CFG s -> CFG s

reverseGrammar =

\(s,al) -> (s,map (\(nt,rhs) -> (nt,reverse rhs)) al)

lasts :: (Symbol s, Ord s) => CFG s -> [(s,[s])]

lasts = firsts . reverseGrammar

10.2.8. Implementation of follow

The ﬁnal function we have to implement is the function follow, which takes a gram-

mar, and returns an association list in which nonterminals are associated to symbols

that can follow upon the nonterminal in a derivation. A nonterminal n is associ-

ated to a list containing terminal t in follow if n and t follow each other in some

sequence of symbols occurring in some leftmost derivation. We can compute pairs

of such adjacent symbols by splitting up the right-hand sides with length at least 2

and, using lasts and firsts, compute the symbols which appear at the end resp.

at the beginning of strings which can be derived from the left resp. right part of the

split alternative. Our grammar exGrammar has three alternatives with length at least

2: "AaS", "CB" and "SC". Function follow uses the functions firsts and lasts

and the predicate isEmpty for intermediate results. The previous subsections explain

how to compute these functions.

Let’s see what happens with the alternative "AaS". The lists of all nonterminal

symbols that can appear at the end of sequences of symbols derived from "A" and

"Aa" are "ADC" and "" respectively. The lists of all terminal symbols which can

appear at the beginning of sequences of symbols derived from "aS" and "S" are "a"

and "dba" respectively. Zipping together those lists shows that an ’A’, a ’D’ and

a ’C’ can be be followed by an ’a’. Splitting the alternative "CB" in the middle

produces sets of ﬁrsts and sets of lasts "CD" and "dba". Splitting the alternative

"SC" in the middle produces sets of ﬁrsts and sets of lasts "SDCAB" and "d". From

the ﬁrst pair we can see that a ’C’ and a ’D’ can be followed by a ’d’, a ’b’ and

an ’a’ From the second pair we see that an ’S’, a ’D’, a ’C’, an ’A’, and a ’B’

can be followed by a ’d’. Combining all these results gives:

[(’S’,"d"),(’A’,"ad"),(’B’,"d"),(’C’,"adb"),(’D’,"adb")]

The function follow uses the functions scanrAlt and scanlAlt. The lists produced

by these functions are exactly the ones we need: using the function zip from the

190

10.2. LL Parsing: Implementation

standard prelude we can combine the lists. For example: for the alternative "AaS"

the functions scanlAlt and scanrAlt produce the following lists:

[[], "ADC", "DC", "SDCAB"]

["dba", "a", "dba", []]

Only the two middle elements of both lists are important (they correspond to the

nontrivial splittings of "AaS"). Thus, we only have to consider alternatives of length

at least 2. We start the computation of follow with assigning the empty follow set

to each symbol:

follow :: (Symbol s, Ord s) => CFG s -> [(s,[s])]

follow grammar = combine (followNE grammar) (startEmpty grammar)

startEmpty grammar = map (\x -> (x,[])) (symbols grammar)

The real work is done by functions followNE and function splitProds. Function

followNE passes the right arguments on to function splitProds, and removes all

nonterminals from the set of ﬁrsts, and all terminals from the set of lasts. Function

splitProds splits the productions of length at least 2, and pairs the last nonterminals

with the ﬁrst terminals.

followNE :: (Symbol s, Ord s) => CFG s -> [(s,[s])]

followNE grammar = splitProds

(prods grammar)

(isEmpty grammar)

(isTfirsts grammar)

(isNlasts grammar)

where isTfirsts = map (\(x,xs) -> (x,filter isT xs)) . firsts

isNlasts = map (\(x,xs) -> (x,filter isN xs)) . lasts

splitProds :: (Symbol s, Ord s) =>

[(s,[s])] -> -- productions

(s -> Bool) -> -- isEmpty

[(s,[s])] -> -- terminal firsts

[(s,[s])] -> -- nonterminal lasts

[(s,[s])]

splitProds prods p fset lset =

map (\(nt,rhs) -> (nt,nub rhs)) (group pairs)

where pairs = [(l, f)

| rhs <- map snd prods

, length rhs >= 2

, (fs, ls) <- zip (rightscan rhs) (leftscan rhs)

191

10. LL Parsing

, l <- ls

, f <- fs

]

leftscan = scanlRhs p (lset?) []

rightscan = scanrRhs p (fset?) []

Exercise 10.7. Give the Rose tree representation of the parse tree corresponding to the

derivation of the sentence ccccba using grammar gramm1.

Exercise 10.8. Give the Rose tree representation of the parse tree corresponding to the

derivation of the sentence abbb using grammar gramm2’ deﬁned in the exercises of the previous

section.

Exercise 10.9. Give the Rose tree representation of the parse tree corresponding to the

derivation of the sentence acbab using grammar gramm3.

Exercise 10.10. In this exercise we will take a closer look at the functions foldrRhs and

scanrRhs which are the essential ingredients of the implementation of the grammar analysis

algorithms. From the deﬁnitions it is clear that grammar analysis is easily expressed via

a calculus for (ﬁnite) sets. A calculus for ﬁnite sets is implicit in the programs for LL(1)

parsing. Since the code in this module is obscured by several implementation details we will

derive the functions foldrRhs and scanrRhs in a stepwise fashion. In this derivation we will

use the following:

A (ﬁnite) set is implemented by a list with no duplicates. In order to construct a set, the

following operations may be used:

[] :: [a] the empty set of a-elements

union :: [a] → [a] → [a] the union of two sets

unions :: [[a]] → [a] the generalised union

single :: a → [a] the singleton function

These operations satisfy the well-known laws for set operations.

1. Deﬁne a function list2Set :: [a] → [a] which returns the set of elements occurring

in the argument.

2. Deﬁne list2Set as a foldr.

3. Deﬁne a function pref p :: [a] → [a] which given a list xs returns the set of

elements corresponding to the longest preﬁx of xs all of whose elements satisfy p.

4. Deﬁne a function prefplus p :: [a] → [a] which given a list xs returns the set of

elements in the longest preﬁx of xs all of whose elements satisfy p together with the

ﬁrst element of xs that does not satisfy p (if this element exists at all).

5. Deﬁne prefplus p as a foldr.

6. Show that prefplus p = foldrRhs p single [].

7. It can be shown that

foldrRhs p f [] = unions . map f . prefplus p

for all set-valued functions f. Give an informal description of the functions foldrRhs

p f [] and foldrRhs p f start.

192

10.2. LL Parsing: Implementation

8. The standard function scanr is deﬁned by

scanr f q0 = map (foldr f q0) . tails

where tails is a function which takes a list xs and returns a list with all tailsegments

(postﬁxes) of xs in decreasing length. The function scanrRhs is deﬁned is a similar

way

scanrRhs p f start = map (foldrRhs p f start) . tails

Give an informal description of the function scanrRhs.

Exercise 10.11. The computation of the functions empty and firsts is not restricted to

nonterminals only. For terminal symbols s these functions are deﬁned by

empty s = False

firsts s = ¦s¦

Using the deﬁnitions in the previous exercise, compute the following.

1. For the example grammar gramm1 and two of its productions A → bSA and B → Cb.

a) foldrRhs empty firsts [] bSA

b) foldrRhs empty firsts [] Cb

2. For the example grammar gramm3 and its production S → AaS

a) foldrRhs empty firsts [] AaS

b) scanrRhs empty firsts [] AaS

193

10. LL Parsing

194

11. LL versus LR parsing

The parsers that were constructed using parser combinators in Chapter 3 are non-

deterministic. They can recognize sentences described by ambiguous grammars by

returning a list of solutions, rather than just one solution; each solution contains a

parse tree and the part of the input that was left unprocessed. Nondeterministic

parsers are of the following type:

type Parser a b = [a] -> [ (b, [a]) ]

where a denotes the alphabet type, and b denotes the parse tree type.

In chapter 10 we turned to deterministic parsers. Here, parsers have only one result,

consisting of a parse tree of type b and the remaining part of the input string:

type Parser a b = [a] -> (b, [a])

Ambiguous grammars are not allowed anymore. Also, in the case of parsing input

containing syntax errors, we cannot return an empty list of successes anymore; intead

there should be some mechanism of raising an error.

There are various algorithms for deterministic parsing. They impose some additional

constraints on the form of the grammar: not every context free grammar is allowable

by these algorithms. The parsing algorithms can be modelled by making use of a stack

machine. There are two fundamentally diﬀerent deterministic parsing algorithms:

• LL parsing, also known as top-down parsing

• LR parsing, also known as bottom-up parsing

The ﬁrst ‘L’ in these acronyms stands for ‘Left-to-right’, that is, input in processed in

the order it is read. So these algorithms are both suitable for reading an input stream,

e.g. from a ﬁle. The second ‘L’ in ‘LL-parsing’ stands for ‘Leftmost derivation’, as

the parsing mimics doing a leftmost derivation of a sentence (see Section 2.4). The

‘R’ in ‘LR-parsing’ stands for ‘Rightmost derivation’ of the sentences. The parsing

algorithms are normally referred to as LL(k) or (LR(k), where k is the number of

unread symbols that the parser is allowed to ‘look ahead’. In most practical cases, k

is taken to be 1.

In Chapter 10 we studied the LL(1) parsing algorithm extensively, including the

so-called LL(1)-property to which grammars must abide. In this chapter we start

with an example application of the LL(1) algorithm. Next, we turn to the LR(1)

algorithm.

195

11. LL versus LR parsing

11.1. LL(1) parser example

The LL(1) parsing algorithm was implemented in Section 10.2.3. Function ll1 de-

ﬁned there takes a grammar and an input string, and returns a single result consisting

of a parse tree and rest string. The function was implemented by calling a generalized

function named gll1; generalized in the sense that the function takes an additional

list of symbols that need to be recognized. That generalization is then called with a

singleton list containing only the root nonterminal.

11.1.1. An LL(1) checker

Here, we deﬁne a slightly simpliﬁed version of the algorithm: it doesn’t return a parse

tree, so it need not be concerned with building it; it merely checks whether or not

the input string is a sentence. Hence the result of the function is simply a boolean

value:

check :: String -> Bool

check input = run [’S’] input

As in chapter 10, the function is implemented by calling a generalized function run

with a singleton containing just the root nonterminal. Now the function run takes,

in addition to the input string, a list of symbols (which is initially called with the

above-mentioned singleton). That list is used as some kind of stack, as elements are

prepended at its front, and removed at the front in other occasions. Therefore we

refer to the whole operation as a stack machine, and that’s why the function is named

run: it runs the stack machine. Function run is deﬁned as follows:

run :: Stack -> String -> Bool

run [] [] = True

run [] (x:xs) = False

run (a:as) input | isT a = not(null input)

&& a==hd input

&& run as (tl input)

| isN a = run (rs++as) input

where rs = select a (hd input)

So, when called with an empty stack and an empty string, the function succeeds. If

the stack is empty, but the input is not, it fails, as there is junk input present. If

the stack is nonempty, case distinction is done on a, the top of the stack. If it is a

terminal, the input should begin with it, and the machine is called recursively for the

rest of the input. In the case of a nonterminal, we push rs on the stack, and leave

the input unchanged in the recursive call. This is a simple tail recursive function,

and could imperatively be implemented in a single loop that runs while the stack is

non-empty.

The new symbols rs that are pushed on the stack is the right hand side of a production

rule for nonterminal a. It is selected from the available alternatives by function

196

11.1. LL(1) parser example

select. For making the choice, the ﬁrst input symbol is passed to select as well.

This is where you can see that the algorithm is LL(1): it is allowed to look ahead

one symbol.

This is how the alternative is selected:

select a x = (snd . hd . filter ok . prods) gram

where ok p@(n,_) = n==a && x ‘elem‘ lahP gram p

So, from all productions of the grammar returned by prods, the ok ones are taken, of

which there should be only one (this is ensured by the LL(1)-property); that single

production is retrieved by hd, and of it only the right hand side is needed, hence

the call to snd. Now a production is ok, if the nonterminal n is a, the one we are

looking for, and moreover the ﬁrst input symbol x is member of the lookahead set of

this production.

Determining the lookahead sets of all productions of a grammar is the tricky part

of the LL(1) parsing algorithm. It is described in Sections 10.2.5 to 10.2.8. Though

the general formulation of the algorithm is quite complex, its application in a simple

case is rather intuitive. So let’s study an example grammar: arithmetic expressions

with operators of two levels of precedence.

11.1.2. An LL(1) grammar

The idea for the grammar for arithmetical expressions was given in Section 2.5.7: we

need auxiliary notions of ‘Term’ and ‘Factor’ (actually, we need as much additional

notions as there are levels of precedence). The most straightforward deﬁnition of the

grammar is shown in the left column below.

Unfortunately, this grammar doesn’t abide to the LL(1)-property. The reason for

this is that the grammar contains rules with common preﬁxes on the right hand side

(for E and T). A way out is the application of a grammar transformation known

as left factoring, as described in Section 2.5.4. The result is a grammar where there

is only one rule for E and T, and the non-common part of the right hand sides is

described by additional nonterminals, P and M. The resulting grammar is shown in

the right column below. For being able to treat end-of-input as if it were a character,

we also add an additional rule, which says that the input consists of an expression

followed by end-of-input (designated as ‘#’ here).

197

11. LL versus LR parsing

E → T

E → T + E

T → F

T → F * T

F → N

F → ( E )

N → 1

N → 2

N → 3

S → E #

E → TP

P →

P → + E

T → FM

M →

M → * T

F → N

F → ( E )

N → 1

N → 2

N → 3

For determining the lookahead sets of each nonterminal, we also need to analyze

whether a nonterminal can produce the empty string, and which terminals can be the

ﬁrst symbol of any string produced by each nonterminal. These properties are named

empty and ﬁrst respecively. All properties for the example grammar are summarized

in the table below. Note that empty and ﬁrst are properties of a nonterminal, whereas

lookahead is a property of a single production rule.

production empty ﬁrst lookahead

S → E # no ( 1 2 3 ﬁrst(E) ( 1 2 3

E → TP no ( 1 2 3 ﬁrst(T) ( 1 2 3

P → yes + follow(P) ) #

P → + E immediate +

T → FM no ( 1 2 3 ﬁrst(F) ( 1 2 3

M → yes * follow(M) + ) #

M → * T immediate *

F → N no ( 1 2 3 ﬁrst(N) 1 2 3

F → ( E ) immediate (

N → 1 no 1 2 3 immediate 1

N → 2 immediate 2

N → 3 immediate 3

11.1.3. Using the LL(1) parser

A sentence of the language described by the example grammar is 1+2*3. Because of

the high precedence of * relative to +, the parse tree should reﬂect that the 2 and 3

should be multiplied, rather than the 1+2 and 3. Indeed, the parse tree does so:

198

11.2. LR parsing

Now when we do a step-by-step analysis of how the parse tree is constructed by the

stack machine, we notice that the parse tree is traversed in a depth-ﬁrst fashion,

where the left subtrees are analysed ﬁrst. Each node corresponds to the application

of a production rule. The order in which the production rules are applied is a pre-

order traversal of the tree. The tree is constructed top-down: ﬁrst the root is visited,

an then each time the ﬁrst remaining nonterminal is expanded. From the table in

Figure 11.1 it is clear that the contents of the stack describes what is still to be

expected on the input. Initially, of course, this is the root nonterminal S. Whenever

a terminal is on top of the stack, the corresponing symbol is read from the input.

11.2. LR parsing

Another algorithm for doing deterministic parsing using a stack machine, is LR(1)-

parsing. Actually, what is described in this section is known as Simple LR parsing

or SLR(1)-parsing. It is still rather complicated, though.

A nice property of LR-parsing is that it is in many ways exactly the opposite, or

dual, of LL-parsing. Some of these ways are:

• LL does a leftmost derivation, LR does a rightmost derivation

199

11. LL versus LR parsing

• LL starts with only the root nonterminal on the stack, LR ends with only the

root nonterminal on the stack

• LL ends when the stack is empty, LR starts with an empty stack

• LL uses the stack for designating what is still to be expected, LR uses the stack

for designating what has already been seen

• LL builds the parse tree top down, LR builds the parse tree bottom up

• LL continuously pops a nonterminal oﬀ the stack, and pushes a corresponding

right hand side; LR tries to recognize a right hand side on the stack, pops it,

and pushes the corresponding nonterminal

• LL thus expands nonterminals, while LR reduces them

• LL reads terminal when it pops one oﬀ the stack, LR reads terminals while it

pushes them on the stack

• LL uses grammar rules in an order which corresponds to pre-order traversal of

the parse tree, LR does a post-order traversal.

11.2.1. A stack machine for SLR parsing

As in Section 10.1.1 a stack machine is used to parse sentences. A stack machine for

a grammar G has a stack and an input. When the parsing starts the stack is empty.

The stack machine performs one of the following two actions.

1. Shift: Move the ﬁrst symbol of the remaining input to the top of the stack.

2. Reduce: Choose a production rule N → α; pop the sequence α from the top

of the stack; push N onto the stack.

These actions are performed until the stack only contains the start symbol. A stack

machine for G accepts an input if it can terminate with only the start symbol on

the stack when the whole input has been processed. Let us take the example of

Section 10.1.1.

stack input

aab

a ab

aa b

baa

Saa

Sa

S

Note that with our notation the top of the stack is at the left side. To decide which

production rule is to be used for the reduction, the symbols on the stack must be

read in reverse order.

200

11.3. LR parse example

The stack machine for G accepts the input string aab because it terminates with the

start symbol on the stack and the input is completely processed. The stack machine

performs three shift actions and then three reduce actions.

The SLR parser only performs a reduction with a grammar rule N → α if the next

symbol on the input is in the follow set of N.

Exercise 11.1. Give for gramm1 of Section 10.1.2 all the states of the stack machine for a

derivation of the input ccccba.

Exercise 11.2. Give for gramm2 of Section 10.1.2 all the states of the stack machine for a

derivation of the input abbb.

Exercise 11.3. Give for gramm3 of Section 10.1.2 all the states of the stack machine for a

derivation of the input acbab.

11.3. LR parse example

11.3.1. An LR checker

for a start, the main function for LR parsing calls a generalized stack function with

an empty stack:

check’ :: String -> Bool

check’ input = run’ [] input

The stack machine terminates when it ﬁnds the stack containing just the root non-

terminal. Otherwise it either pushes the ﬁrst input symbol on the stack (‘Shift’), or

it drops some symbols oﬀ the stack (which should be the right hand side of a rule)

and pushes the corresponding nonterminal (‘Reduce’).

run’ :: Stack -> String -> Bool

run’ [’S’] [] = True

run’ [’S’] (x:xs) = False

run’ stack (x:xs) = case action of

Shift -> run’ (x: stack) xs

Reduce a n -> run’ (a:drop n stack) (x:xs)

Error -> False

where action = select’ stack x

In the case of LL-parsing the hard part was selecting the right rule to expand; here

we have the hard decision of whether to reduce according to a (and which?) rule, or

to shift the next symbol. This is done by the select’ function, which is allowed to

inspect the ﬁrst input symbol x and the entire stack: after all, in needs to ﬁnd the

right hand side of a rule on the stack.

In the right column in Figure 11.1 the LR derivation of sentence 1+2*3 is shown.

Compare closely to the left column, which shows the LL derivation, and note the

duality of the processes.

201

11. LL versus LR parsing

LL derivation

rule read ↓stack remaining

S 1+2*3

S → E E 1+2*3

E → TP TP 1+2*3

T → FM FMP 1+2*3

F → N NMP 1+2*3

N → 1 1MP 1+2*3

read 1 MP +2*3

M → 1 P +2*3

P → +E 1 +E +2*3

read 1+ E 2*3

E → TP 1+ TP 2*3

T → FM 1+ FMP 2*3

F → N 1+ NMP 2*3

N → 2 1+ 2MP 2*3

read 1+2 MP *3

M → *T 1+2 *TP *3

read 1+2* TP 3

T → FM 1+2* FMP 3

F → N 1+2* NMP 3

N → 3 1+2* 3MP 3

read 1+2*3 MP

M → 1+2*3 P

P → 1+2*3

LR derivation

rule read stack↓ remaining

1+2*3

shift 1 1 +2*3

N → 1 1 N +2*3

F → N 1 F +2*3

M → 1 FM +2*3

T → FM 1 T +2*3

shift 1+ T+ 2*3

shift 1+2 T+2 *3

N → 2 1+2 T+N *3

F → N 1+2 T+F *3

shift 1+2* T+F* 3

shift 1+2*3 T+F*3

N → 3 1+2*3 T+F*N

F → N 1+2*3 T+F*F

M → 1+2*3 T+F*FM

T → FM 1+2*3 T+F*T

M → *T 1+2*3 T+FM

T → FM 1+2*3 T+T

P → 1+2*3 T+TP

E → TP 1+2*3 T+E

P → +E 1+2*3 TP

E → TP 1+2*3 E

S → E 1+2*3 S

Figure 11.1.: LL and LR derivations of 1+2*3

11.3.2. LR action selection

The choice whether to shift or to reduce is made by function select’. It is deﬁned

as follows:

select’ as x

| null items = Error

| null redItems = Shift

| otherwise = Reduce a (length rs)

where items = dfa as

redItems = filter red items

(a,rs,_) = hd redItems

In the selection process a set (list) of so-called items plays a role. If the set is empty,

there is an error. If the set contains at least one item, we ﬁlter the red, or reducible

items from it. There should be only one, if the grammar has the LR-property. (Or

rather: it is the LR-property that there is only one element in this situation). The

reducible item is the production rule that can be reduced by the stack machine.

Now what are these items? An item is deﬁned to be a production rule, augmented

202

11.3. LR parse example

with a ‘cursor position’ somewhere in the right hand side of the rule. So for the rule

F → (E), there are four possible items: F → (E), F → ( E), F → (E ) and

F → (E), where denotes the cursor.

for the rule F → 2 we have two items: one having the cursor in front of the single

symbol, and one with the cursor after it: F → 2 and F → 2 . For epsilon-rules

there is a single item, where the cursor is at position 0 in the right hand side.

In the Haskell representation, the cursor is an integer which is added as a third

element of the tuple, which already contains nonterminal and right hand side of the

rule.

The items thus constructed are taken to be the states of a NFA (Nondeterminis-

tic Finite-state Automaton), as described in Section 8.1.2. We have the following

transition relations in this NFA:

• The cursor can be ‘stepped’ to the next position. This transition is labeled

with the symbol that is hopped over

• If the cursor is on front of a nonterminal, we can jump to an item describing

the application of that nonterminal, where the cursor is at the beginning. This

relation is an epsilon-transition of the NFA.

As an example, let’s consider an simpliﬁcation of the arithmetic expression gram-

mar:

E → T

E → T + E

T → N

T → N * T

T → ( E )

it skips the ‘factor’ notion as compared to the grammar earlier in this chapter, so it

misses some well-formed expressions, but it serves only as an example for creating

the states here. There are 18 states in this machine, as depicted here:

203

11. LL versus LR parsing

Exercise 11.4 (no answer provided). How can you predict the number of states from in-

specting the grammar?

Exercise 11.5 (no answer provided). In the picture, the transition arrow labels are not

shown. Add them.

Exercise 11.6 (no answer provided). What makes this FA nondeterministic?

As was shown in Section 8.1.4, another automaton can be constructed that is de-

terministic (a DFA). That construction involves deﬁning states which are sets of the

original states. So in this case, the states of the DFA are sets of items (where items

are rules with a cursor position). In the worst case we would have 2

18

states for the

DFA-version of our example NFA, but it turns out that we are in luck: there are only

11 states in the DFA. Its states are rather complicated to depict, as they are sets of

items, but it can be done:

204

11.3. LR parse example

Exercise 11.7 (no answer provided). Check that this FA is indeed deterministic.

Given this DFA, let’s return to our function that selects whether to shift or to re-

duce:

select’ as x

| null items = Error

| null redItems = Shift

| otherwise = Reduce a (length rs)

where items = dfa as

redItems = filter red items

(a,rs,_) = hd redItems

It runs the DFA, using the contents of the stack (as) as input. From the state where

the DFA ends, which is by construction a set of items, the ‘red’ ones are ﬁltered out.

An item is ‘red’ (that is: reducible) if the cursor is at the end of the rule.

Exercise 11.8 (no answer provided). Which of the states in the picture of the DFA contain

red items? How many?

This is not the only condition that makes an item reducible; the second condition

is that the lookahead input symbol is in the follow set of the nonterminal that is

reduced to. Function follow was also needed in the LL analysis in Section 10.2.8.

Both conditions are in the formal deﬁnition:

... where red (a,r,c) = c==length r && x ‘elem‘ follow a

Exercise 11.9 (no answer provided). How does this condition reﬂect that ‘the cursor is at

the end’ ?

205

11. LL versus LR parsing

11.3.3. LR optimizations and generalizations

The stack machine run’, by its nature, pushes and pops the stack continuously, and

does a recursive call afterwards. In each call, for making the shift/reduce decision,

the DFA is run on the (new) stack. In practice, it is considered a waste of time

to do the full DFA transitions each time, as most of the stack remains the same

after some popping and pushing at the top. Therefore, as an optimization, at each

stack position, the corresponding DFA state is also stored. The states of the DFA

can easily be numbered, so this amounts to just storing extra integers on the stack,

tupled with the symbols that used to be on the stack. (An artiﬁcial bottom element

should be placed on the stack initially, containing a dummy symbol and the number

of the initial DFA state).

By analysing the grammar, two tables can be precomputed:

• Shift, that decides what is the new state when pushing a terminal symbol on

the stack. This basically is the transition relation on the DFA.

• Action, that decides what action to take from a given state seeing a given input

symbol.

Both tables can be implemented as a two-dimensional table of integers, of dimensions

the number of states (typically, 100s to 1000s) times the number of symbols (typically,

under 100).

As said in the introduction, the algorithm described here is a mere Simple LR parsing,

or SLR(1). Its simplicity is in the reducibility test, which says:

... where red (a,r,c) = c==length r && x ‘elem‘ follow a

The follow set is a rough approximation of what might follow a given nonterminal.

But this set is not dependent of the context in which the nonterminal is used; maybe,

in some contexts, the set is smaller. So, the SLR red function, may designate items

as reducible, where it actually should not. For some grammars this might lead to

a decision to reduce, where it should have done a shift, with a failing parser as a

consequence.

An improvement, leading to a wider class of grammars that are allowable, would be

to make the follow set context dependent. This means that it should vary for each

item instead of for each nonterminal. It leads to a dramatic increase of the number

of states in the DFA. And the full power of LR parsing is rarely needed.

A compromise position is taken by the so-called LALR parsers, or Look Ahead LR

parsing. (A rather silly name, as all parsers look ahead. . . ). In LALR parsers,

the follow sets are context dependent, but when states in the DFA diﬀer only with

respect to the follow-part of their set-members (and not with respect to the item-

part of them), the states are merged. Grammars that do not give rise to shift/reduce

conﬂicts in this situation are said to be LALR-grammars. It is not really a natural

206

11.3. LR parse example

notion; rather, it is a performance hack that gets most of LR power while keeping

the size of the goto- and action-tables reasonable.

A widely used parser generator tool named yacc (for ‘yet another compiler compiler’)

is based on an LALR engine. It comes with Unix, and was originally created for

implementing the ﬁrst C compilers. A commonly used clone of yacc is named Bison.

Yacc is designed for doing the context-free aspects of analysing a language. The

micro structure of identiﬁers, numbers, keywords, comments etc. is not handled by

this grammar. Instead, it is described by regular expressions, which are analysed by a

accompanying tool to yacc named lex (for ‘lexical scanner’). Lex is a preprocessor to

yacc, that subdivides the input character stream into a stream of meaningful tokens,

such as numbers, identiﬁers, operators, keywords etc.

Bibliographical notes

The example DFA and NFA for LR parsing, and part of the description of the algo-

rithm were taken from course notes by Alex Aiken and George Necula, which can be

found at www.cs.wright.edu/~tkprasad/courses/cs780

207

11. LL versus LR parsing

208

Bibliography

[1] A.V. Aho, Sethi R., and J.D. Ullman. Compilers — Principles, Techniques and

Tools. Addison-Wesley, 1986.

[2] R.S. Bird. Using circular programs to eliminate multiple traversals of data. Acta

Informatica, 21:239–250, 1984.

[3] R.S. Bird and P. Wadler. Introduction to Functional Programming. Prentice

Hall, 1988.

[4] W.H. Burge. Parsing. In Recursive Programming Techniques. Addison-Wesley,

1975.

[5] J. Fokker. Functional parsers. In J. Jeuring and E. Meijer, editors, Advanced

Functional Programming, volume 925 of Lecture Notes in Computer Science.

Springer-Verlag, 1995.

[6] R. Harper. Proof-directed debugging. Journal of Functional Programming, 1999.

To appear.

[7] G. Hutton. Higher-order functions for parsing. Journal of Functional Program-

ming, 2(3):323 – 343, 1992.

[8] B.W. Kernighan and R. Pike. Regular expressions — languages, algorithms, and

software. Dr. Dobb’s Journal, April:19 – 22, 1999.

[9] D.E. Knuth. Semantics of context-free languages. Math. Syst. Theory, 2(2):127–

145, 1968.

[10] Niklas R¨ojemo. Garbage collection and memory eﬃciency in lazy functional

languages. PhD thesis, Chalmers University of Technology, 1995.

[11] S. Sippu and E. Soisalon-Soininen. Parsing Theory, Vol. 1: Languages and

Parsing, volume 15 of EATCS Monographs on THeoretical Computer Science.

Springer-Verlag, 1988.

[12] S.D. Swierstra and P.R. Azero Alcocer. Fast, error correcting parser combinators:

a short tutorial. In SOFSEM’99, 1999.

[13] P. Wadler. How to replace failure by a list of successes: a method for excep-

tion handling, backtracking, and pattern matching in lazy functional languages.

In J.P. Jouannaud, editor, Functional Programming Languages and Computer

Architecture, pages 113 – 128. Springer, 1985. LNCS 201.

209

Bibliography

210

A. The Stack module

module Stack

( Stack

, emptyStack

, isEmptyStack

, push

, pushList

, pop

, popList

, top

, split

, mystack

)

where

data Stack x = MkS [x] deriving (Show,Eq)

emptyStack :: Stack x

emptyStack = MkS []

isEmptyStack :: Stack x -> Bool

isEmptyStack (MkS xs) = null xs

push :: x -> Stack x -> Stack x

push x (MkS xs) = MkS (x:xs)

pushList :: [x] -> Stack x -> Stack x

pushList xs (MkS ys) = MkS (xs ++ ys)

pop :: Stack x -> Stack x

pop (MkS xs) = if isEmptyStack (MkS xs)

then error "pop on emptyStack"

else MkS (tail xs)

popIf :: Eq x => x -> Stack x -> Stack x

211

A. The Stack module

popIf x stack = if top stack == x

then pop stack

else error "argument and top of stack don’t match"

popList :: Eq x => [x] -> Stack x -> Stack x

popList xs stack = foldr popIf stack (reverse xs)

top :: Stack x -> x

top (MkS xs) = if isEmptyStack (MkS xs)

then error "top on emptyStack"

else head xs

split :: Int -> Stack x -> ([x], Stack x)

split 0 stack = ([], stack)

split n (MkS []) = error "attempt to split the emptystack"

split (n+1) (MkS (x:xs)) = (x:ys, stack’)

where

(ys, stack’) = split n (MkS xs)

mystack = MkS [1,2,3,4,5,6,7]

212

B. Answers to exercises

2.1 Three of the fours strings are elements of L

∗

: abaabaaabaa, aaaabaaaa,

baaaaabaa.

2.2 ¦ε¦.

2.3

∅L

= ¦ Deﬁnition of concatenation of languages ¦

¦st [ s ∈ ∅, t ∈ L¦

= ¦ s ∈ ∅ ¦

∅

The other equalities can be proved in a similar fashion.

2.4 The star operator on sets injects the elements of a set in a list; the star operator

on languages concatenates the sentences of the language. The former star operator

preserves more structure.

2.5 Section 2.1 contains an inductive deﬁnition of the set of sequences over an arbi-

trary set X. Syntactical deﬁnitions for such sets follow immediately from this.

1. A grammar for X =¦a¦ is given by

S → ε

S → aS

2. A grammar for X =¦a, b¦ is given by

S → ε

S → X S

X → a [ b

2.6 A context free grammar for L is given by

S → ε

S → aSb

213

B. Answers to exercises

2.7 Analogous to the construction of the grammar for PAL.

P → ε

[ a

[ b

[ aPa

[ bPb

2.8 Analogous to the construction of PAL.

M → ε

[ aMa

[ bMb

2.9 First establish an inductive deﬁnition for parity sequences. An example of a

grammar that can be derived from the inductive deﬁnition is:

P → ε [ 1P1 [ 0P [ P0

There are many other solutions.

2.10 Again, establish an inductive deﬁnition for L. An example of a grammar that

can be derived from the inductive deﬁnition is:

S → ε [ aSb [ bSa [ SS

Again, there are many other solutions.

2.11 A sentence is a sentential form consisting only of terminals which can be de-

rived in zero or more derivation steps from the start symbol (to be more precise:

the sentential form consisting only of the start symbol). The start symbol is a non-

terminal. The nonterminals of a grammar do not belong to the alphabet (the set

of terminals) of the language we describe using the grammar. Therefore the start

symbol cannot be a sentence of the language. As a consequence we have to perform

at least one derivation step from the start symbol before we end up with a sentence

of the language.

2.12 The language consisting of the empty string only, i.e., ¦ε¦.

2.13 This grammar generates the empty language, i.e., ∅. In general, grammars such

as this one that have no production rules without nonterminals on the right hand

side, cannot produce any sentences with only terminal symbols. Each derivation will

always contain nonterminals, so no sentences can be derived.

2.14 The sentences in this language consist of zero or more concatenations of ab, i.e.,

the language is the set ¦ab¦

∗

.

214

2.15 Yes. Each ﬁnite language is context free. A context free grammar can be

obtained by taking one nonterminal and adding a production rule for each sentence

in the language. For the language in Exercise 2.1, this prodedure yields

S → ab

S → aa

S → baa

2.16 To bring the grammar into the form where we can directly apply the rule for

associative separators, we introduce a new nonterminal:

A → AaA

A → B

B → b [ c

Now we can remove the ambiguity:

A → BaA

A → B

B → b [ c

It is now (optionally) possible to undo the auxiliary step of introducing the additional

nonterminal by applying the rules for substituting right hand sides for nonterminal

and removing unreachable productions. We then obtain:

A → baA[ caA

A → b [ c

2.17

1. Here are two parse trees for the sentence if b then if b then a else a:

S

if b then S

if b then S

a

else S

a

S

if b then S

if b then S

a

else S

a

215

B. Answers to exercises

2. The rule we apply is: match else with the closest previous unmatched else.

This means we prefer the second of the two parse trees above. The disam-

biguating rule is incorporated directly into the grammar:

S → MatchedS [ UnmatchedS

MatchedS → if b then MatchedS else MatchedS

[ a

UnmatchedS → if b then S

[ if b then MatchedS else UnmatchedS

3. An else clause is always matched with the closest previous unmatched if.

2.18 An equivalent grammar for bit lists is

L → B Z [ B

Z → , L Z [ , L

B → 0 [ 1

2.19

1. The grammar generates the language ¦a

2n

b

m

[ m, n ∈ N¦.

2. An equivalent non left recursive grammar is

S → AB

A → ε [ aaA

B → ε [ bB

2.20 Of course, we can choose how we want to represent the diﬀerent operators in

concrete syntax. Choosing standard symbols, this is one possibility:

Expr → Expr + Expr

[ Expr * Expr

[ Int

where Int is a nonterminal that produces integers. Note that the grammar given

above is ambiguous. We could also give an unambiguous version, for example by

introducing operator priorities.

2.21 Recall the grammar for palindromes from Exercise 2.7, now with names for the

productions:

Empty: P → ε

A: P → a

B: P → b

A

2

: P → aPa

B

2

: P → bPb

216

We construct the datatype Pal by interpreting the nonterminal as datatype and the

names of the productions as names of the constructors:

data Pal = Empty [ A[ B [ A

2

P [ B

2

P

Note that once again we keep the nonterminals on the right hand sides as arguments

to the constructors, but omit all the terminal symbols.

The strings abaaba and baaab can be derived as follows:

P ⇒ aPa ⇒ abPba ⇒ abaPaba ⇒ abaaba

P ⇒ bPb ⇒ baPba ⇒ baaab

The parse trees corresponding to the derivations are the following:

P

a

P

b P

a

P

ε

a

b

a

and

P

b P

a

P

a

a

b

Consequently, the desired Haskell deﬁnitions are

pal

1

= A

2

(B

2

(A

2

Empty))

pal

2

= B

2

(A

2

A)

2.22

1.

printPal :: Pal → String

printPal Empty = ""

printPal A = "a"

printPal B = "b"

printPal (A

2

p) = "a" ++ printPal p ++ "a"

printPal (B

2

p) = "b" ++ printPal p ++ "b"

Note how the function follows the structure of the datatype Pal closely, and

calls itself recursively wherever a recursive value of Pal occurs in the datatype.

Such a pattern is typical for semantic functions.

217

B. Answers to exercises

2.

aCountPal :: Pal → Int

aCountPal Empty = 0

aCountPal A = 1

aCountPal B = 0

aCountPal (A

2

p) = aCountPal p + 2

aCountPal (B

2

p) = aCountPal p

2.23 Recall the grammar from Exercise 2.8, this time with names for the produc-

tions:

MEmpty: M → ε

MA: M → aMa

MB: M → bMb

1. By systematically transforming the grammar, we obtain the following datatype:

data Mir = MEmpty [ MA Mir [ MB Mir

The concrete mirror palindromes cMir

1

and cMir

2

correspond to the following

terms of type Mir:

aMir

1

= MA (MB (MA Empty))

aMir

2

= MA (MB (MB Empty))

2.

printMir :: Mir → String

printMir MEmpty = ""

printMir (MA m) = "a" ++ printMir m ++ "a"

printMir (MB m) = "b" ++ printMir m ++ "b"

3.

mirToPal :: Mir → Pal

mirToPal MEmpty = Empty

mirToPal (MA m) = A

2

(mirToPal m)

mirToPal (MB m) = B

2

(mirToPal m)

2.24 Recall the grammar from Exercise 2.9, this time with names for the produc-

tions:

Stop: P → ε

POne: P → 1P1

PZeroL: P → 0P

PZeroR: P → P0

218

1.

data Parity = Stop [ POne Parity [ PZeroL Parity [ PZeroR Parity

aEven

1

= PZeroL (PZeroL (POne (PZeroL Stop)))

aEven

2

= PZeroL (PZeroR (POne (PZeroL Stop)))

Note that the grammar is ambiguous, and other representations for cEven

1

and

cEven

2

are possible, for instance:

aEven

1

= PZeroL (PZeroL (POne (PZeroR Stop)))

aEven

2

= PZeroR (PZeroL (POne (PZeroL Stop)))

2.

printParity :: Parity → String

printParity Stop = ""

printParity (POne p) = "1" ++ printParity p ++ "1"

printParity (PZeroL p) = "0" ++ printParity p

printParity (PZeroR p) = printParity p ++ "0"

2.25 A grammar for bit lists that is not left-recursive is the following:

L → B Z [ B

Z → , L Z [ , L

B → 0 [ 1

1.

data BitList = ConsBit Bit Z [ SingleBit Bit

data Z = ConsBitList BitList Z [ SingleBitList BitList

data Bit = Bit

0

[ Bit

1

aBitList

1

= ConsBit Bit

0

(ConsBitList (SingleBit Bit

1

)

(SingleBitList (SingleBit Bit

0

)))

aBitList

2

= ConsBit Bit

0

(ConsBitList (SingleBit Bit

0

)

(SingleBitList (SingleBit Bit

1

)))

2.

printBitList :: BitList → String

printBitList (ConsBit b z ) = printBit b ++ printZ z

printBitList (SingleBit b) = printBit b

printZ :: Z → String

printZ (ConsBitList bs z ) = "," ++ printBitList bs ++ printZ z

printZ (SingleBitList bs) = "," ++ printBitList bs

219

B. Answers to exercises

printBit :: Bit → String

printBit Bit

0

= "0"

printBit Bit

1

= "1"

When multiple datatypes are involved, semantic functions typically still follow

the structure of the datatypes closely. We get one function per datatype, and

the functions call each other recursively where appropriate – we say they are

mutually recursive.

3. We can still make the concatenation function structurally recursive in the ﬁrst

of the two bit lists. We never have to match on the second bit list:

concatBitList :: BitList → BitList → BitList

concatBitList (ConsBit b z ) cs = ConsBit b (concatZ z cs)

concatBitList (SingleBit b) cs = ConsBit b (SingleBitList cs)

concatZ :: Z → BitList → Z

concatZ (ConsBitList bs z ) cs = ConsBitList bs (concatZ z xs)

concatZ (SingleBitList bs) cs = ConsBitList bs (SingleBitList cs)

2.26 We only give the EBNF notation for the productions that change.

Digs → Dig

∗

Int → Sign? Nat

AlphaNums → AlphaNum

∗

AlphaNum → Letter [ Dig

2.27 L(G?) =L(G) ∪ ¦ε¦

2.28

1. L

1

is generated by:

S → ZC

Z → aZb [ ε

C → c

∗

and L

2

is generated by:

S → AZ

A → a

∗

Z → bZc [ ε

2. We have that

L

1

∩ L

2

=¦a

n

b

n

c

n

[ n ∈ N¦

However, this language is not context-free, i. e., there is no context-free grammar

that generates this language. We will see in Chapter 9 how to prove such a

statement.

220

2.29 No. Furthermore, for any language L, since ε ∈ L

∗

, we have ε / ∈ (L

∗

). On the

other hand, ε ∈ (L)

∗

. Thus (L

∗

) and (L)

∗

cannot be equal.

2.30 For example L =¦x

n

[ n ∈ N¦. For any language L, it holds that

L

∗

= (L

∗

)

∗

so, given any language L, the language L

∗

fulﬁlls the desired property.

2.31 This is only the case when ε / ∈ L.

2.32 No. the language L =¦aab, baa¦ also satisﬁes L = L

R

.

2.33

1. The shortest derivation is three steps long and yields the sentence aa. The

sentences baa, aba, and aab can all be derived in four steps.

2. Several derivations are possible for the string babbab. Two of them are

S ⇒ AA ⇒ bAA ⇒ bAbA ⇒ babA ⇒ babbA ⇒ babbAb ⇒ babbab

S ⇒ AA ⇒ AAb ⇒ bAAb ⇒ bAbAb ⇒ bAbbAb ⇒ babbAb ⇒ babbab

3. A leftmost derivation is:

S ⇒ AA ⇒

∗

b

m

AA ⇒

∗

b

m

Ab

n

A ⇒ b

m

ab

n

A ⇒

∗

b

m

ab

n

Ab

p

⇒ b

m

ab

n

ab

p

2.34 The grammar is equivalent to the grammar

S → aaB

B → bBba

B → a

This grammar generates the string aaa and the strings aab

m

a(ba)

m

for m ∈ N,

m 1. The string aabbaabba does not appear in this language.

2.35 The language L is generated by:

S → aSa [ bSb [ c

The derivation is:

S ⇒ aSa ⇒ abSba ⇒ abcba

2.36 The language generated by the grammar is

¦a

n

b

n

[ n ∈ N¦

221

B. Answers to exercises

The same language is also generated by the grammar

S → aAb [ ε

2.37 The ﬁrst language is

¦a

n

[ n ∈ N¦

This language is also generated by the grammar

S → aS [ ε

The second language is

¦ε¦ ∪ ¦a

2n+1

[ n ∈ N¦

This language is also generated by the grammar

S → A[ ε

A → a [ aAa

or using EBNF notation

S → (a(aa)

∗

)?

2.38 All three grammars generate the language

¦a

n

[ n ∈ N¦

2.39

S → A[ ε

A → aAb [ ab

2.40 The language is

L =¦a

2n+1

[ n ∈ N¦

A grammar for L without left-recursive productions is

A → aaA[ a

And a grammar without right-recursive prodcutions is

A → Aaa [ a

2.41 The language is

222

¦ab

n

[ n ∈ N¦

A grammar for L without left-recursive productions is

X → aY

Y → bY [ ε

A grammar for L without left-recursive productions that is also non-contracting is

X → aY [ a

Y → bY [ b

2.42 A grammar that uses only productions with two or less symbols on the right

hand side:

S → T [ US

T → Xa [ Ua

X → aS

U → S [ YT

Y → SU

The sentential forms aS and SU have been abstracted to nonterminals X and Y.

A grammar for the same language with only two nonterminals:

S → aSa [ Ua [ US

U → S [ SUaSa [ SUUa

The nonterminal T has been substituted for its alternatives aSa and Ua.

2.43

S → 1O

O → 1O [ 0N

N → 1

∗

2.44 The language is generated by the grammar:

S → (A) [ SS

A → S [ ε

A derivation for ( ) ( ( ) ) ( ) is:

S ⇒ SS ⇒ SSS ⇒ (A)SS ⇒ ()SS ⇒ ()(A)S ⇒ ()(S)S ⇒ ()((A))S

⇒ ()(())S ⇒ ()(())(A) ⇒ ()(())()

2.45 The language is generated by the grammar

223

B. Answers to exercises

S → (A) [ [A] [ SS

A → S [ ε

A derivation for [ ( ) ] ( ) is:

S ⇒ SS ⇒ [A]S ⇒ [S]S ⇒ [(A)]S ⇒ [()]S ⇒ [()](A) ⇒ [()]()

2.46 First leftmost derivation:

Sentence

⇒ Subject Predicate

⇒ they Predicate

⇒ they Verb NounPhrase

⇒ they are NounPhrase

⇒ they are Adjective Noun

⇒ they are flying Noun

⇒ they are flying planes

Second leftmost derivation:

Sentence

⇒ Subject Predicate

⇒ they Predicate

⇒ they AuxVerb Verb Noun

⇒ they are Verb Noun

⇒ they are flying Noun

⇒ they are flying planes

2.48 Here is an unambiguous grammar for the language from Exercise 2.45:

S → (E)E [ [E] E

E → ε [ S

2.49 Here is a leftmost derivation for ♣3♣´♠.

⇒ ´⊗ ⇒ ⊗´⊗ ⇒ ⊗3⊕´⊗ ⇒ ⊕3⊕´⊗ ⇒ ♣3⊕´⊗

⇒ ♣3♣´⊗ ⇒ ♣3♣´⊕ ⇒ ♣3♣´♠

Notice that the grammar of this exercise is the same, up to renaming, as the gram-

mar

E → E + T [ T

T → T * F [ F

F → 0 [ 1

224

2.50 The palindrome ε with length 0 can be generated by de grammar with the

derivation P ⇒ ε. The palindromes a, b, and b are the palindromes with length 1.

They can be derivated with P ⇒ a, P ⇒ b, P ⇒ c, respectively.

Suppose that the palindrome s with a length of 2 or more. Then s can be written as

ata or btb or ctc where the length of t is strictly smaller, and t also is a palindrome.

Thus, by induction hypothesis, there is a derivation P ⇒

∗

t. But then, there is also

a derivation for s, for example P ⇒

∗

t ⇒ ata in the ﬁrst situation – the other two

cases are analogous.

We have now proved that any palindrome can be generated by the grammar – we

still have to prove that anything generated by the grammar is a palindrome, but this

is easy to see by induction over the length of derivations. Certainly ε, a, b, and c

are palindromes. And if s is a palindrome that can be derived, so are asa, bsb, and

csc.

3.1 Either we use the predeﬁned predicate isUpper in module Data.Char,

capital = satisfy isUpper

or we make use of the ordering deﬁned characters,

capital = satisfy (λs → (’A’ s) ∧ (s ’Z’))

3.2 A symbol equal to a satisﬁes the predicate (= = a):

symbol a = satisfy (= = a)

3.3 The function epsilon is a special case of succeed:

epsilon :: Parser s ()

epsilon = succeed ()

3.4 Let xs :: [s]. Then

(f <$>succeed a) xs

= ¦ deﬁnition of <$> ¦

[(f x, ys) [ (x, ys) ← succeed a xs]

= ¦ deﬁnition of succeed ¦

[(f a, xs)]

= ¦ deﬁnition of succeed ¦

succeed (f a) xs

3.5 The type and results of (:) <$>symbol ’a’ are (note that you cannot write this

as a deﬁnition in Haskell):

225

B. Answers to exercises

((:) <$>symbol ’a’) :: Parser Char (String → String)

((:) <$>symbol ’a’) [] = []

((:) <$>symbol ’a’) (x : xs) [ x = = ’a’ = [((x:), xs)]

[ otherwise = []

3.6 The type and results of (:) <$>symbol ’a’ <∗>p are:

((:) <$>symbol ’a’ <∗>p) :: Parser Char String

((:) <$>symbol ’a’ <∗>p) [] = []

((:) <$>symbol ’a’ <∗>p) (x : xs)

[ x = = ’a’ = [(’a’ : x, ys) [ (x, ys) ← p xs]

[ otherwise = []

3.7

pBool :: Parser Char Bool

pBool = const True <$>token "True"

<[>const False <$>token "False"

3.9

1.

data Pal

2

= Nil [ Leafa [ Leafb [ Twoa Pal

2

[ Twob Pal

2

2.

palin

2

:: Parser Char Pal

2

palin

2

= (` y → Twoa y) <$>

symbol ’a’ <∗>palin

2

<∗>symbol ’a’

<[>(` y → Twob y) <$>

symbol ’b’ <∗>palin

2

<∗>symbol ’b’

<[>const Leafa <$>symbol ’a’

<[>const Leafb <$>symbol ’b’

<[>succeed Nil

3.

palina :: Parser Char Int

palina = (λ y → y + 2) <$>

symbol ’a’ <∗>palina <∗>symbol ’a’

<[>(λ y → y) <$>

symbol ’b’ <∗>palina <∗>symbol ’b’

<[>const 1 <$>symbol ’a’

<[>const 0 <$>symbol ’b’

<[>succeed 0

226

3.10

1.

data English = E

1

Subject Pred

data Subject = E

2

String

data Pred = E

3

Verb NounP [ E

4

AuxV Verb Noun

data Verb = E

5

String [ E

6

String

data AuxV = E

7

String

data NounP = E

8

Adj Noun

data Adj = E

9

String

data Noun = E

10

String

2.

english :: Parser Char English

english = E

1

<$>subject <∗>pred

subject = E

2

<$>token "they"

pred = E

3

<$>verb <∗>nounp

<[>E

4

<$>auxv <∗>verb <∗>noun

verb = E

5

<$>token "are"

<[>E

6

<$>token "flying"

auxv = E

7

<$>token "are"

nounp = E

8

<$>adj <∗>noun

adj = E

9

<$>token "flying"

noun = E

10

<$>token "planes"

3.11 As <[> uses ++, it is more eﬃciently evaluated if right-associative.

3.12 The function is the same as <∗>, but instead of applying the result of the ﬁrst

parser to that of the second, it pairs them together:

(<,>) :: Parser s a → Parser s b → Parser s (a, b)

(p <,>q) xs = [ ((x, y), zs)

[ (x, ys) ← p xs

, (y, zs) ← q ys

]

3.13 ‘Parser transformator’, or ‘parser modiﬁer’ or ‘parser postprocessor’, etcetera.

3.14 The transformator <$> does to the result part of parsers what map does to

the elements of a list.

3.15 The parser combinators <∗> and <,> can be deﬁned in terms of each other:

p <∗>q = uncurry ($) <$>(p <,>q)

p <,>q = (, ) <$>p <∗>q

227

B. Answers to exercises

3.16 Yes. You can combine the parser parameter of <$> with a parser that consumes

no input and always yields the function parameter of <$>:

f <$>p = succeed f <∗>p

3.17

f ::

Char → Parentheses → Char → Parentheses → Parentheses

open ::

Parser Char Char

f <$>open ::

Parser Char (Parentheses → Char → Parentheses → Parentheses)

parens ::

Parser Char Parentheses

(f <$>open) <∗>parens ::

Parser Char (Char → Parentheses → Parentheses)

3.18 To the left. Yes.

3.19 The function has to check whether applying the parser p to input s returns at

least one result with an empty rest sequence:

test p s = not (null (ﬁlter (null . snd) (p s)))

3.20

1.

listofdigits :: Parser Char [Int]

listofdigits = listOf newdigit (symbol ’ ’)

?> listofdigits "1 2 3"

[([1,2,3],""),([1,2]," 3"),([1]," 2 3")]

?> listofdigits "1 2 a"

[([1,2]," a"),([1]," 2 a")]

2. In order to use the parser chainr we ﬁrst deﬁne a parser plusParser that recog-

nises the character ’+’ and returns the function (+).

plusParser :: Num a ⇒ Parser Char (a → a → a)

plusParser [] = []

plusParser (x : xs) [ x = = ’+’ = [((+), xs)]

[ otherwise = []

228

The deﬁniton of the parser sumParser is:

sumParser :: Parser Char Int

sumParser = chainr newdigit plusParser

?> sumParser "1+2+3"

[(6,""),(3,"+3"),(1,"+2+3")]

?> sumParser "1+2+a"

[(3,"+a"),(1,"+2+a")]

?> sumParser "1"

[(1,"")]

Note that the parser also recognises a single integer.

3. The parser many should be replaced by the parser greedy in de deﬁnition of

listOf .

3.21 We introduce the abbreviation

listOfa = (:) <$>symbol ’a’

and use the results of Exercises 3.5 and 3.6.

xs = []:

many (symbol ’a’) []

= ¦ deﬁnition of many and listOfa ¦

(listOfa <∗>many (symbol ’a’) <[>succeed []) []

= ¦ deﬁnition of <[> ¦

(listOfa <∗>many (symbol ’a’)) [] ++ succeed [] []

= ¦ Exercise 3.6, deﬁnition of succeed ¦

[] ++ [([], [])]

= ¦ deﬁnition of ++ ¦

[([], [])]

xs = [’a’]:

many (symbol ’a’) [’a’]

= ¦ deﬁnition of many and listOfa ¦

(listOfa <∗>many (symbol ’a’) <[>succeed []) [’a’]

= ¦ deﬁnition of <[> ¦

(listOfa <∗>many (symbol ’a’)) [’a’] ++ succeed [] [’a’]

= ¦ Exercise 3.6, previous calculation ¦

229

B. Answers to exercises

[([’a’], []), ([], [’a’])]

xs = [’b’]:

many (symbol ’a’) [’b’]

= ¦ as before ¦

(listOfa <∗>many (symbol ’a’)) [’b’] ++ succeed [] [’b’]

= ¦ Exercise 3.6, previous calculation ¦

[([], [’b’])]

xs = [’a’, ’b’]:

many (symbol ’a’) [’a’, ’b’]

= ¦ as before ¦

(listOfa <∗>many (symbol ’a’)) [’a’, ’b’] ++ succeed [] [’a’, ’b’]

= ¦ Exercise 3.6, previous calculation ¦

[([’a’], [’b’]), ([], [’a’, ’b’])]

xs = [’a’, ’a’, ’b’]:

many (symbol ’a’) [’a’, ’a’, ’b’]

= ¦ as before ¦

(listOfa <∗>many (symbol ’a’)) [’a’, ’a’, ’b’] ++ succeed [] [’a’, ’a’, ’b’]

= ¦ Exercise 3.6, previous calculation ¦

[([’a’, ’a’], [’b’]), ([’a’], [’a’, ’b’]), ([], [’a’, ’a’, ’b’])]

3.22 The empty alternative is presented last, because the <[> combinator uses list

concatenation for concatenating lists of successes. This also holds for the recursive

calls; thus the ‘greedy’ parsing of all three a’s is presented ﬁrst, then two a’s with a

singleton rest string, then one a, and ﬁnally the empty result with the original input

as rest string.

3.24

— Combinators for repetition

psequence :: [Parser s a] → Parser s [a]

psequence [] = succeed []

psequence (p : ps) = (:) <$>p <∗>psequence ps

psequence

:: [Parser s a] → Parser s [a]

psequence

= foldr f (succeed [])

where f p q = (:) <$>p <∗>q

230

choice :: [Parser s a] → Parser s a

choice = foldr (<[>) failp

?> (psequence [digit, satisfy isUpper]) "1A"

[("1A","")]

?> (psequence [digit, satisfy isUpper]) "1Ab"

[("1A","b")]

?> (psequence [digit, satisfy isUpper]) "1ab"

[]

?> (choice [digit, satisfy isUpper]) "1ab"

[(’1’,"ab")]

?> (choice [digit, satisfy isUpper]) "Ab"

[(’A’,"b")]

?> (choice [digit, satisfy isUpper]) "ab"

[]

3.25

token :: Eq s ⇒ [s] → Parser s [s]

token = psequence . map symbol

3.27

identiﬁer :: Parser Char String

identiﬁer = (:) <$>satisfy isAlpha <∗>greedy (satisfy isAlphaNum)

3.28

1. As Haskell terms:

"abc": Var "abc"

"(abc)": Var "abc"

"a*b+1": Var "a" :∗: Var "b" :+: Con 1

"a*(b+1)": Var "a" :∗: (Var "b" :+: Con 1)

"-1-a": Con (−1) :−: Var "a"

"a(1,b)": Fun "a" [Con 1, Var "b"]

2. The parser fact ﬁrst tries to parse an integer, then a variable, then a function

application and ﬁnally a parenthesised expression. A function application is a

variable followed by an argument list. When the parser encounters a function

application, a variable will ﬁrst be recognised. This ﬁrst solution will however

231

B. Answers to exercises

not lead to a parse tree for the complete expression because the list of arguments

that comes after the variable cannot be parsed.

If we swap the second and the third line in the deﬁnition of the parser fact, the

parse tree for a function application will be the ﬁrst solution of the parser:

fact :: Parser Char Expr

fact = Con <$>integer

<[>Fun <$>identiﬁer <∗>parenthesised (commaList expr)

<[>Var <$>identiﬁer

<[>parenthesised expr

?> expr "a(1,b)"

[(Fun "a" [Con 1,Var "b"],""),(Var "a","(1,b)")]

3.29 A function with no arguments is not accepted by the parser:

?> expr "f()"

[(Var "f","()")]

The parser parenthesised (commaList expr) that is used in the parser fact does not

accept an empty list of arguments because commaList does not. To accept an empty

list we modify the parser fact as follows:

fact :: Parser Char Expr

fact = Con <$>integer

<[>Fun <$>identiﬁer

<∗> parenthesised (commaList expr <[>succeed [])

<[>Var <$>identiﬁer

<[>parenthesised expr

?> expr "f()"

[(Fun "f" [],""),(Var "f","()")]

3.30

expr = chainr (chainl term (const (:−:) <$>symbol ’-’))

(const (:+:) <$>symbol ’+’)

3.31 The datatype Expr is extended as follows to allow raising an expression to the

power of an expression:

data Expr = Con Int

[ Var String

232

[ Fun String [Expr]

[ Expr :+: Expr

[ Expr :−: Expr

[ Expr :∗: Expr

[ Expr :/: Expr

[ Expr : ˆ : Expr

deriving Show

Now the parser expr

**of Listing 3.8 can be extended with a new level of priorities:
**

powis = [(’^’, (: ˆ :))]

expr

**:: Parser Char Expr
**

expr

= foldr gen fact

**[addis, multis, powis]
**

Note that because of the use of chainl all the operators listed in addis, multis and

powis are treated as left-associative.

3.32 The proofs can be given by using laws for list comprehension, but here we prefer

to exploit the following equation

(f <$>p) xs = map (f ∗∗∗ id) (p xs) (B.1)

where (∗∗∗) is deﬁned by

(∗∗∗) :: (a → c) → (b → d) → (a, b) → (c, d)

(f ∗∗∗ g) (a, b) = (f a, g b)

It has the following property:

(f ∗∗∗ g) . (h ∗∗∗ k) = (f . h) ∗∗∗ (g . k) (B.2)

Furthermore, we will use the following laws about map in our proof: map distributes

over composition, concatenation, and the function concat:

map f . map g = map (f . g) (B.3)

map f (x ++ y) = map f x ++ map f y (B.4)

map f . concat = concat . map (map f ) (B.5)

1.

(h <$>(f <$>p)) xs

= ¦ (B.1) ¦

map (h ∗∗∗ id) ((f <$>p) xs)

= ¦ (B.1) ¦

map (h ∗∗∗ id) (map (f ∗∗∗ id) (p xs))

233

B. Answers to exercises

= ¦ (B.3) ¦

map ((h ∗∗∗ id) . (f ∗∗∗ id)) (p xs)

= ¦ (B.2) ¦

map ((h . f ) ∗∗∗ id) (p xs)

= ¦ (B.1) ¦

((h . f ) <$>p) xs

2.

(h <$>(p <[>q)) xs

= ¦ (B.1) ¦

map (h ∗∗∗ id) ((p <[>q) xs)

= ¦ deﬁnition of <[> ¦

map (h ∗∗∗ id) (p xs ++ q xs)

= ¦ (B.4) ¦

map (h ∗∗∗ id) (p xs) ++ map (h ∗∗∗ id) (q xs)

= ¦ (B.1) ¦

(h <$>p) xs ++ (h <$>q) xs

= ¦ deﬁnition of <[> ¦

((h <$>p) <[>(h <$>q)) xs

3. First note that (p <∗>q) xs can be written as

(p <∗>q) xs = concat (map (mc q) (p xs)) (B.6)

where

mc q (f , ys) = map (f ∗∗∗ id) (q ys)

Now we calculate

(((h.) <$>p) <∗>q) xs

= ¦ (B.6) ¦

concat (map (mc q) (((h.) <$>p) xs))

= ¦ (B.1) ¦

concat (map (mc q) (map ((h.) ∗∗∗ id) (p xs)))

= ¦ (B.3) ¦

concat (map (map (h ∗∗∗ id) (map (mc q) (p xs))))

= ¦ (B.7), see below ¦

concat (map ((map (h ∗∗∗ id)) . mc q) (p xs))

234

= ¦ (B.3) ¦

concat (map (map (h ∗∗∗ id)) (map (mc q) (p xs)))

= ¦ (B.5) ¦

map (h ∗∗∗ id) (concat (map (mc q) (p xs)))

= ¦ (B.6) ¦

map (h ∗∗∗ id) ((p <∗>q) xs)

= ¦ (B.1) ¦

(h <$>(p <∗>q)) xs

It remains to prove the claim

mc q . ((h.) ∗∗∗ id) = map (h ∗∗∗ id) . mc q (B.7)

This claim is also proved by calculation:

((map (h ∗∗∗ id)) . mc q) (f , ys)

= ¦ deﬁnition of . ¦

map (h ∗∗∗ id) (mc q (f , ys))

= ¦ deﬁnition of mc q ¦

map (h ∗∗∗ id) (map (f ∗∗∗ id) (q ys))

= ¦ map and ∗∗∗ distribute over composition ¦

map ((h . f ) ∗∗∗ id) (q y)

= ¦ deﬁnition of mc q ¦

mc q (h . f , ys)

= ¦ deﬁnition of ∗∗∗ ¦

(mc q . ((h.) ∗∗∗ id)) (f , ys)

3.33

pMir :: Parser Char Mir

pMir = (λ m → MB m) <$>symbol ’b’ <∗>pMir <∗>symbol ’b’

<[>(λ m → MA m) <$>symbol ’a’ <∗>pMir <∗>symbol ’a’

<[>succeed MEmpty

3.34

pBitList :: Parser Char BitList

pBitList = SingleB <$>pBit

<[>(λb bs → ConsB b bs) <$>pBit <∗>symbol ’,’ <∗>pBitList

pBit = const Bit

0

<$>symbol ’0’

<[>const Bit

1

<$>symbol ’1’

235

B. Answers to exercises

3.35

— Parser for ﬂoating point numbers

ﬁxed :: Parser Char Float

ﬁxed = (+) <$>(fromIntegral <$>greedy integer)

<∗> (((λ y → y) <$>symbol ’.’ <∗>fractpart) ‘option‘ 0.0)

fractpart :: Parser Char Float

fractpart = foldr f 0.0 <$>greedy newdigit

where f d n = (n + fromIntegral d) / 10.0

3.36

ﬂoat :: Parser Char Float

ﬂoat = f <$>ﬁxed

<∗> (((λ y → y) <$>symbol ’E’ <∗>integer) ‘option‘ 0)

where f m e = m ∗ power e

power e [ e <0 = 1.0 / power (−e)

[ otherwise = fromIntegral (10e)

3.37 Parse trees for Java assignments are of type:

data JavaAssign = JAssign String Expr

deriving Show

The parser is deﬁned as follows:

assign :: Parser Char JavaAssign

assign = JAssign

<$>identiﬁer

<∗> ((λ y → y) <$>symbol ’=’ <∗>expr <∗>symbol ’;’)

?> assign "x1=(a+1)*2;"

[(JAssign "x1" (Var "a" :+: Con 1 :*: Con 2),"")]

?> assign "x=a+1"

[]

Note that the second example is not recognised as an assignment because the string

does not end with a semicolon.

4.1

data FloatLiteral = FL

1

IntPart FractPart ExponentPart FloatSuﬃx

[ FL

2

FractPart ExponentPart FloatSuﬃx

[ FL

3

IntPart ExponentPart FloatSuﬃx

236

[ FL

4

IntPart ExponentPart FloatSuﬃx

deriving Show

type ExponentPart = String

type ExponentIndicator = String

type SignedInteger = String

type IntPart = String

type FractPart = String

type FloatSuﬃx = String

digit = satisfy isDigit

digits = many

1

digit

ﬂoatLiteral = (λa b c d e → FL

1

a c d e)

<$>intPart <∗>period <∗>optfract <∗>optexp <∗>optﬂoat

<[> (λa b c d → FL

2

b c d)

<$>period <∗>fractPart <∗>optexp <∗>optﬂoat

<[> (λa b c → FL

3

a b c)

<$>intPart <∗>exponentPart <∗>optﬂoat

<[> (λa b c → FL

4

a b c)

<$>intPart <∗>optexp <∗>ﬂoatSuﬃx

intPart = signedInteger

fractPart = digits

exponentPart = (++) <$>exponentIndicator <∗>signedInteger

signedInteger = (++) <$>option sign "" <∗>digits

exponentIndicator = token "e" <[>token "E"

sign = token "+" <[>token "-"

ﬂoatSuﬃx = token "f" <[>token "F"

<[>token "d" <[>token "D"

period = token "."

optexp = option exponentPart ""

optfract = option fractPart ""

optﬂoat = option ﬂoatSuﬃx ""

4.2 The data and type deﬁnitions are the same as before, only the parsers return

another (semantic) result.

digit = f <$> satisfy isDigit

where f c = ord c - ord ’0’

digits = foldl f 0 <$> many1 digit

where f a b = 10*a + b

floatLiteral = (\a b c d e -> (fromIntegral a + c) * power d) <$>

intPart <*> period <*> optfract <*> optexp <*> optfloat

<|> (\a b c d -> b * power c) <$>

period <*> fractPart <*> optexp <*> optfloat

237

B. Answers to exercises

<|> (\a b c -> (fromIntegral a) * power b) <$>

intPart <*> exponentPart <*> optfloat

<|> (\a b c -> (fromIntegral a) * power b) <$>

intPart <*> optexp <*> floatSuffix

intPart = signedInteger

fractPart = foldr f 0.0 <$> many1 digit

where f a b = (fromIntegral a + b)/10

exponentPart = (\x y -> y) <$> exponentIndicator <*> signedInteger

signedInteger = (\ x y -> x y) <$> option sign id <*> digits

exponentIndicator = symbol ’e’ <|> symbol ’E’

sign = const id <$> symbol ’+’

<|> const negate <$> symbol ’-’

floatSuffix = symbol ’f’ <|> symbol ’F’<|> symbol ’d’ <|> symbol ’D’

period = symbol ’.’

optexp = option exponentPart 0

optfract = option fractPart 0.0

optfloat = option floatSuffix ’ ’

power e | e < 0 = 1 / power (-e)

| otherwise = fromIntegral (10^e)

4.3 The parsing scheme for Java ﬂoats is

digit = f <$> satisfy isDigit

where f c = .....

digits = f <$> many1 digit

where f ds = ..

floatLiteral = f1 <$>

intPart <*> period <*> optfract <*> optexp <*> optfloat

<|> f2 <$>

period <*> fractPart <*> optexp <*> optfloat

<|> f3 <$>

intPart <*> exponentPart <*> optfloat

<|> f4 <$>

intPart <*> optexp <*> floatSuffix

where

f1 a b c d e = .....

f2 a b c d = .....

f3 a b c = .....

f4 a b c = .....

intPart = signedInteger

fractPart = f <$> many1 digit

where f ds = ....

exponentPart = f <$> exponentIndicator <*> signedInteger

where f x y = .....

238

signedInteger = f <$> option sign ?? <*> digits

where f x y = .....

exponentIndicator = f1 <$> symbol ’e’ <|> f2 <$> symbol ’E’

where

f1 c = ....

f2 c = ..

sign = f1 <$> symbol ’+’ <|> f2 <$> symbol ’-’

where

f1 h = .....

f2 h = .....

floatSuffix = f1 <$> symbol ’f’

<|> f2 <$> symbol ’F’

<|> f3 <$> symbol ’d’

<|> f4 <$> symbol ’D’

where

f1 c = .....

f2 c = .....

f3 c = .....

f4 c = .....

period = symbol ’.’

optexp = option exponentPart ??

optfract = option fractPart ??

optfloat = option floatSuffix ??

5.1

type LNTreeAlgebra a b x = (a → x, x → b → x → x)

foldLNTree :: LNTreeAlgebra a b x → LNTree a b → x

foldLNTree (leaf , node) = fold

where

fold (Leaf a) = leaf a

fold (Node l m r) = node (fold l ) m (fold r)

5.2

1. Deﬁnition of height by case analysis:

height (Leaf x) = 0

height (Bin lt rt) = 1 + (height lt ‘max‘ height rt)

Deﬁnition as a fold:

height :: BinTree x → Int

height = foldBinTree heightAlgebra

heightAlgebra = (λu v → 1 + (u ‘max‘ v), const 0)

239

B. Answers to exercises

2. Deﬁnition of ﬂatten by case analysis:

ﬂatten (Leaf x) = [x]

ﬂatten (Bin lt rt) = ﬂatten lt ++ ﬂatten rt

Deﬁnition as a fold:

ﬂatten :: BinTree x → [x]

ﬂatten = foldBinTree ((++), λx → [x])

3. Deﬁnition of maxBinTree by case analysis:

maxBinTree (Leaf x) = x

maxBinTree (Bin lt rt) = maxBinTree lt ‘max‘ maxBinTree rt

Deﬁnition as a fold:

maxBinTree :: Ord x ⇒ BinTree x → x

maxBinTree = foldBinTree (max, id)

4. Deﬁnition of sp by case analysis:

sp (Leaf x) = 0

sp (Bin lt rt) = 1 + (sp lt) ‘min‘ (sp rt)

Deﬁnition as a fold:

sp :: BinTree x → Int

sp = foldBinTree spAlgebra

spAlgebra = (λu v → 1 + u ‘min‘ v, const 0)

5. Deﬁnition of mapBinTree by case analysis:

mapBinTree f (Leaf x) = Leaf (f x)

mapBinTree f (Bin lt rt) = Bin (mapBinTree f lt) (mapBinTree f rt)

Deﬁnition as a fold:

mapBinTree :: (a → b) → BinTree a → BinTree b

mapBinTree f = foldBinTree (Bin, Leaf . f )

5.3 Using explicit recursion:

allPaths (Leaf x) = [[]]

allPaths (Bin lt rt) = map (L:) (allPaths lt)

++ map (R:) (allPaths rt)

240

As a fold:

allPaths :: BinTree a → [[Direction]]

allPaths = foldBinTree psAlgebra

psAlgebra :: BinTreeAlgebra a [[Direction]]

psAlgebra = (λu v → map (L:) u ++ map (R:) v, const [[]])

5.4

1.

data Resist = Resist :[: Resist

[ Resist :∗: Resist

[ BasicR Float

deriving Show

type ResistAlgebra a = (a → a → a, a → a → a, Float → a)

foldResist :: ResistAlgebra a → Resist → a

foldResist (par, seq, basic) = fold

where

fold (r

1

:[: r

2

) = par (fold r

1

) (fold r

2

)

fold (r

1

:∗: r

2

) = seq (fold r

1

) (fold r

2

)

fold (BasicR f ) = basic f

2.

result :: Resist → Float

result = foldResist resultAlgebra

resultAlgebra :: ResistAlgebra Float

resultAlgebra = (λu v → (u ∗ v) / (u + v), (+), id)

5.5

1. isSum = foldExpr ((&&)

,\x y -> False

,\x y -> False

,\x y -> False

,const True

,const True

,\x -> (&&)

)

2. vars = foldExpr ((++)

,(++)

,(++)

241

B. Answers to exercises

,(++)

,const []

,\x -> [x]

,\x y z -> x : (y ++ z)

)

5.6

1. der computes a symbolic diﬀerentiation of an expression.

2. The function der is not a compositional function on Expr because the righthand

sides of the ‘Mul ‘ and ‘Dvd‘ expressions do not only use der e

1

dx and der e

2

dx,

but also e

1

and e

2

themselves.

3.

data Exp = Exp ‘Plus‘ Exp

[ Exp ‘Sub‘ Exp

[ Con Float

[ Idf String

deriving Show

type ExpAlgebra a = (a → a → a

, a → a → a

, Float → a

, String → a

)

foldExp :: ExpAlgebra a → Exp → a

foldExp (plus, sub, con, idf ) = fold

where

fold (e

1

‘Plus‘ e

2

) = plus (fold e

1

) (fold e

2

)

fold (e

1

‘Sub‘ e

2

) = sub (fold e

1

) (fold e

2

)

fold (Con n) = con n

fold (Idf s) = idf s

4. Using explicit resursion:

der :: Exp → String → Exp

der (e

1

‘Plus‘ e

2

) dx = der e

1

dx ‘Plus‘ der e

2

dx

der (e

1

‘Sub‘ e

2

) dx = der e

1

dx ‘Sub‘ der e

2

dx

der (Con f ) dx = Con 0

der (Idf s) dx =if s = = dx then Con 1 else Con 0

Using a fold:

der = foldExp derAlgebra

derAlgebra :: ExpAlgebra (String → Exp)

242

derAlgebra = (λf g → λs → f s ‘Plus‘ g s

, λf g → λs → f s ‘Sub‘ g s

, λn → λs → Con 0

, λs → λt → if s = = t then Con 1 else Con 0

)

5.7 Using explicit recursion:

replace (Leaf x) y = Leaf y

replace (Bin lt rt) y = Bin (replace lt y) (replace rt y)

As a fold:

replace :: BinTree a → a → BinTree a

replace = foldBinTree repAlgebra

repAlgebra = (λf g → λy → Bin (f y) (g y), λx y → Leaf y)

5.8 Using explicit recursion:

path2Value (Leaf x) =

λbs → if null bs then x else error "no roothpath"

path2Value (Bin lt rt) =

λbs → case bs of

[] → error "no roothpath"

(L : rs) → path2Value lt rs

(R : rs) → path2Value rt rs

Using a fold:

path2Value :: BinTree a → [Direction] → a

path2Value = foldBinTree pvAlgebra

pvAlgebra :: BinTreeAlgebra a ([Direction] → a)

pvAlgebra = (λﬂ fr → λbs → case bs of

[] → error "no roothpath"

(L : rs) → ﬂ rs

(R : rs) → fr rs

, λx → λbs → if null bs then x

else error "no roothpath")

5.9

1.

type PalAlgebra p = (p, p, p, p → p, p → p)

243

B. Answers to exercises

2.

foldPal :: PalAlgebra p → Pal → p

foldPal (pal

1

, pal

2

, pal

3

, pal

4

, pal

5

) = fPal

where

fPal Pal

1

= pal

1

fPal Pal

2

= pal

2

fPal Pal

3

= pal

3

fPal (Pal

4

p) = pal

4

(fPal p)

fPal (Pal

5

p) = pal

5

(fPal p)

3.

a2cPal = foldPal (""

, "a"

, "b"

, λp → "a" ++ p ++ "a"

, λp → "b" ++ p ++ "b"

)

aCountPal = foldPal (0, 1, 0, λp → p + 2, λp → p)

4.

pfoldPal :: PalAlgebra p → Parser Char p

pfoldPal (pal

1

, pal

2

, pal

3

, pal

4

, pal

5

) = pPal

where

pPal = const pal

1

<$>ε

<[>const pal

2

<$>syma

<[>const pal

3

<$>symb

<[>(` p → pal

4

p) <$>syma <∗>pPal <∗>syma

<[>(` p → pal

5

p) <$>symb <∗>pPal <∗>symb

syma = symbol ’a’

symb = symbol ’b’

5. The parser pfoldPal m

1

returns the concrete representation of a palindrome.

The parser pfoldPal m

2

returns the number of a’s occurring in a palindrome.

5.10

1. type MirAlgebra m = (m,m->m,m->m)

2. foldMir :: MirAlgebra m -> Mir -> m

foldMir (mir1,mir2,mir3) = fMir where

fMir Mir1 = mir1

fMir (Mir2 m) = mir2 (fMir m)

fMir (Mir3 m) = mir3 (fMir m)

3. a2cMir = foldMir ("",\m->"a"++m++"a",\m->"b"++m++"b")

m2pMir = foldMir (Pal1,Pal4,Pal5)

244

4. pfoldMir :: MirAlgebra m -> Parser Char m

pfoldMir (mir1, mir2, mir3) = pMir where

pMir = const mir1 <$> epsilon

<|> (\_ m _ -> mir2 m) <$> syma <*> pMir <*> syma

<|> (\_ m _ -> mir3 m) <$> symb <*> pMir <*> symb

syma = symbol ’a’

symb = symbol ’b’

5. The parser pfoldMir m

1

returns the concrete representation of a palindrome.

The parser pfoldMir m

2

returns the abstract representation of a palindrome.

5.11

1. type ParityAlgebra p = (p,p->p,p->p,p->p)

2. foldParity :: ParityAlgebra p -> Parity -> p

foldParity (empty,parity1,parityL0,parityR0) = fParity where

fParity Empty = empty

fParity (Parity1 p) = parity1 (fParity p)

fParity (ParityL0 p) = parityL0 (fParity p)

fParity (ParityR0 p) = parityR0 (fParity p)

3. a2cParity = foldParity (""

,\x->"1"++x++"1"

,\x->"0"++x

,\x->x++"0"

)

5.12

1. type BitListAlgebra bl z b = ((b->z->bl,b->bl),(bl->z->z,bl->z),(b,b))

2. foldBitList :: BitListAlgebra bl z b -> BitList -> bl

foldBitList ((consb,singleb),(consbl,singlebl),(bit0,bit1)) = fBitList

where

fBitList (ConsB b z) = consb (fBit b) (fZ z)

fBitList (SingleB b) = singleb (fBit b)

fZ (ConsBL bl z) = consbl (fBitList bl) (fZ z)

fZ (SingleBL bl) = singlebl (fBitList bl)

fBit Bit0 = bit0

fBit Bit1 = bit1

3. a2cBitList = foldBitList (((++),id)

,((++),id)

,("0","1")

)

4. pfoldBitList :: BitListAlgebra bl z b -> Parser Char bl

pfoldBitList ((consb,singleb),(consbl,singlebl),(bit0,bit1)) = pBitList

where

pBitList = consb <$> pBit <*> pZ

<|> singleb <$> pBit

245

B. Answers to exercises

pZ = (\_ bl z -> consbl bl z) <$> symbol ’,’

<*> pBitList

<*> pZ

<|> singlebl <$> pBitList

pBit = const bit0 <$> symbol ’0’

<|> const bit1 <$> symbol ’1’

5.13

1.

data Block = B

1

Stat Rest

data Rest = R

1

Stat Rest [ Nix

data Stat = S

1

Decl [ S

2

Use [ S

3

Nest

data Decl = Dx [ Dy

data Use = UX [ UY

data Nest = N

1

Block

The abstract representation of x;(y;Y);X is

B

1

(S

1

Dx) (R

1

stat

2

rest

2

)

where

stat

2

= S

3

(N

1

block)

block = B

1

(S

1

Dy) (R

1

(S

2

UY) Nix)

rest

2

= R

1

(S

2

UX) Nix

2.

type BlockAlgebra b r s d u n =

(s → r → b

, (s → r → r, r)

, (d → s, u → s, n → s)

, (d, d)

, (u, u)

, b → n

)

3.

foldBlock :: BlockAlgebra b r s d u n → Block → b

foldBlock (b1, (r

1

, nix), (s1, s2, s3), (dx, dy), (ux, uy), n1) =

foldB

where

foldB (B

1

stat rest) = b1 (foldS stat) (foldR rest)

foldR (R

1

stat rest) = r

1

(foldS stat) (foldR rest)

246

foldR Nix = nix

foldS (S

1

decl ) = s1 (foldD decl )

foldS (S

2

use) = s2 (foldU use)

foldS (S

3

nest) = s3 (foldN nest)

foldD Dx = dx

foldD Dy = dy

foldU UX = ux

foldU UY = uy

foldN (N

1

block) = n1 (foldB block)

4.

a2cBlock :: Block → String

a2cBlock = foldBlock a2cBlockAlgebra

a2cBlockAlgebra :: BlockAlgebra String String String

String String String

a2cBlockAlgebra =

(b1, (r

1

, nix), (s1, s2, s3), (dx, dy), (ux, uy), n1)

where

b1 u v = u ++ v

r

1

u v = ";" ++ u ++ v

nix = ""

s1 u = u

s2 u = u

s3 u = u

dx = "x"

dy = "y"

ux = "X"

uy = "Y"

n1 u = "(" ++ u ++ ")"

7.1 The type TreeAlgebra and the function foldTree are deﬁned in the rep-min

problem.

1. height = foldTree heightAlgebra

heightAlgebra = (const 0, \l r -> (l ‘max‘ r) + 1)

--frontAtLevel :: Tree -> Int -> [Int]

--frontAtLevel (Leaf i) h = if h == 0 then [i] else []

--frontAtLevel (Bin l r) h =

-- if h > 0 then frontAtLevel l (h-1) ++ frontAtLevel r (h-1)

-- else []

frontAtLevel = foldTree frontAtLevelAlgebra

frontAtLevelAlgebra =

( \i h -> if h == 0 then [i] else []

247

B. Answers to exercises

, \f g h -> if h > 0 then f (h-1) ++ g (h-1) else []

)

2. a) Straightforward Solution: impossible to give a solution like rm.sol1.

heightAlgebra = (const 0, \l r -> (l ‘max‘ r) + 1)

front t = foldTree frontAtLevelAlgebra t h

where h = foldTree heightAlgebra t

b) Lambda Lifting

front’ t = foldTree

frontAtLevelAlgebra

t

(foldTree heightAlgebra t)

c) Tupling Computations

htupfr :: TreeAlgebra (Int, Int -> [Int])

htupfr

-- = (heightAlgebra ‘tuple‘ frontAtLevelAlgebra)

= ( \i -> ( 0

, \h -> if h == 0 then [i] else []

)

, \(lh,f) (rh,g) -> ( (lh ‘max‘ rh) + 1

, \h -> if h > 0

then f (h-1) ++ g (h-1)

else []

)

)

front’’ t = fr h

where (h, fr) = foldTree htupfr t

d) Merging Tupled Functions

It is helpful to note that the merged algebra is constructed such that

foldTree mergedAlgebra t i = (height t, frontAtLevel t i)

Therefore, the deﬁnitions corresponding to the fourth solution are

mergedAlgebra :: TreeAlgebra (Int -> (Int, [Int]))

mergedAlgebra =

(\i -> \h -> ( 0

, if h == 0 then [i] else []

)

, \lfun rfun -> \h -> let (lh, xs) = lfun (h-1)

(rh, ys) = rfun (h-1)

in

( (lh ‘max‘ rh) + 1

248

, if h > 0 then xs ++ ys else []

)

)

front’’’ t = fr

where (h, fr) = foldTree mergedAlgebra t h

7.2 The highest front problem is the problem of ﬁnding the ﬁrst non-empty list

of nodes which are at the lowest level. The solutions are similar to these of the

deepest front problem.

1. --lowest :: Tree -> Int

--lowest (Leaf i) = 0

--lowest (Bin l r) = ((lowest l) ‘min‘ (lowest r)) + 1

lowest = foldTree lowAlgebra

lowAlgebra = (const 0, \ l r -> (l ‘min‘ r) + 1)

2. a) Straightforward Solution: impossible to give a solution like rm.sol1.

lfront t = foldTree frontAtLevelAlgebra t l

where l = foldTree lowAlgebra t

b) Lambda Lifting

lfront’ t = foldTree

frontAtLevelAlgebra

t

(foldTree lowAlgebra t)

c) Tupling Computations

ltupfr :: TreeAlgebra (Int, Int -> [Int])

ltupfr =

( \i -> ( 0

, (\l -> if l == 0 then [i] else [])

)

, \ (lh,f) (rh,g) -> ( (lh ‘min‘ rh) + 1

, \l -> if l > 0

then f (l-1) ++ g (l-1)

else []

)

)

lfront’’ t = fr l

where (l, fr) = foldTree ltupfr t

d) Merging Tupled Functions

It is helpful to note that the merged algebra is constructed such that

foldTree mergedAlgebra t i = (lowest t, frontAtLevel t i)

249

B. Answers to exercises

Therefore, the deﬁnitions corresponding to the fourth solution are

lmergedAlgebra :: TreeAlgebra (Int -> (Int, [Int]))

lmergedAlgebra =

( \i -> \l -> ( 0

, if l == 0 then [i] else []

)

, \lfun rfun -> \l ->

let (ll, lres) = lfun (l-1)

(rl, rres) = rfun (l-1)

in

( (ll ‘min‘ rl) + 1

, if l > 0 then lres ++ rres else []

)

)

lfront’’’ t = fr

where (l, fr) = foldTree lmergedAlgebra t l

8.1 Let G = (T, N, R, S). Deﬁne the regular grammar G

= (T, N ∪ ¦S

¦, R

, S

)

as follows. S

is a new nonterminal symbol. For the productions R

, divide the

productions of R in two sets: the productions R1 of which the right-hand side consists

just of terminal symbols, and the productions R2 of which the right-hand side ends

with a nonterminal. Deﬁne a set R3 of productions by adding the nonterminal S

at the right end of each production in R1. Deﬁne R

= R ∪ R3 ∪ S

→ S [ . The

grammar G

**thus deﬁned is regular, and generates the language L
**

∗

.

8.2 In step 2, add the productions

S → aS

S →

S → cC

S → a

A →

A → cC

A → a

B → cC

B → a

and remove the productions

S → A

A → B

B → C

In step 3, add the production S → a, and remove the productions A → , B → .

250

8.3

ndfsa d qs [ ] = qs

ndfsa d qs (x : xs) = ndfsa d (d qs x) xs

8.4 Let M = (X, Q, d, S, F) be a deterministic ﬁnite-state automaton. Then the

nondeterministic ﬁnite-state automaton M

deﬁned by M

= (X, Q, d

, ¦S¦, F), where

d

**q x = ¦d q x¦ accepts the same language.
**

8.5 Let M = (X, Q, d, S, F) be a deterministic ﬁnite-state automaton that accepts

language L. Then the deterministic ﬁnite-state automaton M

= (X, Q, d, S, Q−F),

where Q − F is the set of states Q from which all states in F have been removed,

accepts the language L. Here we assume that d q a is deﬁned for all q and a.

8.6

1. L

1

∩ L

2

= (L

1

∪ L

2

), so it follows that regular languages are closed under

intersection if they are closed under complementation and union. Regular lan-

guages are closed under union, see Theorem 8.10, and they are closed under

complementation, see Exercise 8.5.

2. Ldfa M

=

¦w ∈ X

∗

[ dfa accept w (d, (S

1

, S

2

), (F

1

F

2

))¦

=

¦w ∈ X

∗

[ dfa d (S

1

, S

2

) w ∈ F

1

F

2

¦

= This requires a proof by induction

¦w ∈ X

∗

[ (dfa d

1

S

1

w, dfa d

2

S

2

w) ∈ F

1

F

2

¦

=

¦w ∈ X

∗

[ dfa d

1

S

1

w ∈ F

1

∧ dfa d

2

S

2

w ∈ F

2

¦

=

¦w ∈ X

∗

[ dfa d

1

S

1

w ∈ F

1

¦ ∩ ¦w ∈ X

∗

[ dfa d

2

S

2

w ∈ F

2

¦

=

Ldfa M

1

∩ Ldfa M

2

8.7

1.

S

?>=< 89:; /.-, ()*+

(

@A BC

)

A

?>=< 89:;

)

@A BC

(

B

?>=< 89:; /.-, ()*+

251

B. Answers to exercises

2.

S

?>=< 89:;

0

GF ED

1

0

A

?>=< 89:;

0

1

C

?>=< 89:; /.-, ()*+

B

?>=< 89:; /.-, ()*+

0

D

?>=< 89:; /.-, ()*+

8.8 Use the deﬁnition of Lre to calculate the languages:

1. ¦, b¦

2. (bc

∗

)

3. ¦a¦b

∗

∪ c

∗

8.9

1. Lre(R(S +T))

=

(Lre(R))(Lre(S +T))

=

(Lre(R))(Lre(S) ∪ Lre(T))

=

(Lre(R))(Lre(S)) ∪ (Lre(R))(Lre(T))

=

Lre(RS +RT)

2. Similar to the above calculation.

8.10 Take R = a, S = a, and R = a, S = b.

8.11 If both V and W are subsets of S, then Lre(R(S+V )) = Lre(R(S+W)). Since

S = ∅, V = S and W = ∅ satisfy the requirement. Another solution is

V = S ∩ R

W = S ∩ R

Since S = ∅, at least one of S∩R and S∩R is not empty, and it follows that V = W.

There exist other solutions than these.

8.12 The string 01 may not occur in a string in the language of the regular expression.

So when a 0 appears somewhere, only 0’s can follow. Take 1∗0∗.

8.13

252

1. If we can give a regular grammar for (a + bb)∗ + c, we can use the procedure

constructed in Exercise 8.1 to obtain a regular grammar for the complete regular

expression. The following regular grammar generates (a + bb)∗ + c:

S

0

→ S

1

[ c

provided S

1

generates (a + bb)∗. Again, we use Exercise 8.1 and a regular

grammar for a + bb to obtain a regular grammar for (a + bb)∗. The language

of the regular expression a + bb is generated by the regular grammar

S

2

→ a [ bb

2. The language of the regular expression a∗ is generated by the regular grammar

S

0

→ [ aS

0

The language of the regular expression b∗ is generated by the regular grammar

S

1

→ [ bS

1

The language of the regular expression ab is generated by the regular grammar

S

2

→ ab

It follows that the language of the regular expression a∗ +b∗ +ab is generated

by the regular grammar

S

3

→ S

0

[ S

1

[ S

2

8.14 First construct a nondeterministic ﬁnite-state automaton for these grammars,

and use the automata to construct the following regular expressions.

1. a(b + b(b + ab)∗(ab +)) + bb∗ +

2. (0 + 10∗1)∗

8.16

1. b ∗ +b ∗ (aa ∗ b(a + b)∗)

2. (b + a(bb) ∗ b)∗

9.1 We follow the proof pattern for non-regular languages given in the text.

Let n ∈ N.

Take s = a

n

2

with x = , y = a

n

, and z = a

n

2

−n

.

Let u, v, w be such that y = uvw with v = , that is, u = a

p

, v = a

q

and w = a

r

with

p +q +r = n and q > 0.

Take i = 2, then

253

B. Answers to exercises

xuv

2

wz ∈ L

⇐ defn. x, u, v, w, z, calculus

a

p+2q+r

a

n

2

−n

∈ L

⇐ p +q +r = n and q > 0

n

2

+q is not a square

⇐

n

2

< n

2

+q < (n + 1)

2

⇐

q < 2n + 1

9.2 Using the proof pattern.

Let n ∈ N.

Take s = a

n

b

n+1

with x = a

n

, y = b

n

, and z = b.

Let u, v, w be such that y = uvw with v = , that is, u = b

p

, v = b

q

and w = b

r

with

p +q +r = n and q > 0.

Take i = 0, then

xuwz ∈ L

⇐ defn. x, u, v, w, z, calculus

a

n

b

p+r+1

∈ L

⇐

p +q +r p +r + 1

⇐

q > 0

9.3 Using the proof pattern.

Let n ∈ N.

Take s = a

n

b

2n

with x = a

n

, y = b

n

, and z = b

n

.

Let u, v, w be such that y = uvw with v = , that is, u = b

p

, v = b

q

and w = b

r

with

p +q +r = n and q > 0.

Take i = 2, then

xuv

2

wz ∈ L

⇐ defn. x, u, v, w, z, calculus

a

n

b

2n+q

∈ L

⇐

2n +q > 2n

⇐

q > 0

254

9.4 Using the proof pattern.

Let n ∈ N.

Take s = a

6

b

n+4

a

n

with x = a

6

b

n+4

, y = a

n

, and z = .

Let u, v, w be such that y = uvw with v = , that is, u = a

p

, v = a

q

and w = a

r

with

p +q +r = n and q > 0.

Take i = 6, then

xuv

6

wz ∈ L

⇐ defn. x, u, v, w, z, calculus

a

6

b

n+4

a

n+5q

∈ L

⇐

n + 5q > n + 4

⇐

q > 0

9.5 The length of a substring with a’s, b’s, and c’s is at least r +2, and [vwx[ d

r.

9.6 We follow the proof pattern for proving that a language is not context-free given

in the text.

Let c, d ∈ N.

Take z = a

k

2

with k = max(c, d).

Let u, v, w, x, y be such that z = uvwxy, [vx[ > 0 and [vwx[ d

That is u = a

p

, v = a

q

, w = a

r

, x = a

s

and y = a

t

with p + q + r + s + t = k

2

,

q +r +s d and q +s > 0.

Take i = 2, then

uv

2

wx

2

y ∈ L

⇐ defn. u, v, w, x, y, calculus

a

k

2

+q+s

∈ L

⇐ q +s > 0

k

2

+q +s is not a square

⇐

k

2

< k

2

+q +s < (k + 1)

2

⇐

q +s < 2k + 1

⇐ defn. k

q +s d

9.7 Using the proof pattern.

Let c, d ∈ N.

255

B. Answers to exercises

Take z = a

k

with k is prime and k > max(c, d).

Let u, v, w, x, y be such that z = uvwxy, [vx[ > 0 and [vwx[ d

That is u = a

p

, v = a

q

, w = a

r

, x = a

s

and y = a

t

with p + q + r + s + t = k,

q +r +s d and q +s > 0.

Take i = k + 1, then

uv

k+1

wx

k+1

y ∈ L

⇐ defn. u, v, w, x, y, calculus

a

k+kq+ks

∈ L

⇐

k(1 +q +s) is not a prime

⇐

q +s > 0

9.8 Using the proof pattern.

Let c, d ∈ N.

Take z = a

k

b

k

a

k

b

k

with k = max(c, d).

Let u, v, w, x, y be such that z = uvwxy, [vx[ > 0 and [vwx[ d

Note that our choice for k guarantees that substring vwx has one of the following

shapes:

• vwx consists of just a’s, or just b’s.

• vwx contains both a’s and b’s.

Take i = 0, then

• If vwx consists of just a’s, or just b’s, then it is impossible to write the string

uwy as ww for some string w, since only the number of terminals of one kind

is decreased.

• If vwx contains both a’s and b’s, it lies somewhere on the border between a’s

and b’s, or on the border between b’s and a’s. Then the string uwy can be

written as

uwy = a

s

b

t

a

p

b

q

for some s, t, p, q, respectively. At least one of s ,t, p and q is less than k,

while two of them are equal to k. Again this sentence is not an element of the

language.

9.9 Using the proof pattern.

Let n ∈ N.

Take s = 1

n

0

n

with x = 1

n

, y = 0

n

, and z = .

Let u, v, w be such that y = uvw with v = , that is, u = b

p

, v = b

q

and w = b

r

with

p +q +r = n and q > 0.

Take i = 2, then

256

xuv

2

w ∈ L

⇐ defn. x, u, v, w, z, calculus

1

n

0

p+2q+r

∈ L

⇐ p +q +r = n

1

n

0

n+q

∈ L

⇐ q > 0

true

9.10

1. The language ¦a

i

b

j

[ 0 i j¦ is context-free. The language can be generated

by the following context-free grammar:

S → aSbA

S →

A → bA

A →

This grammar generates the strings a

m

b

m

b

n

for m, n ∈ N. These are exactly

the strings of the given language.

2. The language is not regular. We prove this using the proof pattern.

Let n ∈ N.

Take s = a

n

b

n+1

with x = , y = a

n

, and z = b

n+1

.

Let u, v, w be such that y = uvw with v = , that is, u = a

p

, v = a

q

and w = a

r

with p +q +r = n and q > 0.

Take i = 3, then

xuv

3

wz ∈ L

⇐ defn. x, u, v, w, z, calculus

a

p+3q+r

b

n+1

∈ L

⇐ p +q +r = n

n + 2q > n + 1

⇐ q > 0

true

9.11

1. The language ¦wcw [ w ∈ ¦a, b¦

∗

¦ is not context-free. We prove this using the

proof pattern.

Let c, d ∈ N.

257

B. Answers to exercises

Take z = a

k

b

k

ca

k

b

k

with k = max(c, d).

Let u, v, w, x, y be such that z = uvwxy, [vx[ > 0 and [vwx[ d

Note that our choice for k guarantees that [vwx[ k. The substring vwx has

one of the following shapes:

• vwx does not contain c.

• vwx contains c.

Take i = 0, then

• If vwx does not contain c it is a substring at the left-hand side or at the

right-hand side of c. Then it is impossible to write the string uwy as scs

for some string s, since only the number of terminals on one side of c is

decreased.

• If vwx contains c then the string vwx can be written as

vwx = b

s

ca

t

for some s, t respectively. For uwy there are two possibilities.

The string uwy does not contain c so it is not an element of the language.

If the string uwy does contain c then it has either fewer b’s at the left-hand

side than at the right-hand side or fewer a’s at the right-hand side than

at the left-hand side, or both. So it is impossible to write the string uwy

as scs for some string s.

2. The language is not regular because it is not context-free. The set of regular

languages is a subset of the set of context-free languages.

9.12

1. The grammar G is context-free because there is only one nonterminal at the

left hand side of the production rules and the right hand side is a sequence of

terminals and nonterminals. The grammar is not regular because there is a

production rule with two nonterminals at the right hand side.

2. L(G) = ¦(st)

∗

[ s ∈ ¦

∗

∧ t ∈ ¦

∗

∧ [s[ = [t[ ¦

L(G) is all sequences of nested braces.

3. The language is context-free because it is generated by a context-free grammar.

The language is not regular. We use the proof pattern and choose

s = ¦

n

¦

n

. The proof is exactly the same as the proof for non-regularity of

L = ¦a

m

b

m

[ m 0¦ in Section 9.2.

9.13

1. The grammar is context-free. The grammar is not right-regular but left-regular

because the nonterminal appears at the left hand side in the last production

rule.

258

2. L(G) = 0

∗

∪ 10

∗

Note that the language is also generated by the following right-regular grammar:

S →

S → 1A

S → 0A

A →

A → 0

A → 0A

3. The language is context-free and regular because it is generated by a regular

grammar.

10.1 For both grammars we have:

empty = const False

10.2 For grammar1 we have:

firsts S = {b,c}

firsts A = {a,b,c}

firsts B = {a,b,c}

firsts C = {a,b}

For grammar2:

firsts S = {a}

firsts A = {b}

10.3 For gramm1 we have:

follow S = {a,b,c}

follow A = {a,b,c}

follow B = {a,b}

follow C = {a,b,c}

For gramm2:

follow = const {}

10.4 For the productions of gramm3 we have:

259

B. Answers to exercises

lookAhead (S → AaS) = ¦a,c ¦

lookAhead (S → B) = ¦b¦

lookAhead (A → cS) = ¦c¦

lookAhead (A → []) = ¦a¦

lookAhead (B → b) = ¦b¦

Since all lookAhead sets for productions of the same nonterminal are disjoint, gramm3

is an LL(1) grammar.

10.5 After left factoring gramm2 we obtain gramm2’ with productions

S → aC

C → bA [ a

A → bD

D → b [ S

For this transformed grammar gramm2’ we have:

The empty function

empty = const False

The first function

firsts S = {a}

firsts C = {a,b}

firsts A = {b}

firsts D = {a,b}

The follow function

follow = const {}

The lookAhead function

lookAhead (S → aC) = ¦a¦

lookAhead (C → bA) = ¦b¦

lookAhead (C → a) = ¦a¦

lookAhead (A → bD) = ¦b¦

lookAhead (D → b) = ¦b¦

lookAhead (D → S) = ¦a¦

Clearly, gramm2’ is an LL(1) grammar.

10.6 For the empty function we have:

260

empty R = True

empty _ = False

For the firsts function we have

firsts L = {0,1}

firsts R = {,}

firsts B = {0,1}

The follow function is

follow B = {,}

follow _ = {}

The lookAhead function is

lookAhead (L → B R) = ¦0, 1¦

lookAhead (R → ) = ¦¦

lookAhead (R → , B R) = ¦,¦

lookAhead (B → 0) = ¦0¦

lookAhead (B → 1) = ¦1¦

Since all lookAhead sets for productions of the same nonterminal are disjoint, the

grammar is LL(1).

10.7

Node S [ Node c []

, Node A [ Node c []

, Node B [ Node c [], Node c [] ]

, Node C [ Node b [], Node a [] ]

]

]

10.8

Node S [ Node a []

, Node C [ Node b []

, Node A [ Node b [], Node D [ Node b [] ] ]

]

]

10.9

261

B. Answers to exercises

Node S [ Node A []

, Node a []

, Node S [ Node A [ Node c [], Node S [ Node B [ Node b []]]]

, Node a []

, Node S [ Node B [ Node b [] ]]

]

]

10.10

1. list2Set :: Ord s => [s] -> [s]

list2Set = unions . map single

2. list2Set :: Ord s => [s] -> [s]

list2Set = foldr op []

where

op x xs = single x ‘union‘ xs

3. pref :: Ord s => (s -> Bool) -> [s] -> [s]

pref p = list2Set . takeWhile p

or

pref p [] = []

pref p (x:xs) = if p x then single x ‘union‘ pref p xs else []

or

pref p = foldr op []

where

op x xs = if p x then single x ‘union‘ xs else []

4. prefplus p [] = []

prefplus p (x:xs) = if p x then single x ‘union‘ prefplus p xs

else single x

5. prefplus p = foldr op []

where

op x us = if p x then single x ‘union‘ us

else single x

6. prefplus p

=

foldr op []

where

op x us = if p x then single x ‘union‘ us else single x

=

foldr op []

262

where

op x us = single x ‘union‘ rest

where

rest = if p x then us else []

=

foldrRhs p single []

7. The function foldrRhs p f [] takes a list xs and returns the set of all f-images

of the elements of the preﬁx of xs all of whose elements satisfy p together with

the f-image of the ﬁrst element of xs that does not satisfy p (if this element

exists).

The function foldrRhs p f start takes a list xs and returns the set start

‘union‘ foldrRhs p f [] xs.

8.

10.11

1. For gramm1

a) foldrRhs empty first [] bSA

=

unions (map first (prefplus empty bSA))

= empty b = False

unions (map first [b])

=

unions [¦b¦]

=

¦b¦

b) foldrRhs empty first [] Cb

=

unions (map first (prefplus empty Cb))

= empty C = False

unions (map first [C])

=

unions [¦a, b ¦]

=

¦ a, b ¦

2. For gramm3

a) foldrRhs empty first [] AaS

=

unions (map first (prefplus empty AaS))

263

B. Answers to exercises

= empty A = True

unions (map first ([A] ‘union‘ prefplus empty aS))

empty a = False

unions (map first ([A] ‘union‘ [a]))

=

unions (map first ([A, a]))

=

unions [first A, first a]

=

unions [¦c¦, ¦a¦]

=

¦a, c¦

b) scanrRhs empty first [] AaS

=

map (foldrRhs empty first []) (tails AaS)

=

map (foldrRhs empty first []) [AaS, aS, S, []]

=

[ foldrRhs empty first [] AaS

, foldrRhs empty first [] aS

, foldrRhs empty first [] S

, foldrRhs empty first [] []

]

= calculus

[ ¦a, c¦, ¦a¦, ¦a, b, c ¦, [] ]

11.1 Recall, for gramm1 we have:

follow S = {a,b,c}

follow A = {a,b,c}

follow B = {a,b}

follow C = {a,b,c}

The production rule B → cc cannot be used for a reduction when the input begins

with the symbol c.

264

stack input

ccccba

c cccba

cc ccba

ccc cba

cccc ba

Bcc ba

bBcc a

abBcc

CBcc

Ac

S

11.2 Recall, for gramm2 we have:

follow S = {}

follow A = {}

No reduction can be made until the input is empty.

stack input

abbb

a bbb

ba bb

bba b

bbba

Aba

S

11.3 Recall, for gramm3 we have:

follow S = {a}

follow A = {a}

follow B = {a}

At the beginning a reduction is applied using the production rule A → ε.

265

B. Answers to exercises

stack input

acbab

A acbab

aA cbab

caA bab

bcaA ab

BcaA ab

ScaA ab

AaA ab

aAaA b

baAaA

SaAaA

SaA

S

266

Copyright c 2001–2009 by Johan Jeuring and Doaitse Swierstra.

Contents

Preface for the 2009 version Voorwoord 1. Goals 1.1. History . . . . . . . . . . . . . . 1.2. Grammar analysis of context-free 1.3. Compositionality . . . . . . . . . 1.4. Abstraction mechanisms . . . . . v vii 1 2 3 4 4 7 8 11 14 15 17 19 22 23 23 23 24 24 25 26 27 28 31 34 36 37

. . . . . . grammars . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

2. Context-Free Grammars 2.1. Languages . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Grammars . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1. Notational conventions . . . . . . . . . . . . . . 2.3. The language of a grammar . . . . . . . . . . . . . . . 2.3.1. Examples of basic languages . . . . . . . . . . . 2.4. Parse trees . . . . . . . . . . . . . . . . . . . . . . . . 2.5. Grammar transformations . . . . . . . . . . . . . . . . 2.5.1. Removing duplicate productions . . . . . . . . 2.5.2. Substituting right hand sides for nonterminals 2.5.3. Removing unreachable productions . . . . . . . 2.5.4. Left factoring . . . . . . . . . . . . . . . . . . . 2.5.5. Removing left recursion . . . . . . . . . . . . . 2.5.6. Associative separator . . . . . . . . . . . . . . . 2.5.7. Introduction of priorities . . . . . . . . . . . . . 2.5.8. Discussion . . . . . . . . . . . . . . . . . . . . . 2.6. Concrete and abstract syntax . . . . . . . . . . . . . . 2.7. Constructions on grammars . . . . . . . . . . . . . . . 2.7.1. SL: an example . . . . . . . . . . . . . . . . . . 2.8. Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . 2.9. Exercises . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

3. Parser combinators 41 3.1. The type of parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2. Elementary parsers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3. Parser combinators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

i

Contents

3.4.

3.5. 3.6. 3.7.

3.3.1. Matching parentheses: an example More parser combinators . . . . . . . . . . 3.4.1. Parser combinators for EBNF . . . 3.4.2. Separators . . . . . . . . . . . . . . Arithmetical expressions . . . . . . . . . . Generalised expressions . . . . . . . . . . Exercises . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

53 55 55 58 60 63 64 67 68 68 69 69 69 70 71 72 72 74 75 76 77 78 78 80 81 82 82 83 86 88 88 89 90 90 91 93 95 97 97 98 102

4. Grammar and Parser design 4.1. Step 1: Example sentences for the language 4.2. Step 2: A grammar for the language . . . . 4.3. Step 3: Testing the grammar . . . . . . . . 4.4. Step 4: Analysing the grammar . . . . . . . 4.5. Step 5: Transforming the grammar . . . . . 4.6. Step 6: Deciding on the types . . . . . . . . 4.7. Step 7: Constructing the basic parser . . . . 4.7.1. Basic parsers from strings . . . . . . 4.7.2. A basic parser from tokens . . . . . 4.8. Step 8: Adding semantic functions . . . . . 4.9. Step 9: Did you get what you expected . . 4.10. Exercises . . . . . . . . . . . . . . . . . . . 5. Compositionality 5.1. Lists . . . . . . . . . . . . . . . . . . . 5.1.1. Built-in lists . . . . . . . . . . . 5.1.2. User-deﬁned lists . . . . . . . . 5.1.3. Streams . . . . . . . . . . . . . 5.2. Trees . . . . . . . . . . . . . . . . . . . 5.2.1. Binary trees . . . . . . . . . . . 5.2.2. Trees for matching parentheses 5.2.3. Expression trees . . . . . . . . 5.2.4. General trees . . . . . . . . . . 5.2.5. Eﬃciency . . . . . . . . . . . . 5.3. Algebraic semantics . . . . . . . . . . 5.4. Expressions . . . . . . . . . . . . . . . 5.4.1. Evaluating expressions . . . . . 5.4.2. Adding variables . . . . . . . . 5.4.3. Adding deﬁnitions . . . . . . . 5.4.4. Compiling to a stack machine . 5.5. Block structured languages . . . . . . 5.5.1. Blocks . . . . . . . . . . . . . . 5.5.2. Generating code . . . . . . . . 5.6. Exercises . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

ii

Contents

6. Computing with parsers 6.1. Insert a semantic function in the parser 6.2. Apply a fold to the abstract syntax . . . 6.3. Deforestation . . . . . . . . . . . . . . . 6.4. Using a class instead of abstract syntax 6.5. Passing an algebra to the parser . . . . 7. Programming with higher-order folds 7.1. The rep min problem . . . . . . . . . . 7.1.1. A straightforward solution . . . 7.1.2. Lambda lifting . . . . . . . . . 7.1.3. Tupling computations . . . . . 7.1.4. Merging tupled functions . . . 7.2. A small compiler . . . . . . . . . . . . 7.2.1. The language . . . . . . . . . . 7.2.2. A stack machine . . . . . . . . 7.2.3. Compiling to the stackmachine 7.3. Attribute grammars . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

105 . 105 . 105 . 106 . 107 . 109 111 112 113 113 114 115 117 117 117 118 122

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

8. Regular Languages 8.1. Finite-state automata . . . . . . . . . . . . . . . . . . . . . . . 8.1.1. Deterministic ﬁnite-state automata . . . . . . . . . . . . 8.1.2. Nondeterministic ﬁnite-state automata . . . . . . . . . . 8.1.3. Implementation . . . . . . . . . . . . . . . . . . . . . . . 8.1.4. Constructing a DFA from an NFA . . . . . . . . . . . . 8.1.5. Partial evaluation of NFA’s . . . . . . . . . . . . . . . . 8.2. Regular grammars . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1. Equivalence of Regular grammars and Finite automata 8.3. Regular expressions . . . . . . . . . . . . . . . . . . . . . . . . . 8.4. Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1. Proof of Theorem 8.7 . . . . . . . . . . . . . . . . . . . 8.4.2. Proof of Theorem 8.16 . . . . . . . . . . . . . . . . . . . 8.4.3. Proof of Theorem 8.17 . . . . . . . . . . . . . . . . . . . 8.5. Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9. Pumping Lemmas: the expressive power of languages 9.1. The Chomsky hierarchy . . . . . . . . . . . . . . . . . . 9.1.1. Type-0 grammars . . . . . . . . . . . . . . . . . . 9.1.2. Type-1 grammars . . . . . . . . . . . . . . . . . . 9.1.3. Type-2 grammars . . . . . . . . . . . . . . . . . . 9.1.4. Type-3 grammars . . . . . . . . . . . . . . . . . . 9.2. The pumping lemma for regular languages . . . . . . . . 9.3. The pumping lemma for context-free languages . . . . . 9.4. Proofs of pumping lemmas . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

123 . 123 . 124 . 126 . 129 . 132 . 135 . 136 . 139 . 142 . 145 . 146 . 147 . 148 . 150 153 154 154 154 154 155 155 157 161

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

iii

. . . . . . . . .7. 206 211 213 . . . . . . . . . .3. . . . LL Parsing: Background . 11. Implementation of ﬁrst and last . . . . . . Implementation of isLL(1) . . .3.Contents 9. 200 . . . . . . . . . . .4. . 11. . . . . . . . 10. . . An LL(1) grammar .1. . . .1. . 167 . . . . . . . . .2. . . . . . . . 179 . . . . . . Implementation of empty . . . 10. . . . . . . .1. . . . . . . . . . . . . . . 11. . . . . 11. . . . . . . . . . . . .2. . .2. . . . . . . . . . . . . LR parse example . 176 . . . . 162 9. . .1. . 11. . . .1. . . . . 10. . . . .2 . . . . . Answers to exercises . 201 . .1. .2. . . .1. . . . . . LR action selection . . .2. . . . . . . . . . . . . . . .2. . . . 10. . . . . . . . . . . . .3. . 169 . . . .1. . .1 . . . . . . . . . . . .4. . . . 161 9. . . . . . . . . . . . . . . . 11. 10. .8. . . . . . . . .1. The Stack module B. . .2. . . . 11. .5. . . 199 . . . . . . . .6. 198 . . . 11.2. . LL(1) grammars . Proof of the Regular Pumping Lemma. . .1. 10. . . . . .1.2. . . . . . 10. . .3. 196 . . . . . . . 176 . .5. . . . . . . . Context-free grammars in Haskell . . . 168 . LL(1) parsing . .2. . . . 201 . . . . . . . . . . . . . . . . . . . . .3. LL Parsing: Implementation . Exercises . . . . LL(1) parser example . 11. . . . . . . . . . . . . . . . . . . . 181 .1. . . . . . . . . . . .LL versus LR parsing 11. . 187 . . . . . . . . 164 10. . . . .3. . . . . . . . . . . . . . . . Some example derivations . . . . . . . . . . . An LR checker .LL Parsing 10. . . LR parsing . . . .2. Implementation of follow . . . . . . . . . .3. . . . . . . . . . . A stack machine for parsing .1.2. . . . 181 . . . Theorem 9. . .4. . . .2. . . . . . . . . Proof of the Context-free Pumping Lemma. . . . . 178 . . 10.2. . 196 . . . . . LR optimizations and generalizations . . . . . . . . . . . . . . . 186 . .2. . . . . . . . . . . . . 10. . . . . 10. . . . . 202 . . . . 190 195 . . . . . A stack machine for SLR parsing . iv . 10. . . . . . 167 . . Parse trees in Haskell . 11. . . . . . . . . . . . An LL(1) checker . Implementation of lookahead . . . . . . 10. Theorem 9.2. 197 . . . A. . . . . . . . . . Using the LL(1) parser . . .3. . . . . . . . . . . 173 . . .1. . . . . . . . . .

These lecture notes are in large parts identical with the old lecture notes for “Grammars and Parsing” by Johan Jeuring and Doaitse Swierstra. and this means that some of the later chapters look a bit diﬀerent from the earlier chapters. Some modiﬁcations made by Manuela Witsiers have been reintegrated. I have also started to make some modiﬁcations to style and content – mainly trying to adapt the Haskell code contained in the lecture notes to currently established coding guidelines.Preface for the 2009 version This is a work in progress. I hope to be able to update the online version of the lecture notes regularly. but also at the Open University. this work is not quite ﬁnished. The lecture notes have not only been used at Utrecht University. Andres L¨h o October 2009 v . I also rearranged some of the material because I think they connect better in the new order. Please feel free to point out any mistakes you ﬁnd in these lecture notes to me – any other feedback is also welcome. I apologize in advance for any inconsistencies or mistakes I may have introduced. However.

Preface for the 2009 version vi .

en naarmate er meer voorbeelden gegeven worden. • Oefening baart kunst. Breght Boschker. en doe dat niet door te kijken of je de oplossingen die anderen gevonden hebben. Erik Meijer. “Begrijpen is echter niet hetzelfde als “kennen”. die onder andere geschreven zijn in het kader van het project Kwaliteit en Studeerbaarheid. is het verleidelijker om. In het ideale geval zou je in staat moeten zijn een mooi tentamen in elkaar te zetten voor je mede-studenten! • Zorg dat je up-to-date bent. Het dictaat is de afgelopen jaren verbeterd. “kennen” is iets anders dan “beheersen” en “beheersen” is weer iets anders dan “er iets mee kunnen”. Rik van Geldrop. Pieter Eendebak. Veel mensen hebben een bijgedrage geleverd aan de totstandkoming van dit dictaat door een gedeelte te schrijven.Voorwoord Het hiervolgende dictaat is gebaseerd op teksten uit vorige jaren. Martin Bravenboer. In tegenstelling tot sommige andere vakken is het bij dit vak gemakkelijk de vaste grond onder je voeten kwijt te raken. probeer dan eens in eigen woorden uiteen te zetten. die mee hebben geholpen door het schrijven van (een) hoofdstuk(ken) van het dictaat. of (een gedeelte van) het dictaat te becommentari¨ren. Gijsbert Bol. de conclusie te trekken dat je een en ander daadwerkelijk beheerst. Het is niet “elke week nieuwe kansen”. begrijpt. Graham Hutton. Daan Leijen. Als je dus van mening bent dat je een hoofdstuk goed begrijpt. e Speciale vermelding verdienen Jeroen Fokker. maar we houden ons van harte aanbevolen voor suggesties voor verdere verbetering. Arnoud Berendsen. We hebben geprobeerd door de indeling van vii . en Vincent Oosto indi¨. e Tenslotte willen we van de gelegenheid gebruik maken enige studeeraanwijzingen te geven: • Het is onze eigen ervaring dat het uitleggen van de stof aan iemand anders vaak pas duidelijk maakt welke onderdelen je zelf nog niet goed beheerst. Rijk-Jan van Haaften. Commentaar is onder andere geleverd door: Arthur Baars. en Luc Duponcheel. Andres L¨h. Probeer voor jezelf bij te houden welk stadium je bereikt hebt met betrekking tot alle genoemde leerdoelen. na lezing van een hoofdstuk. met name daar waar het het aangeven van verbanden met andere vakken betreft. Maak dus de opgaven die in het dictaat opgenomen zijn zelf. Naarmate er meer aandacht wordt besteed aan de presentatie van de stof.

Als je een week gemist hebt is het vrijwel onmogelijk de nieuwe stof van de week daarop te begrijpen. De tijd die je dan op college en werkcollege doorbrengt is dan weinig eﬀectief. en hopelijk ook veel plezier. • We maken gebruik van de taal Haskell om veel concepten en algoritmen te presenteren. Goed gereedschap is het halve werk. Anders maak je jezelf het leven heel moeilijk. Johan Jeuring en Doaitse Swierstra viii . en Haskell is hier ons gereedschap. met als gevolg dat je vaak voor het tentamen heel veel tijd (die er dan niet is) kwijt bent om in je uppie alles te bestuderen. maar de totale opbouw laat hier niet heel veel vrijheid toe. Veel sterkte. Als je nog moeilijkheden hebt met de taal Haskell aarzel dan niet direct hier wat aan te doen. en zonodig hulp te vragen.Voorwoord de stof hier wel iets aan te doen.

and when completing the practical exercises. This situation has remained so for years. we could construct programs someone else could not – thus giving us the feeling we had “the right stuﬀ” –. show correspondences and transfer insights from one area of interest to the other. thus linking the o area with many other areas of computer science. For many practicing computer scientists the course on compiler construction still is one of the highlights of their education. where we had tools for generating parts of compilers out of formal descriptions of the tasks to be performed. Goals The goals of these lecture notes can be split into primary goals. and it is only in the last years that we have started to discover and make explicit the reasons why this area attracted so much interest. and not only that. we had the usual satisﬁed feeling. and secondary – but not less important – goals which 1 . The reason for this is that from the very beginning of these curricula it has been one of the few areas where the development of formal methods and the application of formal techniques in actual program construction come together. For a long time the construction of compilers has been one of the few areas where we had a methodology available. One of the things which were not so clear however is where exactly this joy originated from: the techniques taught deﬁnitely had a certain elegance. and where such program generators were indeed generating programs which would have been impossible to create by hand. and now we knew “how to do it”. Goals Introduction Courses on Grammars. which invariably consisted of constructing a compiler for some toy language. This feeling was augmented by the fact that we would not have had the foggiest idea how to complete such a product a few months before. but also giving us a means to explain such links.1. have been provided with a theoretical underpinning which explains why the techniques work. As a beneﬁcial side-eﬀect we also gradually learned to see where the discovered concept further played a rˆle. Parsing and Compilation of programming languages have always been some of the core components of a computer science curriculum. Many of the techniques which were taught on a “this is how you solve this kind of problems” basis. which are associated with the speciﬁc subject studied. stress their importance.

• to understand the concept of compositionality. which was soon to be followed by ALGOL-60 and COBOL. the problem of how to describe the language was not a hot issue: much more important problems were to be solved. to understand the concept of domain speciﬁc languages. • to familiarise oneself with the concept of computability. of which John Backus was the prime architect. goals are: to develop the capability to abstract. The secondary. how to construct a compiler for the language that would ﬁt into the small memories which were available at that time (kilobytes instead of megabytes). more far reaching. Writing larger and larger programs by more and more people sparked the development of the ﬁrst more or less machine-independent programming language FORTRAN (FORmula TRANslator). As a result the language was very much implicitly deﬁned by what was accepted by the compiler and what not. History When at the end of the ﬁfties the use of computers became more and more widespread. • to be able to apply these techniques in the construction of all kinds of programs.1. to 2 . The primary. and their reliability had increased enough to justify applying them to a wide range of problems. For the developers of the FORTRAN language. how to recognise (build) such structures in (from) a sequence of symbols. such as. • to show how to develop programs in a calculational style. and how to generate machine code that would not be ridiculed by programmers who had thus far written such code by hand. Goals have to do with developing skills which one would expect every educated computer scientist to have. to show how proper formalisations can be used as a starting point for the construction of useful tools.1.. Soon after the development of FORTRAN an international working group started to work on the design of a machine independent high-level programming language. what should be in the language and what not. • to know how to parse. i. • to know how to analyse grammars to see whether or not speciﬁc properties hold. • to improve the general programming skills. • • • • 1..e. goals are: • to understand how to describe structures (i. • to show a wide variety of useful programming techniques.e. “formulas”) using grammars. it was no longer the actual hardware which posed most of the problems. somewhat more traditional. to understand the concepts of abstract interpretation and partial evaluation.

programs emerged which take such a piece of language description as input. such as Java and Haskell. i. where the type checking performed by the compilers is one of the main contributions to the increase in eﬃciency in the programming process. and not so much generating actual code. This latter fact may not be so important from a theoretical point of view. this notation. It appeared that the BNF-formalism actually was a member of a hierarchy of grammar classes which had been formulated a number of years before by the linguist Noam Chomsky in an attempt to capture the concept of a “language”. which soon became to be known as the Backus-Naur formalism (BNF).2. Once this relationship between a piece of BNF and a compiler became well understood. and especially people writing compilers. not only a language standard was produced. inspired by the Chomsky hierarchy. As a remarkable side-eﬀect of this undertaking. and probably not so good in handling erroneous input. 1.e. Unfortunately. for a human being it is not always easy. Ever since it was introduced. It was not for long that computer scientists. we most often speak of context-free grammars. and probably caused by the need to exchange proposals in writing. Such programs are now known under the name parser generators. but it is from a pragmatical point of view. and. discovered that the formalism was not only useful to express what language should be accepted by their compilers. but could also be used as a guideline for structuring their compilers. which classes of languages can be expressed by means of BNF and which not. the construction of compilers.. When constructing a recogniser for a language described by a context-free grammar one often wants to check whether or not the grammar has speciﬁc desirable properties. has been used as the primary tool for describing the basic structure of programming languages. Questions arose about the expressibility of BNF. This aspect is even stronger present in strongly typed languages. Grammar analysis of context-free grammars Nowadays the use of the word Backus-Naur is gradually diminishing. and produce a skeleton of the desired compiler. the BNFformalism also became soon a subject of study for the more theoretically oriented. and consequently how to express restrictions and properties of languages for which the BNF-formalism is not powerful enough. For the construction of everyday compilers for everyday languages it appears that this class is still a bit too large. If we use the full power of the context-free languages we get compilers which in general are ineﬃcient. Besides these very mundane goals.e.. Grammar analysis of context-free grammars become known under the name ALGOL-60. but also a notation for describing programming languages was proposed by Naur and used to describe the language in the famous Algol-60 report.2. and quite often practically 3 . i. In the lectures we will see many examples of this.1. Most invocations of compilers still have as their primary goal to discover mistakes made when typing the program.

Goals impossible. and some are easily checked by a simple human inspection. In this course we will see many examples of such classes.4. to see whether or not a particular property holds. some are easy to check by machine. As a side eﬀect you will acquire a special form of writing functional programs. Once this phenomenon was recognised it went under the name syntax directed compilation. it was recognised that actually compilers are a special form of homomorphisms. the more computational eﬀort is needed and the sooner this question cannot be answered by a human being anymore. The only way we see is to take a historians view and to compare the old and the new situations. and under the inﬂuence of the more functional oriented style of programming. Fortunately however there have also been some developments in programming language design. The general observation is that the more precise the answer to a speciﬁc question one wants to have.1. We will see that the concept of lazy evaluation plays an important rˆle in making these eﬃcient and straightforward implementations possible. a concept thus far only familiar to mathematicians and more theoretically oriented computer scientist that study the description of the meaning of a programming language. and an eﬃcient way of partitioning a problem into smaller components. This has led to a whole hierarchy of context-free grammars classes. Unfortunately there is no simple way to teach the techniques which have led us thus far. o 1. This should not come as a surprise since this recognition is a direct consequence of the tendency that ever greater parts of compilers are more or less automatically generated from a formal description of some aspect of a programming language. Furthermore. albeit quite 4 . Under closer scrutiny. 1.3. which makes it often surprisingly simple to solve at ﬁrst sight rather complicated programming assignments. of which we want to mention the developments in the area of functional programming in particular. is that over the last thirty years we have discovered the right kind of abstractions to be used. We claim that the combination of a modern. e. We will see many examples of such mappings. some of which are more powerful. Abstraction mechanisms One of the main reasons for that what used to be an endeavour for a large team in the past can now easily be done by a couple of ﬁrst year’s students in a matter of days or weeks. it may be very expensive to check whether or not such a property holds. Compositionality As we will see the structure of many compilers follows directly from the grammar that describes the language to be compiled. by making use of a description of their outer appearance or by making use of a description of the semantics (meaning) of a language.g.

4.e. and fragments thereof.1. combined with the concept of lazy evaluation. in a modern functional language. 5 . and even ﬁnd some inspiration for further developments in language design: i. Abstraction mechanisms elaborate. ﬁnd clues about how to extend such languages to enable us to make common patterns. explicit. type system. provides an ideal platform to develop and practice ones abstraction skills. we can show the real power of abstraction. There does not exist another readily executable formalism which may serve as an equally powerful tool.. which thus far have only been demonstrated by giving examples. We hope that by presenting many algorithms.

Goals 6 .1.

we have a grammatical formalism for describing artiﬁcial languages. g. when reading this sentence. Context-Free Grammars Introduction We often want to recognise a particular structure hidden in a sequence of symbols. whereas a Java program can be seen as a sequence of Java symbols. in the area of natural language research the question has often been posed what actually constitutes a “language”. speciﬁc operators etc. set of sentences and sentences are sequences of symbols taken from a ﬁnite set (e.. sequences of characters.2. 7 . given the limitations of the formalism. A diﬀerence with the grammars for natural languages is that this grammatical formalism is a completely formal one. This chapter introduces the most important notions of this course: the concept of a language and a grammar. where one may easily disagree about whether something is correct English or not. Notice that this is quite diﬀerent from the grammars for natural languages. possibly inﬁnite. not any sequence of symbols is an English sentence. and that a sentence consists of a sequence of English words. This completely formal approach however also comes with a disadvantage. Just as we say that the fact whether or not a sentence belongs to the English language is determined by the English grammar (remember that before we have used the phrase “grammatically correct”). A language is a. So how do we characterise English sentences? This is an old question. This property may enable us to mathematically prove that a sentence belongs to some language. reserved words. This terminology has been carried over to computer science: the programming language Java can be seen as the set of all correct Java programs. you automatically structure it by means of your understanding of the English language. the expressiveness of the class of grammars we are going to describe in this chapter is rather limited. which are referred to as strings). Of course. which was posed long before computers were widely used. The simplest deﬁnition one can come up with is to say that the English language equals the set of all grammatically correct English sentences. such as identiﬁers. and often such proofs can be constructed automatically by a computer in a process called parsing. For example. and there are many languages one might want to describe but which cannot be described.

know how to describe languages by means of context-free grammars. understand the rˆle of parse trees. The set of sequences over X . then az is also a sequence. understand the derivation process used in describing languages. Languages The goal of this section is to introduce the concepts of language and sentence.2. 2.1. whereas sequences are always ﬁnite. called the empty sequence. It is implicitly understood that nothing that cannot be formed by repeated. In conventional texts about mathematics it is not uncommon to encounter a deﬁnition of sequences that looks as follows: sequence Deﬁnition 2. The above deﬁnition is an instance of a very common deﬁnition pattern: it is a deﬁnition by induction. induction 8 . be able to verify whether or not a simple grammar is ambiguous. be able to read and interpret the BNF notation. i. In particular. be able to convert a grammar from EBNF-notation into BNF-notation by hand. the deﬁnition of the concept refers to the concept itself. Context-Free Grammars Goals The main goal of this chapter is to introduce and show the relation between the main concepts for describing the parsing problem: languages and sentences.1 (Sequence). the deﬁnition corresponds almost exactly to the deﬁnition of the type [a] of lists with elements of type a in Haskell. for example for removing left recursion. called X ∗ . o understand the relation between context-free grammars and datatypes. be able to transform a grammar. but ﬁnite application of one of the two given rules is a sequence over X . In the following. understand the EBNF formalism. and grammars. after you have studied this chapter you will: • • • • • • • • • • • • • know the concepts of language and sentence. Let X be a set. and • if z is a sequence and a is an element of X . understand the concepts of concrete and abstract syntax . be able to construct a simple context-free grammar in EBNF notation. know the diﬀerence between a terminal symbol and a nonterminal symbol. The one diﬀerence is that Haskell lists can be inﬁnite. They can be implemented easily in Haskell using lists.. we will introduce several concepts based on sequences. e. Furthermore. is deﬁned as follows: • ε is a sequence.

e.) Note that the Haskell notation for lists is generally more precise than the mathematical notation for sequences. (Recall the function foldr from the course on Functional Programming. a set of characters l = {a. t. i. there is a clear diﬀerence between a and [a]. and letters from the end of the alphabet to represent sequences.. e. s. Language. i. p. c. Concatenation of lists is handled by the operator (+ whereas a single element can be added to the front +). n. {ε} and T are languages over alphabet T . c. ∅ (the empty set).. We start with some terminology: Deﬁnition 2. We typically use ε only when we want to explicitly emphasize that we are talking about the empty sequence. of a list using (:). e. Some examples of alphabets are: • • • • • the conventional Roman alphabet: {a. x}. az should be understood as the symbol a followed by the sequence z . m. In Haskell.1.2 (Alphabet. z}. Sentence). exercise. l. r. so ab in Haskell is to be understood as a single identiﬁer with name ab. Elements are distinguished from lists by their type. such deﬁnitions often follow a recursion pattern which is similar to the deﬁnition of the structure itself. . . then. b. a set of English words {course. • An alphabet is a ﬁnite set of symbols. for some alphabet T . else}. sets of reserved words {if. Now we move from individual sequences to ﬁnite or inﬁnite sets of sequences. Note that ‘word’ and ‘sentence’ in formal languages are used as synonyms. u. all these distinctions are explicit. d. When talking about languages and grammars. whereas xy is the concatenation of sequences x and y. • A language is a subset of T ∗ . 1}. We use letters from the beginning of the alphabet to represent single symbols. exam}. Furthermore. Haskell identiﬁers often have longer names. not as a combination of two symbols a and b. . Languages Functions that operate on an inductively deﬁned structure such as sequences are typically structurally recursive. k. the binary alphabet {0. depending on context. Also. o. alphabet language sentence Examples of languages are: • T ∗ . i. . • A sentence (often also called word ) is an element of a language. w. b. We write a to denote both the single symbol a or the sequence aε.2. practical. we often leave the distinction between single symbols and sequences implicit. we denote concatenation of sequences and symbols in the same way. 9 .

/ In addition to these set operators. s = s R }. A famous example of a language which is easily deﬁned in a predicative way. . it provides us with a way to enumerate all elements of a language. Let L and M be languages over the same alphabet T . so {1. . . 2. c}: {s | s ∈ {a. . has a speciﬁc property P . Since a language is a set we immediately see three diﬀerent approaches: • enumerate all the elements of the set explicitly. i. there are more speciﬁc operators. Chapter 8. which apply only to sets of sequences. where s R denotes the reverse of sequence s. P (x ). exercise. Context-Free Grammars • the set {course. exam} is a language over the alphabet l of characters and exam is a sentence in it. Deﬁnition 2. n mod 2 = 0}. Since languages are sets the usual set operators such as union. Note that ∪ denotes set union. 1. If this property is of the form P (L) = ∀x ∈ L . . One of the fundamental diﬀerences between the predicative and the inductive approach to deﬁning a language is that the latter approach is constructive. The question that now arises is how to specify a language. e. x ∈ L}. 3} = {1.3 (Language operations). 9}∗ . b. sequences which read the same forward as backward. practical. t ∈ M } L0 = {ε} Ln+1 = LLn L∗ = i∈N Li = L0 ∪ L1 ∪ L2 ∪ . b. is the set of prime numbers. complement of L reverse of L concatenation of L and M 0th power of L n + 1st power of L star-closure of L positive closure of L 10 .. • the language PAL of palindromes. • deﬁne which elements belong to the set by means of induction. We have just seen some examples of the ﬁrst (the Roman alphabet) and third (the set of sequences over an alphabet) approach. but for which the membership test is very hard. then L = T∗ − L LR = {s R | s ∈ L} LM = {st | s ∈ L. over the alphabet {a. .i>0 Li = L1 ∪ L2 ∪ . intersection and difference can be used to construct new languages from existing ones. . If we deﬁne a language by means of a predicate we only have a means to decide whether or not an element belongs to a language. The complement of a language L over alphabet T is deﬁned by L = {x | x ∈ T ∗ . • characterise the elements of the set by means of a predicate. We will use these operators mainly in the chapter on regular languages. . which is deﬁned by means of an inductive deﬁnition. then we want to prove that L ⊆ P . L+ = i∈N. 3}. Quite often we want to prove that a language L. c}∗ .2. 2} ∪ {1. Examples of the second approach are: • the even natural numbers {n | n ∈ {0.

and c. a. • the strings consisting of just one character. Which of the following strings are in L∗ : abaabaaabaa. For any language. aa.2. and that can describe a large class of languages. Grammars The following equations follow immediately from the above deﬁnitions. a kind of grammars that are convenient for automatic processing. and to prove properties of sets. ∅L = L∅ = ∅ 2. that is. it is hard to use in proofs and programs. where a and b are the terminals. • The empty string. This section will only discuss so-called context-free grammars.2. Let L={ab. b. called grammars. In the previous section we deﬁned PAL. But the class of languages that can be described by context-free grammars is limited.3).4.4 (Palindromes by induction). baa}. of sets. • if P is a palindrome. and c. What are the elements of ∅∗ ? Exercise 2.1. Deﬁnition 2. by means of a predicate. prove 1. are palindromes. 11 . Working with sets might be fun.2. b. baaaaabaa? Exercise 2.2.1) and one for languages (Deﬁnition 2. but it often is complicated to manipulate sets. aaaabaaaa. a. ε. Although this deﬁnition deﬁnes the language we want. to it are also palindromes. Is there a diﬀerence between these operators? 2. baaaaabaaaab. the strings aP a bP b cP c are palindromes. An important observation is the fact that the set of palindromes can be deﬁned inductively as follows. the language of palindromes. Grammars The goal of this section is to introduce the concept of context-free grammars. L∗ = {ε} ∪ LL∗ L+ = LL∗ Exercise 2. In this section we deﬁned two “star” operators: one for arbitrary sets (Deﬁnition 2. is a palindrome. For these purposes we introduce syntactical deﬁnitions.3. {ε}L = L{ε} = L Exercise 2. then the strings obtained by prepending and appending the same character.

called an axiom. Inductive deﬁnitions like the one above can be represented formally by making use of deduction rules which look like: a1 . and an are true. . then a is true. . Very often it is relatively easy to ﬁnd a deﬁnition that is sound. called a grammar . if a string consisting of a’s. Using these deduction rules we can now write down the inductive deﬁnition for PAL as follows: ε ∈ PAL a ∈ PAL b ∈ PAL c ∈ PAL aP a ∈ PAL bP b ∈ PAL cP c ∈ PAL P ∈ PAL P ∈ PAL P ∈ PAL grammar Although the deﬁnition of PAL is completely formal. The second kind of deduction rule. has to be read as follows: a is true. We can give a grammar of PAL by translating the deduction rules given above into production rules. we introduce a shorthand for it. The last part of the deﬁnition covers the inductive cases. Since in computer science we use many deﬁnitions which follow such a pattern.2. The rule with which the empty string is constructed is: 12 . . b’s. it is still laborious to write. . Conversely. and c’s reads the same forwards and backwards then it belongs to the language PAL. a2 . All strings which belong to the language PAL inductively deﬁned using the above deﬁnition read the same forwards and backwards. A grammar consists of production rules. Therefore this deﬁnition is said to be complete (every palindrome is in PAL). we can proceed by giving a formal representation of this inductive deﬁnition. Context-Free Grammars sound complete The ﬁrst two parts of the deﬁnition cover the basic cases. an a or a The ﬁrst kind of deduction rule has to be read as follows: if a1 . A typical method for proving soundness and completeness of an inductive deﬁnition is mathematical induction. . but you also have to convince yourself that the deﬁnition is complete. a2 . Therefore this deﬁnition is said to be sound (every string in PAL is a palindrome). . Finding an inductive deﬁnition for a language which is described by a predicate (like the one for palindromes) is often a nontrivial task. Now that we have an inductive deﬁnition for palindromes. .

and c are palindromes. • the set of nonterminals {P }. or nonterminal for short. where s is symbol and λα is a sequence of symbols. and gives rise to a production: P →a P →b P →c These production rules correspond to the axioms that state that the one element strings a. then we obtain a new palindrome by prepending and appending an a. and c is a palindrome. we obtain the following grammar for PAL: P →ε P →a P →b terminal production rule nonterminal start symbol 13 . that is. then one can deduce that aP a. We complete the description of the grammar by adding a fourth component: the nonterminal start symbol P . The grammar we have presented so far consists of three components: • the set of terminals {a. or c to it. b. The symbol P to the left of the arrow is a symbol which denotes palindromes. aαa. c}. b. Nonterminal symbols are also called auxiliary symbols: their only purpose is to denote structure. and cαc are also palindromes. Note that the intersection of the set of terminals and the set of nonterminals is empty. they are not part of the alphabet of the language.2. b. If a string α is a palindrome. if P is a palindrome. To summarize. but a grammar may have many nonterminal symbols. bP b and cP c are also palindromes. A production rule can be considered as a possible way to rewrite the symbol s. and we always have to select one to start with. To obtain these palindromes we use the following recursive productions: P → aP a P → bP b P → cP c These production rules correspond to the deduction rules that state that. Each of the one element strings a. or production for short. • and the set of productions (the seven productions that we have introduced so far). A rule of the form s → α. In this case we have only one choice for a start symbol. Such a symbol is an example of a nonterminal symbol . b. Grammars P →ε This rule corresponds to the axiom that states that the empty string ε is a palindrome. is called a production rule.2. bαb. Three other basic production rules are the rules for constructing palindromes consisting of just one character.

2. {P }. Notational conventions In the deﬁnition of the grammar for PAL we have written every production on a single line. The standard example here is {an bn cn | n ∈ N}. 14 . N . Not every language can be described via a context-free grammar. Each production has the form A → α. S ∈ N . The adjective “context-free” in the above deﬁnition comes from the speciﬁc production rules that are considered: exactly one nonterminal on the left hand side. and the set of nonterminals. A context-free grammar G is a four-tuple (T . c}. R. b. Also the start-symbol is implicitly deﬁned here since there is only one nonterminal. where A is a nonterminal and α is a sequence of terminals and nonterminals. Since this takes up a lot of space. We conclude this example with the formal deﬁnition of a context-free grammar.1. we introduce the following shorthand. • S is the start symbol. context-free grammar Deﬁnition 2.5 (Context-Free Grammar). and since the production rules form the heart of every grammar. so the grammar for PAL may also be written as follows: P → ε | a | b | c | aP a | bP b | cP c BNF The notation we use for grammars is known as BNF – Backus Naur Form – after Backus and Naur. {a. who ﬁrst used this notation for deﬁning grammars.2. • N is a ﬁnite set of nonterminal symbols (T and N are disjunct). 2. • R is a ﬁnite set of production rules. S ) where • T is a ﬁnite set of terminal symbols. Context-Free Grammars P P P P →c → aP a → bP b → cP c The deﬁnition of the set of terminals. We will encounter this example again later in these lecture notes. is often implicit. Instead of writing S →α S →β we combine the two productions for S in one line as using the symbol |: S →α|β We may rewrite any number of rewrite rules for one nonterminal in this fashion.

Give a context free grammar for the set of sentences over alphabet X where 1. So.6.5.2. Give a context free grammar for the language L = {a n b n | n ∈ N} Exercise 2. A parity sequence is a sequence consisting of 0’s and 1’s that has an even number of ones. The language of a grammar The goal of this section is to describe the relation between grammars and languages: to show how to derive sentences of a language. b}. Exercise 2. Exercise 2. Now we consider the reverse question: how to obtain a language from a given grammar? Before we can answer this question we ﬁrst have to say what we can do with a grammar. X = {a. The language of a grammar Another notational convention concerns names of productions. Give a grammar for the language L = {s s R | s ∈ {a. 15 . w )} where #(c. the startsymbol is usually the nonterminal in the left hand side of the ﬁrst production. we have demonstrated in several examples how to construct a grammar for a particular language. The names will be written in front of the production.3. Sometimes we want to give names to production rules.7. Give a grammar for parity sequences.10. b} ∧ #(a. 2. ∗ ∗ 2. and the start-symbol is usually called S . b}. b} } This language is known as the language of mirror palindromes. given its grammar. Exercise 2. X = {a}. Give a grammar for palindromes over the alphabet {a. Exercise 2. for example.8. The answer is simple: we can derive sequences with it.3. w ) is the number of c-occurrences in w . In the previous section. Give a grammar for the language L = {w | w ∈ {a. w ) = #(b.9. if we give a context-free grammar just by means of its productions. Exercise 2. Alpha rule: S → α Beta rule: S → β Finally.

written ϕ0 ⇒∗ ϕn . in the last step production P → c is used to rewrite baP ab into bacab. . . We will now describe derivation steps more formally. A derivation is only one branch of a whole search tree which contains many more branches. which is obtained by replacing the left hand side X of the production by the corresponding right hand side β. in our case the characters a.6 (Derivation). if there exist sequences ϕ0 . For example. ϕn such that ∀i . In the ﬁrst step of this derivation production P → bP b is used to rewrite P into bP b. 0 i <n : ϕi ⇒ ϕi+1 direct derivation derivation If n = 0. Finding a derivation ϕ0 ⇒∗ ϕn is. Another important challenge is to arrange things in such a way that ﬁnding a derivation can be done in an eﬃcient way. Constructing a derivation can be seen as a constructive proof that the string bacab is a palindrome. . In the second step production P → aP a is used to rewrite bP b into baP ab. 16 . Finally. where X is a nonterminal symbol and β is a sequence of (nonterminal or terminal) symbols. We write αX γ ⇒ αβγ and we also say that αX γ rewrites to αβγ in one step. in general. this statement is trivially true. We say that αX γ directly derives the sequence αβγ. and it follows that we can derive each sentence ϕ from itself in zero steps: ϕ ⇒∗ ϕ partial derivation A partial derivation is a derivation of a sequence that still contains nonterminals. a nontrivial task. Context-Free Grammars How do we construct a palindrome? A palindrome is a sequence of terminals. Suppose X → β is a production of a grammar. . the sequence bacab can be derived using the grammar for palindromes as follows: P ⇒ bP b ⇒ baP ab ⇒ bacab derivation Such a construction is called a derivation. that can be derived in zero or more direct derivation steps from the start symbol P using the productions of the grammar for palindromes given before. A sequence ϕn is derived from a sequence ϕ0 . b and c. Let αX γ be a sequence of (nonterminal or terminal) symbols.2. Each branch represents a (successful or unsuccessful) direction in which a possible derivation may proceed. Deﬁnition 2.

the set of all palindromes over the alphabet {a. In this subsection we will deﬁne some grammars that specify some basic languages such as digits and letters. To phrase it more mathematically: the mapping between a grammar and its language is not a bijection. is deﬁned as L(G) = {s | S ⇒∗ s. Two grammars that generate the same language are called equivalent. grammar equivalent context-free language 2. For example. we say that the string bacab belongs to the language generated by the grammar for palindromes. we obtain a grammar with exactly the same language as PAL.3.8 (Context-free language). The language of a grammar From the example derivation above it follows that P ⇒∗ bacab Because this derivation begins with the start symbol of the grammar and results in a sequence consisting of terminals only (a terminal string). if we extend the grammar for PAL with the production P → bacab. and PAL is context-free.1. Thus. S ). A context-free language is a language that is generated by a context-free grammar. s ∈ T ∗ } The language L(G) is also called the language generated by the grammar G. All palindromes can be derived from the start symbol P .7 (Language of a grammar).2. the language of our grammar for palindromes is PAL. We sometimes also talk about the language of a nonterminal A. Examples of basic languages Digits occur in a several programming languages and other languages. c}. N . R. These grammars will be used frequently in later sections. The language of a grammar G=(T . we deﬁne usually denoted by L(G).3. b. language of a Deﬁnition 2. and so do letters. s ∈ T ∗ } Note that diﬀerent grammars may have the same language. but the reverse is not true: given a language we can construct many grammars that generate the language. Deﬁnition 2. So for a particular grammar there exists a unique language. • The language of single digits is speciﬁed by a grammar with ten production rules for the nonterminal Dig. In general. Dig → 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 17 . which is deﬁned by L(A) = {s | A ⇒∗ s.

| z CLetter → A | B | . So in order to specify natural numbers. it is supposed to be a positive number. The following grammar for identiﬁers might be used in a programming language: Identiﬁer → Letter AlphaNums AlphaNums → ε | Letter AlphaNums | Dig AlphaNums An identiﬁer starts with a letter. of course. Context-Free Grammars • We obtain sequences of digits by means of the following grammar: Digs → ε | Dig Digs • Natural numbers are sequences of digits that start with a non-zero digit. . A letter is now either a small or a capital letter. datatypes. Dig-0 → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Now we can deﬁne the language of natural numbers as follows. . i. If a natural number is not preceded by a sign. Letter → SLetter | CLetter • Variable names. we ﬁrst deﬁne the language of non-zero digits. | Z In the real deﬁnitions of these grammars we have to write each of the 26 letters.2.. e. Nat → 0 | Dig-0 Digs • Integers are natural numbers preceded by a sign. but then we have to adjust the grammar. • Dutch zip codes consist of four digits. etc. followed by two capital letters. So ZipCode → Dig-0 Dig Dig Dig CLetter CLetter 18 . of course. are all represented by identiﬁers in programming languages. We might want to allow more symbols. and is followed by a sequence of alphanumeric characters. such as for example underscores and dollar symbols. of which the ﬁrst digit is non-zero. .. . function names. Sign → + | Int → Sign Nat | Nat • The languages of small letters and capital letters are each speciﬁed by a grammar with 26 productions: SLetter → a | b | . letters and digits.

then these derivations are considered to be diﬀerent. then the derivations are considered to be equivalent. a derivation that contains nonterminals in its right hand side. What language does the grammar with the following productions generate? S → Aa A→B B → Aa Exercise 2. i.14.1 context free ? 2. Parse trees Exercise 2. there may be diﬀerent derivations for the same sentence. A terminal string that belongs to the language of a grammar is always derived in one or more steps from the start symbol of the grammar. and to show how parse trees relate to derivations. if only the order in which the derivation steps are chosen diﬀers between two derivations. diﬀerent derivation steps have been chosen in two derivations.. Give a simple description of the language generated by the grammar with productions S → aA A → bS S →ε Exercise 2. this section deﬁnes (non)ambiguous grammars. Consider the grammar SequenceOfS with productions: S → SS S →s Using this grammar. Here is a simple example.13. Furthermore. however. Why? Exercise 2. However.15.4. e.2.12. there may be several productions of the grammar that can be used to proceed the partial derivation with. What language is generated by the grammar with the single production rule S →ε Exercise 2. Parse trees The goal of this section is to introduce parse trees. Is the language L deﬁned in Exercise 2.11. As a consequence. If. For any partial derivation.4. we can derive the sentence sss in at least the following two ways (the nonterminal that is rewritten in each step appears underlined): S ⇒ S S ⇒ S S S ⇒ S sS ⇒ ssS ⇒ sss 19 .

2. The internal nodes of a parse tree are labelled with a nonterminal N . The resulting parse tree of the ﬁrst two derivations of the sentence sss looks as follows: S S s S s S S s The third derivation of the sentence sss results in a diﬀerent parse tree: S S S s S s S s 20 . The parse tree of a terminal symbol is a leaf labelled with the terminal symbol. however. A good candidate for such a canonical element is the leftmost derivation. However. and the children of such a node are the parse trees for symbols of the right hand side of a production for N . the following derivation does not use the same derivation steps: S ⇒ S S ⇒ S SS ⇒ sS S ⇒ ssS ⇒ sss In both derivation sequences above. A parse tree is a representation of a derivation which abstracts from the order in which derivation steps are chosen. Context-Free Grammars S ⇒ S S ⇒ sS ⇒ sS S ⇒ sS s ⇒ sss These derivations are the same up to the order in which derivation steps are taken. In this derivation. In a leftmost derivation. The leftmost derivation corresponding to these two sequences above is S ⇒ S S ⇒ sS ⇒ sS S ⇒ ssS ⇒ sss leftmost derivation parse tree There exists another convenient way for representing equivalent derivations: they all correspond to the same parse tree (or derivation tree). the ﬁrst S was rewritten to s. then there exists also a leftmost derivation of x . the leftmost nonterminal is rewritten in each step. If there exists a derivation of a sentence x using the productions of a grammar. the ﬁrst S is rewritten to SS . however. are both not leftmost. The set of all equivalent derivations can be represented by selecting a so-called canonical element. The last of the three derivation sequences for the sentence sss given above is a leftmost derivation. The two equivalent derivation sequences before.

9 (ambiguous grammar. with diﬀerent meanings: ‘They – are – ﬂying planes’. It is usually rather diﬃcult to translate languages with ambiguous grammars.. it certainly is considered a disadvantage for language translators. Therefore. Otherwise it is called ambiguous. A grammar is unambiguous if every sentence has a unique leftmost derivation. all derivations of the string abba using the productions of the grammar P P P A B →ε → APA → BPB →a →b are represented by the following derivation tree: P A a B b P P ε B b A a A derivation tree can be seen as a structural interpretation of the derived sentence. equivalently.2. It is in general undecidable whether or not an arbitrary context-free grammar is ambiguous. Grammars have proved very successful in the speciﬁcation of artiﬁcial languages (such as programming languages). The grammar SequenceOfS for constructing sequences of s’s is an example of an ambiguous grammar. g. and ‘They – are ﬂying – planes’. This implies that is impossible to write a program that determines for an arbitrary context-free grammar whether it is ambiguous or not. They have proved less successful in the speciﬁcation of natural languages (such as English). unambiguous grammar). Note that there might be more than one structural interpretation of a sentence with respect to a given grammar.4. Deﬁnition 2. since there exist two parse trees for the sentence sss. Such grammars are called ambiguous. politicians). While ambiguity of natural languages may perhaps be considered as an advantage for their users (e. Take for example the sentence ‘They are ﬂying planes’. if every sentence has a unique derivation tree. or. This sentence can be read in two ways. because it is usually impossible to maintain an ambiguous meaning in a translation. Parse trees As another example. partly because is extremely diﬃcult to construct an unambiguous grammar that speciﬁes a nontrivial part of the language. you will ﬁnd that most grammars of programming languages and other languages that are used in processing information are unambiguous. unambiguous ambiguous 21 .

Properties are particularly interesting in combination with grammar transformations. Introduction of priorities. Context-Free Grammars 2. we describe a number of grammar transformations: • • • • • • • Removing duplicate productions. Other properties are used to prove facts about grammars. Grammar transformations In this section. Removing unreachable productions. Then we can use this program for building parse trees for any grammar. Removing left recursion. Substituting right hand sides for nonterminals. Yet other properties are used to eﬃciently compute certain information from parse trees of a grammar. that we have a program that builds parse trees for sentences of grammars in Chomsky normal form. Suppose. that is.2. Left factoring. we will ﬁrst look at a number of properties of grammars. Such a grammar is said to be in Chomsky normal form. • a grammar may have the property that every production either has a single terminal. but that possibly satisfy diﬀerent properties. for example. every sentence of its language has a unique parse tree. Some other properties imply that we can build these parse trees very fast. A grammar transformationis a procedure to obtain a grammar G from a grammar G such that L(G ) = L(G). We then will show how we can systematically transform grammars into grammars that describe the same language and satisfy particular properties. In the following. So why are we interested in such properties? Some of these properties imply that it is possible to build parse trees for sentences of the language of the grammar in only one way. or two nonterminals in its right hand side. Since it is sometimes convenient to have a grammar that satisﬁes a particular property for a language. no other nonterminal can derive the empty string. we would like to be able to transform grammars into other grammars that generate the same language. Here are some examples of properties that are worth considering for a particular grammar: • a grammar may be unambiguous. Associative separator. grammar transformation 22 . • a grammar may have the property that only the start symbol can derive the empty string. and that we can prove that every grammar can be transformed in a grammar in Chomsky normal form.5.

y and z denote sequences of terminals and nonterminals. x .2. are elements of (N ∪ T )∗ . e. For example. Removing duplicate productions This grammar transformation is a transformation that can be applied to any grammar of the correct form. w . v . w . Removing unreachable productions Consider the result of the transformation above: A → uxv | uwv | z B →x |w If A is the start symbol and B does not occur in any of the symbol sequences u.5. we will only show a small but useful set of grammar transformations. x . Grammar transformations There are many more transformations than we describe here. In such a case. T is the set of terminals.5. we can substitute B in the right hand side of the ﬁrst production in the following grammar: A → uBv | z B →x |w The resulting grammar is: A → uxv | uwv | z B →x |w 2. we will assume that N is the set of nonterminals. in which X has been replaced by its right hand sides. If a grammar contains two occurrences of the same production rule. one of these occurrences can be removed.5.3. Substituting right hand sides for nonterminals If a nonterminal X occurs in a right hand side of a production. For example. z . v . In the following. the production may be replaced by just as many productions as there exist productions for X .1. and that u.2. the unreachable production can be dropped: A → uxv | uwv | z 23 .5. then the second production can never occur in the derivation of a sentence starting from A. 2.. A→u |u |v can be transformed into A→u |v 2. i.

. The following transformation removes left-recursive productions. replacing the part of the sequence after the common start sequence.5. | ym where none of the symbol sequences y1 . that ends with a new nonterminal. . We can thus factorize the productions for A as follows: A → Ax1 | Ax2 | . . As we will see in Chapter 3. . Left-recursive grammars are sometimes undesirable – we will.e. and replace the productions for A by: 24 . Two productions for the new nonterminal are added: one for each of the two diﬀerent end sequences of the two productions.2. To remove the left-recursive productions of a nonterminal A. These two productions can then be replaced by a single production. . and left factoring the grammar before constructing a parser can dramatically improve the performance of the resulting parser. Context-Free Grammars 2. left recursion can be removed by transforming the grammar. For example: A → xy | xz | v may be transformed into A → xZ | v Z →y |z where Z is a new nonterminal. parsers can be constructed systematically from a grammar. . . | Axn A → y1 | y2 | . if we can derive Az from A in one or more steps). we divide the productions for A in sets of left-recursive and non left-recursive productions. . the production A → Az is left-recursive. see in Chapter 3 that a parser constructed systematically from a left-recursive grammar may loop.5. for instance. For example. We now add a new nonterminal Z .5. Fortunately. Removing left recursion left recursion A production is called left-recursive if the right-hand side starts with the nonterminal of the left-hand side. Left factoring left factoring Left factoring is a grammar transformation that is applicable when two productions for the same nonterminal start with the same sequence of (terminal and/or nonterminal) symbols. ym starts with A.4.. A grammar is left-recursive if we can derive A => + Az for some nonterminal A of the grammar (i. 2.

The second is not.6.5. we may use the following unambiguous grammar for generating a language of declarations: Decls → Decl . which generates a single declaration.4: S → SS S →s The ﬁrst production is left-recursive. An example is A → Bx B → Ay None of the two productions is left recursive.5. namely the grammar SequenceOfS that we have introduced in Section 2. Removing left recursion in an indirectly left-recursive grammar is also possible. the meaning of d1 . Grammars can also be indirectly left recursive. e. | xn | xn Z Note that this procedure only works for a grammar that is directly left-recursive. and d3 . i. is an associative separator in the generated language. have been omitted. but we can still derive A ⇒∗ Ayx .’: Decls → Decls .. separated by a semicolon ‘. Therefore. d2 ) . d2 . it does not matter how we group the declarations. a grammar that contains a left-recursive production of the form A → Ax . Grammar transformations A → y1 | y1 Z | .2. Here is an example of how to apply the procedure described above to a grammar that is directly left recursive. but a bit more complicated [1]. . Decls Decls → Decl The productions for Decl . and obtain the following productions: S → s | sZ Z → S | SZ 2. d3 ) and (d1 . | ym | ym Z Z → x1 | x1 Z | . . for the same reason as SequenceOfS is ambiguous. d3 is the same. given three declarations d1 . that is. . (d2 . Associative separator The following grammar fragment generates a list of declarations. Decls Decls → Decl 25 . This grammar is ambiguous. . We can thus directly apply the procedure for left recursion removal. The operator .

Introduction of priorities Another form of ambiguity often arises in the part of a grammar for a programming language which describes expressions. the transformed grammar is also no longer left-recursive.2. This grammar transformation is often useful for expressions which are separated by associative operators.7. the following grammar generates arithmetic expressions: E E E E →E +E →E *E →(E ) → Digs where Digs generates a list of digits as described in Section 2. An alternative unambiguous (but still left-recursive) grammar for the same language is Decls → Decls . In order to obtain parse trees that respect these priorities. and one corresponding to 2+(4*6). we transform the grammar as follows: E →T E →E +T T →F T →T *F F →(E ) F → Digs This grammar generates the same language as the previous grammar for expressions. such as for example natural numbers and addition.5. like the semicolon in this case. 2. However. Context-Free Grammars Note that in this case. For example. then removing the ambiguity in favour of a particular nesting is dangerous. applying the transformation is ﬁne. the sentence 2+4*6 has two parse trees: one corresponding to (2+4)*6. This grammar is ambiguous: for example. If we make the usual assumption that * has higher priority than +. Decl Decls → Decl The grammar transformation just described should be handled with care: if the separator is associative in the generated language.3. the latter expression is the intended reading of the sentence 2+4*6. 26 . if the separator is not associative.1. but it respects the priorities of the operators.

In addition to the above productions.5. Then. there should also be a production for expressions of the highest priority. Since the proofs are too complicated at this point. Grammar transformations In practice. we can abbreviate the grammar by using parameterised nonterminals. Let G be a grammar with terminal set {if. left factoring). Proofs can be found in any of the theoretical books on language and parsing theory [11]. Exercise 2. they are omitted. else. but the ones given in this section suﬃce for now.5. often more than two levels of priority are used. For 1 i < n. How does Java prevent this dangling else problem? 27 . Discussion We have presented several examples of grammar transformations. we can replace it by ‘right’. The standard example of ambiguity in programming languages is the dangling else. A grammar transformation transforms a grammar into another grammar that generates the same language. Give two parse trees for the sentence if b then if b then a else a. then. Choose the transformation such that the resulting grammar is also no longer left-recursive.16. For each of the above transformations we should therefore prove that the generated language remains the same. b. we get productions Ei → Ei+1 Ei → Ei Op i Ei+1 The nonterminal Op i is parameterised and generates operators of priority i . 2. a} and the following productions: S → if b then S else S S → if b then S S →a 1. There exist many other grammar transformations.17. and obtain a dual grammar transformation. 3.8. instead of writing a large number of nearly identically formed production rules. Exercise 2. Consider the following ambiguous grammar with start symbol A: A → AaA A→b|c Transform the grammar by applying the rule for associative separators. Note that everywhere we use ‘left’ (left-recursion. Give an unambiguous grammar that generates the same language as G. for example: En → ( E1 ) | Digs 2. We will discuss a larger example of a grammar transformation after the following section.2.

Consider the follwing grammar with start symbol S : S → AB A → ε | aaA B → ε | Bb 1. As an example we take the grammar SequenceOfS: S → SS S →s First.L B →0|1 Remove the left recursion from this grammar. we introduce the notion of abstract syntax. Give an equivalent non left recursive grammar. using the names of the productions as constructors: data S = Beside S S | Single Note that the nonterminals on the right hand side of Beside reappear as arguments of the constructor Beside. For each context-free grammar we can deﬁne a corresponding datatype in Haskell. A bit list is a nonempty list of bits separated by commas. and show how to obtain an abstract syntax from a concrete syntax. To this end. Concrete and abstract syntax In this section.18. What language does this grammar generate? 2. the terminal symbol s in the production Single is omitted in the deﬁnition of the constructor Single.19. On the other hand. Values of these datatypes represent parse trees of the context-free grammar. we establish a connection between context-free grammars and Haskell datatypes. Context-Free Grammars Exercise 2. One may be tempted to make the following deﬁnition instead: 28 .2. A grammar for bit lists is given by L →B L →L. 2. Exercise 2.6. we give each of the productions of this grammar a name: Beside: S → SS Single: S → s Now we interpret the start symbol of the grammar S as a datatype.

2. Concrete and abstract syntax data S = Beside S S | Single Char — too general However. For example. On the other hand. The datatype S is an example of an abstract syntax for the language of SequenceOfS. this datatype is too general for the given grammar. but we know that this character always has to be s. A concrete syntax of a language describes the appearance of the sentences of a language. The adjective ‘abstract’ indicates that values of the abstract syntax do not need to explicitly contain all information about particular sentences. Since there is no choice anyway. and the ﬁrst datatype S serves the purpose of encoding the parse trees of the grammar SequenceOfS just ﬁne. the meaning of a more abstract representation is expressed in terms of a concrete representation. So the concrete syntax of the language of nonterminal S is given by the grammar SequenceOfS. A function such as sToString is often called a semantic function. as for example by applying function sToString. as long as that information is recoverable. Semantic functions are used to give semantics (meaning) to values. Using the removing left recursion grammar transformation. Parse trees are therefore often also called abstract syntax trees. the parse tree that corresponds to the ﬁrst two derivations of the sequence sss is represented by the following value of the datatype S : Beside Single (Beside Single Single) The third derivation of the sentence sss produces the following parse tree: Beside (Beside Single Single) Single To emphasize that these representations contain suﬃcient information in order to reproduce the original strings. there is no extra value in storing that s. the grammar SequenceOfS can be transformed into the grammar with the following productions: concrete syntax abstract syntax semantic function 29 . we can write a function that performs this conversion: sToString :: S → String sToString (Beside l r ) = sToString l + sToString r + sToString Single = "s" Applying the function sToString to either Beside Single (Beside Single Single) or Beside (Beside Single Single) Single yields the string "sss". In this example. A semantic function is a function that is deﬁned on an abstract syntax of a language. An argument of type Char can be instantiated to any single character. without the need to refer to concrete terminal symbols. an abstract syntax of a language describes the shapes of parse trees of the language.6.

we have introduced a corresponding datatype.7. the sequence sss is represented by the parse tree Size 3. The nonterminals on the right hand side of the production rules appear as arguments of the constructors. Exercise 2. b} and a Haskell datatype describing the abstract syntax of such palindromes. Context-Free Grammars S → sZ | s Z → SZ | S An abstract syntax of this grammar may be given by data SA = ConsS Z | SingleS data Z = ConsZ SA Z | SingleZ SA For each nonterminal in the original grammar.22. we have introduced a constructor and invented a name for the constructor.20. Give parse trees for the palindromes pal 1 = "abaaba" and pal 2 = "baaab". we can conclude that the only important information about sequences of s’s is how many occurrences of s there are. So the ultimate abstract syntax for SequenceOfS is data SS = Size Int Using the abstract syntax SS .2.21. 30 .e. but the terminals disappear. Consider the grammar for palindromes that you have constructed in Exercise 2. The following Haskell datatype represents a limited form of arithmetic expressions data Expr = Add Expr Expr | Mul Expr Expr | Con Int Give a grammar for a suitable concrete syntax corresponding to this datatype. The choice of an abstract syntax over another should therefore be determined by the demands of the application. by what we ultimately want to compute.. because that information can be recovered by knowing which constructors have been used. If we look at the two diﬀerent grammars for SequenceOfS. and we can still recover the original string from a value SS by means of a semantic function: ssToString :: SS → String ssToString (Size n) = replicate n ’s’ The SequenceOfS example shows that one may choose between many diﬀerent abstract syntaxes for a given grammar. Exercise 2.7 and 2. i.21 where we have given a grammar for palindromes over the alphabet {a. Consider your answers to Exercises 2. Exercise 2. Deﬁne a datatype Pal corresponding to the grammar and represent the parse trees for pal 1 and pal 2 as values of Pal . and the two abstract syntaxes. For each production for a particular nonterminal (expanding all the alternatives into separate productions).

Consider your answer to Exercise 2. Exercise 2. Test your function with the abstract bit lists aBitList 1 and aBitList 2 .1. 2.1". Write a function that transforms an abstract representation of a mirror palindrome into the corresponding abstract representation of a palindrome (using the datatype from Exercise 2. Write a semantic function that transforms an abstract representation of a mirror palindrome into a concrete one. Give the two abstract parity sequences aEven 1 and aEven 2 that correspond to the concrete parity sequences cEven 1 = "00101" and cEven 2 = "01010". Write a semantic function that counts the number of a’s occurring in a palindrome. for example for programming languages. introduced in Section 2.7. Test your function with the abstract bit lists aBitList 1 and aBitList 2 . Deﬁne a datatype BitList that describes the abstract syntax corresponding to your grammar. which describes the concrete syntax for parity sequences.0. which describes the concrete syntax for mirror palindromes. Write a semantic function that transforms an abstract representation of a bit list into a concrete one. Write a function that concatenates two abstract representations of a bit lists into a bit list.21. was ﬁrst used in the early sixties when the programming language ALGOL 60 was deﬁned and until now it is the 31 . Consider your answer to Exercise 2.24. 1. Test your function with the abstract mirror palindromes aMir 1 and aMir 2 .2.1.0" and cBitList 2 = "0.9. Exercise 2. Exercise 2.21).21. 3.25. which describes the concrete syntax for bit lists by means of a grammar that is not left-recursive. Test your function with the palindromes pal 1 and pal 2 from Exercise 2. 2.2. Give the two abstract mirror palindromes aMir 1 and aMir 2 that correspond to the concrete mirror palindromes cMir 1 = "abaaba" and cMir 2 = "abbbba". The BNF notation. 3.8. 2. Consider your anwer to Exercise 2. Deﬁne a datatype Parity describing the abstract syntax corresponding to your grammar.23. Give the two abstract bit-lists aBitList 1 and aBitList 2 that correspond to the concrete bit-lists cBitList 1 = "0. Test your function with the abstract parity sequences aEven 1 and aEven 2 .7. Write a semantic function that transforms an abstract representation of a palindrome into a concrete one.18. 1. Write a semantic function that transforms an abstract representation of a parity sequence into a concrete one. Test your function with the abstract mirror palindromes aMir 1 and aMir 2 . Furthermore. 2. Constructions on grammars This section introduces some constructions on grammars that are useful when specifying larger grammars. Constructions on grammars 1. Deﬁne a datatype Mir that describes the abstract syntax corresponding to your grammar. Test your function with the palindromes pal 1 and pal 2 from Exercise 2. it gives an example of a larger grammar that is transformed in several steps. 2. 1.

‘+’ and ‘∗’. EBNF .3. N = NL ∪ {S } and R = RL ∪ {S → SL . • the language L+ is generated by the grammar (T . N = NL ∪ NM ∪ {S } and R = RL ∪ RM ∪ {S → SL SM }. abbreviated P + . The same notation can be used for languages. N . NL . • one or more occurrences of nonterminal P .for instance.. SM ). R. N = NL ∪ NM ∪ {S } and R = RL ∪ RM ∪ {S → SL . RL . e. N . N . say GL = (T . and {P } for P ∗ . we have deﬁned a number of operations on languages using operations on sets. R. In some texts. Context-Free Grammars standard way of deﬁning the syntax of programming languages (see. You may object that the Java grammar contains more “syntactical sugar” than the grammars that we considered thus far (and to be honest. • the language L M is generated by the grammar (T . and sequences of terminal and nonterminal symbols instead of just single nonterminals. We could easily express these constructions by adding additional nonterminals. R. Theorem 2. • and zero or more occurrences of nonterminal P . S ) where S is a fresh nonterminal. abbreviated P ?. grammars. but that decreases the readability of the grammar. sets of sentences) have a direct counterpart at the level of grammatical descriptions. N = NL ∪ {S } and R = RL ∪ {S → ε. R. One approach is to decompose the language and to ﬁnd grammars for each of its constituent parts. Designing a grammar for a speciﬁc language may not be a trivial task. In Deﬁnition 2. Suppose we have grammars for the languages L and M . In this section. A straightforward question to ask is now: can we also deﬁne languages 32 . Then • the language L ∪ M is generated by the grammar (T . The theorem above establishes that the set-theoretic operations at the level of languages (i. S ) where S is a fresh nonterminal. • the language L∗ is generated by the grammar (T . EBNF This extended BNF notation. The notation for the EBNF constructions is not entirely standardized.10 (Language operations). We introduced grammars as an alternative for the description of languages. the Java Language Grammar). N . RM . We now show that these operations can be expressed in terms of operations on context-free grammars. this also holds for the ALGOL 60 grammar): one encounters nonterminals with postﬁxes ‘?’. S → SM }. NM . you will for instance ﬁnd the notation [P ] instead of P ?. S → SL S }. is introduced in order to help abbreviate a number of standard constructions that usually occur quite often in the syntax of a programming language: • one or zero occurrences of nonterminal P . we deﬁne the meaning of these constructs. S → SL S }. We assume that the nonterminal sets NL and NM are disjoint. abbreviated P ∗ .2. S ) where S is a fresh nonterminal. SL ) and GM = (T . S ) where S is a fresh nonterminal.

the answer is negative – there are no operations on grammars that correspond to the language intersection and diﬀerence operators. Deﬁnition 2.1) 33 . Constructions on grammars as the diﬀerence between two languages or as the intersection of two languages. P ∗ denotes zero or more concatenations of string P . we add a new grammar construction for an “optional grammar”. Two of the above constructions are important enough to actually deﬁne them as grammar operations. and translate these operations to operations on grammars? Unfortunately.7. S → S S }. R ∪ {S → S . then L(P ∗ ) L(P + ) L(P ?) = L(Z ) with = L(Z ) with = L(Z ) with Z → ε | PZ Z → P | PZ Z →ε |P where Z is a new nonterminal in each deﬁnition. N ∪ {S }. S → S S }. For example. S ) G? = (T . S ) The deﬁnition of P ?.12 (EBNF for sequences). the operators ·∗ and ·+ can also be deﬁned symmetrically: L(P ∗ ) L(P + ) = L(Z ) with = L(Z ) with Z → ε | ZP Z → P | ZP Many variations are possible on this theme: L(P ∗ Q) = L(Z ) with or also L(P Q ∗ ) = L(Z ) with Z → P | ZQ (2. Let G = (T . R ∪ {S → ε. Let P be a sequence of nonterminals and terminals. N ∪ {S }.2. Then G ∗ = (T . Furthermore. Because the concatenation operator for sequences is associative.2) Z → Q | PZ (2. N ∪ {S }. Deﬁnition 2. and P ∗ for a sequence of symbols P is very similar to the deﬁnitions of the operations on grammars. so Dig ∗ denotes the language consisting of zero or more digits. R ∪ {S → ε. S → S }.11 (Grammar operations). P + . R. S ) be a context-free grammar and let S be a fresh nonterminal. S ) G + = (T . N .

we obtain: Expr → Expr 1 Expr → Expr 1 where Decls Expr 1 → Expr 2 Expr 1 → if Expr 1 then Expr 1 else Expr 1 Expr 2 → Atomic Expr 2 → Expr 2 Atomic where Atomic and Decls have the same productions as before. and the semicolon in between the Decls in the second production for Decls are also terminal symbols. Application binds stronger than if.7. and Bool generate variables. The following ‘program’ is a sentence of this language: if true then funny true else false where funny = 7 It is clear that this is not a very convenient language to write programs in. Context-Free Grammars 2. Number . SL: an example To illustrate EBNF and some of the grammar transformations given in the previous section. The nonterminal Expr 2 is left-recursive.2. The following grammar generates expressions in a very small programming language. we give a larger example. Removing left recursion gives the following productions for Expr 2 : Expr 2 → Atomic | Atomic Expr 2 Expr 2 → Atomic | Atomic Expr 2 34 . and we introduce priorities to resolve some of the ambiguities. and both application and if bind stronger then where. Using the “introduction of priorities” grammar transformation. respectively. called SL. and boolean expressions. Expr Expr Expr Atomic Decls Decls Decl → if Expr then Expr else Expr → Expr where Decls → AppExpr → Var | Number | Bool | ( Expr ) → Decl → Decls . number expressions. Decls → Var = Expr AppExpr → AppExpr Atomic | Atomic where the nonterminals Var . The above grammar is ambiguous (why?).1. Note that the brackets around the Expr in the production for Atomic.

according to (2. the ·? operation applied to G). Exercise 2. is assumed to be associative. can you give a context-free grammar for this language? 35 .27. we can replace Expr 2 by Atomic + . Give grammars for L1 and L2 . and yields Expr → Expr 1 Expr 1 Expr 1 → ε | where Decls Since nonterminal Expr 1 generates either nothing or a where clause. Constructions on grammars Since the new nonterminal Expr 2 has exactly the same productions as Expr 2 . n ∈ N} L2 = {a m b n c n | m. Exercise 2. This transformation is applied to the productions for Expr .3.. i. and the separator . we can replace Expr 1 by an optional where clause in the production for Expr : Expr → Expr 1 (where Decls)? After all these grammar transformations. these productions can be replaced by Expr 2 → Atomic | Atomic Expr 2 So Expr 2 generates a nonempty sequence of atomics. The nonterminal Decls generates a nonempty list of declarations..e. Hence we can apply the “associative separator” transformation to obtain Decls → Decl | Decls . Give the EBNF notation for each of the basic languages deﬁned in Section 2. Decl )∗ Exercise 2. Expr Expr 1 Expr 1 Atomic Decls → Expr 1 (where Decls)? → Atomic + → if Expr 1 then Expr 1 else Expr 1 → Var | Number | Bool | ( Expr ) → Decl (. Decl )∗ The last grammar transformation we apply is “left factoring”.2. Is L1 ∩ L2 context-free. Decls → Decl (.1.2). Let L1 = {a m b m c n | m.7. we obtain the following grammar. Another source of ambiguity are the productions for Decls. Give the language that is generated by G? (i. e.28. Using the ·+ -notation introduced before. Let G be a grammar G. n ∈ N} 1. Decl or.26. 2.

and a code generator. In the second part of this course we will take a look at more complicated grammars. "7"] token parsing problem scanner lexer So a token is a syntactical entity. Until now we have only seen simple grammars for which it is relatively easy to determine whether or not a string is a sentence of the grammar. which do not always conform to the restrictions just referred to. a parser is preceded by a scanner (also called lexer ). Usually. One of the problems we have not referred to yet in this rather formal chapter is of a more practical nature. "where". introduction into the area of compiler construction. However. and actually impossible for all practical cases. the parsing problem answers the question whether or not s ∈ L(G). This raises another interesting question: What are the minimal changes that have 36 . If s ∈ L(G). For more complicated grammars this may be more diﬃcult. "false". and discusses some of the future topics of the course. For example. This question may not be easy to answer given an arbitrary grammar.2. "true". Given the grammar G and a string s. given the sentence if true then funny true else false where funny = 7 a scanner might return the following list of tokens: ["if". A scanner usually performs the ﬁrst step towards an abstract syntax: it throws away layout information such as spacing and newlines. Deﬁnition 2. By analysing the grammar we may nevertheless be able to generate parsers as well. Quite often the sentence presented to the parser will not be a sentence of the language since mistakes were made when typing the sentence. a type checker. "funny". but some of the concepts of scanners will sometimes be used. "true". a parser. which splits an input sentence into a list of so-called tokens. Such generated parsers will be in such a form that it will be clear that writing such parsers by hand is far from attractive. Context-Free Grammars 2. in the ﬁrst part of this course we will show how – given a grammar with certain reasonable properties – we can easily construct parsers by hand. At the same time we will show how the parsing process can quite often be combined with the algorithm we actually want to perform on the recognized object (the semantic function). A compiler for a programming language consists of several parts.8. Parsing This section formulates the parsing problem. although surprisingly eﬃcient. "funny". "else". In this course we will concentrate on parsers. "then". The techniques we describe comprise a simple.13 (Parsing problem). the answer to this question may be either a parse tree or a derivation. "=". Examples of such parts are a scanner.

31. describe a derivation of the string bm abn abp . ∗ Exercise 2. the language of palindromes. Consider the grammar with productions S → aaB A → bB b A→ε B → Aa Show that the string aabbaabba cannot be derived from S . but the chance that an if symbol is missing is far from likely.9. Exercises Exercise 2. given an erroneous input. Give at least two distinct derivations for the string babbab. n. 2. Consider the grammar with productions S → AA A → AAA A→a A → bA A → Ab 1. 3. Which terminal strings can be produced by derivations of four or fewer steps? 2. This is where grammar engineering starts to play a rˆle. One of the most important aspects here is to deﬁne metric for deciding about the minimality of a change. Under which circumstances is L+ = L∗ − {ε}? Exercise 2.2. A semicolon can easily be forgotten. Does L contain only palindromes? Exercise 2. Associated concepts. p 0. b. we have introduced the concept of a context-free grammar.32. such as derivations and parse trees were introduced.34. Exercise 2.33. humans usually make certain mistakes more often than others.30. Exercises to be made to the sentence in order to convert it into a sentence of the language? It goes almost without saying that this is an important question to be answered in practice. o Summary Starting from a simple example.29. c} such that L=LR . For any m.9. one would not be very happy with a compiler which. Do there exist languages L such that (L∗ ) = (L) ? Exercise 2. Let L be a language over alphabet {a. 37 . would just reply that the “Input could not be recognised”. Give a language L such that L = L∗ .

39. G1 : S →ε S → aS G2 : S →ε S → Sa G3 : S →ε S →a S → SS ∗ Exercise 2. Give a non-contracting grammar which describes the same language. A grammar having this property is called non-contracting. Give a grammar for the language L = {ωcω R | ω ∈ {a.35. S →ε S →A A → Aa A→a and S →ε S →A A → AaA A→a Can you ﬁnd other (preferably simpler) grammars for the same languages? Exercise 2. Give a grammar for L that has no right-recursive productions. Describe the language L of the grammar A → AaA | a Give a grammar for L that has no left-recursive productions. Exercise 2. the start symbol is the only nonterminal which may have an empty production (a production of the form X → ε).38.36. Give a derivation of the sentence abcba. Describe the language generated by the grammar: S →ε S →A A → aAb A → ab Can you ﬁnd another (preferably simpler) grammar for the same language? Exercise 2. The grammar A → aAb | ε does not have this property. 2. Show that the languages generated by the grammars G1 . 38 .40.37.2. G2 en G3 are the same. Context-Free Grammars Exercise 2. b} } This language is known as the center-marked palindromes language. Exercise 2. Describe the languages generated by the grammars. Consider the following property of grammars: 1. the start symbol does not occur in any alternative.

Exercises Exercise 2. → they → Verb NounPhrase → AuxVerb Verb Noun → are → flying → are → Adjective Noun → flying → planes Give two diﬀerent leftmost derivations for the sentence they are flying planes. This exercise shows an example (attributed to Noam Chomsky) of an ambiguous English sentence.9. Give a grammar for L which uses only two nonterminals. [. )} in which the brackets match. Consider the following grammar for a part of the English language: Sentence Subject Predicate Predicate Verb Verb AuxVerb NounPhrase Adjective Noun → Subject Predicate . Exercise 2. {(. An example sentence in this language is [ ( ) ] ( ). Exercise 2.2. {(. Exercise 2.46.44. Give a derivation for this sentence. ). An example sentence of the language is ( ) ( ( ) ) ( ). Exercise 2.42. Consider the language L of the grammar S → T | US T → aS a | U a U → S | SUT Give a grammar for L which uses only productions with two or less symbols on the right hand side.41. Give a grammar for the language of all sequences of 0’s and 1’s which start with a 1 and contain exactly one 0. Exercise 2. Give a grammar for L that has no left-recursive productions and is non-contracting.43.45. Give a grammar for the language consisting of all nonempty sequences of two kinds of brackets. Describe the language L of the grammar X → a | Xb Give a grammar for L that has no left-recursive productions. Give a grammar for the language consisting of all nonempty sequences of brackets. 39 . ]} in which the brackets match.

Prove. and ⊕ are nonterminals. → ⊗ →⊗ ⊗→⊗3⊕ ⊗→⊕ ⊕→♣ ⊕→♠ Find a derivation for the sentence ♣ 3 ♣ ♠. Exercise 2. 40 . thus 4 is represented as 001. Assume that . Try to ﬁnd some ambiguous sentences in your own natural language. Consider the natural numbers in reverse binary notation. Prove that the language generated by the grammar of Exercise 2. determines whether or not w is divisible by 7.51 (••. given a string w of I’s.52 (no answer provided).2. given a string w of zeros and ones. thus 4 is represented as IIII.49. Write an algorithm that determines whether or not the number of a’s in w equals the number of b’s in w . This exercise deals with a grammar that uses unusual terminal and nonterminal symbols. ﬁnd one which is unambiguous.48 (•).45 unambiguous? If not. b} where the number of a’s is even and greater than zero.33 contains all strings over {a. Here are some ambiguous Dutch sentences seen in the newspapers: Vliegen met hartafwijking niet gevaarlijk Jessye Norman kan niet zingen Alcohol is voor vrouwen schadelijker dan mannen Exercise 2. Exercise 2. Let w be a string consisting of a’s and b’s only. determines whether or not w is divisible by 7.47. Exercise 2.53 (no answer provided). ⊗.50 (•). Exercise 2.54 (no answer provided). and the other symbols are terminals.2 does indeed generate the language of palindromes. Write an algorithm that. using induction. that the grammar G for palindromes in Section 2. Consider the natural numbers in unary notation where only the symbol I is used. Write an algorithm that. Exercise 2. Context-Free Grammars Exercise 2. no answer provided). Is your grammar for Exercise 2. Exercise 2.

and datatypes. and we will avoid using indices and ellipses. which is often a simple task. Parser combinators are built by means of standard functional language constructs like higher-order functions. we can build parsers for the language of (possibly ambiguous) grammars. The basic parsing functions do not combine parsers. and are therefore not parser combinators in this sense. Parser combinators are used to write parsers that are very similar to the grammar of a language. but they are not essential. The functions that combine parsers are called parser combinators. which can be used for sequentially and alternatively combining parsers. lists. Parsers can be written using a small set of basic parsing functions. and a number of functions that combine parsers into more complicated parsers. As an example. ﬁlter and concat functions. and for calculating so-called semantic functions during the parse. This is done without coding the priorities of operators as integers. 41 .3 the ﬁrst parser combinators are introduced.5.3. Using that type. Finally. In Section 3. and could easily be rephrased using the map. Thus writing a parser amounts to translating a grammar to a functional program. Not only do these make life easier later. List comprehensions are used in a few places.3.1. we will introduce some elementary parsers that can be used for parsing the terminal symbols of a language. but they are usually also called parser combinators. and an integer indicating the nesting depth. Diﬀerent semantic values are calculated for the matching parentheses: a tree describing the structure. Next. A real application is given in Section 3. We will start by motivating the deﬁnition of the type of parser functions. but their deﬁnitions are also nice examples of using parser combinators. Type classes are only used for overloading the equality and arithmetic operators. we construct a parser for strings of matching parentheses in Section 3. Semantic functions are used to give meaning to syntactic structures. where a parser for arithmetical expressions is developed. the expression parser is generalised to expressions with an arbitrary number of precedence levels. In Section 3.4 we introduce some new parser combinators. Parser combinators Introduction This chapter is an informal introduction to writing parsers in a lazy functional language using ‘parser combinators’.

Furthermore. Two secondary goals of this chapter are: • to develop the capability to abstract. important primary goals of this chapter are: • to understand how to parse. you should be familiar with functional programming concepts such as type. by means of parser combinators. Another limitation of the parser combinator technique as described in this chapter is that it is not trivial to write parsers for complex grammars that perform reasonably eﬃcient. there do exist implementations of parser combinators that perform remarkably well. and you should understand the concept of context-free grammar. how to recognise structure in a sequence of symbols. and higher-order functions. • to understand the concept of domain speciﬁc language. you should be able to formulate the parsing problem. • to understand the concept of semantic functions. i. 42 . • to be able to construct a parser when given a grammar. However. Wadler [13] and Hutton [7]. Hence. If the grammar is left-recursive. e. For example. it has to be transformed into a non left-recursive grammar before we can construct a combinator parser. This chapter is a revised version of an article by Fokker [5]. class. Parsers are composed from simple parsers by means of parser combinators. Parser combinators It is not always possible to directly construct a parser from a context-free grammar using parser combinators.. 12].3. see [10. there exist good parsers using parser combinators for the Haskell language. Required prior knowledge To understand this chapter. Goals This chapter introduces the ﬁrst programs for parsing in these lecture notes. Most of the techniques introduced in this chapter have been described by Burge [4].

which is deﬁned by data S = Beside S S | Single A parser for expressions could be implemented as a function of the following type: type Parser = String → S — preliminary For parsing substructures. which represents the type of parse trees. Therefore it is better to abstract from the type S . The two results can be paired. when parsing the string sss.6. and only then build a complete parse tree Beside (Beside Single Single) Single using the unprocessed part s of the input string. 43 . String) — still preliminary For example. The parsing problem is (see Section 2. The type of parsers The goals of this section are: • develop a type Parser that is used to give the type of parsing functions.1. in Section 2. see Section 2. but also the part of the input string that is left unprocessed.8): Given a grammar G and a string s. The type Parser is parametrised with a type a. a parser that returns a structure of type Oak (whatever that is) now has type Parser Oak . A parser that parses sequences of s’s has type Parser S . String) — still preliminary Any parser of type Parser returns an S and a String. • show how to obtain this type by means of several abstraction steps. a parser can call other parsers. As this cannot be done using a global variable.1. If s ∈ L(G).3. However. For example.4 we have seen a grammar for sequences of s’s: S → SS | s A parse tree of an expression of this language is a value of the datatype S (or a value of several variants of that type. depending on what you want to do with the result of the parser). a parser will ﬁrst build a parse tree Beside Single Single for ss. or call itself recursively. and to turn the parser type into a polymorphic type. type Parser a = String → (a. For example. the unprocessed input string has to be part of the result of the parser. the answer to this question may be either a parse tree or a derivation. determine whether or not s ∈ L(G). The type of parsers 3. These calls do not only have to communicate their result. for diﬀerent grammars we want to return diﬀerent parse trees: the type of tree that is returned depends on the grammar for which we want to parse sentences. A better deﬁnition for the type Parser is hence: type Parser = String → (S .

described by Wadler [13]. Thanks to lazy evaluation. For example. Lazy evaluation provides a backtracking approach to ﬁnding the ﬁrst solution. you can take the head of the list of successes. Parser combinators We might also deﬁne a parser that does not return a value of type S . the result consists of all possible parses. which is subsequently parsed. In case of an ambiguous grammar. To cater for this situation we reﬁne the parser type once more: we let the type of the elements of the input string be an argument of the parser type. In this case the function is also of type Parser Int. the string "sss" has the following two parse trees: Beside (Beside Single Single) Single Beside Single (Beside Single Single) As another reﬁnement of the type Parser . the type of parsers becomes type Parser s a = [s] → [(a. a recogniser that either accepts or rejects sentences of a grammar returns a boolean value. and will have type Parser Bool . the result is an empty list. instead of returning one parse tree (and its associated rest string). Until now. but instead the number of s’s in the input sequence. the result of the parse function is a singleton list. If only one solution is required rather than all possible solutions. list of successes lazy evaluation This method for parsing is called the list of successes method. so there will be no loss of eﬃciency. This parser would have type Parser Int. It can be used in situations where in other languages you would use so-called backtracking techniques. In general. Calling the type of symbols s. and as before the result type a. if you prefer meaningful identiﬁers over conciseness: 44 . or that there is no way to parse a string. paired with the rest string that was left unprocessed after parsing. You may imagine a situation in which a preprocessor prepares a list of tokens (see Section 2.8). There is however no reason for not allowing parsing strings of elements other than characters. and returns the number represented by it as a parse ‘tree’. that is lists of characters. In the Bird and Wadler textbook it is used to solve combinatorial problems like the eight queens problem [3]. String)] — useful. Another instance of a parser is a parse function that recognises a string of digits.3. we let a parser return a list of trees. The type deﬁnition of Parser therefore becomes: type Parser a = String → [(a. not all elements of the list are computed if only the ﬁrst value is needed. we have assumed that every string can be parsed in exactly one way. but still suboptimal If there is just one parse. If no parse is possible. Each element of the result consists of a tree. this need not be the case: it may be that a single string can be parsed in various ways. [s])] or. Parsers with the type described so far operate on strings. Finally.

trivially correct functions by means of generalisation and partial parametrisation. and as a parse ‘tree’ we also simply use a Char : symbola :: Parser Char Char symbola [] = [] symbola (x : xs) | x = = ’a’ = [(’a’. 3. Elementary parsers — The type of parsers type Parser symbol result = [symbol ] → [(result. We will hardly use the full generality provided by the Parser type: the type of the input s (or symbol ) will almost always be Char .1. Each element of this list is a possible parsing of (an initial part of) the input. xs)] | otherwise = [] 45 . The list of successes appears in result type of a parser. The type of the input string symbols is Char in this case. This type is deﬁned in Listing 3. This section deﬁnes parsers that can only be used to parse ﬁxed sequences of terminal symbols. • show how one can construct useful functions from simple.2. the ﬁrst part of our parser library.3. [symbol ])] Listing 3. For a grammar with a production that contains nonterminals in its righthand side we need techniques that will be introduced in the following section.hs type Parser symbol result = [symbol ] → [(result. [symbol ])] We will use this type deﬁnition in the rest of this chapter. Elementary parsers The goals of this section are: • introduce some very simple parsers for parsing sentences of grammars with rules of the form: A→ε A→a A→x where x is a sequence of terminals.1: ParserType.2. We will start with a very simple parse function that just recognises the terminal symbol a.

so that it can be used in other applications than character oriented ones.hs The list of successes method immediately pays oﬀ. this is indicated by the Eq predicate in the type of the function. The function symbol is a function that. but also on lists of symbols of other types. the function can operate on lists of characters. it is better to abstract from the symbol to be recognised by making it an extra argument of the function. This is why two arguments appear in the deﬁnition of symbol . Using these generalisations. Furthermore. rather than deﬁning a lot of closely related functions. given a symbol. Parser combinators — Elementary parsers symbol :: Eq s ⇒ s → Parser s s symbol a [] = [] symbol a (x : xs) | x = = a = [(x .2: ParserType. 46 . drop n xs)] | otherwise = [] where n = length k failp :: Parser s a failp xs = [] succeed :: a → Parser s a succeed r xs = [(r .2. we can write parsers that recognise other symbols. we obtain the function symbol that is given in Listing 3. xs)] | otherwise = [] satisfy :: (s → Bool ) → Parser s s satisfy p [] = [] satisfy p (x : xs) | p x = [(x . The only prerequisite is that the symbols to be parsed can be tested for equality. As always. because now we can return an empty list if no parsing is possible (because the input is empty. In Haskell. xs)] | otherwise = [] token :: Eq s ⇒ [s] → Parser s [s] token k xs | k = = take n xs = [(k .3. A parser on its turn is a function too. In the same fashion. returns a parser for that symbol. xs)] — Applications of elementary parsers digit :: Parser Char Char digit = satisfy isDigit Listing 3. or does not start with an a).

but always returns a given. we will deﬁne a function epsilon that ‘parses’ the empty string. We will call this function token. such as while or switch. In this tradition. and in the unit type 47 . Of course. the type of token is: token :: Eq s ⇒ [s] → Parser s [s] The function token is a generalisation of the symbol function. which fails to recognise any symbol on the input string. in that it recognises a list of symbols instead of a single symbol. A zero-tuple can be used as a result value: () is the only value of the type () – both the type and the value are pronounced unit. epsilon :: Parser s () epsilon xs = [((). Dual to the function succeed is the function failp. xs)] A more useful variant is the function succeed . ﬁxed value (or ‘parse tree’.2. For example. Where symbol tests for equality to a speciﬁc value.8).2. we do need an equality test on the type of values in the input string. It is deﬁned in Listing 3. It is deﬁned in Listing 3. and hence always returns an empty parse tree and unmodiﬁed input. it is deﬁned in Listing 3. if you can call the result of processing zero symbols a parse tree).3. Another generalisation of symbol is a function which may. this function is not conﬁned to strings of characters. It does not consume any input. Instead of specifying a speciﬁc symbol. a useful parser is one that recognises a ﬁxed string of symbols. Elementary parsers We will now deﬁne some elementary parsers that can do the work traditionally taken care of by lexical analysers (see Section 2. eﬀectively making it into a family of functions. depending on the input. the function satisfy tests for compliance with this predicate. return diﬀerent parse results. As the result list of a parser is a ‘list of successes’. we can parametrise the function with a condition that the symbol should fulﬁll. As in the case of the symbol function we have parametrised this function with the string to be recognised. Note that we cannot deﬁne symbol in terms of token: the two functions have incompatible types. However. Thus the function satisfy has a function s → Bool as argument.2. which also doesn’t consume input. This generalised function is for example useful when we want to parse digits (characters in between ’0’ and ’9’): digit :: Parser Char Char digit = satisfy isDigit where the function isDigit is the standard predicate that tests whether or not a character is a digit: isDigit :: Char → Bool isDigit x = ’0’ x ∧ x ’9’ In books on grammar theory an empty string is often called ‘ε’.2.

The goals of this section are: • show how parsers can be constructed directly from the productions of a grammar.1. How can this be done? Exercise 3.5: E →T +E |T T →F *T |F F → Digs | ( E ) where Digs is a nonterminal that generates the language of sequences of digits. • show how we can construct a small. the result list should be empty. Parser combinators Using the elementary parsers from the previous section.3. The kind of productions for which parsers will be constructed are A→x |y A→x y where x and y are sequences of nonterminal or terminal symbols. see also Section 2.1.2. • understand and use the concept of semantic functions in parsers. Furthermore. Deﬁne a function capital :: Parser Char Char that parses capital letters. Therefore the function failp always returns the empty list of successes. Note the diﬀerence with epsilon. see Section 2. which does have one element in its list of successes (albeit an empty one). Let us have a look at the grammar for expressions again. Parser combinators case of failure there are no successes. powerful combinator language (a domain speciﬁc language) for the purpose of parsing. Deﬁne the function epsilon using succeed . This implies that we want to have a way to say that a parser consists of several alternative parsers. It is convenient to construct these parsers by partially parametrising higherorder functions. It is deﬁned in Listing 3.3. Do not confuse failp with epsilon: there is an important diﬀerence between returning one solution (which contains the unchanged input as ‘rest’ string) and not returning a solution at all! Exercise 3. Since satisfy is a generalisation of symbol . 48 . the function symbol can be deﬁned as an instance of satisfy. Exercise 3. More interesting are parsers for nonterminal symbols. 3.2. the ﬁrst rule says that in order to parse an expression.3. An expression can be parsed according to any of the two rules for E.3. parsers can be constructed for terminal symbols from a grammar.

the function operates on a string. then a terminal symbol +. By again combining the result with other parsers. Both operators take two parsers as argument. though. zs) ← q ys ] The rest string of the parser for the sequential composition of p and q is whatever the second parser q leaves behind as rest string. and return a parser as result. Apart from the arguments p and q. q is applied to the rest string of the result. in which the second parser is applied in all possible ways to the rest string of the ﬁrst parser: (p <∗> q) xs = [(combine r1 r2 . In the deﬁnitions in Listing 3. it binds stronger – than <|>. This implies that we want to have a way to say that a parser consists of several parsers that are applied sequentially. and then an expression. So important operations on parsers are sequential and alternative composition: a more complex construct can consist of a simple construct followed by another construct (sequential composition). We will develop two functions for this. The second parser. For sequential composition. These operations correspond directly to their grammatical counterparts.3. Be careful. ys) ← p xs . and <|> can be pronounced as ‘or’. p must be applied to the input ﬁrst. returns a list of successes. zs) |(r1 . you may construct even more involved parsers.3. Therefore we use a list comprehension.3. which for notational convenience are deﬁned as operators: <∗> for sequential composition. and <|> for alternative composition. Parser combinators we should ﬁrst parse a term. q. We start with the deﬁnition of operator <∗>. Priorities of these operators are deﬁned so as to minimise parentheses in practical situations: inﬁxl 6 <∗> inﬁxr 4 <|> So <∗> has a higher priority – i. 49 . (r2 .. which can be thought of as the string that is parsed by the parser that is the result of combining p and q. which is used to distinguish cases in a function deﬁnition or as a separator for the constructors of a datatype. After that. e. p. not to confuse the <|>-operator with Haskell’s built-in construct |. The ﬁrst parser. The names of these operators are chosen so that they can be easily remembered: <∗> ‘multiplies’ two constructs together. should be applied to the rest string. or by a choice between two constructs (alternative composition). returning a second value. the functions operate on parsers p and q. each of which contains a value and a rest string.

parametrise the whole thing with an operator that describes how to combine the parts (as is done in the zipWith function). Apart from ‘sequential composition’ we need a parser combinator for representing ‘choice’. of course. a straightforward choice for the combine function would be function application. 50 . That is exactly the approach taken in the deﬁnition of <∗> in Listing 3. We will now also accept functions as parse trees.3. However. zs ) | (f . the result type of a parser may be a function type. we have interpreted the word ‘tree’ liberally: simple values. how should the results of the two parsings be combined? We could. ys) ← p xs . we only need to concatenate these two lists. and the combined parser returns the value that is obtained by applying the function to the value. That is.hs Now. we have chosen for a diﬀerent approach. which nicely exploits the ability of functional languages to manipulate functions. ( x . may also be used as a parse ‘tree’. both p1 and p2 return lists of possible parsings. For this. In the past. the second parser a value. The ﬁrst parser returns a function. ys) | ( y. If the ﬁrst parser that is combined by <∗> would return a function of type b → a. Thanks to the list of successes method.3: ParserCombinators. like characters. To obtain all possible parsings when applying p1 or p2 .3. we have the parser combinator operator <|>. ys) ← p xs ] — Applications of parser combinators newdigit :: Parser Char Int newdigit = f <$> digit where f c = ord c − ord ’0’ Listing 3. The function combine should combine the results of the two parse trees recognised by p and q. Parser combinators — Parser combinators (<|>) :: Parser s a → Parser s a → Parser s a (p <|> q) xs = p xs + q xs + (<∗>) :: Parser s (b → a) → Parser s b → Parser s a (p <∗> q) xs = [ (f x . zs ) ← q ys ] (<$>) :: (a → b) → Parser s a → Parser s b (f <$> p) xs = [ (f y. and the second parser a value of type b.

For any left-recursive grammar. we can modify the parser digit that was deﬁned above: newdigit :: Parser Char Int newdigit = f <$> digit where f c = ord c − ord ’0’ The auxiliary function f determines the ordinal number of a digit character. or a list of all variables with their types. In practice. we may need a parser that recognises one digit character. Sometimes we are not quite satisﬁed with the result value of a parser.). but the result value might need some postprocessing. Put more generally: using <$> we can add semantic functions to parsers.6. but ‘postprocesses’ the result using the function. in Section 2. It is an inﬁx operator: inﬁxl 7 <$> Using this postprocessing parser combinator. The parser might work well in that it consumes symbols from the input adequately (leaving the unused symbols as rest-string in the tuples in the list of successes).> in exercise 3.12 is just a variation of <∗>. the result is a parser that recognises the same string as the original parser. is deﬁned as follows: sequenceOfS :: Parser Char S sequenceOfS = Beside <$> sequenceOfS <∗> sequenceOfS <|> const Single <$> symbol ’s’ But if you try to run this function. The parser combinator <. a systematically constructed parser using parser combinators will exhibit the problem that it loops. We use the $ sign in the name of the combinator. the ﬁrst thing it does is to apply itself to the same string. The deﬁnition of <$> is given in Listing 3. but returns the result as an integer.3.3. the <$> operator is used to build a certain value during parsing (in the case of parsing a computer program this value may be the generated code. For example. In some applications. we can use a new parser combinator: <$>.. However. which loops. i. see Section 2. It takes a function and a parser as argument. using the parser combinator <$> it is applied to the result part of the digit parser. The problem stems from the fact that the underlying grammar is leftrecursive.e. a parser that recognises one digit is deﬁned using the function satisfy: digit = satisfy isDigit. Parser combinators By combining parsers with parser combinators we can construct new parsers. you will get a stack overﬂow! If you apply sequenceOfS to a string.3. A parser for the SequenceOfS grammar that returns the abstract syntax tree of the input. because the combinator resembles the operator that is used for normal function application in Haskell: f $ x = f x . a value of type S . instead of a character.5 51 . In a case like this. etc. The most important parser combinators are <∗> and <|>.

10. Exercise 3. Deﬁne a parser english that returns parse trees for the English language. 52 .21. Parser combinators we have shown how to remove the left recursion in the SequenceOfS grammar. e. Deﬁne a parser for Booleans. Exercise 3. Test your function with the sentence they are flying planes. Give its type and show its results on inputs [] and x : xs. 1.46. 2. Deﬁne a parser palina that counts the number of a’s occurring in a palindrome..6.7.3. f = λ → c for some term c.7. Give its type and show its results on inputs [] and x : xs. Consider the grammar for a part of the English language that is given in Exercise 2. Consider the parser (:) <$> symbol ’a’ <∗> p. Compare the result to your answer of Exercise 2.5. i. Exercise 3. The resulting grammar is used to obtain the following parser: sequenceOfS :: Parser Char SA sequenceOfS = const ConsS <$> symbol ’s’ <∗> parseZ <|> const SingleS <$> symbol ’s’ where parseZ = ConsZ <$> sequenceOfS <∗> parseZ <|> SingleZ <$> sequenceOfS This example is a direct translation of the grammar obtained by using the removing left recursion grammar transformation. Exercise 3. Exercise 3. Consider the grammar for palindromes that you have constructed in Exercise 2. 3. Exercise 3. 1.1.3.8 (no answer provivded).46.9. 2. Compare the results with your answer to Exercise 2. Deﬁne parsers for each of the basic languages deﬁned in Section 2. Deﬁne a parser palin 2 that returns parse trees for palindromes. Give the datatype English that corresponds to this grammar. Test your function with the palindromes cPal 1 = "abaaba" and cPal 2 = "baaab". Prove for all f :: a → b that f <$> succeed a = succeed (f a) In the sequel we will often use this rule for constant functions f . There exists a much simpler parser for parsing sequences of s’s.4. Exercise 3. Consider the parser (:) <$> symbol ’a’. Give the datatype Pal 2 that corresponds to this grammar.

3. So. A nice result would be a treelike description of the parentheses that are parsed. using the parser combinators <∗> and <|>. what function should we use? This depends on the kind of value that we want as a result of the parser.3.12.6.14. the grammar that you wrote in Exercise 2.11. this function is not correctly typed: the parsers in the ﬁrst alternative cannot be composed using <∗>.3. it is often fairly straightforward to construct a parser for a language for which you have a grammar. Consider. The value returned by the combined parser is a tuple containing the results of the two component parsers.44: S →(S )S |ε This grammar can be directly translated to a parser. But we can postprocess the parser symbol ’(’ so that.> and <$>. for the parentheses grammar. catchy phrase? Exercise 3. Deﬁne <∗> in terms of <. for example. Deﬁne <. instead of a character. Can you deﬁne <$> in terms of <∗>? 3. we also speciﬁed that the operator associates to the right.13. We obtain the following Haskell datatype: data Parentheses = Match Parentheses Parentheses | Empty 53 .3. Can you think of a better word? Exercise 3. What is the type of this parser combinator? Exercise 3.> that combines two parsers. parens :: Parser Char ??? — ill-typed parens = symbol ’(’ <∗> parens <∗> symbol ’)’ <∗> parens <|> epsilon However. Parser combinators Exercise 3. Exercise 3. Can you describe your observations in an easy-to-remember. Matching parentheses: an example Using parser combinators. this parser does return a function.15. If you examine the deﬁnitions of <∗> and <$> in Listing 3. Deﬁne a parser combinator <. as for example symbol ’(’ is not a parser returning a function. We use <∗> when symbols are written next to each other. you can observe that <$> is in a sense a special case of <∗>. The term ‘parser combinator’ is in fact not an adequate description for <$>. For this purpose we introduce an abstract syntax.16. Compare the type of <$> with the type of the standard function map. see Section 2.> in terms of <∗> and <$>. and <|> when | appears in a production (or when there is more than one production for a nonterminal). When deﬁning the priority of the <|> operator with the inﬁxr keyword.1. Why is this a better choice than association to the left? Exercise 3.

4. we can return other things than parse trees. see the function nesting deﬁned in Listing 3."()(())()")] ? nesting "())" [(1.[]). (1."()"). which is deﬁned by induction on the datatype Parentheses.")"). nrofpars :: Parentheses → Int nrofpars (Match pl pr ) = 2 + nrofpars pl + nrofpars pr nrofpars Empty =0 Using the datatype Parentheses. (0."(())()"). (0.hs For example. A session in which nesting is used may look like this: ? nesting "()(())()" [(2. As an example we construct a parser that calculates the nesting depth of nested parentheses. we can add ‘semantic functions’ to the parser. Parser combinators data Parentheses = Match Parentheses Parentheses | Empty deriving Show open = symbol ’(’ close = symbol ’)’ parens :: Parser Char Parentheses parens = (λ x y → Match x y) <$> open <∗> parens <∗> close <∗> parens <|> succeed Empty nesting :: Parser Char Int nesting = (λ x y → max (1 + x ) y) <$> open <∗> nesting <∗> close <∗> nesting <|> succeed 0 Listing 3.4: ParseParentheses. the sentence ()() is represented by Match Empty (Match Empty Empty) Suppose we want to calculate the number of parentheses in a sentence. The number of parentheses is calculated by the function nrofpars.4. We then obtain the deﬁnition of parens in Listing 3. (2."())")] 54 .3. By varying the function used as a ﬁrst argument of <$> (the ‘semantic function’).

there are no solutions with empty rest string.3. Given a parser p for a construction. what is the type of f <$> open? Can f <$> open be used as a left hand side of <∗>parens? What is the type of the result? Exercise 3. As a ﬁrst example we consider repetition. The goal of this section is to show how the set of parser combinators can be extended. What is a convenient way for <∗> to associate? Does it? Exercise 3. the many combinator can be used in parsing a natural number: natural :: Parser Char Int natural = foldl f 0 <$> many newdigit where f a b = a ∗ 10 + b 55 . What is the type of the function f = λ x y → Match x y which appears in function parens in Listing 3.19. In traditional grammar formalisms. denoted by a star. More parser combinators As you can see. additional symbols are used to describe for example optional or repeated constructions.17. Write a function test that determines whether or not a given string belongs to the language parsed by a given parser. For example. Parser combinators for EBNF It is very easy to make new parser combinators for EBNF. The function (:) is just the cons-operator for lists: it takes a head element and a tail list and combines them.4. many p constructs a parser for zero or more occurrences of that construction: many :: Parser s a → Parser s [a] many p = (:) <$> p <∗> many p <|> succeed [] So the EBNF expression P ∗ is implemented by many P . The order in which the alternatives are given only inﬂuences the order in which solutions are placed in the list of successes. It is fairly simple to test whether a given string belongs to the language that is parsed by a given parser.4? What is the type of the parser open? Using the type of <$>.4. but in practice it is easier to have some more parser combinators available. respectively). More parser combinators In principle you can build parsers for any context-free language using the combinators <∗> and <|>. Exercise 3. Consider for example the BNF formalism.1. 3. but which was later extended to EBNF to also allow for repetition. when there is a syntax error in the argument.18. 3.4. in which originally only sequential and alternative composition can be used (denoted by juxtaposition and vertical bars.

many greedy 1 = ﬁrst . greedy 1 :: Parser s b → Parser s [b] greedy = ﬁrst .hs 56 .3. Parser combinators — EBNF parser combinators option :: Parser s a → a → Parser s a option p d = p <|> succeed d many :: Parser s a → Parser s [a] many p = (:) <$> p <∗> many p <|> succeed [] many 1 :: Parser s a → Parser s [a] many 1 p = (:) <$> p <∗> many p pack :: Parser s a → Parser s b → Parser s c → Parser s b pack p r q = (λ x → x ) <$> p <∗> r <∗> q listOf :: Parser s a → Parser s b → Parser s [a] listOf p s = (:) <$> p <∗> many ((λ x → x ) <$> s <∗> p) — Auxiliary functions ﬁrst :: Parser s b → Parser s b ﬁrst p xs | null r = [] | otherwise = [head r ] where r = p xs greedy. many 1 Listing 3.5: EBNF.

many greedy 1 = ﬁrst . Another combinator from EBNF is the option combinator P ?.3. Caused by the order of the alternatives in the deﬁnition of many (succeed [] appears as the second alternative). More parser combinators Deﬁned in this way. a large amount of backtracking possibilities are introduced. You will give a better deﬁnition of identiﬁer in Exercise 3.5. and returns a parser that recognises the same construct. 57 .4. which accepts one or more occurrences of a construction. but which also succeeds if that construct is not present in the input string. see Section 2. It is a kind of ‘default’ value. that transforms a parser into a parser that only returns the ﬁrst possible parsing. but which does not fail if it is not present.7. we can deﬁne a parser transformer ﬁrst. many 1 If we compose the ﬁrst function with the option parser combinator: obligatory p d = ﬁrst (option p d ) we get a parser which must accept a construction if it is present. we had better use the many 1 parser combinator. By the use of the option and many functions. also less greedy parsings of the identiﬁer are tried – in vain. The deﬁnition is given in Listing 3. we can create a special ‘take all or nothing’ version of many: greedy = ﬁrst . but if parsing fails elsewhere in the sentence. If this is not desired. ﬁrst :: Parser a b → Parser a b ﬁrst p xs | null r = [] | otherwise = [head r ] where r = p xs Using this function. It does so by taking the ﬁrst element of the list of successes. and corresponds to the EBNF expression P + . For example.27. In situations where from the way the grammar is built we can predict that it is hopeless to try non-greedy results of many. It is deﬁned in Listing 3. the natural parser also accepts empty input as a number. the ‘greedy’ parsing.5. which accumulates as many letters as possible in the identiﬁer is tried ﬁrst. if we deﬁne a parser for identiﬁers by identiﬁer = many 1 (satisfy isAlpha) a single word may also be parsed as two identiﬁers. It takes a parser as argument. This is not always advantageous. It has an additional argument: the value that should be used as result in case the construct is not present.

given a parser for the items and a parser for the separators: listOf :: Parser s a → Parser s b → Parser s [a] listOf p s = (:) <$> p <∗> many ((λ x → x ) <$> s <∗> p) Useful instantiations are: commaList.’) semicList p = listOf p (symbol ’. The deﬁnitions look quite complicated. The function listOf below generates a parser for a non-empty list. ·+ and ·?. Parser combinators 3. in many languages constructions are frequently enclosed between two meaningless symbols.’) A somewhat more complicated variant of the function listOf is the case where the separators carry a meaning themselves. it constructs a parser for the enclosed body.4. many 1 and option are classical in compiler constructions – there are notations for it in EBNF (·∗ . Suppose we apply operator ⊕ (⊕ is an operator variable.2. Given a parser for an opening token. For this case we will develop the functions chainr and chainl . but when you look at the underlying grammar they are quite straightforward. In the case of chainr the operator is applied right-to-left. but there is no need to leave it at that. or compound statements (statements separated by semicolons).6 (remember that $ is function application: f $ x = f x ). in the case of chainl it is applied left-to-right. it denotes an arbitrary right-associative operator) from right to left. and a closing token. These functions expect that the parser for the separators returns a function (!). so e1 ⊕ e2 ⊕ e3 ⊕ e4 = 58 . a body. Separators The combinators many. the separators are of no importance. respectively) –. Special cases of this combinator are: parenthesised p = pack (symbol ’(’) p (symbol ’)’) bracketed p = pack (symbol ’[’) p (symbol ’]’) compound p = pack (token "begin") p (token "end") Another frequently occurring construction is repetition of a certain construction. where the operators that separate the subexpressions have to be part of the parse tree. most often some sort of parentheses. that function is used by chain to combine parse trees for the items. For this case we design a parser combinator pack . The functions chainr and chainl are deﬁned in Listing 3. You may think of lists of arguments (expressions separated by commas).3. For example. For the parse trees. semicList :: Parser Char a → Parser Char [a] commaList p = listOf p (symbol ’. where the elements are separated by some symbol. as deﬁned in Listing 3. in arithmetical expressions. For example.5.

3. the only diﬀerence is that everything is ‘turned around’: function j of chainr takes a value and an operator. see Listing 3. This is done by function chainr . turning them into functions. The resulting deﬁnitions can be found in the article by Fokker [5].6.6. and returns the function obtained by ‘left’ applying the operator. see Listing 3. function j of chainl takes an operator and a value. Functions chainl and chainr can be made more eﬃcient by avoiding the construction of the intermediate list of functions.hs e1 ⊕ (e2 ⊕ (e3 ⊕ e4 )) = ((e1 ⊕) · (e2 ⊕) · (e3 ⊕)) e4 It follows that we can parse such expressions by parsing many pairs of expressions and operators. and applying all those functions to the ﬁrst expression. More parser combinators — Chain expression combinators chainr :: Parser s a → Parser s (a → a → a) → Parser s a chainr pe po = h <$> many (j <$> pe <∗> po) <∗> pe where j x op = (x ‘op‘) h fs x = foldr ($) x fs chainl :: Parser s a → Parser s (a → a → a) → Parser s a chainl pe po = h <$> pe <∗> many (j <$> po <∗> pe) where j op x = (‘op‘x ) h x fs = foldl (ﬂip ($)) x fs Listing 3. Note that functions chainl and chainr are very similar. then e1 ⊕ e2 ⊕ e3 ⊕ e4 = ((e1 ⊕ e2 ) ⊕ e3 ) ⊕ e4 = ((⊕e4 ) · (⊕e3 ) · (⊕e2 )) e1 So such an expression can be parsed by ﬁrst parsing a single expression (e1 ). turning them into functions.4. dual 59 . and returns the function obtained by ‘right’ applying the operator to the value. and applying all those functions to the last expression. This is done by function chainl .6: Chains. If operator ⊕ is applied from left to right. Such functions are sometimes called dual . and then parsing many pairs of operators and expressions.

which have the following concrete syntax: E →E +E | E -E | E /E | (E ) | Digs 60 . ’a’. Both parsers return a list of solutions. The result is a list of integers. [’a’.27. Exercise 3. 2. deﬁne a parser combinator psequence that transforms a list of parsers for some type into a parser returning a list of elements of that type. In what order do the four possible parsings appear in the list of successes? Exercise 3.3. or underscore symbol. We will develop a parser for arithmetical expressions.5. As another variation on the theme ‘repetition’. Carefully analyse the semantic functions in the deﬁnition of chainl in Listing 3. [’b’].3.20. Deﬁne a more realistic parser identiﬁer .26 (no answer provided).2.24. many and many 1 deﬁne parsers for each of the basic languages deﬁned in Section 2. Exercise 3.25.6. Exercise 3. In real programming languages.1. Parser combinators Exercise 3. digit. Deﬁne a parser sumParser that recognises digits separated by the character ’+’ and returns the sum of these integers. 1. 3. deﬁne the function token that was discussed in Section 3.21. What is the value of many (symbol ’a’) xs for xs ∈ {[]. What is the type of psequence? Also deﬁne a combinator choice that iterates the operator <|>. but the symbols that follow (if any) may be a letter. What should be changed in order to get only one solution? Exercise 3. Arithmetical expressions The goal of this section is to use parser combinators in a concrete application. ’b’].23 (no answer provided). As an application of psequence. [’a’]. Exercise 3. identiﬁers follow rather ﬂexible rules: the ﬁrst symbol must be a letter. [’a’. ’b’]}? Exercise 3. Deﬁne a parser that analyses a string and recognises a list of digits separated by a space character. 3. Consider the application of the parser many (symbol ’a’) to the string aaa.22. Using the parser combinators option.

fact :: Parser Char Expr fact = Con <$> integer <|> Var <$> identiﬁer <|> Fun <$> identiﬁer <∗> parenthesised (commaList expr ) <|> parenthesised expr The ﬁrst alternative is an integer parser which is postprocessed by the ‘semantic function’ Con. We use chainl and not chainr because the operator ’/’ is considered to be left-associative. In absence of the latter. Arithmetical expressions Besides these productions. we also have productions for identiﬁers and applications of functions: E → Identiﬁer | Identiﬁer ( Args ) Args → ε | E (. we will use a grammar with three non-terminals ‘expression’. For the fourth alternative there is no semantic function. Recall that chainl repeatedly recognises its ﬁrst argument (fact). ‘term’ and ‘factor’: an expression is composed of terms separated by + or −. fact.3. depending on the presence of an argument list. E )∗ The parse trees for this grammar are of type Expr : data Expr = Con Int | Var String | Fun String [Expr ] | Expr :+: Expr | Expr :−: Expr | Expr :∗: Expr | Expr :/: Expr You can almost recognise the structure of the parser in this type deﬁnition. This grammar appears as a parser in the functions in Listing 3. and with terms instead of factors. in presence the function Fun. a term is composed of factors separated by ∗ or /.5. because the meaning of an expression between parentheses is the meaning of the expression. parses factors. For the deﬁnition of a term as a list of factors separated by multiplicative operators we use the function chainl . only with additive operators instead of multiplicative operators. The ﬁrst parser. The function expr is analogous to term. 61 . variable. But in order to account for the priorities of the operators.7. or expression between parentheses. the function Var is applied. function call. The second and third alternative are a variable or function call. The parse trees for the individual factors are joined by the constructor functions that appear before <$>. and a factor is a constant. separated by its second argument (a ∗ or /).

7: ExpressionParser.3. Parser combinators — Type deﬁnition for parse tree data Expr = Con Int | Var String | Fun String [Expr ] | Expr :+: Expr | Expr :−: Expr | Expr :∗: Expr | Expr :/: Expr — Parser for expressions with two priorities fact :: Parser Char Expr fact = Con <$> integer <|> Var <$> identiﬁer <|> Fun <$> identiﬁer <∗> parenthesised (commaList expr ) <|> parenthesised expr integer :: Parser Char Int integer = (const negate <$> (symbol ’-’)) ‘option‘ id <∗> natural term :: Parser Char Expr term = chainl fact ( const (:∗:) <$> symbol ’*’ <|> const (:/:) <$> symbol ’/’ ) expr :: Parser Char Expr expr = chainl term ( const (:+:) <$> symbol ’+’ <|> const (:−:) <$> symbol ’-’ ) Listing 3.hs 62 .

The function chainl is used in each deﬁnition. the production rules of the grammar are combined with higher-order functions. 3. a → a → a) gen :: [Op a] → Parser Char a → Parser Char a gen ops p = chainl p (choice (map f ops)) where f (s. the functions can be viewed both as description of the grammar and as an executable parser. Give the parse tree for the expressions "abc".b)" not the ﬁrst solution of the parser? Modify the functions in Listing 3.29. there is no need for a separate parser generator (like ‘yacc’). Exercise 3. Also.6. Why is the parse tree for the expression "a(1. "a*b+1". where the diﬀerences are described using extra arguments. Generalised expressions This section generalises the parser in the previous section with respect to priorities.b)" 2. If there are nine levels of priority.30. and "a(1. c) = const c <$> symbol s If furthermore we deﬁne as shorthand: 63 . Therefore. we obtain nine copies of almost the same text. 1. Arithmetical expressions in which operators have more than two levels of priority can be parsed by writing more auxiliary functions between term and expr .is parsed as a left-associative operator. Exercise 3. Generalised expressions This example clearly shows the strength of parsing with parser combinators. "-1-a". There is no need for a separate formalism for grammars. Exercise 3. Functions that resemble each other are an indication that we should write a generalised function.3. "(abc)". This is not as it should be. "a*(b+1)". Modify the functions in Listing 3. and . Explain why and modify the parser in such way that it will be. with as ﬁrst argument the function of one priority level lower.7 in such way that it will be. A function with no arguments such as "f()" is not accepted by the parser. let us inspect the diﬀerences in the deﬁnitions of term and expr again. the second in the form of a parse function: type Op a = (Char .6.7.28. These are: • The operators and associated tree constructors that are used in the second argument of chainl • The parser that is used as ﬁrst argument of chainl The generalised function will take these two diﬀerences as extra arguments: the ﬁrst in the form of a list of pairs. in such a way that + is parsed as a right-associative operator.

Summary This chapter shows how to construct parsers from simple combinators. More advanced implementations are discussed elsewhere. Contrary to conventional approaches. Exercises Exercise 3. 3.6 be adapted to also allow raising an expression to the power of an expression? Exercise 3. (:/:))] addis = [(’+’.3. Also. The deﬁnitions are summarised in Listing 3. Prove the following laws h <$> (f <$> p) = (h . It shows how a small parser combinator library can be a powerful tool in the construction of parsers.32. How should the parser of Section 3.3) 64 . The very compact formulation of the parser for expressions with an arbitrary number of priority levels is possible because the parser combinators can be used together with the existing mechanisms for generalisation and partial parametrisation in Haskell. multis] From this deﬁnition a generalisation to more levels of priority is simply a matter of extending the list of operator-lists.) <$> p) <∗> q (3.31. the levels of priority need not be coded explicitly with integers. (’/’. this chapter gives a rather basic implementation of the parser combinator library. the insertion of new priority levels is very easy.2) (3.8. Furthermore.7. The only thing that matters is the relative position of an operator in the list of ‘list with operators of the same priority’. f ) <$> p h <$> (p <|> q) = (h <$> p) <|> (h <$> q) h <$> (p <∗> q) = ((h.1) (3. (:∗:)). (:−:))] then expr and term can be deﬁned as partial parametrisations of gen: expr = gen addis term term = gen multis fact By expanding the deﬁnition of term in that of expr we obtain: expr = addis ‘gen‘ (multis ‘gen‘ fact) which an experienced functional programmer immediately recognises as an application of foldr : expr = foldr gen fact [addis. (’-’. (:+:)). Parser combinators multis = [(’*’.

a → a → a) fact :: Parser Char Expr fact = Con <$> integer <|> Var <$> identiﬁer <|> Fun <$> identiﬁer <∗> parenthesised (commaList expr ) <|> parenthesised expr gen :: [Op a] → Parser Char a → Parser Char a gen ops p = chainl p (choice (map f ops)) where f (s.34. integer) exponent. Deﬁne a combinator parser pMir that transforms a concrete representation of a mirror-palindrome into an abstract one.456. (:∗:)). (’-’. Exercise 3. that is numbers like 12. Deﬁne a parser for Java assignments that consist of a variable. Outline the construction of a parser for Java programs. an expression and a semicolon.37. Deﬁne a parser for (simpliﬁed) Java statements. an = sign. Also integers are acceptable. Exercise 3. Notice that the part following the decimal point looks like an integer. (’/’.3.33.35. c) = const c <$> symbol s expr :: Parser Char Expr expr = foldr gen fact [addis.38 (no answer provided).25. multis] multis = [(’*’.23. which are ﬁxed point numbers followed by an optional E and an (positive or negative. Exercise 3. 65 .8: GExpressionParser.39 (no answer provided). Deﬁne a parser for ﬂoating point numbers. (:−:))] Listing 3. Deﬁne a parser for ﬁxed-point numbers. (:/:))] addis = [(’+’. Exercise 3. (:+:)). Assuming the comma is an associative operator. Test your function with the concrete bit-lists cBitList 1 and cBitList 2 . Exercises — Parser for expressions with aribitrary many priorities type Op a = (Char .34 and -123. Consider your answer to Exercise 2.7. Test your function with the concrete mirror-palindromes cMir 1 and cMir 2 . we can give the following abstract syntax for bit-lists: data BitList = SingleB Bit | ConsB Bit BitList Deﬁne a combinator parser pBitList that transforms a concrete representation of a bit-list into an abstract one. but has a diﬀerent semantics! Exercise 3.hs Exercise 3.36. Exercise 3. Consider your answer to Exercise 2.

3. Parser combinators 66 .

The design of a grammar and parser for a language consists of several steps: you have to 1. add semantic functions. test that the grammar can indeed describe the example sentences. 9. the travel time minus the waiting time (2 hours and 17 minutes in the above example). The goal of this chapter is to review these concepts. construct a basic parser. e. for example: 1. 7. and to show how they are used in the design of grammars and parsers. give a grammar for the language for which you want to have a parser. Grammar and Parser design The previous chapters have introduced many concepts related to grammars and parsers. and that the parser returns what you expect.. i.4. 6. 3. decide on the type of the parser: Parser a b. possibly transform the grammar to obtain some of these desirable properties. decide on both the input type a of the parser (which may be the result type of a scanner). test that the parser can parse the example sentences you have given in the ﬁrst step. 5. compute the net travel time. 4. 67 . 8. As a running example we will construct a grammar and parser for travelling schemes for day trips. and the result type b of the parser. give example sentences of the language for which you want to design a grammar and a parser. analyse this grammar to ﬁnd out whether or not it has some desirable properties. of the following form: Groningen 8:37 9:44 Zwolle 9:49 10:15 Utrecht 10:21 11:05 Den Haag We might want to do several things with such a schema. 2. that is. We will describe and exemplify each of these steps in detail in the rest of this section.

It is important to convince yourself from the fact that the grammar you give really generates the desired language. Step 1: Example sentences for the language We have already given an example sentence above: Groningen 8:37 9:44 Zwolle 9:49 10:15 Utrecht 10:21 11:05 Den Haag Other example sentences are: Utrecht Centraal 10:25 10:58 Amsterdam Centraal Assen 4. 4. Another grammar focuses on changing at a station: TS → Station Departure (Arrival Station Departure)∗ Arrival Station | Station So each travelling scheme starts and ends at a station. compute the total time one has to wait on the intermediate stations (11 minutes). This chapter deﬁnes functions to perform these computations.2.1. we can give several grammars. which might turn the grammar into a set of incomprehensible productions.4. 68 . since the grammar will be the basis for grammar transformations. Grammar and Parser design 2. TS Station Departure Arrival Time → TS Departure Arrival TS | Station → Identiﬁer + → Time → Time → Nat : Nat where Identiﬁer and Nat have been deﬁned in Section 2. and in between there is a list of intermediate stations.1. For the language of travelling schemes. Step 2: A grammar for the language The starting point for designing a parser for your language is to deﬁne a grammar that describes the language as precisely as possible. Note that a single station is also a travelling scheme with this grammar. So a travelling scheme is a sequence of departure and arrival times. separated by stations. The following grammar focuses on the fact that a trip consists of zero or more departures and arrivals.3.

Step 5: Transforming the grammar Since the sequence Departure Arrival is an associative separator in the generated language. and determining which properties are (not) satisﬁed. we might desire other properties of our grammar.4. The other properties will be used in grammar transformations in the following subsection. Step 3: Testing the grammar 4. we will not make use of the fact that the grammar is right recursive. The derivation of these sentences using these grammars is easy. Both productions for TS start with the nonterminal Station.1) to the two productions for TS from (4. We have not yet developed tools for grammar analysis (we will do so in the chapter on LL(1) parsing) but for some grammars it is easy to detect some properties.1) Thus we have removed the left recursion in the grammar. We will show several choices in the next section. so TS can be left factored.4. the sequence Departure Arrival is an associative separator in the generated language. Step 4: Analysing the grammar To parse sentences of a language eﬃciently.3.1). 4.3. Step 3: Testing the grammar Both grammars we have given in step 2 describe the example sentences given in step 1. Depending on the parser we want to obtain. we want to have a unambiguous grammar that is left-factored and not left recursive.2) So which productions do we take for TS ? This depends on what we want to do with the parsed sentences. These properties may be used for transforming the grammar. The ﬁrst example grammar is left and right recursive: the ﬁrst production for TS starts and ends with TS . 69 . Furthermore. the productions for TS may be transformed into: TS → Station | Station Departure Arrival TS (4. The resulting productions are: TS → Station Z Z → ε | Departure Arrival TS We can also apply equivalence (2. 4. Since we don’t mind about right recursion.5. and obtain the following single production: TS → (Station Departure Arrival )∗ Station (4. So a ﬁrst step in designing a parser is analysing the grammar.

If we want to compute the total travelling time. we may do several things: • deﬁne three parsers. A second abstract syntax corresponds to the grammar for travelling schemes deﬁned in (4. and deﬁne three functions on TS that compute the desired results. Station) 70 . the total waiting time. ts :: Parser Char ? ts :: Parser Token ? For the result type we have many choices. • deﬁne an abstract syntax for travelling schemes. Int (total waiting time). Time. Step 6: Deciding on the types We want to write a parser for travel schemes. The type of tokens can be chosen as follows: data Token = Station Token Station | Time Token Time type Station = String type Time = (Int. There are several ways to deﬁne an abstract syntax for travelling schemes. Grammar and Parser design 4. The ﬁrst abstract syntax corresponds to deﬁnition (4. we want to write a function ts of type ts :: Parser ? ? The question marks should be replaced by the input type and the result type. Int) We will construct a parser for both input types in the next subsection. and is rather ineﬃcient compared with the other alternatives.2).4. respectively. The second alternative is hard to extend if we want to compute something extra. Int suﬃces for the result type. that is. and a nicely printed version of the travelling scheme.6. • deﬁne a single parser with the triple (Int.1) of grammar TS . type TS 2 = ([(Station. respectively. Int. but in some cases it might be more eﬃcient than the third alternative. The ﬁrst alternative parses the input three times. with Int (total travelling time). The third alternative needs an abstract syntax. If we just want to compute the total travelling time. and String (nicely printed version) as result type. For the input type we can choose between at least two possibilities: characters. Time)]. Char or tokens Token. data TS 1 = Single 1 Station | Cons 1 Station Time Time TS 1 where Station and Time are deﬁned above. So ts has one of the following two types. say a datatype TS . String) as result type.

4. TS 3 is useful when we want to compute waiting times since it combines arrival and departure times in one constructor. The result type of the parsing function ts may be one of types mentioned earlier (Int. Station) Which abstract syntax should we take? Again. but putting undeﬁned here ensures type correctness of the parser. a departure time.2: data TS 3 = Single 3 Station | Cons 3 (Station. TS 1 cannot be deﬁned as a type because of the two alternative productions for TS . 71 . and the second component of which is the ﬁnal arrival station. Time)]. Functional programming languages oﬀer some extra ﬂexibility that we sometimes use. Types and datatypes each have their advantages and disadvantages. Since TS 2 and TS 1 combine departure and arrival times in a tuple.7. etc. We use the following replacement rules. [(Time. whereas TS 2 is a type. running the parser will result in an error. A third abstract syntax corresponds to the second grammar deﬁned in Section 4. Station. Note that TS 1 is a datatype. grammar construct → | (space) ·+ ·∗ ·? terminal x begin of sequence of symbols Haskell/parser construct = <|> <∗> many 1 many option symbol x undeﬁned <$> Note that we start each sequence of symbols by undeﬁned <$>. Step 7: Constructing the basic parser Converting a grammar to a parser is a mechanical process that consists of a set of simple replacement rules. The undeﬁned has to be replaced by an appropriate semantic function in Step 6. the ﬁrst component of which is a list of triples consisting of a departure station.1) for travelling schemes. Of course. the application determines which to use. so if we use grammar (4. Often we want to exactly mimic the productions of the grammar in the abstract syntax. Time. 4. Time. TS 2 .). and an arrival time. Step 7: Constructing the basic parser So a travelling scheme is a tuple. TS 3 .7. we use TS 1 for the abstract syntax. this depends on what we want to do with the abstract syntax. they are convenient to use when computing travelling times. TS 2 can be deﬁned as a datatype by adding a constructor. or one of TS 1 . but usually writing a parser is a simple translation.

4. A basic parser from tokens To obtain a basic parser from tokens. Grammar and Parser design We construct a basic parser for each of the input types Char and Token. combine . station :: Parser Char Station station = undeﬁned <$> many 1 identiﬁer time :: Parser Char Time time = undeﬁned <$> natural <∗> symbol ’:’ <∗> natural departure. scanner :: String → [Token] scanner = mkTokens .2. words combine :: [String] → [String] 72 . we ﬁrst write a scanner that produces a list of tokens from a string. we obtain the following basic parser. The semantic functions are deﬁned in the next and ﬁnal step. For the other basic parsers we have chosen some reasonable return types.7. The semantic glue also determines the type of the function tsstring. 4. which is denoted by ? for the moment. Basic parsers from strings Applying these rules to the grammar (4.1.4. arrival :: Parser Char Time departure = undeﬁned <$> time arrival = undeﬁned <$> time tsstring :: Parser Char ? tsstring = undeﬁned <$> many ( undeﬁned <$> spaces <∗> station <∗> spaces <∗> departure <∗> spaces <∗> arrival ) <∗> spaces <∗> station spaces :: Parser Char String spaces = undeﬁned <$> many (symbol ’ ’) The only thing left to do is to add the semantic glue to the functions.2) for travelling schemes.7.

time This is a basic scanner with very basic error messages. xs)] tdeparture = [] tarrival (Time Token (h. m). tstoken 1 :: Parser Token ? tstoken 1 = undeﬁned <$> many ( undeﬁned <$> tstation <∗> tdeparture <∗> tarrival ) <∗> tstation tstation :: Parser Token Station tstation (Station Token s : xs) = [(s.2. m) : xs) = [((h. xs)] tarrival = [] where again the semantic functions remain to be deﬁned. Their presence reﬂects their presence in the grammar. tarrival :: Parser Token Time tdeparture (Time Token (h.4. m). xs)] = [] tstation tdeparture. Step 7: Constructing the basic parser combine [] = [] combine [x ] = [x ] combine (x : y : xs) = if isAlpha (head x ) ∧ isAlpha (head y) then combine ((x + " " + y) : xs) + + else x : combine (y : xs) mkToken :: String → Token mkToken xs = if isDigit (head xs) then Time Token (mkTime xs) else Station Token xs parse result :: [(a. Note that functions tdeparture and tarrival are the same functions. The composition of the scanner with the function tstoken 1 deﬁned below gives the ﬁnal parser.7. m) : xs) = [((h. tstoken 2 :: Parser Token ? tstoken 2 = undeﬁned 73 . b)] → a parse result xs | null xs = error "parse_result: could not parse the input" | otherwise = fst (head xs) mkTime :: String → Time mkTime = parse result . Another basic parser from tokens is based on the second grammar of Section 4. but it suﬃces for now.

we have to combine the two integers in a tuple. we need to add the semantic glue: the functions that take the results of the elements in the right hand side of a production. and then we replace id <$> time by just time. y) provided many returns a value of the desired type: [(Station. we can take the concatenation function concat for undeﬁned in function station. Since function many 1 identiﬁer returns a list of strings. since function time returns a value of type Time. and spaces. arrival . time.4. Time. the result of many is a string. we can take the identity function for undeﬁned in departure and arrival . Finally. departure. The ﬁrst semantic function for the basic parser tsstring deﬁned in Section 4. Now. and convert them into the result of the left hand side. a character.8.7. Note that this semantic function basically only throws away the value returned by the spaces parser: we are not interested in the spaces between the components of our 74 .1 returns an abstract syntax tree of type TS 2 . To obtain a value of type Time from an integer. Step 8: Adding semantic functions Once we have the basic parsing functions. So the ﬁrst undeﬁned in tsstring should return a tuple of a list of things of the correct type (the ﬁrst component of the type TS 2 ) and a Station. So we take the following function λx y → (x . The basic rule is: Let the types do the work! First we add semantic functions to the basic parsing functions station. Grammar and Parser design <$> tstation <∗> tdeparture <∗> many ( undeﬁned <$> tarrival <∗> tstation <∗> tdeparture ) <∗> tarrival <∗> tstation <|> undeﬁned <$> tstation 4. we can construct such a tuple by means of the function λx y → (x . Time)]. and we want to obtain the concatenation of these strings for the station name. so for undeﬁned in spaces we can take the identity function too. Since many returns a list of things. y) for undeﬁned in time. and an integer.

The second occurrence of function undeﬁned computes the waiting time for one intermediate station: λ(uh. The next semantic functions we deﬁne compute the net travel time. z ) Again. and to add the travel times of all of these trips.9. zm) → (zh − xh) ∗ 60 + zm − xm and Haskell’s prelude function sum sums these times. We have to give deﬁnitions of the three undeﬁned semantic functions. the results of spaces are thrown away. If a trip consists of a single station. um) (wh. wm) → (wh − uh) ∗ 60 + wm − um Finally. We obtain the travel time of a single trip if we replace the second occurrence of undeﬁned in tsstring by: λ (xh. This completes a parser for travelling schemes. so the last occurrence of undeﬁned is the function const 0.9. the ﬁrst occurrence of undeﬁned sums the list of waiting time obtained by means of the function that replaces the second occurrence of undeﬁned : λ x → sum x 4. 75 . we use a parser based on this grammar: the basic parser tstoken 2 . To compute the net travel time. so for the ﬁrst occurrence of undeﬁned we take: λx → sum x The ﬁnal set of semantic functions we deﬁne are used for computing the total waiting time.4. there is now waiting time. The many parser returns a value of the correct type if we replace the second occurrence of undeﬁned in tsstring by the function λ x y z → (x . we have to compute the travel time of each trip from a station to a station. Since the second grammar of Section 4. and whether or not you have made errors in the above process. xm) (zh. Summary This chapter describes the diﬀerent steps that have to be considered in the design of a grammar and a language. y. Step 9: Did you get what you expected In the last step you test your parser(s) to see whether or not you have obtained what you expected.2 combines arrival times and departure times. Step 9: Did you get what you expected travelling scheme.

10. Exercise 4. except for the nonterminal FloatLiteral . 76 . FractPart? ExponentPart? FloatSuﬃx ? | . Give this parsing scheme for Java ﬂoat literals.4. Exercise 4. parsers constructed on a (ﬁxed) abstract syntax have the same shape. Write a parser ﬂoatLiteral for Java ﬂoat-literals. assume that all nonterminals.2. Write an evaluator signedFloat for Java ﬂoat-literals (the ﬂoat-suﬃx may be ignored).3. Up to the deﬁnition of the semantic functions. are represented by a String in the abstract syntax. Exercises Exercise 4. Grammar and Parser design 4.1. The EBNF grammar for ﬂoat-literals is given by: FloatLiteral → IntPart . FractPart ExponentPart? FloatSuﬃx ? | IntPart ExponentPart FloatSuﬃx ? | IntPart ExponentPart? FloatSuﬃx → SignedInteger → Digits → ExponentIndicator SignedInteger → Sign? Digits → Digits Digit | Digit →+|→f|F|d|D IntPart FractPart ExponentPart SignedInteger Digits Sign FloatSuﬃx ExponentIndicator → e | E To keep your parser simple.

Compositional functions on datatypes are deﬁned in terms of algebras of semantic actions that correspond to the constructors of the datatype. These common patterns of recursion can conveniently be captured by higher order functions. • understand the connection between datatypes. • understand the notions of synthesised and inherited attributes. Such functions can be seen as computations that read values from a component of the domain and write values to a component of the codomain. • know how to write compositional functions. on (possibly mutually recursive) datatypes. • understand the advantages of using folds. The former values are called inherited attributes and the latter values are called synthesised attributes. Goals After studying this chapter and making the exercises you will • know how to generalise constructors of a datatype to an algebra. Compositionality Introduction Many recursive functions follow a common pattern of recursion. For example: many recursive functions deﬁned on lists are instances of the higher order function foldr . Attributes can be both inherited and synthesised. also known as folds. Such functions are called compositional. one can associate a (possibly mutually recursive) datatype. • have seen the notions of fusion and deforestation. Such semantics is referred to as algebraic semantics.6. Algebraic semantics often uses algebras that contain functions from tuples to tuples. and have seen that many problems can be solved with a fold. • know how to write syntax driven code. As explained in Section 2.5. abstract syntax and concrete syntax. which describes the abstract syntax of the language. • know that a fold applied to the constructors algebra is the identity function. Compositional functions on these datatypes are called syntax driven. which describes the concrete syntax of a language. It is possible to deﬁne a function such as foldr for a whole range of datatypes other than lists. there is an important relationship between grammars and compositionality: with every grammar. Compositional functions can typically be used to deﬁne the semantics of programming languages constructs. 77 .

Compositional functions are deﬁned on the built-in datatype [a] for lists (Section 5.1 shows how to deﬁne compositional recursive functions on built-in lists using a function which is similar to foldr and shows how to do the same thing with user-deﬁned lists and streams. Organisation The chapter is organised as follows. Lists This section introduces compositional functions on the well known datatype of lists.5 presents a relatively complex example of the use of tuples in combination with compositionality.1.1.2).1.5. the type [x ] is recursive.1. Built-in lists The datatype of lists is perhaps the most important example of a datatype.2 shows how to deﬁne compositional recursive functions on several kinds of trees. Thus. Compositionality • be able to associate inherited and synthesised attributes with the diﬀerent alternatives of (possibly mutually recursive) datatypes (or the diﬀerent nonterminals of grammars). data [x ] = x : [x ] | [] 78 . They only diﬀer in the way they handle basic expressions (variables and local deﬁnitions are handled in the same way). an expression interpreter which makes use of a stack and expression compiler to a stack machine.1). followed by a tail xs which itself is again a list.3).1. and on streams or inﬁnite lists (Section 5.1. Section 5. • be able to deﬁne algebraic semantics in terms of compositional (or syntax driven) code that is deﬁned using algebras of computations which make use of inherited and synthesised attributes. 5. All three examples use an algebra of computations which can read values from and write values to an environment which binds names to values. Section 5. The informal deﬁnition of above corresponds to the following (pseudo) datatype. A list is either the empty list [] or a nonempty list (x : xs) consisting of an element x at the head of the list. In Haskell the type [x ] for lists is built-in. 5.4 shows the usefulness of algebraic semantics by presenting an expression evaluator. Section 5. Section 5.3 deﬁnes algebraic semantics. on a user-deﬁned datatype List a for lists (Section 5. We also show how to construct an algebra that directly corresponds to a datatype. In a second version of the expression evaluator the use of inherited and synthesised attributes is made more explicit by using tuples. Section 5. It deals with the problem of variable scope when compiling block structured languages.

e.3. which works the other way around.4] 24 The pair (op. 1)). The similarity between the deﬁnitions of the functions sumL and prodL can be captured by the following higher order recursive function foldL. Don’t confuse foldL with Haskell’s prelude function foldl . which is nothing else but an uncurried version of the well known function foldr .1. sumL. a list-algebra consists of a type l (the carrier of the algebra). Note that a type (like Int) can be the carrier of a listalgebra in more than one way (for example using ((+). We can now use foldL to compute the sum and product of the elements of a list of integers as follows. where the L denotes that it is a sum function deﬁned on lists) is very similar to computing the product of the elements of a list of integers (prodL). 0) and ((∗). Here is another example of how to turn Int into a list-algebra. Note that we have replaced the constructor (:) (a constructor with two arguments) by binary operators (+) and (∗) (i. foldL :: (x → l → l . ‘functions’ with zero variables). More precisely.2.0) [1.5.1) [1.0) [1. Computing the sum of the elements of a list of integers (sumL.2.e. a binary operator op of type x → l → l and a constant c of type l . ? foldL ((+).3. l ) → [x ] → l foldL (op.2.3. c) = fold where fold (x : xs) = op x (fold xs) fold [] =c The function foldL recursively replaces the constructor (:) by an operator op and the constructor [] by a constant c.4] 10 ? foldL ((*).4] 4 79 . c) is often referred to as a list-algebra. ? foldL (\_ n -> n+1. Lists Many recursive functions on lists look very similar. functions with two arguments) and the constructor [] (a constructor with zero variables) by constants 0 and 1 (i. prodL :: [Int] → Int sumL (x : xs) = x + sumL xs sumL [] =0 prodL (x : xs) = x ∗ prodL xs prodL [] =1 The function sumL replaces the list constructor (:) by (+) and the list constructor [] by 0 and the function prodL replaces the list constructor (:) by (∗) and the list constructor [] by 1.

arguments of type List x ) are replaced by recursive function arguments (i. ? :t Cons Cons :: a -> List a -> List a ? :t Nil Nil :: List a A algebra type ListAlgebra corresponding to the datatype List directly follows the structure of that datatype. The constructors (:) and [] are replaced by constructors Cons and Nil . arguments of type l ). The types of the other arguments are simply left unchanged. 5. and increments the result obtained thus far with one. User-deﬁned lists In this subsection we present an example of a fold function deﬁned on another datatype than built-in lists.1.e.5.2. The types of recursive constructor arguments (i. nil ) = fold 80 . type ListAlgebra x l = (x → l → l .e. the deﬁnition of a fold function can be generated automatically from the data deﬁnition. data List x = Cons x (List x ) | Nil User-deﬁned lists are deﬁned in the same way as built-in ones. To keep things simple we redo the list example for user-deﬁned lists. l ) The left hand side of the type deﬁnition is obtained from the left hand side of the datatype as follows: a postﬁx Algebra is added at the end of the name List and a type variable l is added at the end of the whole left hand side of the type deﬁnition. The type of the elements of the list does not play a role. In a similar way. Here are the types of the constructors Cons and Nil . The right hand side of the type deﬁnition is obtained from the right hand side of the data deﬁnition as follows: all List x valued constructors are replaced by l valued functions which have the same number of arguments (if any) as the corresponding constructors. Compositionality This list-algebra ignores the value at the head of a list. It corresponds to the function sizeL deﬁned by: sizeL :: [x ] → Int sizeL ( : xs) = 1 + sizeL xs sizeL [] =0 Note that the type of sizeL is more general than the types of sumL and prodL. foldList :: ListAlgebra x l → List x → l foldList (cons.

Recursive functions on user-deﬁned lists which are deﬁned by means of foldList are called compositional. They correspond to the examples of Section 5. data Stream x = And x (Stream x ) Here is a standard example of a stream: the inﬁnite list of ﬁbonacci numbers.1. Nil ) idList :: List x → List x idList = foldList idListAlgebra ? idList (Cons 1 (Cons 2 Nil)) Cons 1 (Cons 2 Nil) 5.5. ﬁbStream :: Stream Int ﬁbStream = And 0 (And 1 (restOf ﬁbStream)) where restOf (And x stream@(And y )) = And (x + y) (restOf stream) streams 81 . The function fold is applied recursively to all recursive constructor arguments of type List x to return a value of type l as required by the functions of the algebra (in this case Cons and cons have one such recursive argument). Lists where fold (Cons x xs) = cons x (fold xs) fold Nil = nil The constructors Cons and Nil in the left hand sides of the deﬁnition of the local function fold are replaced by functions cons and nil in the right hand sides. prodList :: List Int → Int sumList = foldList ((+). Streams In this section we consider streams (or inﬁnite lists). 0) prodList = foldList ((∗). sumList. Every algebra deﬁnes a unique compositional function.1.1. The other arguments are left unchanged.3. Here are three examples of compositional functions. idListAlgebra :: ListAlgebra x (List x ) idListAlgebra = (Cons.1. 1) sizeList :: List x → Int sizeList = foldList (const (1+). This algebra deﬁnes the identity function on user-deﬁned lists. 0) It is worth mentioning one particular ListAlgebra: the trivial ListAlgebra that replaces Cons by Cons and Nil by Nil .

Trees Now that we have seen how to generalise the foldr function on built-in lists to compositional functions on user-deﬁned lists and streams we proceed by explaining another common class of datatypes: trees. Binary trees A binary tree is either a node where the tree splits into two subtrees or a leaf which holds a value. and the fold function foldStream can be generated automatically from the datatype Stream. We will treat four diﬀerent kinds of trees in the subsections below: • • • • binary trees. Note that Bin has two recursive arguments and that Leaf has one non-recursive argument.2. trees for matching parentheses. For the same reason the fold function is deﬁned using only one equation.5. Here is an example of using a compositional function on user deﬁned streams. we will brieﬂy mention the concepts of fusion and deforestation. x → t) 82 . type BinTreeAlgebra x t = (t → t → t.1. Compositionality The algebra type StreamAlgebra. expression trees. 5. It computes the ﬁrst element of a monotone stream that is greater or equal than a given value. Furthermore. type StreamAlgebra x s = x → s → s foldStream :: StreamAlgebra x s → Stream x → s foldStream and = fold where fold (And x xs) = and x (fold xs) Note that the algebra has only one component because Stream has only one constructor.2. data BinTree x = Bin (BinTree x ) (BinTree x ) | Leaf x One can generate the corresponding algebra type BinTreeAlgebra and fold function foldBinTree from the datatype automatically. ﬁrstGreaterThan :: Ord x ⇒ x → Stream x → x ﬁrstGreaterThan n = foldStream (λx y → if x n then x else y) 5. general trees.

then sizeBinTree ignores the value at the leaf and returns 1 as the size of the tree.1 deﬁnes the datatype Parentheses for matching parentheses. in the same way as we did for lists and binary trees. data Parentheses = Match Parentheses Parentheses | Empty For example. then sizeBinTree returns the sum of the sizes of those subtrees as the size of the tree. sizeBinTree :: BinTree x → Int sizeBinTree = foldBinTree ((+). Similarly. Remember that the abstract syntax ignores the terminal bracket symbols of the concrete syntax. We can now deﬁne compositional functions on binary trees much in the same way as we deﬁned them on lists. which can be used to compute the depth (depthParentheses) and the width (widthParentheses) of matching parenthesis in a compositional way. The depth of a string of matching parentheses 83 . leaf ) = fold where fold (Bin l r ) = bin (fold l ) (fold r ) fold (Leaf x ) = leaf x In the BinTreeAlgebra type. an algebra type ParenthesesAlgebra and a fold function foldParentheses. in the foldBinTree function.3. If a tree consists of two subtrees. We can now deﬁne.2. Trees for matching parentheses Section 3. the local fold function is applied recursively to both arguments of bin and is not called on the argument of leaf . the sentence ()() of the concrete syntax for matching parentheses is represented by the value Match Empty (Match Empty Empty) in the abstract syntax Parentheses. Here is an example: the function sizeBinTree computes the size of a binary tree. Trees foldBinTree :: BinTreeAlgebra x t → BinTree x → t foldBinTree (bin. Functions for computing the sum and the product of the integers at the leafs of a binary tree can be deﬁned in a similar way.2.5. 5. const 1) ? sizeBinTree (Bin (Bin (Leaf 3) (Leaf 7)) (Leaf 11)) 3 If a tree consists of a leaf.2. It suﬃces to deﬁne appropriate semantic actions bin and leaf on a type t (in this case Int) that correspond to the syntactic constructs Bin and Leaf of the datatype BinTree. the bin part of the algebra has two arguments of type t and the leaf part of the algebra has one argument of type x .

0) depthParentheses. the width of the string ((()))() is 2. the depth of the string ((()))() is 3. The algebra used by the function a2cParentheses simply reinserts those terminals that we have deleted. empty) = fold where fold (Match l r ) = match (fold l ) (fold r ) fold Empty = empty depthParenthesesAlgebra :: ParenthesesAlgebra Int depthParenthesesAlgebra = (λx y → max (1 + x ) y. type ParenthesesAlgebra m = (m → m → m. The width of a string of matching parentheses s is the the number of substrings that are matching parentheses themselves. Compositional functions on datatypes that describe the abstract syntax of a language are called syntax driven. Note that a2cParentheses does not deal with layout such as blanks. 0) widthParenthesesAlgebra :: ParenthesesAlgebra Int widthParenthesesAlgebra = (λ y → 1 + y . m) foldParentheses :: ParenthesesAlgebra m → Parentheses → m foldParentheses (match. For large examples layout is very important: it can be used to let concrete representations look pretty. For a simple example layout does not really matter. indentation and newlines.5. For example. Fortunately. 84 . we can easily write a program that computes the concrete representation from the abstract one. Compositionality s is the largest number of unmatched parentheses that occurs in a substring of s. What is the concrete representation of the matching parenthesis example represented by parenthesesExample? It happens to be ((()))()(()). widthParentheses :: Parentheses → Int depthParentheses = foldParentheses depthParenthesesAlgebra widthParentheses = foldParentheses widthParenthesesAlgebra parenthesesExample = Match (Match (Match Empty Empty) Empty) (Match Empty (Match (Match Empty Empty) Empty ) ) ? depthParentheses parenthesesExample 3 ? widthParentheses parenthesesExample 3 Our example reveals that abstract syntax is not very well suited for interpretation by human beings. which are not a substring of a surrounding string of matching parentheses. For example. We know exactly which terminals we have deleted when going from the concrete syntax to the abstract one.

2. Note that function nesting ﬁrst builds a tree by means of function parens. Function nesting never builds a tree. and then ﬂattens it by means of the fold. "") + + + a2cParentheses :: Parentheses → String a2cParentheses = foldParentheses a2cParenthesesAlgebra ? a2cParentheses parenthesesExample ((()))()(()) This example illustrates that a computer can easily interpret abstract syntax (something human beings have diﬃculties with).1 again. open = symbol ’(’ close = symbol ’)’ parens :: Parser Char Parentheses parens = (λ x y → Match x y) <$> open <∗> parens <∗> close <∗> parens <|> succeed Empty nesting :: Parser Char Int nesting = (λ x y → max (1 + x ) y) <$> open <∗> nesting <∗> close <∗> nesting <|> succeed 0 Function nesting could have been deﬁned by means of function parens and a fold: nesting :: Parser Char Int nesting = depthParentheses <$> parens (Remember that depthParentheses has been deﬁned as a fold. So for reasons of ‘programming eﬃciency’ function nesting is preferable. and is thus preferable for reasons of eﬃciency.) Functions nesting and nesting compute exactly the same result. in function nesting we have to write our own parser. The function nesting is the fusion of the fold function with the parser parens from the function nesting . This is the place where parsing enters the picture: computing an abstract representation from a concrete one is precisely what parsers are used for. Trees a2cParenthesesAlgebra :: ParenthesesAlgebra String a2cParenthesesAlgebra = (λxs ys → "(" + xs + ")" + ys. What we would really like is that computers can interpret concrete syntax as well. Consider the functions parens and nesting of Section 3. Strangely enough. On the other hand: in function nesting we reuse the parser parens and the function depthParentheses.3. human beings can easily interpret concrete syntax (something computers have diﬃculties with). Using laws for parsers and folds (not shown here) we can prove that the two functions are equal. To obtain the best of both 85 .5. and convince ourselves that it is correct.

and F uses E . T . Expression trees The matching parentheses grammar has only one nonterminal. 5. e → t → e) . and F . Compositionality deforestation worlds. we will study the datatypes E . (e → f . the following abstract syntax is more convenient.6 we transform the nonterminals to datatypes: data E = E1 T | E2 E T data T = T1 F | T2 T F data F = F1 E | F2 Int where we have translated Digs by the type Int. Therefore its abstract syntax is described by a single datatype. T . The automatic transformation of function nesting into function nesting is called deforestation (trees are removed). as always. Int → f ) ) 86 . T uses F . data Expr = Con Int | Add Expr Expr | Mul Expr Expr However. these three types are mutually recursive. and F deﬁned above. Note that this is a rather inconvenient and clumsy abstract syntax for expressions. the one corresponding to the main datatype and is therefore the one corresponding to the start-symbol). Some (very few) compilers are clever enough to perform this transformation automatically. type EAlgebra e t f = ((t → e.5: E E T T F F →T →E +T →F →T *F →(E ) → Digs This grammar has three nonterminals. E . we would like to write function nesting and have our compiler ﬁgure out that it is better to use function nesting in computations. Since E uses T . t → f → t) .3. The main datatype of the three datatypes is the one corresponding to the start symbol E . (f → t. Using the approach from Section 2.5.2. In this section we look again at the expression grammar of Section 3. Since the datatypes are mutually recursive. the algebra type EAlgebra consists of three tuples of functions and three carriers (the main carrier is. to illustrate the concept of mutual recursive datatypes.

evalE :: E → Int evalE = foldE ((id . foldE :: EAlgebra e t f → E → e foldE ((e1 . (f1 . (t1 .2. (f1 . In the algebra that is used in the foldE . id )) exE = E2 (E1 (T2 (T1 (F2 2)) (F2 3))) (T1 (F2 1)) ? evalE exE 7 Once again our example shows that abstract syntax cannot easily be interpreted by human beings. e2 ). (+)). and t are instantiated with Int. Here is a function a2cE which does this job for us.5. t2 ). Trees The fold function foldE for E also folds over T and F . all type variables e. so it uses three mutually recursive local functions. t2 ). e2 ). a2cE :: E → String a2cE = foldE ((e1 . (id . f . (t1 . f2 )) = fold where fold (E1 t) = e1 (foldT t) fold (E2 e t) = e2 (fold e) (foldT t) foldT (T1 f ) = t1 (foldF f ) foldT (T2 t f ) = t2 (foldT t) (foldF f ) foldF (F1 e) = f1 (fold e) foldF (F2 n) = f2 n We can now use foldE to write a syntax driven expression evaluator evalE . (id . (∗)). f2 )) where e1 = λt → t e2 = λe t → e + "+" + t + + t1 = λf → f t2 = λt f → t + "*" + f + + f1 = λe → "(" + e + ")" + + f2 = λn → show n ? a2cE exE "2*3+1" 87 .

Notice that this list may be empty (in which case. General trees A general tree consist of a node. As usual. Therefore the node function of the algebra has a corresponding list of recursive arguments. the fold is usually not linear. For example. Eﬃciency A fold takes a value of a datatype. If the evaluation of each of these functions on their arguments takes constant time. but if the functions in the algebra are not constant. A technique that often can be used in such a case is the accumulating parameter technique. For example.4.2.5.5. and replaces its constructors by functions. So. folds are often eﬃcient functions.2. However. Often such a nonlinear fold can be transformed into a more eﬃcient function. []) + then function reverse takes time quadratic in the length of its argument list. Compositionality 5. of course. and it follows that if we deﬁne the function reverse by reverse :: [a] → [a] reverse = foldL (λx xs → xs + [x ]. 5. One can compute the sum of the values at the nodes of a general tree as follows: sumTree :: Tree Int → Int sumTree = foldTree (λx xs → x + sum xs) Computing the product of the values at the nodes of a general tree and computing the size of a general tree can be done in a similar way. for the reverse function we have reverse x = reverse x [] 88 . holding a value. The local fold function is recursively called on all elements of a list using the map function. list concatenation is linear in its left argument. the type TreeAlgebra and the function foldTree can be generated automatically from the data deﬁnition. only the value at the node is of interest). where the tree splits into a list of subtrees. some functions require more than constant evaluation time. evaluation of the fold takes time linear in the number of constructors in its argument. data Tree x = Node x [Tree x ] type TreeAlgebra x a = x → [a] → a foldTree :: TreeAlgebra x a → Tree x → a foldTree node = fold where fold (Node x gts) = node x (map fold gts) Notice that the constructor Node has a list of recursive arguments.

which maps a function over the elements at the leaves. (Recall the rules 1 = r1 + r1 and r = r1 + r2 . Deﬁne the datatype Resist to represent resistors. data LNTree a b = Leaf a | Node (LNTree a b) b (LNTree a b) Exercise 5.3.1.1. A path through a binary tree describes the route from the root of the tree to some leaf. 1.4. and then using a fold. 3. we have seen how to associate an algebra type and a fold function with every datatype. Deﬁne an algebra type and a fold function for the following datatype. they can be put in parallel (:|:) or in sequence (:∗:). sp. Exercise 5. There are some basic resistors with a ﬁxed (ﬂoating point) resistance and. In the previous section. Deﬁne this function ﬁrst using explicit recursion. given two resistors. Exercise 5. Exercise 5. height.) r 1 2 5. 2. see Section 5.2. 2. 5. Deﬁne a compositional function allPaths which produces all paths of a given tree. which returns the height of a tree.3. which returns the maximal value at the leaves. Deﬁne a compositonal function result which determines the resistance of a resistors. 4. We choose to represent paths by sequences of Direction’s: data Direction = L | R in such a way that taking the left subtree in an internal node will be encoded by L and taking the right subtree will be encoded by R. mapBinTree.2. maxBinTree. deﬁne the type ResistAlgebra and the corresponding function foldResist. Deﬁne the following functions as folds on the datatype BinTree. and establishes terminology. which returns the length of a shortest path. This exercise deals with resistors. ﬂatten. Also.3. or family of mutually recursive datatypes. which returns the list of leaf values in left-to-right order. Algebraic semantics reverse :: [a] → [a] → [a] reverse [] ys = ys reverse (x : xs) ys = reverse xs (x : ys) The evaluation of reverse xs takes time linear in the length of xs. 1. Algebraic semantics This section summarizes what we have seen before. The algebra is a tuple (one component per datatype) of tuples (one component per constructor 89 .5.

3: here we will use a single. compositional initial algebra algebra homomorphism algebraic semantics 5.2. multiplication and division are deﬁned. when applied via a fold. also called the carrier of the algebra.1. The function resulting from applying an algebra to a fold function is sometimes also called an algebra homomorphism. If we work with a family of mutually recursive datatypes. subtraction. Compositionality carrier of the datatype) of semantic actions. Expressions The ﬁrst part of this section presents a basic expression evaluator. and the others auxiliary carriers. This algebra. We will use another datatype for expressions than the one introduced in Section 5. There is one special algebra: the one whose components are the constructor functions of the datatypes we operate on. and hence non mutual recursive datatype for expressions. 5. deﬁnes the identity function.4. We restrict ourselves to ﬂoat valued expressions on which addition. Evaluating expressions In this subsection we start with a more involved example: an expression evaluator.4. We call the carrier corresponding to the main datatype of the family the main carrier. The evaluator is extended with variables in the second part and with local deﬁnitions in the third part. The fold function traverses a value and recursively replaces syntactic constructors of the datatypes by the corresponding semantic actions of the algebra.5. and is called the initial algebra. The algebra is parameterized over the result type of the computation. and the algebra is parameterized over all of them. Functions which are deﬁned in terms of a fold function and an algebra are called compositional functions. and the compositional function is said to deﬁne algebraic semantics. The datatype and the corresponding algebra type and fold function are as follows: inﬁxl 7 ‘Mul ‘ inﬁx 7 ‘Dvd ‘ inﬁxl 6 ‘Add ‘. ‘Min‘ data Expr = Expr | Expr | Expr | Expr | Num ‘Add ‘ Expr ‘Min‘ Expr ‘Mul ‘ Expr ‘Dvd ‘ Expr Float — add type ExprAlgebra a = (a → a → a 90 . we need one carrier per datatype.

Float → a) — — — — min mul dvd num foldExpr :: ExprAlgebra a → Expr → a foldExpr (add .2. Note that we use the same name (Expr ) for the datatype. (−). min. Computing the result of an expression now simply consists of replacing the constructors by appropriate functions. the fact that Expr does not have an extra parameter x like the list and tree examples. v ) ← env . mul .5. id ) 5. data Expr = Expr ‘Add ‘ Expr | Expr ‘Min‘ Expr | Expr ‘Mul ‘ Expr 91 . dvd . The values of variables are typically looked up in an environment which binds the names of the variables to values.a → a → a . We implement an environment as a list of name-value pairs. (∗). num) = fold where fold (expr 1 ‘Add ‘ expr 2 ) = fold expr 1 ‘add ‘ fold fold (expr 1 ‘Min‘ expr 2 ) = fold expr 1 ‘min‘ fold fold (expr 1 ‘Mul ‘ expr 2 ) = fold expr 1 ‘mul ‘ fold fold (expr 1 ‘Dvd ‘ expr 2 ) = fold expr 1 ‘dvd ‘ fold fold (Num n) = num n expr 2 expr 2 expr 2 expr 2 There is nothing special to notice about these deﬁnitions except.4. Adding variables Our next goal is to extend the evaluator of the previous subsection such that it can handle variables as well.4.a → a → a . value)] (?) :: Eq name ⇒ Env name value → name → value env ? x = head [v | (y. (/). although it diﬀers from the previous Expr datatype.a → a → a . For our purposes names are strings and values are ﬂoats. resultExpr :: Expr → Float resultExpr = foldExpr ((+). In the following programs we will use the following functions and types: type Env name value = [(name. Expressions . x = = y] type Name = String type Value = Float The datatype and the corresponding algebra type and fold function are now as follows. perhaps.

a → a → a . id .3)] (Var "x" ‘Mul‘ Num 2) 6 The good way of using an environment is the following: instead of working with a computation which. Thus we turn the environment in a ‘local’ variable. Name → a) — — — — — — add min mul dvd num var foldExpr :: ExprAlgebra a → Expr → a foldExpr (add . (env ?)) ? resultExprBad [("x".a → a → a .a → a → a . the argument of foldExpr now has an extra component: the unary function var which corresponds to the unary constructor Var . (/). mul . Similarly. resultExprBad :: Env Name Value → Expr → Value resultExprBad env = foldExpr ((+). yields an algebra of values it is better to turn the computation itself into an algebra. (|∗|). Here is a ﬁrst. the basic idea is that we use the environment as a global variable here). given an environment. num. var ) = fold where fold (expr 1 ‘Add ‘ expr 2 ) = fold expr 1 ‘add ‘ fold fold (expr 1 ‘Min‘ expr 2 ) = fold expr 1 ‘min‘ fold fold (expr 1 ‘Mul ‘ expr 2 ) = fold expr 1 ‘mul ‘ fold fold (expr 1 ‘Dvd ‘ expr 2 ) = fold expr 1 ‘dvd ‘ fold fold (Num n) = num n fold (Var x ) = var x expr 2 expr 2 expr 2 expr 2 The datatype Expr now has an extra constructor: the unary constructor Var . (|/|) :: (Env Name Value → Value) → (Env Name Value → Value) → (Env Name Value → Value) f |+| g = λenv → f env + g env f |−| g = λenv → f env − g env 92 . (∗). Computing the result of an expression somehow needs to use an environment. (|+|). (−). min. (|−|). bad way of doing this: one can use it as an argument of a function that computes an algebra (we will explain why this is a bad choice in the next subsection. Compositionality | Expr ‘Dvd ‘ Expr | Num Value | Var Name type ExprAlgebra a = (a → a → a . Value → a . dvd .5.

In this case the computation is of the form env → val .4.4. The environment which is used by the computation of a subexpression of an expression is inherited from the computation of the expression. ﬂip (?)) ? resultExprGood (Var "x" ‘Mul‘ Num 2) [("x". synthesised attribute inherited attribute 5. Since we are working with abstract syntax we say that the synthesised and inherited attributes are attributes of the datatype Expr . The datatype (called Expr again) and the corresponding algebra type and eval function are now as follows: data Expr = Expr ‘Add ‘ Expr | Expr ‘Min‘ Expr | Expr ‘Mul ‘ Expr 93 . (|/|). Computing the result of a variable consists of looking it up in the environment. (|−|). (|∗|) and (|/|) on computations. The value type is an example of a synthesised attribute. A (local) deﬁnition is an expression of the form Def name expr 1 expr 2 which should be interpreted as: let the value of name be equal to expr 1 in expression expr 2 .5. (∗) and (/) on values are now replaced by corresponding actions (|+|).3)] 6 The actions (+). If Expr is one of the mutually recursive datatypes which are generated from the nonterminals of a grammar. (|−|). const. (−). Adding deﬁnitions Our next goal is to extend the evaluator of the previous subsection such that it can handle deﬁnitions as well. Expressions f |∗| g = λenv → f env ∗ g env f |/| g = λenv → f env / g env resultExprGood :: Expr → (Env Name Value → Value) resultExprGood = foldExpr ((|+|). Variables are typically deﬁned by updating the environment with an appropriate name-value pair. Computing the result of a constant does not need the environment at all. the algebraic semantics of an expression is a computation which yields a value.3. Computing the result of the sum of two subexpressions within a given environment consists of computing the result of the subexpressions within this environment and adding both results to yield a ﬁnal result. The value of an expression is synthesised from values of its subexpressions. (|∗|). The environment type is an example of an inherited attribute. Thus. then we say that the synthesised and inherited attributes are attributes of the nonterminal.

Notice that the last two arguments are recursive ones. min. which can be used to introduce a local variable. dvd .5. (/). (−).a → a → a . (env ?). We can now explain why the ﬁrst use of environments is inferior to the second one. Trying to extend the ﬁrst deﬁnition gives something like: resultExprBad :: Env Name Value → Expr → Value resultExprBad env = foldExpr ((+). For example. the parameter of foldExpr now has an extra component: the ternary function def which corresponds to the ternary constructor Def . mul . Name → a .a → a → a . Compositionality | | | | Expr ‘Dvd ‘ Expr Num Value Var Name Def Name Expr Expr — — — — — — — add min mul dvd num var def type ExprAlgebra a = (a → a → a . def ) = fold where fold (expr 1 ‘Add ‘ expr 2 ) = fold expr 1 ‘add ‘ fold expr 2 fold (expr 1 ‘Min‘ expr 2 ) = fold expr 1 ‘min‘ fold expr 2 fold (expr 1 ‘Mul ‘ expr 2 ) = fold expr 1 ‘mul ‘ fold expr 2 fold (expr 1 ‘Dvd ‘ expr 2 ) = fold expr 1 ‘dvd ‘ fold expr 2 fold (Num n) = num n fold (Var x ) = var x fold (Def x value body) = def x (fold value) (fold body) Expr now has an extra constructor: the ternary constructor Def . seconds = Def "days_per_year" (Num Def "hours_per_day" (Num Def "minutes_per_hour" (Num Def "seconds_per_minute" (Num Var "days_per_year" ‘Mul ‘ Var "hours_per_day" ‘Mul ‘ Var "minutes_per_hour" ‘Mul ‘ Var "seconds_per_minute" 365) ( 24) ( 60) ( 60) ( )))) Similarly. var . id . (∗). num. the following expression can be used to compute the number of seconds per year. Value → a . Name → a → a → a) foldExpr :: ExprAlgebra a → Expr → a foldExpr (add . error "def") 94 .a → a → a .

A stack (a value of type Stack v for some value type v ) is a list of values. We can then use this stack machine to evaluate compiled expressions. Extending the second deﬁnition causes no problems: the environment is now accessible in the algebra (which consists of computations). the binding for x hides possible other bindings for x .5. y) to an environment (in the deﬁnition of the operator |=|). (|−|). The computation corresponding to the body of an expression with a local deﬁnition can now be evaluated in the updated environment. and pushes the result back on the stack.4. (|∗|).4. (|=|)) ? resultExprGood seconds [] 31536000 Note that by prepending a pair (x . We cannot update the environment in this setting: we can read the environment but afterwards it is not accessible any more in the algebra (which consists of values). f f f f x |+| g |−| g |∗| g |/| g |=| f = λenv = λenv = λenv = λenv = λg env →f →f →f →f →g env + g env env − g env env ∗ g env env / g env ((x . Expressions The last component causes a problem: a body that contains a local deﬁnition has to be evaluated in an updated environment. applies the operator. from which you can pop values. (|/|).4. on which you can push values. Imagine a simple computer for evaluating arithmetic expressions. data MachInstr v = Push v | Apply (v → v → v ) type MachProgr v = [MachInstr v ] An instruction either pushes a value of type v on the stack. and from which 95 . We can easily add a new action which updates the environment. const. This section is inspired by an example in the textbook by Bird and Wadler [3]. or it executes an operator that takes the two top values of the stack. This computer has a ‘stack’ and can execute ‘instructions’ which change the value of the stack. Compiling to a stack machine In this section we compile expressions to instructions on a stack machine. we add the pair to the environment. The class of possible instructions is deﬁned by the following datatype. f env ) : env ) resultExprGood :: Expr → (Env Name Value → Value) resultExprGood = foldExpr ((|+|). ﬂip (?). 5. By deﬁnition of (?).

Exercise 5. dvd . The function der is deﬁned by der (e1 ‘Add ‘ e2 ) dx = der e1 dx ‘Add ‘ der e2 dx der (e1 ‘Min‘ e2 ) dx = der e1 dx ‘Min‘ der e2 dx der (e1 ‘Mul ‘ e2 ) dx = (e1 ‘Mul ‘ der e2 dx ) ‘Add ‘ (der e1 dx ‘Mul ‘ e2 ) 96 .6.5. which determines whether or not an expression is a sum. Deﬁne the following functions as folds on the datatype Expr that contains deﬁnitions. min. Compositionality you can take the top value.5. A module Stack for stacks can be found in Appendix A. num. deﬁned by: compileExpr :: Expr → Env Name (MachProgr Value) → MachProgr Value compileExpr = foldExpr (add . var . An expression can be translated (or compiled) into a list of instructions by the function compile. This exercise deals with expressions without deﬁnitions. fd env ) : env ) Exercise 5. isSum. 1. which returns the list of variables that occur in the expression. 2. vars. The eﬀect of executing an instruction of type MachInstr is deﬁned by execute :: MachInstr v → Stack v → Stack v execute (Push x ) s = push x s execute (Apply op) s = let a = top s t = pop s b = top t u = pop t in push (op a b) u A sequence of instructions is executed by the function run deﬁned by run :: MachProgr v → Stack v → Stack v run [] s =s run (x : xs) s = run xs (execute x s) It follows that run can be deﬁned as a foldl . mul . def ) where f ‘add ‘ g = λenv → f env + g env + [Apply (+)] + + f ‘min‘ g = λenv → f env + g env + [Apply (−)] + + f ‘mul ‘ g = λenv → f env + g env + [Apply (∗)] + + f ‘dvd ‘ g = λenv → f env + g env + [Apply (/)] + + num v = λenv → [Push v ] var x = λenv → env ? x def x fd fb = λenv → fb ((x .

yields the element at the unique leaf. but then it is impossible to write replace as a fold. Exercise 5. (use z . y and z are variables). addition and substraction. a variable usage or a nested block.7. A variable from a global scope is visible in a local scope only if it is not hidden by a variable with the same name in the local scope. dcl x . use stands for usage and x.5. Consider the datatype of paths introduced in Exercise 5. use y . use y 97 . Give an informal description of the function der . The concrete representation of an example block of our block structured language looks as follows (dcl stands for declaration. given a tree and a path in the tree. dcl x . Why is the function der not compositional ? 3. Also. Deﬁne the function replace. Block structured languages der (e1 ‘Dvd ‘ e2 ) dx = ((e2 ‘Mul ‘ der e1 dx ) ‘Min‘ (e1 ‘Mul ‘ der e2 dx )) ‘Dvd ‘ (e2 ‘Mul ‘ e2 ) der (Num f ) dx = Num 0 der (Var s) dx = if s = = dx then Num 1 else Num 0 1. use x .5. 5. It is easy to write a function with the required functionality if you swap the arguments.1.2. Deﬁne a datatype Exp to represent expressions consisting of (ﬂoating point) constants.5. 4. see Section 5. The example deals with the scope of variables in a block structured language. deﬁne the type ExpAlgebra and the corresponding foldExp. 2. Deﬁne the function der on Exp and show that this function is compositional. which when given m replaces all the leaves by m. dcl y . use x) .5. which given a binary tree and an element m replaces the elements at the leaves by m as a fold on the datatype BinTree. Exercise 5. Block structured languages This section presents a more complex example of the use of tuples in combination with compositionality. Note that the fold returns a function. A statement is a variable declaration. variables. dcl z . A path in a tree leads to a unique leaf.3. Deﬁne a compositonal function path2Value which.8. Blocks A block is a list of statements. 5.1.

The usage of z refers to the local declaration (the only declaration of z). 0) . and the fold function foldBlock . which describe the abstract syntax of blocks. Access (0. Enter (1. c): accesses the c’th variable of the l ’th nested block. Idf → s.5. empty). c): leaves the l ’th nested block in which c local variables were declared. There are three types of instructions. The code generated for the above example looks as follows. Access (1. Nested blocks are surrounded by parentheses. corresponding to the grammar that describes the concrete syntax of blocks which is used above. (Idf → s. Leave (1. 1). (dcl . Generating code The goal of this section is to generate code from a block. use. As usual. type Block = [Statement] data Statement = Dcl Idf | Use Idf | Blk Block type Idf = String type BlockAlgebra b s = ((s → b → b. The local usage of x refers to the local declaration and the global usage of x refers to the global declaration. 2). Access (1. 0). 2). which consists of two tuples of functions. The usage of y refers to the global declaration (the only declaration of y). which uses two mutually recursive local functions. [Enter (0. Here are some mutually recursive (data)types.5. blk )) = fold where fold (s : b) = cons (foldS s) (fold b) fold [] = empty foldS (Dcl x ) = dcl x foldS (Use x ) = use x foldS (Blk b) = blk (fold b) 5. 1). • Enter (l . Access (0. We use meaningful names for data constructors and we use built-in lists instead of user-deﬁned lists for the block algebra. Compositionality Statements are separated by semicolons. b) . 2) 98 . • Access (l . the algebra type BlockAlgebra. Note that it is allowed to use variables before they are declared.2. c): enters the l ’th nested block in which c local variables are declared. • Leave (l . The code consists of a sequence of instructions. b → s) ) foldBlock :: BlockAlgebra b s → Block → b foldBlock ((cons. can be generated from the two mutually recursive (data)types.

Count) type BlockInfo = (Level . Count) data Instruction = Enter BlockInfo | Leave BlockInfo | Access Variable type Code = [Instruction] The function ab2ac.5. lc)]. lc) of the variable is looked up in the global environment e. For all syntactic constructs of Statement and Block we deﬁne appropriate semantic actions on an algebra of computations. c) = e ? x 99 . c)] (l . description of these semantic actions. The level–local-count pair (l . This code is of the form [Access (l . (l . use x e = cd where cd = [Access (l . Note that we do not generate any code for a declaration statement. lc)) lc = lc + 1 where function update is deﬁned in the AssociationList module. Access (0. Moreover we have to increment the local variable count lc to lc + 1. The abstract syntax of the code to be generated is described by the following datatype. Instead we perform some computations which make it possible to generate appropriate code for other statements. which generates abstract code (a value of type Code) from an abstract block (a value of type Block ).5. somewhat simpliﬁed. dcl x (le. • Use: Every time we use a local variable x we have to generate code cd for it. l . • Dcl : Every time we declare a local variable x we have to update the local environment le of the block we are in by associating with x the current pair of level and local-count (l . Here is a. type Count = Int type Level = Int type Variable = (Level . uses a compositional function block2Code. lc) = (le . 1). lc ) where le = le ‘update‘ (x . 2) ] Note that we start numbering levels (l ) and counts (c) (which are sometimes called displacements) from 0. Block structured languages . Leave (0. lc).

The code which is generated for the block is surrounded by an appropriate [Enter lcB ]-[Leave lcB ] pair. a local block environment and a local variable count. For cons we compute three synthesised attributes: the local block environment. lc . e. 0) cd = [Enter (l . The code cd which is generated is the concatenation cdS + cdB of the code cdS which is + generated for the ﬁrst statement and the code cdB which is generated for the rest of the block. cdB ) = fB (leB . lc) = (le . and we compute two synthesised attributes: the local variable count and the generated code. The computation for the nested block results in a local variable count lcB and a local environment leB . Compositionality • Blk : Every time we enter a nested block we increment the global level l to l + 1. given a local environment le and local variable count lc. cons fS fB (le. l ) = cd where l =l +1 (leB . lcB . lc) (le . start with a fresh local variable count 0 and set the local environment of the nested block we enter to the current global environment e. For use we need one inherited attribute: a global block environment. lcB )] + cdB + [Leave (l . Two of them.5. When processing the statements of a nested block we already make use of the global block environment which we are synthesising (when looking up variables). and we compute one synthesised attribute: the generated code. the local variable count and the generated code. blk fB (e. For blk we need two inherited attributes: a global block environment and a global level. cdB ) = fB (leS . lcB )] + + • []: No action need to be performed for an empty block. lcS ) cd = cdS + cdB + What does our actual computation type look like? For dcl we need three inherited attributes: a global level. the 100 . l . lcS . results in a local environment leS and local variable count lcS . lc . Two of them: the local block environment and the local variable count are also synthesised attributes. This environment-count pair is then used by the computation of the rest of the block to result in a local environment le and local variable count lc . cd ) where (leS . • (:): For every nonempty block we perform the computation of the ﬁrst statement of the block which. Furthermore we need to make sure that the global environment (the one in which we look up variables) which is used by the computation for the nested block is equal to leB . cdS ) = fS (le. Moreover there is one extra attribute: a local block environment which is both inherited and synthesised.

lc) = ((le. lcS ). Level ) type LocalEnv = (BlockEnv . When processing a block we already use the global environment which we are synthesising (when looking up variables). lc ). cdB ) = fB (leB . l ) (leS . c)] (l . cd ) where cd = [Access (l . l ) (le. c)] + cd + [Leave (0. ab2ac :: Block → Code ab2ac b = [Enter (0. cd ) where ((leS . block2Code :: Block → GlobalEnv → LocalEnv → (LocalEnv .5. It is clear from the considerations above that the following types fulﬁll our needs. []) dcl x (e. type BlockEnv = [(Idf . lc) = ((le. lc ). The global environment is an attribute which is both inherited and synthesised. cd ) where ((leB . lc) ((le . lc) = ((le. use. []) where le = (x . lc ). l ) (le. a fresh level and a fresh local variable count. Block structured languages local block environment and the local variable count are also needed as inherited attributes. lc) = ((le . lcS ) cd = cdS + cdB + empty (e. The code is a synthesised attribute. (dcl . lc). blk )) where cons fS fB (e. l ) (e. l ) (le. Attributes which are not mentioned in those actions are added as extra components which do not contribute to the functionality. l ) (le. l ) (le. lc). lc) = ((le . cdB ) = fB (e. lcB ). lcB )] + cdB + [Leave (l . c)] + + where 101 . lc)) : le lc = lc + 1 use x (e. Count) The implementation of block2Code is now a straightforward translation of the actions described above. 0) l =l +1 cd = [Enter (l . lcB )] + + The code generator starts with an empty local environment.5. (l . lc). Code) block2Code = foldBlock ((cons. l ) (le. empty). Variable)] type GlobalEnv = (BlockEnv . cdS ) = fS (e. c) = e ? x blk fB (e.

Deﬁne the function a2cParity as foldParity.Access (1. Consider your answer to Exercise 2. Deﬁne the function foldMir . 2. 3. Exercises Exercise 5.5. Compositionality ((e. 2. which gives an abstract syntax for mirror-palindromes.Enter (1.22. which describes how the semantic actions that correspond to the syntactic constructs of Parity should be applied. 4. Dcl "z". 2. which gives an abstract syntax for palindromes.9. 4.Access (0. Deﬁne the function foldPal . which describes how the semantics actions that correspond to the syntactic constructs of Pal should be applied.2).Leave (1. Deﬁne a type PalAlgebra that describes the type of the semantic actions that correspond to the syntactic constructs of Pal . Consider your answer to exercise 2.23. Use "y". Exercise 5. Deﬁne the function foldParity.1). 5. 0) aBlock = [ Use "x". Dcl "x" . which gives an abstract syntax for parity-sequences. which describes how semantic actions that correspond to the syntactic constructs of Mir should be applied. 1. Describe the parsers pfoldPal m1 and pfoldPal m2 where m1 and m2 correspond to the algebras of a2cPal and aCountPal respectively.2)] 5.2) . which interprets its input in an arbitrary semantic MirAlgebra without building the intermediate abstract syntax tree. Exercise 5. Deﬁne the functions a2cPal and aCountPal as foldPal ’s.Access (0. Consider your answer to Exercise 2. 3. Deﬁne the functions a2cMir and m2pMir as foldMir ’s. Describe the parsers pfoldMir m1 and pfoldMir m2 where m1 and m2 correspond to the algebras of a2cMir and m2pMir . Deﬁne the type MirAlgebra that describes the semantic actions that correspond to the syntactic constructs of Mir . 5. Deﬁne the parser pfoldPal which interprets its input in an arbitrary semantic PalAlgebra without building the intermediate abstract syntax tree.0).Access (0. c). Deﬁne the parser pfoldMir . Dcl "y". respectively.10. 0) ([].Leave (0.1).1). 3. 1. Blk [Use "z". cd ) = block2Code b (e.2).0) . 102 . Use "x"] . Deﬁne the type ParityAlgebra that describes the semantic actions that correspond to the syntactic constructs of Parity. Use "y"] ? ab2ac aBlock [Enter (0.24.6.Access (1.11. 1. Dcl "x".

Deﬁne the function foldBlock . Exercises Exercise 5. Deﬁne the function foldBitList. Exercise 5. Deﬁne the type BlockAlgebra that describes the semantic actions that correspond to the syntactic constructs of Block .5. 1.13. What is the abstract representation of x. 2. 3.(y. The function checkBlock tests whether or not each variable of a given abstract block is declared before use (declared in the same or in a surrounding block).S R|ε →D |U |N →x|y →X|Y →(B ) (block ) (rest) (statement) (declaration) (usage) (nested block ) 1. 4.12. 4. Deﬁne the type BitListAlgebra that describes the semantic actions that correspond to the syntactic constructs of BitList. Deﬁne a datatype Block that describes the abstract syntax that corresponds to the grammar. Consider your answer to Exercise 2. Deﬁne the function a2cBlock . which converts an abstract block into a concrete one. which interprets its input in an arbitrary semantic BitListAlgebra without building the intermediate abstract syntax tree.Y). 3. Write a2cBlock as a foldBlock 5. which gives an abstract syntax for bit-lists. which describes how the semantic actions corresponding to the syntactic constructs of Block should be applied. The following grammar describes the concrete syntax of a simple blockstructured programming language B R S D U N →S R →.X? 2. which describes how the semantic actions that correspond to the syntactic constructs of BitList should be applied. 103 . Deﬁne the function a2cBitList as a foldBitList.25.6. Deﬁne the parser pfoldBitList.

5. Compositionality 104 .

apply a fold to the abstract syntax. For example. Apply a fold to the abstract syntax Instead of inserting operations in a basic parser. both compute the maximum nesting depth in a string of parentheses.2. The net travelling time is computed directly by inserting the correct semantic functions. If we want to compute the total travelling time. use a class instead of abstract syntax. Function nesting is deﬁned 105 . and a parser that computes the net travelling time.2. and computes the desired result by applying a fold to the abstract syntax. Computing with parsers Parsers produce results.6. where we deﬁned two functions with the same functionality: nesting and nesting’. Insert a semantic function in the parser In Chapter 4 we have deﬁned two parsers: a parser that computes the abstract syntax for a travelling schema. we have to insert diﬀerent functions in the basic parser.1. pass an algebra to the parser. or an integer that represents the net travelling time in minutes. 6. the parsers for travelling schemes given in Chapter 4 return an abstract syntax. we can write a parser that parses the input to an abstract syntax.2. • it is diﬃcult to locate all positions where semantic functions have to be inserted in the parser. but it has some disadvantages when building a larger parser: • semantics is intertwined with the parsing process. These functions are obtained by inserting diﬀerent functions in the basic parser. This section shows several ways to compute results using parsers: • • • • insert a semantic function in the parser. 6. This approach works ﬁne for a small parser. Another way to compute the net travelling time is by ﬁrst computing the abstract syntax. An example of such an approach has been given in Section 5. and then applying a function to the abstract syntax that computes the net travelling time.

The only thing we have to write ourselves in the second deﬁnition is the depthParenthesesAlgebra. we repeat the main arguments below. 106 . we would like to write function nesting’ and have our compiler ﬁgure out that it is better to use function nesting in computations. On the other hand. We want to avoid building the abstract syntax tree altogether. To obtain the best of both worlds.3.6. Suppose we have a datatype AbstractTree data AbstractTree = . Some (very few) compilers are clever enough to perform this transformation automatically. parens parens :: Parser Char Parentheses = (\_ b _ d -> Match b d) <$> open <*> parens <*> close <*> parens <|> succeed Empty :: Parser Char Int = (\_ b _ d -> max (1+b) d) <$> open <*> nesting <*> close <*> nesting <|> succeed 0 :: Parser Char Int = depthParentheses <$> parens nesting nesting nesting’ nesting’ The ﬁrst deﬁnition (nesting) is more eﬃcient. and the function depthParentheses (or the function foldParentheses. This section sketches the general idea. The advantage of the second deﬁnition (nesting’) is that we reuse both the parser parens. which returns an abstract syntax tree. it might be more diﬃcult to write because we have to insert functions in the correct places in the basic parser. 6. which does recursion over an abstract syntax tree.. Each of these deﬁnitions has its own merits. Function nesting’ is deﬁned by applying a fold to the abstract syntax. The disadvantage of the second deﬁnition is that it builds an intermediate abstract syntax tree. which is ‘ﬂattened’ by the fold.. The previous section gives an example of deforestation on the datatype Parentheses. which is used in the deﬁnition of depthParentheses). Deforestation Deforestation removes intermediate trees in computations. because it does not build an intermediate abstract syntax tree. Computing with parsers by inserting functions in the basic parser. The automatic transformation of function nesting’ into function nesting is called deforestation (trees are removed).

and then computes some value by folding with an algebra f over this tree: p = foldAbstractTree f . The following two sections each describe a way to implement such a deforestated function. Suppose now that we deﬁne a function p that parses an AbstractTree. see Chapter 5.. Using a class instead of abstract syntax From this datatype we construct an algebra and a fold.4. we deﬁne the following class: class Parens a where match :: a -> a -> a empty :: a Note that types of the functions in the class Parens correspond exactly to the two types that occur in the type ParenthesesAlgebra. 6. Using a class instead of abstract syntax Classes can be used to implement the deforestated or fused computation of a fold with a parser. parseAbstractTree Then deforestation says that p is equal to the function parseAbstractTree in which occurrences of the constructors of the datatype AbstractTree have been replaced by the corresponding components of the algebra f. This class is used in a parser for parentheses: parens parens :: Parens a => Parser Char a = (\_ b _ d -> match b d) <$> open <*> parens <*> close <*> parens <|> succeed empty 107 . foldAbstractTree :: AbstractTreeAlgebra a -> AbstractTree -> a A parser for the datatype AbstractTree (which returns a value of AbstractTree) has the following type: parseAbstractTree :: Parser Symbol AbstractTree where Symbol is some type of input symbols (for example Char). For example. This gives a solution of the desired eﬃciency.4. type AbstractTreeAlgebra a = . for the language of parentheses.6..

Another instance of Parens can be used to compute the nesting depth of parentheses: instance Parens Int where match b d = max (1+b) d empty = 0 And now we can write: ?(parens :: Parser Char Int) "()()" [(1.6. "()") . "()()")] So the answer depends on the type we want our function parens to have. "") . we know nothing about their implementation. instance Parens Parentheses match = Match empty = Empty where Now we can write: ?(parens :: Parser Char Parentheses) "()()" [(Match Empty (Match Empty Empty).(Match Empty Empty. This also immediately shows a problem of this. ""). "()"). otherwise elegant. we cannot use function parens to compute. the width (also an Int) of a string of parentheses. Computing with parsers The algebra is implicit in this function: the only thing we know is that there exist functions empty and match of the correct type. "()()") ] Note that we have to supply the type of parens in this expression. otherwise Haskell doesn’t know which instance of Parens to use. So once we have deﬁned the instance Parens Int as above. To obtain a function parens that returns a value of type Parentheses we create the following instance of the class Parens. for example. (0. (1. 108 . This is how we turn the implicit ‘class’ algebra into an explicit ‘instance’ algebra. approach: it does not work if we want to compute two diﬀerent results of the same type. because Haskell doesn’t allow you to deﬁne two (or more) instances with the same type.(Empty.

0) 109 .5.5. Passing an algebra to the parser The previous section shows how to implement a parser with an implicit algebra.empty) = par where par = (\_ b _ d -> match b d) <$> open <*> par <*> close <*> par <|> succeed empty Note that it is now easy to deﬁne diﬀerent parsers with the same result type: nesting.0) breadth = parens (\b d -> d+1. Passing an algebra to the parser 6. we make the algebras explicit. breadth :: Parser Char Int nesting = parens (\b d -> max (1+b) d. Since this approach fails when we want to deﬁne diﬀerent parsers with the same result type.6. Thus we obtain the following deﬁnition of parens: parens :: ParenthesesAlgebra a -> Parser Char a parens (match.

6. Computing with parsers 110 .

We will see that such carrier types may be functions themselves. Unfortunately. We will start with the rep min example. This related formalism is the attribute grammar formalism. a program) we deﬁne two sets of functions: one set that describes how to recursively visit the nodes of the tree.7. which will make it easier to construct such involved algebras. and especially those programs that use functions that recursively descend over datatypes a lot. The domain of that algebra is a higher order (data)type (a (data)type that contains functions). When computing a property of such a recursive object (for example. We start with developing a somewhat unconventional way of looking at functional programs. For each recursive datatype T we have a function foldT. it has been chosen for its simplicity. and give an informal deﬁnition. and deals with a non-interesting. Once this step has been taken. In our case one may think about these datatypes as abstract syntax trees. We will not formally deﬁne attribute grammars. However. In this way we deﬁne the meaning of a language in a semantically compositional way. these types are a guideline for further design steps. the resulting code comes as a surprise to many. Evaluating expressions in this language can be done by choosing an appropriate algebra. We will do so by constructing algebras out of other algebras. and to not distract our attention to speciﬁc. Programming with higher-order folds Introduction In the previous chapters we have seen that algebras play an important role when describing the meaning of a recognised structure (a parse tree). which looks a bit artiﬁcial. In this chapter we will illustrate a related formalism. but instead illustrate the formalism with some examples. programming language related. and one set of functions (an algebra) that describes what to compute at each node when visited. In this chapter we will present a view on recursive computations that will enable us to “design” the carrier type in an incremental way. and for each constructor of the datatype we have a corresponding function as a component in the algebra. Chapter 5 introduces a language in which local declarations are permitted. One of the most important steps in this process is deciding what the carrier type of the algebras is going to be. 111 . and that deciding on the type of such functions may not always be simple. highly speciﬁc problem.

• how to combine algebras. and a sequence of diﬀerent solutions. bin) (Bin l r) = bin (foldTree alg l) (foldTree alg r) Listing 7.1: rm. to build up an understanding of a whole class of such programs. _ ) (Leaf i) = leaf i foldTree alg@(_ . The carrier type of an algebra is the type that describes the objects of the algebra. 112 .start. a -> a -> a) The associated evaluation function foldTree systematically replaces the constructors Leaf and Bin by corresponding operations from the algebra alg that is passed as an argument. • (informally) the concept of an attribute grammar. together with its associated algebra.hs semantic issues. In Listing 7. The rep min problem One of the famous examples in which the power of lazy evaluation is demonstrated is the so-called rep min problem [2]. a -> a -> a) foldTree :: TreeAlgebra a -> Tree -> a foldTree alg@(leaf. We represent it by a type parameter of the algebra type: type TreeAlgebra a = (Int -> a. 7.7. The second example of this chapter demonstrates the techniques on a larger example: a small compiler for part of a programming language. We will use this problem. or ‘higher-order folds’. Programming with higher-order folds data Tree = | Leaf Int Bin Tree Tree = deriving Show type TreeAlgebra a (Int -> a. Many have wondered how this program achieves its goal.1. Goals In this chapter you will learn: • how to write ‘circular’ functional programs.1 we present the datatype Tree. since at ﬁrst sight it seems that it is impossible to compute anything with this program.

we will try to construct a solution that calls foldTree only once. Notice how we have emphasized the fact that a function is returned through some superﬂuous notation: the ﬁrst lambda in the function deﬁnitions constituting the algebra repAlg is required by the signature of the algebra. Program Listing 7. ?rep_min (Bin (Bin (Leaf 1) (Leaf 7)) (Leaf 11)) Bin (Bin (Leaf 1) (Leaf 1)) (Leaf 1) 7. which takes the computed minimal value as an argument.hs We now want to construct a function rep_min :: Tree -> Tree that returns a Tree with the same “shape” as its argument Tree.1. but instead a function constructing a tree of type Int -> Tree. In this program the global variable m has been removed and the second call of foldTree does not construct a Tree anymore. For example.7. min :: Int->Int->Int) rep_min :: Tree -> Tree rep_min t = foldTree repAlg t where m = foldTree minAlg t repAlg = (const (Leaf m). A straightforward solution A straightforward solution to the rep min problem consists of a function in which foldTree is used twice: once for computing the minimal value of the leaf values. The function rep_min that solves the problem in this way is given in Listing 7.3 is an intermediate step towards this goal.2: rm. the second lambda. One of the disadvantages of this solution is that in the course of the computation the pattern matching associated with the inspection of the tree nodes is performed twice for each node in the tree.1. Although this solution as such is no problem. Bin) Listing 7. which could have been 113 . that is used in the tree constructing call of foldTree. Notice that the variable m is a global variable of the repAlg-algebra. The rep min problem minAlg minAlg :: TreeAlgebra Int = (id. 7.1. Lambda lifting We want to obtain a program for the rep min problem in which pattern matching is used only once. but with the values in its leaves replaced by the minimal value occurring in the original tree.1.sol1. and once for constructing the resulting Tree.2.2.

First a function tuple is deﬁned.7. This process is done routinely by functional compilers and is known as lambda-lifting.b) (leaf1. 114 .bin2 (snd l) (snd r) ) ) min_repAlg :: TreeAlgebra (Int. Tupling computations We are now ready to formulate a solution in which foldTree is called only once.hs omitted. 7. which has as its carrier tuples of the carriers of the original algebras. As a consequence we may perform both the computation of the tree constructing function and the minimal value in one go.hs infix 9 ‘tuple‘ tuple :: TreeAlgebra a -> TreeAlgebra b -> TreeAlgebra (a. bin2) = (\i -> (leaf1 i. Programming with higher-order folds repAlg = ( \_ -> \m -> Leaf m .4.3. This function takes two TreeAlgebras as arguments and constructs a third one.1. is there because the carrier set of the algebra contains functions of type Int -> Tree.\l r -> (bin1 (fst l) (fst r) . r) = foldTree min_repAlg t Listing 7.4: rm.3: rm. leaf2 i) . Int -> Tree) min_repAlg = (minAlg ‘tuple‘ repAlg) rep_min’’ t = r m where (m. Note that in the last solution the two calls of foldTree don’t interfere with each other.sol2. by tupling the results of the computations.sol3.\lfun rfun -> \m -> let lt = lfun m rt = rfun m in Bin lt rt ) rep_min’ t = (foldTree repAlg t) (foldTree minAlg t) Listing 7. bin1) ‘tuple‘ (leaf2. The solution is given in Listing 7.

which together achieve the same eﬀect.7. languages that require arguments to be evaluated completely before a call is evaluated (so-called strict evaluation in contrast to lazy evaluation). in an attempt to make the diﬀerent rˆles of the parameters o explicit. see Listing 7. The resulting solution passes back part of the result of a call as an argument to that same call. The new version is given in Listing 7. We want to mention here too that the reverse is in general not true.4. The parameters after the ﬁrst lambda are there because we deal with a TreeAlgebra. again introduced extra lambdas in the deﬁnition of the functions of the algebra. it is in general not possible to split this function into two functions of type a -> c and b -> d. and the result type becomes (Int. Lazy evaluation makes this possible. Tree).1. into an equivalent type Int -> (Int. The deepest front problem is the problem of ﬁnding the so-called front of a tree. Notice however that there is nothing circular in this program. 115 . Exercise 7. d). given a function of type (a. which is isomorphic to just Int. Notice how we have. Tree). we may transform that type into a new type in which we compute only one function. Thus we obtain the original solution as given in Bird [2]. Int->Tree). especially to those who used to program in LISP or ML. b) -> (c. is passed back as an argument to the result of (foldTree mergedAlg t). in our case the value m. and no value is deﬁned in terms of itself (as in ones=1:ones). Each value is deﬁned in terms of other values. and returns as its result the cartesian product of the result types. but we use it to demonstrate that if we compute a cartesian product of functions. Finally. which takes as its arguments the cartesian product of all the arguments of the functions in the tuple. A curious step taken here is that part of the result. In our example the computation of the minimal value may be seen as a function of type ()->Int. Merging tupled functions In the next step we transform the type of the carrier set in the previous example. Int). The rep min problem 7. the trees involved are elements of the datatype Tree. we have systematically transformed a program that inspects each node twice into an equivalent program that inspects each node only once. Listing 7. Lazy evaluation makes this work. This transformation is not essential here. (Int. Concluding. As in the rep min problem.1.6 shows the version of this program in which the function foldTree has been unfolded.1. The front of a tree is the list of all nodes that are at the deepest level.1.5. That such programs were possible came originally as a great surprise to many functional programmers. A straightforward solution is to compute the height of the tree and passing the result of this function to a function frontAtLevel :: Tree -> Int -> [Int]. Because of this surprising behaviour this class of programs became known as circular programs. The parameters after the second lambda are there because we construct values in a higher order carrier set. As a consequence the argument of the new type is (().

sol5.\lfun rfun -> \m -> let (lm.sol4.hs rep_min’’’’ t = r where (m. Leaf m) .lt) = lfun m (rm. r) = tree t m tree (Leaf i) = \m -> (i. Bin lt rt ) ) rep_min’’’ t = r where (m. lt) = tree l m (rm.hs 116 . Leaf m) tree (Bin l r) = \m -> let (lm. Bin lt rt) Listing 7.5: rm.rt) = rfun m in (lm ‘min‘ rm .7.Tree)) mergedAlg = (\i -> \m -> (i. Programming with higher-order folds mergedAlg :: TreeAlgebra (Int -> (Int.6: rm. r) = (foldTree mergedAlg t) m Listing 7. rt) = tree r m in (lm ‘min‘ rm.

4. Redo the previous exercise for the highest front problem. Here we deﬁne a stack machine that has some more instructions.7. 7. and you will extend the language in the exercises with some other constructs.2. A parser for expressions is given in Listing 7. Give the four diﬀerent solutions as deﬁned in the rep min problem. booleans.4 we have deﬁned a stack machine with which simple arithmetic expressions can be evaluated. A language with just these constructs is useless. and an if-then-else expression. Expr0 → if Expr1 then Expr1 else Expr1 | Expr1 Expr1 → Expr2 Expr2∗ Expr2 → Int | Bool where Int generates integers. A stack machine In section 5. which make the language a bit more interesting.2.8.7 also contains a deﬁnition of a fold and an algebra type for the abstract syntax. The compiler compiles this code into code for a hypothetical stack machine. The language The language we consider in this section has integers. An abstract syntax for our language is given in Listing 7. • it can load a boolean. The stack machine we will use has the following instructions: • it can load an integer. Deﬁne the functions height and frontAtLevel 2.2. The Listing 7. 7.7. A small compiler 1. 117 .1.2. The language of the previous section wil be compiled into code for this stack machine in the following section. function application.2. We take the following context-free grammar for the concrete syntax of the language. A small compiler This section constructs a small compiler for (a part of) a small language. and Bool booleans. • given an argument and a function on the stack. 7.2. Note that we use a single datatype for the abstract syntax instead of three datatypes (one for each nonterminal). it can call the function on the argument. Exercise 7. this simpliﬁes the code a bit.

compile (ConInt i) = [LoadInt i] • A ConBool b is compiled to a LoadBool b. The datatype for instructions is given in Listing 7. • it can jump to a label (unconditionally).Bool -> a ) foldExprAS :: ExprASAlgebra a -> ExprAS -> a foldExprAS (iff.a -> a -> a .hs • it can set a label in the code. 7. • given a boolean on the stack. Compiling to the stackmachine How do we compile the diﬀerent expressions to stack machine code? We want to deﬁne a function compile of type compile :: ExprAS -> [InstructionSM] • A ConInt i is compiled to a LoadInt i.conint. it can jump to a label provided the boolean is false. compile (ConBool b) = [LoadBool b] 118 .apply.7: ExprAbstractSyntax.3.9.7.conbool) = fold where fold (If ce te ee) = iff (fold ce) (fold te) (fold ee) fold (Apply fe ae) = apply (fold fe) (fold ae) fold (ConInt i) = conint i fold (ConBool b) = conbool b Listing 7. Programming with higher-order folds data ExprAS = | | | If ExprAS ExprAS ExprAS Apply ExprAS ExprAS ConInt Int ConBool Bool deriving Show type ExprASAlgebra a = (a -> a -> a -> a .2.Int -> a .

A small compiler sptoken :: String -> Parser Char String sptoken s = (\_ b _ -> b) <$> many (symbol ’ ’) <*> token s <*> many1 (symbol ’ ’) boolean = const True <$> token "True" <|> const False <$> token "False" parseExpr :: Parser Char ExprAS parseExpr = expr0 where expr0 = (\a b c d e f -> If b d f) <$> sptoken "if" <*> parseExpr <*> sptoken "then" <*> parseExpr <*> sptoken "else" <*> parseExpr <|> expr1 expr1 = chainl expr2 (const Apply <$> many1 (symbol ’ ’)) <|> expr2 expr2 = ConBool <$> boolean <|> ConInt <$> natural Listing 7.hs data InstructionSM = | | | | | LoadInt Int LoadBool Bool Call SetLabel Label BrFalse Label BrAlways Label type Label = Int Listing 7.2.hs 119 .7.8: ExprParser.9: InstructionSM.

We add a label argument (an integer. then the ‘function’ f (at the moment it is impossible to deﬁne functions in our language. compile (Apply f x) = compile x ++ compile f ++ [Call] • An if-then-else expression If ce te ee is compiled by ﬁrst compiling the conditional expression ce.l) \l -> ([LoadBool b]. but where do these labels come from? From the above description we see that we also need labels when compiling an expression. for which we need another label. Programming with higher-order folds • An application Apply f x is compiled by ﬁrst compiling the argument x. Then we set the label for the else expression. After the then expression we always jump to the end of the code of the if-then-else expression.l’’) = compile te l’ (ec. and we want function compile to return the ﬁrst unused label.l) \l -> let (xc. ﬁnally. used for the ﬁrst label in the compiled code) to function compile. We change the type of function compile as follows: compile :: ExprAS -> Label -> ([InstructionSM]. Then we jump to a label (which will be set before the code of the else expression ee later) if the resulting boolean is false.l’’) = compile f l’ in (xc ++ fc ++ [Call]. we compile the else expression ee. we set the label for the end of the if-then-else expression. and.l’) = compile ce (l+2) (tc. Then we compile the then expression te. and ﬁnally putting a Call on top of the stack.7.l’’) \l -> let (cc.l’’’) = compile ee l’’ compile (If ce te ee) = 120 . compile (If ce te ee) = ++ ++ ++ ++ ++ ++ compile ce [BrFalse ?lab1] compile te [BrAlways ?lab2] [SetLabel ?lab1] compile ee [SetLabel ?lab2] Note that we use labels here.Label) = Int type Label The four cases in the deﬁnition of compile have to take care of the labels. hence the quotes around ‘function’). We obtain the following deﬁnition of compile: compile (ConInt i) compile (ConBool b) compile (Apply f x) = = = \l -> ([LoadInt i].l’) = compile x l (fc.

\cf cx -> \l -> let (xc. Exercise 7.2.l’’’) = cee l’’ in ( cc ++ [BrFalse l] ++ tc ++ [BrAlways (l+1)] ++ [SetLabel l] ++ ec ++ [SetLabel (l+1)] .10: CompileExpr.l’’) = cf l’ in (xc ++ fc ++ [Call].7. 121 .Label)) compileAlgebra = (\cce cte cee -> \l -> let (cc.l’’) .\i -> \l -> ([LoadInt i].l’) = cce (l+2) (tc. A small compiler compile = foldExprAS compileAlgebra compileAlgebra :: ExprASAlgebra (Label -> ([InstructionSM].l) ) Listing 7. Extend the code generation example by adding variables to the datatype Expr.l’’) = cte l’ (ec.l) .\b -> \l -> ([LoadBool b].hs in ( cc ++ [BrFalse l] ++ tc ++ [BrAlways (l+1)] ++ [SetLabel l] ++ ec ++ [SetLabel (l+1)] . Extend the code generation example by adding deﬁnitions to the datatype Expr too. The deﬁnition of function compile as a fold is given in Listing 7. Exercise 7.4 (no answer provided).l’’’ ) .l’) = cx l (fc.3 (no answer provided).10. the carrier type of its algebra is a function of type Label -> ([InstructionSM].l’’’ ) Function compile is a fold.Label).

Programming with higher-order folds 7.7. This program computes the minimum of a tree. Traditionally such attribute grammars are used as the input of a compiler generator. Attribute grammars provide a solution for the systematic description of the phases of the compiler that come after scanning and parsing. But. The minimum value is then passed on to the functions that build the tree with the minimum value in its leaves.3. This chapter does not further introduce attribute grammars. but they will appear again in the course in implementing programming languages. The result tree is computed bottomup. The minimum is computed bottomup: it is synthesized from its children. and is then passed down to the result tree. and is therefore a synthesized and inherited attribute. they can straightforwardly be mapped onto a functional program. The minimum is computed bottom-up. These functions receive the minimum value from their parent tree node: they inherit the minimum value from their parent. Attribute grammars In Section 7. we have shown how one can do without a special purpose attribute grammar processing system. and it computes the tree in which all the leaf values are replaced by the minimum value. Just as we have seen how by introducing a suitable set of parsing combinators one may avoid the use of a special parser generator and even gain a lot of ﬂexibility in extending the grammatical formalism by introducing more complicated combinators.1 we have written a program that solves the rep min problem. just as the concept of a context free grammar was useful in understanding the fundamentals of parser combinators. and is hence a synthesized attribute. understanding attribute grammars will help signiﬁcantly in describing the semantic part of the recognition and compilation process. on which two attributes are deﬁned: the minimum and result tree attributes. 122 . The programs you have seen in this chapter could also have been obtained by means of such a mapping from an attribute grammar speciﬁcation. We can see the rep min computation as a computation on a value of type Tree. The formalism in which it is possible to specify such attributes and computations on datatypes or grammars is called attribute grammars. Although they look diﬀerent from what we have encountered thus far and are probably a little easier to write. and was originally proposed by Donald Knuth in [9].

• regular grammars. Finite-state automata The classical approach to recognising sentences from a regular language uses ﬁnitestate automata. and regular languages. nondeterministic and deterministic ones. A ﬁnite-state automaton can be viewed as a simple form of digital 123 . equally expressive. A regular grammar is a particular kind of context-free grammar that can be used to describe regular expressions. 8. it shows their equivalence with ﬁnite-state automata.1 introduces ﬁnite-state automata. Section 8. Section 8.3 introduces regular expressions as ﬁnite descriptions of regular languages and shows that such expressions are another. ﬁnite-state automata and regular expressions have the same expressive power. numbers. Regular Languages Introduction The ﬁrst phase of a compiler takes an input program. ﬁnite-state automata and regular expressions are three different tools for regular languages. • it is not always possible to give a regular grammar for a context-free grammar. • ﬁnite-state automata appear in two. tool for regular languages. Section 8.4 shows that a nondeterministic ﬁnite-state automaton can be transformed into a deterministic ﬁnitestate automaton. Finite-state automata can be used to recognise sentences of regular grammars. Finally. Furthermore.1.2 introduces regular grammars (context-free grammars of a particular form). • regular grammars. Section 8. equivalent. Regular expressions are used for the description of the terminal symbols. and is organised as follows. Goals After you have studied this chapter you will know that • regular languages are a subset of context-free languages.8. Section 8.1. and splits the input into a list of terminal symbols: keywords. This chapter discusses all of these concepts. etc. versions: deterministic and nondeterministic. Finite-state automata appear in two versions.4 gives some of the proofs of the results of the previous sections. identiﬁers. punctuation. so you don’t have to worry about whether or not your automaton is deterministic.

We start with a description of deterministic ﬁnite-state automata. and a control unit which records the state transitions. S. accepting states get a double circle. start states are explicitly mentioned or indicated otherwise. Q. consider the DFA M0 = (X. F ) where • • • • • X is the input alphabet. b. the simplest form of automata. S. but a useful tool in many practical subproblems. B. 8. and constant space. then there is an arrow labelled x from Qi to Qj .1. The transition function is represented by the edges: whenever d Qi x is a state Qj . F ⊆ Q is the set of accepting states. Q. Deterministic ﬁnite-state automata Finite-state automata come in two ﬂavours: deterministic and nondeterministic. an input ﬁle but only the possibility to read it. Q is a ﬁnite set of states. can be implemented by means of very eﬃcient programs. d.1 (Deterministic ﬁnite-state automaton. DFA). A. A ﬁnitestate automaton can easily be implemented by a function that takes time linear in the length of its input. S ∈ Q is the start state. C} F = {C} where state transition function d is deﬁned by dSa = C dSb = A dSc = S dAa = B dBc = C For human beings. For automaton M0 above. F ) with X = {a. d :: Q → X → Q is the state transition function. The following representation is customary: states are depicted as the nodes in a graph.1. The input alphabet is implicit in the labels. Regular Languages computer with only a ﬁnite number of states. A deterministic ﬁnitestate automaton (DFA) is a 5-tuple (X. d. no temporary storage. Deﬁnition 8. a ﬁnite-state automaton is more comprehensible in a graphical representation.8. c} Q = {S. deterministic ﬁnite-state automaton As an example. A rather limited medium. This implies that problems that can be solved by means of a ﬁnite-state automaton. the pictorial representation is: 124 .

/. the result state of all undeﬁned transitions. The sink state and the transitions from/to it are almost always omitted. ()*+ / ?>=< C O c Note that d is a partial function: for example d B a is not deﬁned. When the complete input has been processed and the DFA is in one of its accepting states. the pairs of current state and remaining input values. 125 . The predicate will be derived in a top-down fashion. the automaton blocks. A 89:. Q. d.e. when starting the DFA in S. We can make d into a total function by introducing a new ‘sink’ state. / ?>=< B accept Because of the deterministic behaviour of a DFA the deﬁnition of acceptance by a DFA is relatively easy. ) ?>=< 89:.e. bac) → (A.1. ac) → (B. S. a sequence w ∈ X ∗ is accepted by a DFA (X. Finite-state automata @A GFED 89:. For example. / ?>=< S c a b a 89:. i. and d E x = D for all states E and terminals x for which d E x is undeﬁned. This operational description of acceptance is formalised in the predicate dfa accept. w can be ‘processed’ symbol by symbol (from left to right) and — depending on the speciﬁc input symbol — the DFA (initially in the start state) moves to the state as determined by its state transition function. in the above automaton we can introduce a sink state D with d D x = D for all terminals x. F ).8. if it is possible. c) → (C. we formulate the predicate in terms of (“smaller”) subcomponents and afterwards we give solutions to the subcomponents. We do so by recording the successive conﬁgurations. i. The action of a DFA on an input string is described as follows: given a sequence w of input symbols. to end in an accepting state after processing w. (S. To illustrate the action of an DFA. we will show that the sentence bac is accepted by M0 .-. If no move is possible. Informally. then we say that w is accepted by the automaton.

the language of M . S.2. For DFA M = (X. The sequence w ∈ X ∗ is accepted by DFA (X.8. Nondeterministic ﬁnite state automata are deﬁned as follows.e.e. a start state. Sometimes it is convenient to have two or more edges labelled with the same terminal symbol from a state. and it follows that the function dfa is actually identical to foldl . Ldfa(M ). {Q}) → Bool dfa accept w (d. d. a function which given a transition function. qs. i. the language of a nondeterministic ﬁnite-state automaton. is deﬁned by Ldfa(M ) = {w ∈ X ∗ | dfa accept w (d. The transition function of a DFA returns a state. returns the unique state that is reached after processing the string starting from the start state. F ) = (dfa d S w) ∈ F It remains to construct a deﬁnition of function dfa that takes a transition function. Q. F ) where dfa accept w (d. the language of a DFA is deﬁned as follows. Regular Languages Suppose dfa is a function that reﬂects the behaviour of the DFA. F ) if dfa accept w (d. Deﬁnition 8. S.2 (Acceptance by a DFA). Nondeterministic ﬁnite-state automata This subsection introduces nondeterministic ﬁnite-state automata and deﬁnes their semantics. dfa d q = foldl d q. 126 . F )} 8. and a list of input symbols. a start state and a string. In these cases one can use a nondeterministic ﬁnitestate automaton.1. S. which implies that for all terminal symbols x and for all states t there can only be one edge starting in t labelled with x. fs) = dfa d qs w ∈ fs dfa d qs = foldl d qs Using the predicate dfa accept. Q. F ). The deﬁnition of dfa is straightforward dfa :: (Q → X → Q) → Q → X ∗ → Q dfa d q = q dfa d q (ax) = dfa d (d q a) x Note that both the type and the deﬁnition of dfa match the pattern of the function foldl . i. d. S. Deﬁnition 8. S.3 (Language of a DFA). Then the predicate dfa accept is deﬁned by: dfa accept :: X ∗ → (Q → X → Q. Q. and reﬂects the behaviour of a DFA.

A ba ED a 89:. c} Q = {S. B} d B c = {C} Again d is a partial function. Thus the DFA becomes an NFA. ()*+ / ?>=< C O c a 89:. F ).1. d. Q. b. Q0 . where • • • • • X is the input alphabet. Q. Q0 . F ) with X = {a.4 (Nondeterministic ﬁnite-state automaton. Here is an example of an NFA: @A GFED 89:. / ?>=< B where state transition function d is deﬁned by d S a = {C} d S b = {A} d S c = {S} d A a = {S. A nondeterministic ﬁnite-state automaton (NFA) is a 5-tuple (X. this NFA is deﬁned as M1 = (X. /. nondeterministic ﬁnite-state automaton An NFA diﬀers from a DFA in that there may be more than one start state and that there may be more than one possible move for each state and input symbol. d. A. F ⊆ Q is the set of accepting states. / ?>=< o S c Note that this NFA is very similar to the DFA in the previous section: the only diﬀerence is that there are two outgoing arrows labelled with a from state A. NFA). 127 . d :: Q → X → {Q} is the state transition function. Finite-state automata Deﬁnition 8. which can be made total by adding d D x = { } for all states D and all terminal symbols x for which d D x is undeﬁned.8. B. C} Q0 = {S} F = {C} BC ?>=< 89:.-. Formally. Q0 ⊆ Q is the set of start states. Q is a ﬁnite set of states.

sequence w ∈ X ∗ is accepted by NFA (X.8. r ∈ d q a} The behaviour of the NFA on X-sequences of arbitrary length follows from this “one step” behaviour: nfa :: (Q → X → {Q}) → {Q} → X ∗ → {Q} nfa d qs = qs nfa d qs (ax) = nfa d (deltas d qs a) x Again. qs. Assume that we have a function. F ) if nfa accept w (d. Q0 . we have to be careful in deﬁning what it means that a sequence is accepted by an NFA. Regular Languages Since an NFA can make an arbitrary (nondeterministic) choice for one of its possible moves. F ). called deltas. F ) = nfa d Q0 w ∩ F = ∅ Now it remains to ﬁnd a function nfa d qs of type X ∗ → {Q} that reﬂects the behaviour of the NFA. fs) = nfa d qs w ∩ fs = ∅ nfa d qs = foldl (deltas d) qs deltas d qs a = {r | q ∈ qs. returns all possible states that can be reached after processing the string starting in some start state. In summary we have derived Deﬁnition 8. Q0 . the language of an NFA is deﬁned by 128 . when starting the NFA in a state from Q0 . d. Informally. d. if it is possible. For lists of length 1 such a function. say nfa. This operational description of acceptance is formalised in the predicate nfa accept. Q0 . is deﬁned by deltas :: (Q → X → {Q}) → {Q} → X → {Q} deltas d qs a = {r | q ∈ qs. a set of start states and a string. nfa d qs = foldl (deltas d) qs This concludes the deﬁnition of predicate nfa accept. {Q}) → Bool nfa accept nfa accept w (d.5 (Acceptance by an NFA). That is a function which given a transition function. The sequence w ∈ X ∗ is accepted by NFA (X. Q0 . Then the predicate nfa accept can be expressed as :: X ∗ → (Q → X → {Q}. which reﬂects the behaviour of the NFA. {Q}. F ) where nfa accept w (d. Q. to end in an accepting state after processing w. Q. r ∈ d q a} Using the nfa accept-predicate. it follows that nfa can be written as a foldl.

. This is due to the fact that all possible transitions have to be tried in order to determine whether or not the automaton can end in an accepting state after reading the input. 8. d. Q0 .1. S.3. F ). We will show how to construct a DFA from an NFA in subsection 8. Finite-state automata Deﬁnition 8. Fortunately. deterministic ﬁnite-state automata are preferable. is deﬁned by Lnfa(M ) = {w ∈ X ∗ | nfa accept w (d. Given a DFA M = (X. Furthermore. Q0 . Lnfa(M ).1. The extended transition function dfa and the accept function dfaAccept are now deﬁned by: dfa dfa :: [SymbolM] -> StateM = foldl (flip delta) start :: [SymbolM] -> Bool = elem (dfa xs) finals dfaAccept dfaAccept xs 129 . the language of M . and the symbols of M (the elements of the set X) are listed as constructors of SymbolM. so from a computational view. d. Q.8. Implementation This section describes how to implement ﬁnite state machines. For NFA M = (X. Determining whether or not a list is an element of the language of a deterministic ﬁnite-state automaton can be done in time linear in the length of the input list. for each nondeterministic ﬁnite-state automaton there exists a deterministic ﬁnitestate automaton that accepts the same language. F ).. . deriving Eq where the states of M (the elements of the set Q) are listed as constructors of StateM. F )} Note that it is computationally expensive to determine whether or not a list is an element of the language of a nondeterministic ﬁnite-state automaton. we deﬁne three values (one of which a function): start delta finals :: :: :: StateM SymbolM -> StateM -> StateM [StateM] Note that the ﬁrst two arguments of delta have changed places: this has been done in order to be able to apply ‘partial evaluation’ later.4. we deﬁne two datatypes: data StateM data SymbolM = = .. Q.6 (Language of an NFA)..1. We start with implementing DFA’s.

data StateMEX data SymbolMEX = = A | B | C | S SA | SB | SC deriving Eq So the state A is represented by A....xn]. Regular Languages Given a list of symbols [x1. delta x2 (delta x1 start). and similar for the other states and symbols...x2.. we introduce the following class: class Eq a => start :: delta :: finals :: dfa dfa DFA a b where a b -> a -> a [a] :: [b] -> a = foldl (flip delta) start :: [b] -> Bool = elem (dfa xs) finals dfaAccept dfaAccept xs Note that the functions dfa and dfaAccept are deﬁned once and for all for all instances of the class DFA. This list of states is determined uniquely by the input [x1.. and the symbol a is represented by SA..8. Since we want to use the same function names for diﬀerent automata. As an example.x2.... delta x1 start.x2. we give the implementation of the example DFA (called MEX here) given in the previous subsection. the computation of dfa [x1..xn] and the start state...xn] uses the following intermediate states: start.. The automaton is made an instance of class DFA as follows: instance DFA StateMEX SymbolMEX start = S delta x S = where delta SA A delta SC B finals = = = [C] case x of SA -> C SB -> A SC -> S B C 130 .

.xn] = foldl (flip delta) start [x1.xn] = foldl (flip delta) (delta x1 start) [x2.. Partial evaluation applies to ﬁnite automatons in the following way...... dfaS [] dfaS (x:xs) = = S case x of SA -> dfaC xs SB -> dfaA xs SC -> dfaS xs A case x of SA -> dfaB xs B case x of SC -> dfaC xs C dfaA [] dfaA (x:xs) dfaB [] dfaB (x:xs) dfaC [] = = = = = With this deﬁnition of the ﬁnite automaton.xn] SC -> foldl (flip delta) (delta SC start) [x2..xn] All these equalities are simple transformation steps for functional programs. and the second argument is one of the four states S. A.xn] SB -> foldl (flip delta) (delta SB start) [x2.. B..x2.. Since there are only a ﬁnite number of states (four... 131 ... we can deﬁne a transition function for each state: dfaS. The main idea of partial evaluation is to replace computations that are performed often at run-time by a single computation that is performed only once at compile-time.. dfaC :: [Symbol] -> State partial evaluation Each of these functions is a case expression over the possible input symbols. Note that the ﬁrst argument of foldl is always flip delta..8. the number of steps required for computing the value of dfaS xs for some list of symbols xs is reduced considerably. or C (the result of delta). to be precise).... A very simple example is the replacement of the expression if True then f1 else f2 by the expression f1... Finite-state automata We can improve the performance of the automaton (function) dfa by means of partial evaluation. dfaA..xn] = case x1 of SA -> foldl (flip delta) (delta SA start) [x2.1.. dfaB. dfa [x1.x2.

functions union and intersect are implemented as follows: union union nub nub :: Eq a => [[a]] -> [a] = nub .8. Regular Languages The implementation of NFA’s is similar to the implementation of DFA’s. concat :: Eq a => [a] -> [a] = foldr (\x xs -> x:filter (/=x) xs) [] intersect :: Eq a => [a] -> [a] -> [a] intersect xs ys = intersect’ (nub xs) where intersect’ = foldr (\x xs -> if x ‘elem‘ ys then x:xs else xs) [] 8. so the answer to the above question is no. map (delta a) nfaAccept nfaAccept xs :: [b] -> Bool = intersect (nfa xs) finals /= [] Here. in which we use some names that also appear in the class DFA. Before we give the formal proof of this claim. We will use the following class. we illustrate the construction of a DFA for an NFA in an example. This is a problem if the two classes appear in the same module. class Eq a => start :: delta :: finals :: nfa nfa NFA a b where [a] b -> a -> [a] [a] :: [b] -> [a] = foldl (flip deltas) start deltas :: b -> [a] -> [a] deltas a = union . The only diﬀerence is that the transition and accept functions have to take care of sets (lists) of states now.1. 132 . Constructing a DFA from an NFA Is it possible to express more languages by means of nondeterministic ﬁnite-state automata than by deterministic ﬁnite-state automata? For each nondeterministic automaton it is possible to give a deterministic ﬁnite-state automaton such that both automata accept the same language.4.

-. @A GFED 89:. / ?>=< B c We omit the proof that the language of the latter automaton is equal to the language of the former one. ?>=< A ba ED a 89:. We apply the same procedure as above.c c c b c ccc cc cc a 89:. We obtain the following automaton. Finite-state automata Consider the nondeterministic ﬁnite-state automaton corresponding with the example grammar of the previous subsection. ()*+ / ?>=< C _cc ? O cc cc cc a. ?>=< 89:. /. / ?>=< S a 89:.-. this automaton is still nondeterministic: there are two outgoing arcs labelled with c from D. @A GFED ?>=< / 89:.-. ?>=< 89:. @A GFED 89:. o S c The nondeterminicity of this automaton appears in state A: two outgoing arcs of A are labelled with an a. and remove the two outgoing arcs labelled c from D. ()*+ / ?>=< C euu {= F O uu uu {{ { uu {{ uu uu a {{{a c u b {{ c uuuu uu {{{{ u {u {{ c uu a 89:. Since D is a merge of the states S and B we have to merge the outgoing arcs from S and B into outgoing arcs of D. ()*+ / ?>=< C O c a 89:.-. /.1. ?>=< 89:. Add a new state E and an arc labelled c from D to E. /. Although there is just one outgoing arc from A labelled with a. /. / ?>=< B A D O @A BC b c a 89:. / ?>=< S BC 89:. ()*+ 89:. and we remove the arcs labelled a from A to S and from A to B. 89:. Since E is a merge of the states C and S we have to merge the outgoing arcs from C and S into outgoing arcs of E. Suppose we add a new state D. We obtain the following automaton. with an arc from A to D labelled a.8. ?>=< / ?>=< / ?>=< A D E B O ED b @A BC BC b 133 .

Q0 . Q0 . B}. F ) with X = {a. the construction works as follows. Then the ﬁnite-state automaton M = (X . State E is added to the set of accepting states because it is the merge of a set of states among which at least one belongs to the set of accepting states. F ). D. E} where transition function d is deﬁned by dSa = C dSb = A dSc = S dAa = D dBc = C dDa = C dDb = A dDc = E dEa = C dEb = A dEc = S This construction is called the ‘subset construction’. i. S. b. r ∈ q} Q0 = {Q0 } F = {p | p ∩ F = ∅. The DFA constructed from the NFA above is the 5-tuple (X. c} Q = {S. B} = {{ }. we do not prove that the language of this automaton is equal to the language of the previous automaton.8. Q. For example. F ) is a nondeterministic ﬁnite-state automaton. Q . d.e. this automaton is deterministic. d. {A. {A}. Q. C. E} F = {C. accepts the same language. the components of which are deﬁned below. Note that in this automaton for each state. {B}} For the other components of M we deﬁne d q a = {t | t ∈ d r a. subs Q is also called the powerset of Q. all outgoing arcs are labelled diﬀerently. X Q = X = subs Q where subs returns all subsets of a set. A. which until now consisted just of state C. B. Regular Languages Again. Suppose M = (X. d . subs :: {X} → {{X}} subs {A. provided we add the state E to the set of accepting states. In general. p ∈ Q } 134 .

nfaCS For example. Finite-state automata The proof of the following theorem is given in Section 8. [B]..xn] = foldl (flip deltas) (deltas x1 start) [x2.5.xn] = case x1 of SA -> foldl (flip deltas) (deltas SA start) [x2.1. nfaC.xn] SC -> foldl (flip deltas) (deltas SC start) [x2.. nfaB. 8.. nfa [x1. [A].....xn] Note that the ﬁrst argument of foldl is always flip deltas.S] (the possible results of deltas).. Partial evaluation of NFA’s Given a nondeterministic ﬁnite state automaton we can obtain a deterministic ﬁnite state automaton not just by means of the above construction. and the second argument is one of the six sets of states [S].xn] = foldl (flip deltas) start [x1.8.x2.. Theorem 8.. we can deﬁne a transition function for each state: nfaS.4.. nfaBS. But ﬁrst we show that the transformation from an NFA to a DFA is an instance of partial evaluation.... to be precise).. we can calculate as follows with function nfa.. nfaA [] nfaA (x:xs) = = A case x of SA -> nfaBS xs _ -> error "no transition possible" :: [Symbol] -> [State] 135 . Since there are only a ﬁnite number of results of deltas (six. [C].... Equipped with this knowledge we continue the exploration of regular languages in the following section. nfaA.7 (DFA for NFA).xn] SB -> foldl (flip deltas) (deltas SB start) [x2....S].7 enables us to freely switch between NFA’s and DFA’s.1.x2. For every nondeterministic ﬁnite-state automaton M there exists a ﬁnite-state automaton M such that Lnfa(M ) = Ldfa(M ) Theorem 8.. but also by means of partial evaluation.. [B. Just as for function dfa. [C.

Regular grammars This section deﬁnes regular grammars. Subsection 8. There is a symmetric deﬁnition for left-regular grammars. Regular Languages Each of these functions is a case expression over the possible input symbols. B ∈ N . A regular language is a language that is generated by a regular grammar. There are contextfree languages that are not regular. concatenation and Kleene-star. The regular grammars as deﬁned here are sometimes called right-regular grammars. R.10. then L∪M LM L ∗ is regular is regular is regular 136 . Here it suﬃces to know that regular languages form a proper subset of context-free languages and that we will proﬁt from their speciality in the recognition process. and if there is a nonterminal present. 8. To understand this. Because of its subtlety. we postpone this kind of proofs until Chapter 9. it occurs at the end. a special kind of context-free grammars. regular grammar Deﬁnition 8. an example of such a language is {an bn | n ∈ N}. A ﬁrst similarity between regular languages and context-free languages is that both are closed under union. From the deﬁnition it is clear that each regular language is context-free. The question is now: Is each context-free language regular? The answer is: No.8.8 (Regular Grammar). you have to know how to prove that a language is not regular. By partially evaluating the function nfa we have obtained a function that is the implementation of the deterministic ﬁnite state automaton corresponding to the nondeterministic ﬁnite state automaton. So in every rule there is at most one nonterminal. N.1 gives the correspondence between nondeterministic ﬁnite-state automata and regular grammars. S) in which all production rules in R are of one of the following two forms: A → xB A → x with x ∈ T ∗ and A. A regular grammar G is a context free grammar (T.9 (Regular Language). Let L and M be regular languages.2. regular language Deﬁnition 8. Theorem 8.2.

consider all pairs Y . Let GL = (T. then • for regular grammars. Z of nonterminals of G such that Y ⇒ Z. m ∈ N}.8. The following example illustrates this construction. there may exist more than one regular grammar for a given regular language and these regular grammars may be transformed into each other. As for context-free languages. respectively. RL . V . 137 . with W = S . it follows from the above that there exists a regular grammar for L∗ .11. and all W = S is obtained ∗ as follows. For each regular grammar G there exists a regular grammar G with start-symbol S such that L(G) = L(G ) and such that G has no productions of the form U → V and W → . a regular grammar with the same language but without productions of the form U → V and W → for all U . This is remarkable because context-free languages are not closed under these operations. each production of the form T → x and T → by T → xSM and T → SM . • we obtain a regular grammar for LM if we replace. We conclude this section with a grammar transformation: Theorem 8. Recall the language L = L1 ∩ L2 where L1 = {an bn cm | n. m ∈ N} and L2 = {an bm cm | n. NL . SM ) be regular grammars for L and M respectively. Given a regular grammar G. • since L∗ = { } ∪ LL∗ . SL ) and GM = (T. NM . remove all productions of the form W → for W = S. Regular grammars Proof. In other words: every regular grammar can be transformed to a form where every production has a nonempty terminal string in its right hand side (with a possible exception for S → ). with Z → z a production of the original grammar. in GL . Add productions Y → z to the grammar. and for each production U → xW add the production U → x. In addition to these closure properties. Finally. Remove all productions U → V from G. The proof of this transformation is omitted. First. See the exercises. and illustrate the construction with an example. regular languages are closed under intersection and complement too.2. and z not a single nonterminal. we only brieﬂy describe the construction of such a regular grammar. RM . the well-known union construction for context-free grammars is a regular grammar again.

S → C. It remains to remove productions of the form W → for W = S. S → aA S → bB S → c S → A → bB A → aA A → c A → B → bB B → C → c This grammar generates the same language as G. and the ∗ productions A → aA and A → bB since A ⇒ S. Step 2 ∗ Consider all pairs of nonterminals Y and Z. Furthermore. and has no productions of the form U → V . A → S.8. S → aA S → bB S → A S → C A → bB A → S A → B → bB B → C → c The grammar G of the desired form is constructed in 3 steps. and the production A → c since ∗ A ⇒ C. Regular Languages Consider the following regular grammar G. If Y ⇒ Z. and z not a single nonterminal. 138 . and we add the productions ∗ ∗ S → bB and S → since S ⇒ A. In the example we remove the productions S → A. Step 1 Let G equal G. add the productions Y → z to G . We obtain the grammar with the following productions. remove all productions of the form U → V from G . and the production S → c since S ⇒ C. with Z → z a production of the original grammar.

12 and 8. Regular grammars Step 3 Remove all productions of the form W → for W = S. The basis for both theorems is the direct correspondence between a production A → xB and a transition 89:. Applying this transformation to the above grammar gives the following grammar. / ?>=< B 139 . Here we show that regular grammars and nondeterministic ﬁnite-state automata are two sides of one coin. and for each production U → xW add the production U → x.2. For each NFA M there exists a regular grammar G such that Lnfa(M ) = L(G) 89:. and the language of the grammar G we thus obtain is equal to the language of grammar G.2. we introduced ﬁnite-state automata. formulated in the theorems 8.8.1. Equivalence of Regular grammars and Finite automata In the previous section.13 below. The equivalence consists of two parts. ?>=< A x Theorem 8. S → aA S → bB S → a S → b S → c S → A → bB A → aA A → a A → b A → c B → bB B → b C → c Each production in this grammar is of one of the desired forms: U → x or U → xV .7. 8.12 (Regular grammar for NFA). We will prove the equivalence using theorem 8.

Regular Languages Proof. F ) be a DFA for NFA M . Q. R. Q = N ∪ {Nf } 140 . A → x ∈ R} • The states of the automaton are the nonterminals of the grammar extended with a new state Nf .13 (NFA for regular grammar). d. Q. d. the formal proof can be found in the literature. the nonterminals of the grammar are the states of the automaton. {S}. B ∈ N. d A x = B} ∪ {A → | A ∈ F} Theorem 8. S. We will just sketch the construction. The construction consists of two steps: ﬁrst we give a direct translation of a regular grammar to an automaton and then we transform the automaton into a suitable shape. F ) where • The input alphabet of the automaton are the nonempty terminal strings (!) that occur in the rules of the grammar: X = {x ∈ T + | A. / ?>=< B R = {A → xB | A. S) where • • • • the terminals of the grammar are the input alphabet of the automaton. Let (X. Construct the grammar G = (X. the productions of the grammar correspond to the automaton transitions: a rule A → xB for each transition 89:. we will just sketch the construction. Again. Q. N.8. R. the formal proof can be found in the literature. B ∈ Q. ?>=< A x a rule A → for each terminal state A. the start state of the grammar is the start state of the automaton. From a grammar to an automaton. S) be a regular grammar without productions of the form U → V and W → for W = S. x ∈ X. Construct NFA M = (X. Let G = (T. In formulae: 89:. A → xB ∈ R} ∪ {x ∈ T + | A ∈ N. For each regular grammar G there exists a nondeterministic ﬁnite-state automaton M such that L(G) = Lnfa(M ) Proof.

. There is a minor ﬂaw in the automaton given above: the grammar and the automaton have diﬀerent alphabets. . . / ?>=< B In formulae: For all A ∈ N and x ∈ X: @ABC /GFED Nf d A x = {B | B ∈ N. . . Regular grammars • The transitions of the automaton correspond to the grammar productions: for each rule A → xB we get a transition ?>=< 89:. because of the direct correspondence between derivation steps in G and transitions in M . d p1 x2 = p2 . . The transformation is relatively easy and is depicted in the diagram below. Transforming the automaton to a suitable shape. xk+1 with k > 0 and xi ∈ T for all i. . This shortcoming will be remedied by an automaton transformation which yields an equivalent automaton with transitions labelled by elements of T (instead of T ∗ ).2. d pk xk+1 = q . if S → is a grammar production. we get a transition 89:.8. add new (nonﬁnal) states p1 .bp p p p p p p p p p p p p p p p p p b p2 pk xk+1 -b q It is intuitively clear that the resulting automaton is equivalent to the original one.. . A → xB ∈ R} ∪ {Nf | A → x ∈ R} • The ﬁnal states of the automaton are Nf and possibly S. . pk to the existing ones and new transitions d q x1 = p1 . A x for each rule A → x with nonempty x. 141 . Carry out this transformation for each M -transition d q x = q with |x| > 1 in order to get an automaton for G with the same input alphabet T . . ?>=< A x 89:.. F = {Nf } ∪ {S | S → ∈ R} Lnfa(M ) = L(G).xk+1 -b b q q x1 -b p1 x2 . In order to eliminate transition d q x = q where x = x1 x2 . b q x1 x2 .

and shows that regular expressions and regular grammars are equally expressive formalisms. S ∅ ∈ RET ∈ RET a ∈ RET R + S ∈ RET R S ∈ RET R∗ ∈ RET (R) ∈ RET where a ∈ T . S. 142 . the “semantics”) of a regular expression over T is a set of T sequences compositionally deﬁned on the structure of regular expressions. In formulae this reads. 6]. written as juxtaposition (so x concatenated with y is denoted by xy).e. and concatenation binds stronger than +. deﬁnes the language of a regular expression. and idempotent. As follows. regular expressions over alphabet T ). Deﬁnition 8. Examples of regular expressions are: (bc)∗ + ∅ + b( ∗) The language (i.14 (RET . and is the unit of it. We do not discuss implementations of (datatypes and functions for matching) regular expressions. This section deﬁnes regular expressions. the structure of terminal words. for all regular expressions R. for example. implementations can be found in the literature [8. The set RET of regular expressions over alphabet T is inductively deﬁned as follows: for regular expressions R. The operator + is associative. the concatenation operator.8. R + (S + U ) = (R + S) + U R+S = S+R R+R = R R (S U ) = (R S) U R = R (= R) regular expression Furthermore.3. ∗. commutative. Regular Languages 8. binds stronger than concatenation. the star operator. Regular expressions Regular expressions are a classical and convenient way to describe. and V . is associative.

An identiﬁer is an element in the language of the regular expression letter (letter + digit)∗ where letter digit = a + b + . one or more concatenations of b.. For example..15 (Language of a regular expression). we compute the language of the regular expression ( + bc)d.8. i. + z + A + B + . and idempotent. Note that the language Lreb∗ is the set consisting of zero. Lre(( + bc)d) = (Lre( + bc)) (Lre(d)) = (Lre( ) ∪ Lre(bc)){d} = ({ } ∪ (Lre(b))(Lre(c))){d} = { .. + Z = 0 + 1 + . set concatenation is associative with { } as its unit.3.. bcd} Regular expressions are used to describe the tokens of a language.e. Regular expressions Deﬁnition 8. three of which are identiﬁers. As an example of a language of a regular expression. It is deﬁned inductively by: Lre(∅) = ∅ Lre( ) = { } Lre(b) = {b} Lre(x + y) = Lre(x) ∪ Lre(y) Lre(xy) = Lre(x) Lre(y) Lre(x∗) = (Lre (x))∗ Since ∪ is associative. Lre(b∗) = ({b})∗ . and function Lre is well deﬁned.. + 9 143 . bc} {d} = {d.. the list if p then e1 else e2 contains six tokens. commutative.. Function Lre :: RET → {T ∗ } returns the language of a regular expression.

3. Suppose furthermore that the productions for A. For a speciﬁc example it is not diﬃcult to construct a regular grammar for a regular expression. Consider the following regular expression. • The nonterminal A with productions A → aA A → generates the language Lre(a∗). and C satisfy the conditions imposed upon regular grammars. In the beginning of this section we claimed that regular expressions and regular grammars are equivalent formalisms. nonterminal B generates the language Lre( ). and C. Suppose that nonterminal A generates the language Lre(a∗). the nonterminal B with production B → generates the language { }. • Nonterminal C with productions C → aC C → bC C → generates the language Lre((a + b)∗). It remains to construct productions for nonterminals A. but ﬁrst we illustrate the construction of a regular grammar out of a regular expressions in an example. R = a∗ + + (a + b)∗ We aim at a regular grammar G such that Lre(R) = L(G) and again we take a top-down approach. We will prove this claim later.8. 144 . • Since Lre( ) = { }. Regular Languages see subsection 2. B. and nonterminal C generates the language Lre((a + b)∗). We now give the general result. B.1. Then we obtain a regular grammar G with L(G) = Lre(R) by deﬁning S → A S → B S → C where S is the start-symbol of G.

In general. as we will brieﬂy explain in the proof of the following theorem. For our example DFA we obtain: S = aC + bA + cS A = aB B = cC C = It is easy to merge these four regular expressions into a single regular expression.16 (Regular Grammar for Regular Expression). we have: Theorem 8. For each regular expression R there exists a regular grammar G such that Lre(R) = L(G) The proof of this theorem is given in Section 8. To obtain a regular expression that generates the same language as a given regular grammar we go via an automaton.17 (Regular Expression for Regular Grammar). Merging the regular expressions obtained from a DFA that may loop is more complicated.4.4. Proofs This section contains the proofs of some of the theorems given in this chapter. we can use the theorems from the previous sections to obtain a DFA D such that L(G) = Ldfa(D) So if we can obtain a regular expression for a DFA D. Given a regular grammar G. we interpret each state of D as a regular expression deﬁned as the sum of the concatenation of outgoing terminal symbols with the resulting state.4. 145 .8. To obtain a regular expression for a DFA D. For each regular grammar G there exists a regular expression R such that L(G) = Lre(R) The proof of this theorem is given in Section 8. partially because this is a simple example. we have found a regular expression for a regular grammar. Proofs Theorem 8. 8.4.

{A. Q0 . B} = {{ }. (8.4. B}. F ) as follows. F )} deﬁnition of Ldfa Ldfa(M ) It follows that Lnfa(M ) = Ldfa(M ) provided nfa d Q0 w = dfa d Q0 w We prove this equation as follows. Proof of Theorem 8. {B}} For the other components of M we deﬁne d q a = {t | t ∈ d r a. nfa accept w (d. For example. Deﬁne the ﬁnite-state automaton M = (X . d . r ∈ q} Q0 = Q0 F We have Lnfa(M ) = = = = = = deﬁnition of Lnfa {w | w ∈ X ∗ . F ) is a nondeterministic ﬁnite-state automaton.8. {A}.7 Suppose M = (X. d.1. dfa d Q0 w ∈ F } deﬁnition of dfa accept {w | w ∈ X ∗ . p ∈ Q } 146 . subs {A. Q . Q0 . Q. (nfa d Q0 w ∩ F ) = ∅} deﬁnition of F {w | w ∈ X ∗ . F )} deﬁnition of nfa accept {w | w ∈ X ∗ . Regular Languages 8. Q0 . X Q = X = subs Q where subs returns the powerset of a set.1) = {p | p ∩ F = ∅. Q0 . dfa accept w (d . nfa d Q0 w ∈ F } assume nfa d Q0 w = dfa d Q0 w {w | w ∈ X ∗ .

4. The regular grammar without productions generates the language Lre(∅). For the other three cases the induction hypothesis is that there exist a regular grammar with start-symbol S1 that generates the language Lre(x). For the three base cases we argue as follows.2.16 The proof of this theorem is by induction to the structure of regular expressions. Proof of Theorem 8. assume deltas d = d foldl d Q0 w deﬁnition of dfa dfa d Q0 w So if deltas d = d equation (8. This equality follows from the following calculation.1) holds. t ∈ d r a} deﬁnition of deltas deltas d q a 8. We obtain a regular grammar with start-symbol S that generates the language Lre(x + y) by deﬁning S → S1 S → S2 We obtain a regular grammar with start-symbol S that generates the language Lre(xy) by replacing. and a regular grammar with start-symbol S2 that generates the language Lre(y).8.4. in the regular grammar that generates the language Lre(x). Proofs nfa d Q0 w = = = deﬁnition of nfa foldl (deltas d) Q0 w Q0 = Q0 . The regular grammar with production S → b generates the language Lre(b). 147 . d qa = = deﬁnition of d {t | r ∈ q. The regular grammar with the production S → generates the language Lre( ).

We obtain ¯ the deﬁnition of q by combining all pairs c and C such that d q c = C. if we can show that there exists a regular expression R such that Ldfa(D) = Lre(R) then the theorem follows. Regular Languages each production of the form T → a and T → tively.17 In sections 8. ¯ We prove Ldfa(D) = Lre(S). each production of the form T → a and T → by T → aS and T → S. We deﬁne a regular expression R such that Lre(R) = Ldfa(D) ¯ For each state q ∈ Q we deﬁne a regular expression q . where S1 is the start-symbol of the regular grammar that generates the language Lre(x). by T → aS2 and T → S2 . d. which we have to solve. ¯ q = if q ∈ F ¯ / then foldl (+) ∅ [cC | d q c = C] else + foldl (+) ∅ [cC | d q c = C] This gives a set of possibly mutually recursive equations. and by adding the productions S → S1 and S → .4. S. 8. Let D = (X. in the regular grammar that generates the language Lre(x).8. and we let R be S. F ) be a DFA such that L(G) = Ldfa(D).2. S.4 and 8. Proof of Theorem 8. respec- We obtain a regular grammar with start-symbol S that generates the language Lre(x∗) by replacing.1.1 we have shown that there exists a DFA D = (X.3. 148 . d. F ) such that L(G) = Ldfa(D) So. Q. Q. In solving these equations we use the fact that concatenation distributes over the sum operator: z(x + y) = zx + zy and that recursion can be removed by means of the star operator ∗: A = xA + z (where A ∈ z) ≡ A = x∗z / The algorithm for solving such a set of equations is omitted.

F )} deﬁnition of dfa accept {w | w ∈ X ∗ . namely. E abbreviates the fold expression ¯ q = +E ¯ deﬁnition of Lre. Suppose ax is a list of length n+1. (foldl d S w) ∈ F } assumption ¯ {w | w ∈ Lre(S)} equality for set-comprehensions ¯ Lre(S) It remains to prove the assumption in the above calculation: for w ∈ X ∗ . dfa accept w (d. deﬁnition of q ¯ ∈ Lre(¯) q The induction hypothesis is that for all lists w with |w| F ≡ w ∈ Lre(¯). Proofs Ldfa(D) = = = = = deﬁnition of Ldfa {w | w ∈ X ∗ . q (foldl d q (ax)) ∈ F ≡ ≡ ≡ deﬁnition of foldl (foldl d (d q a) x) ∈ F induction hypothesis x ∈ Lre(d ¯ a) q deﬁnition of q . (foldl d q ) ∈ F ≡ ≡ ≡ deﬁnition of foldl q∈F deﬁnition of q . S.8. (dfa d S w) ∈ F } deﬁnition of dfa {w | w ∈ X ∗ . For the base case w = calculate as follows. D is deterministic ¯ ax ∈ Lre(¯) q n we have (foldl d q w) ∈ 149 . ¯ (foldl d S w) ∈ F ≡ w ∈ Lre(S) We prove a generalisation of this equation. (foldl d q w) ∈ F ≡ w ∈ Lre(¯) q we This equation is proved by induction to the length of w.4. for arbitrary q.

Prove the converse of Theorem 8. 150 . Exercises Exercise 8. Regular Languages Summary This chapter discusses methods for recognising sentences from regular languages. Prove this claim. Exercise 8.4. 8. Exercise 8. By means of (non)deterministic ﬁnite-state automata we construct a recogniser that requires time linear in the length of the input list for recognising an input list. and introduces several concepts related to describing and recognising regular languages. The straightforward translation of a regular expression into a recogniser for the language of that regular expression results in a recogniser that is often very ineﬃcient. Regular languages are used for describing simple languages like ‘the language of identiﬁers’ and ‘the language of keywords’ and regular expressions are convenient for the description of regular languages.8. Given a regular grammar G for language L. Regular languages are closed under complementation. Hint: construct a ﬁnite automaton for L out of an automaton for regular language L. and an input list. a set of start states. Deﬁne a function ndfsa of type ({Q} → X → {Q}) → {Q} → X ∗ → {Q} which given a function d.7: show that for every deterministic ﬁnitestate automaton M there exists a nondeterministic ﬁnite-state automaton M such that Lda(M ) = Lna(M ) Exercise 8.3.5. Suppose that the state transition function d in the deﬁnition of a nondeterministic ﬁnite-state automaton has the following type d :: {Q} → X → {Q} Function d takes a set of states V and an element a. S S A A B B C C → → → → → → → → aA A aS B C cC a Exercise 8. Transform the grammar with the following productions to a grammar without productions of the form U → V and W → with W = S. returns the set of states in which the nondeterministic ﬁnite-state automaton can end after reading the input list. construct a regular grammar for L∗ .1.5. and returns the set of states that are reachable from V with an arc labelled a.2.

A → ) A → ( S → 0A S → 0B S → 1A A → 1 2.10. q2 ) x = (d1 q1 x. A → 0 B → 0 B → Exercise 8. 1.5. Regular languages are closed under intersection. + b( ∗) 2. Deﬁne nondeterministic ﬁnite-state automata that accept languages equal to the languages of the following regular grammars. (S1 . Exercise 8. and 000.8. d2 q2 x) Now prove that Ldfa(M ) = Ldfa(M1 ) ∩ Ldfa(M2 ) Exercise 8. Q1 . F1 ) and M2 = (X. Prove this claim using the result from the previous exercise. d1 . 1.12.7. Exercises Exercise 8. Deﬁne the (product) automaton M = (X. Give a regular expression for the language that consists of all lists of zeros and ones such that the segment 01 occurs nowhere in a list. Give regular expressions V and W .11.9. such that for all regular expressions R and S with S = ∅ Lre(R(S + V )) = Lre(R(S + W )) V and W may be expressed in terms of R and S. d2 . Prove that for arbitrary regular expressions R. Examples of sentences of this language are 1110. and T the following equivalences hold. 151 . (bc)∗ + ∅ 3. 2. with Lre(V ) = Lre(W ). Q2 . Describe the language of the following regular expressions. Give regular expressions S and R such that Lre(RS) = Lre(RS) = Lre(SR) Lre(SR) Exercise 8. a(b∗) + c∗ Exercise 8. S → (A S → S → )A 1. S1 . Lre(R(S + T )) Lre((R + S)T ) = Lre(RS + RT ) = Lre(RT + ST ) Exercise 8.6. S2 ). S2 . A direct proof form this claim is the following: Let M1 = (X. F1 × F2 ) by d (q1 . F2 ) be DFA’s for the regular languages L1 and L2 respectively. d.8. Q1 × Q2 . S.

T → 0T T → 1S Exercise 8. b @A GFED 89:. B → aC B → bB B → C → bB C → b S → 0S S → 1T S → 2. Deﬁne regular expressions for the languages of the following deterministic ﬁnite-state automata. Start state is S.8.14. Start state is S. Construct for each of the following regular expressions a nondeterministic ﬁnite-state automaton that accepts the sentences of the language. 1. ?>=< !"# / '&%$ S a @A GFED 89:. Transform the nondeterministic ﬁnite-state automata into deterministic ﬁnite-state automata. 1. Regular Languages Exercise 8. 1.16. S → bA S → aC S → A → bA A → 1. ?>=< /B BC a b 2. a∗ + b∗ + ab Exercise 8.15. (1 + (12) + 0)∗(30)∗ Exercise 8. a∗ + b∗ + (ab) 2.13.b @A GFED 89:. ?>=< !"# / '&%$ S O @A a b b b 152 . Give regular expressions of which the language equals the language of the following regular grammars. Give regular grammars that generate the language of the following regular expressions. /?>=< A O BC@A a. ((a + bb)∗ + c)∗ 2. b @A GFED 89:. /?>=< A 89:. ?>=< !"# / '&%$ B 89:.

153 . if P does not hold.9. Pumping Lemmas: the expressive power of languages Introduction In these lecture notes we have presented several ways to show that a language is regular or context-free. Although the ideas behind pumping lemmas are very simple. In the regular case this means that one may conclude that L is not regular. a precise formulation is not. Pumping lemmas are used to show that the expressive power of the diﬀerent elements of the Chomsky hierarchy is diﬀerent. pumping lemmas are used in the contrapositive way. In this chapter we ﬁll this gap by introducing the so-called Pumping Lemmas. which consists of four diﬀerent kinds of grammars and their corresponding languages. prove that a language is not context-free. it takes some eﬀort to get familiar with applying pumping lemmas. the pumping lemma for regular languages says IF language L is regular. every time yielding another word of L In applications. For example. THEN it has the following property P : each suﬃciently long sentence w ∈ L has a substring that can be repeated any number of times. give examples of languages that are not regular. As a consequence. explain the Chomsky hierarchy. context-free or none of these. Regular grammars and context-free grammars are part of the Chomsky hierarchy. but until now we did not give any means to show the nonregularity or noncontext-freeness of a language. and/or not context-free. identify languages and grammars as regular. Goals After you have studied this chapter you will be able to • • • • • prove that a language is not regular.

And if we put some more restrictions 154 . instead of a single nonterminal as in context-free grammars.9. where φ ∈ V + . The Chomsky hierarchy explains why the answer is no. in a context-free grammar you can rewrite nonterminals without looking at the context in which they appear.1.1. ψ ∈ V ∗ . Although context-sensitive grammars are less expressive than type-0 grammars. or context-sensitive grammar. in the context of lists of symbols φ and ψ. it is much easier to parse sentences from context-free languages. in which a production has the form φ → ψ. However. parsing is still very diﬃcult for context-sensitive grammars.1. You may now wonder: is it possible to express any language with these grammars? And: is it possible to obtain any context-free language from a regular grammar? The answer to these questions is no. where φ. Type-0 grammars The most powerful grammars are the type-0 grammars. where V is the set of symbols of the grammar.3. and the languages described by these grammars are the recursive enumerable languages. So a production describes how to rewrite a nonterminal A. Pumping Lemmas: the expressive power of languages 9. 9. Actually. δ ∈ V + . The Chomsky hierarchy In the preceding chapters we have seen contex-free grammars and regular grammars. In fact. Context-free grammars are less expressive than context-sensitive grammars. This expressive power comes at a cost though: it is very diﬃcult to parse sentences from type-0 grammars. In a type-1. Type-2 grammars The type-2 grammars are the context-free grammars which you have seen a lot in the preceding chapters. The Chomsky hierarchy consists of four elements. a sentence of length n can be parsed in time at most O(n3 ) (or even a bit less than this) for any sentence of a context-free language.1. each of which is explained below.1.2. it is impossible to look at the context when rewriting symbols. So the left-hand side of a production may consist of a list of nonterminal and terminal symbols. A language generated by a contextsensitive grammar is called a context-sensitive language. Type-1 grammars We can slightly restrict the form of the productions to obtain type-1 grammars. Type-0 grammars have the same expressive power as Turing machines. As the name says. 9. This statement can be proved using the pumping lemma for context-free languages. each production has the form φAψ → φδψ. 9. ψ ∈ V ∗ .

and ‘there exists x ∈ X : · · · ’ is false if X = ∅.4. Also remember that ‘for all x ∈ X : · · · ’ is true if X = ∅. we obtain linear-time algorithms for parsing sentences of such grammars. w i∈N : : xyz ∈ L and |y| n : : y = uvw and |v| > 0 : : xuv i wz ∈ L Note that |y| denotes the length of the string y. The pumping lemma for regular languages In this section we give the pumping lemma for regular languages. which takes linear time and constant space in the size of its input. z u. The idea behind this property is simple: regular languages are accepted by ﬁnite automata. 155 . Any sentence from a regular language can be processed by means of a ﬁnite-state automaton. The set of regular languages is strictly smaller than the set of context-free languages. Given a DFA for a regular language. a sentence of the language describes a path from the start state to some ﬁnite state. y. Theorem 9. Then there exists for all there exist for all n∈N x.4.1 (Regular Pumping Lemma).9. The proof of the following lemma is given in Section 9.2. When the length of such a sentence exceeds the number of states. 9. v. Type-3 grammars The type-3 grammars in the Chomsky hierarchy are the regular grammars. a fact we will prove below by means of the pumping lemma for regular languages. then at least one state is visited twice.1. Let L be a regular language. The lemma gives a property that is satisﬁed by all regular languages.2. 9. The property is a statement of the form: in sentences longer than a certain length a substring can be identiﬁed that can be duplicated while retaining a sentence. The pumping lemma for regular languages on context-free grammars (for example LL(1)). consequently the path contains a cycle that can be repeated as often as desired.

?>=< /. v = aq and w = ar with p + q + r = n and q > 0. y = an . v. The part that is accepted in between the two moments the automaton passes through state A can be pumped up to create sentences that contain an arbitrary number of copies of the string v = bca. As an example.-. Take for n the number of states in the automaton (5). z u. and z = bn . z be such that xyz ∈ L. the statement above holds for all such v (namely none!). a(bca)∗ bcd. y. Take s = an bn with x = . w be such that y = uvw with v = . / ?>=< B _cc cc cc cc c c a ccc cc cc 89:. we will prove that language L = {am bm | m 0} is not regular. they are used negatively to show that a given language does not belong to some family. This pumping lemma is useful in showing that a language does not belong to the family of regular languages.9. Let x. The statement of the pumping lemma amounts to the following. Then we know that in order to accept y. the above automaton has to pass at least twice through state A. Take i = 2. / ?>=< A b 89:. abcabcabcd. and since there is no v with |v| > 0 such that y = uvw. ?>=< C 89:. we can choose y = . this is the formulation we will use. Let n ∈ N. Pumping Lemmas: the expressive power of languages For example. v. Let u.1 enables us to prove that a language L is not regular by showing that for all there exist for all there exists n∈N x. 89:. Note that if n = 0. ?>=< S a 89:. w i∈N : : xyz ∈ L and |y| n : : y = uvw and |v| > 0 : : xuv i wz ∈ L In all applications of the pumping lemma in this chapter. and |y| n. Its application is typical of pumping lemmas in general. that is. u = ap . in general. ()*+ D d This automaton accepts: abcabcd. y. consider the following automaton. and. then 156 . Theorem 9.

1. w. This property is a statement of the form: in sentences exceeding a certain length.2. Show that the following language is not regular. and together with the fact that each regular grammar is also a context-free grammar it follows immediately that the set of regular languages is strictly smaller than the set of context-free languages.3.3. calculus ap+2q+r bn n+q =n arithmetic q>0 q>0 true Note that the language L = {am bm | m 0} is context-free. {x | x ∈ {a.9. The pumping lemma for context-free languages The Pumping Lemma for context-free languages gives a property that is satisﬁed by all context-free languages. Note that here we use the pumping lemma (and not the proof of the pumping lemma) to prove that a language is not regular. Prove that the following language is not regular {ak bm | k m 2k} Exercise 9. The pumping lemma for context-free languages xuv 2 wz ∈ L ⇐ ⇐ ⇐ ⇐ defn. Exercise 9. Prove that the following language is not regular {ak | k 2 ∈L p+q+r =n 0} Exercise 9. u. Exercise 9. two sublists of bounded length can be identiﬁed that can be 157 . z. This kind of proof can be viewed as a kind of game: ‘for all’ is about an arbitrary element which can be chosen by the opponent. {ak bl am | k > 5 ∧ l > 3 ∧ m l} 9. b}∗ ∧ nr a x < nr b x} where nr a x is the number of occurrences of a in x. v.3. Show that the following language is not regular.4. where winning means proving that a language is not regular. x. ‘there exists’ is about a particular element which you may choose. Choosing the right elements helps you ‘win’ the game.

9. Pumping Lemmas: the expressive power of languages

duplicated while retaining a sentence. The idea behind this property is the following. Context-free languages are described by context-free grammars. For each sentence in the language there exists a derivation tree. When sentences have a derivation tree that is higher than the number of nonterminals, then at least one nonterminal will occur twice in a node; consequently a subtree can be inserted as often as desired.

As an example of an application of the Pumping Lemma, consider the context-free grammar with the following productions.

S → aAb A → cBd A → e B → fAg

The following parse tree represents the derivation of the sentence acfegdb.

a

Sc c Ac Bc A e

cc cc cc

cc cc cc c cc cc cc c

b

c

d g

f

If we replace the subtree rooted by the lower occurrence of nonterminal A by the

158

9.3. The pumping lemma for context-free languages

**subtree rooted by the upper occurrence of A, we obtain the following parse tree.
**

Sc c Ac Bc

cc cc cc

a

cc cc cc c cc cc cc c

b

c

d g

f c

Ac Bc A e

cc cc cc c cc cc cc c

d g

f

This parse tree represents the derivation of the sentence acfcfegdgdb. Thus we ‘pump’ the derivation of sentence acfegdb to the derivation of sentence acfcfegdgdb. Repeating this step once more, we obtain a parse tree for the sentence acfcfcfegdgdgdb We can repeatedly apply this process to obtain derivation trees for all sentences of the form a(cf)i e(gd)i b for i 0. The case i = 0 is obtained if we replace in the parse tree for the sentence acfegdb the subtree rooted by the upper occurrence of nonterminal A by the subtree rooted by the lower occurrence of A:

ÐÐ ÐÐ ÐÐ ÐÐ

Sb b A e

bb bb b

a

b

This is a derivation tree for the sentence aeb. This step can be viewed as a negative pumping step. The proof of the following lemma is given in Section 9.4.

159

9. Pumping Lemmas: the expressive power of languages

Theorem 9.2 (Context-free Pumping Lemma). Let L be a context-free language. Then there exist for all there exist for all c, d z u, v, w, x, y i∈N : : : : c, d ∈ N z ∈ L and |z| c z = uvwxy and |vx | > 0 and |vwx | uv i wx i y ∈ L : : :

d

The Pumping Lemma is a tool with which we prove that a given language is not context-free. The proof obligation is to show that the property shared by all contextfree languages does not hold for the language under consideration. Theorem 9.2 enables us to prove that a language L is not context-free by showing that for all there exists for all there exists c, d z u, v, w, x, y i∈N : : : : c, d ∈ N z ∈ L and |z| c z = uvwxy and |vx | > 0 and |vwx | uv i wx i y ∈ L : : :

d

As an example, we will prove that the language T deﬁned by T is not context-free. Proof. Let c, d ∈ N. Take z = ar br cr with r = max(c, d). Let u, v, w, x, y be such that z = uvwxy, |vx| > 0 and |vwx | d Note that our choice for r guarantees that substring vwx has one of the following shapes: • vwx consists of just a’s, or just b’s, or just c’s. • vwx contains both a’s and b’s, or both b’s and c’s. So vwx does not contain a’s, b’s, and c’s. Take i = 0, then • If vwx consists of just a’s, or just b’s, or just c’s, then it is impossible to write the string uwy as as bs cs for some s, since only the number of terminals of one kind is decreased. = {an bn cn | n > 0}

160

9.4. Proofs of pumping lemmas

• If vwx contains both a’s and b’s, or both b’s and c’s it lies somewhere on the border between a’s and b’s, or on the border between b’s and c’s. Then the string uwy can be written as uwy uwy = as bt cr = ar bp cq

for some s, t, p, q, respectively. At least one of s and t or of p and q is less than r. Again this list is not an element of T .

Exercise 9.5. Why does vwx not contain a’s, b’s, and c’s? Exercise 9.6. Prove that the following language is not context-free {ak | k

2

0}

Exercise 9.7. Prove that the following language is not context-free {ai | i is a prime number } Exercise 9.8. Prove that the following language is not context-free {ww | w ∈ {a, b}∗ }

**9.4. Proofs of pumping lemmas
**

This section gives the proof of the pumping lemmas.

**9.4.1. Proof of the Regular Pumping Lemma, Theorem 9.1
**

Since L is a regular language, there exists a deterministic ﬁnite-state automaton D such that L = Ldfa D. Take for n the number of states of D. Let s be an element of L with sublist y such that |y| n, say s = xyz . Consider the sequence of states D passes through while processing y. Since |y| n, this sequence has more than n entries, hence at least one state, say state A, occurs twice. Take u, v, w as follows • u is the initial part of y processed until the ﬁrst occurrence of A, • v is the (nonempty) part of y processed from the ﬁrst to the second occurrence of A, • w is the remaining part of y

161

9. Pumping Lemmas: the expressive power of languages

Note that D could have skipped processing v, and hence would have accepted xuwz . Furthermore, D can repeat the processing in between the ﬁrst occurrence of A and the second occurrence of A as often as desired, and hence it accepts xuv i wz for all i ∈ N. Formally, a simple proof by induction shows that (∀i : i 0 : xuv i wz ∈ L).

**9.4.2. Proof of the Context-free Pumping Lemma, Theorem 9.2
**

Let G = (T, N, R, S) be a context-free grammar such that L = L(G). Let m be the length of the longest right-hand side of any production, and let k be the number of nonterminals of G. Take c = mk . In Lemma 9.3 below we prove that if z is a list with |z| > c, then in all derivation trees for z there exists a path of length at least k+1. Let z ∈ L such that |z| > c. Since grammar G has k nonterminals, there is at least one nonterminal that occurs more than once in a path of length k+1 (which contains k+2 symbols, of which at most one is a terminal, and all others are nonterminals) of a derivation tree for z. Consider the nonterminal A that satisﬁes the following requirements. • A occurs at least twice in the path of length k+1 of a derivation tree for z. Call the list corresponding to the derivation tree rooted at the lower A w, and call the list corresponding to the derivation tree rooted at the upper A (which contains the list w ) vwx . • A is chosen such that at most one of v and x equals . • Finally, we suppose that below the upper occurrence of A no other nonterminal that satisﬁes the same requirements occurs, that is, the two A’s are the lowest pair of nonterminals satisfying the above requirements. First we show that a nonterminal A satisfying the above requirements exists. We prove this statement by contradiction. Suppose that for all nonterminals A that occur twice on a path of length at least k+1 both v and x, the border lists of the list vwx corresponding to the tree rooted at the upper occurrence of A, are both equal to . Then we can replace the tree rooted at the upper A by the tree rooted at the lower A without changing the list corresponding to the tree. Thus we can replace all paths of length at least k+1 by a path of length at most k. But this contradicts Lemma 9.3 below, and we have a contradiction. It follows that a nonterminal A satisfying the above requirements exists. There exists a derivation tree for z in which the path from the upper A to the leaf has length at most k+1, since either below A no nonterminal occurs twice, or there is one or more nonterminal B that occurs twice, but the border lists v and x from the list v w x corresponding to the tree rooted at the upper occurrence of nonterminal B are empty. Since the lists v and x are empty, we can replace the tree rooted at

162

For i = 0 we have to show that the list uwy is a sentence in L. we have that |z| m = mj . and height t j.3. In this proof we apply the tree substitution process described in the example before the lemma. Let G be a context-free grammar. The proof of the Pumping Lemma above frequently refers to the following lemma.3 below that the length of vwx is at most mk+1 . the list corresponding to the subtree to the left (right) of the upper occurrence of A is u (y). For the base case. We prove this statement by induction on j. and suppose the top of the tree 163 . so we deﬁne d = mk+1 . This proves the induction step. then |z| mj . For the induction step. Proofs of pumping lemmas the upper occurrence of B by the tree rooted at the lower occurrence of B without changing the list corresponding to the derivation tree. but the root of t is not necessarily the start-symbol. Let t be a derivation tree for a sentence z ∈ L(G). The list uwy is obtained if the tree rooted at the upper A is replaced by the tree rooted at the lower A. assume that for all derivation trees t of height at most j we have that |z| mj .4. Let A be the root of t. that is. then |z| mj . Suppose we have a tree t of height j+1. S vv rr ```vvv Ò rrrÒÒ `` vvv rr Ò `` vvv rrr ÒÒÒ vv ` Ò rrr vv y u r A `` vv `` vv Ò rrr Ò `` vvv rrr ÒÒÒ rr ÒÒ `` vvv rr Ò v rr Ac v x cc cc cc w We prove by induction that (∀i : i 0 : uv i wx i y ∈ L). Suppose z = uvwxy. Proof. and suppose that the longest righthand side of any production has length m. We prove a slightly stronger result: if t is a derivation tree for a list z. The list uv i+1 wx i+1 y is obtained if the tree rooted at the lower A in the derivation tree for uv i wx i y ∈ L is replaced by the tree rooted above it A. Theorem 9. where z is the list corresponding to t.9. This situation is depicted as follows. suppose j = 1. If height t j. Since we can do this for all nonterminals B that occur twice below the upper occurrence of A. Suppose that for all i n we have uv i wx i y ∈ L. and since the longest right-hand side of any production has length m. there exists a derivation tree for z in which the path from the upper A to the leaf has length at most k+1. It follows from Lemma 9. Then tree t corresponds to a single production in G.

5.9. the list corresponding to the tree rooted at A has length at most m × mj = mj+1 . Summary This section introduces pumping lemmas. Pumping lemmas are used to prove that languages are not regular or not context-free. Is this language context-free? If it is.10. why not? Exercise 9. {A} S AA }{ 1. Exercises Exercise 9. so the induction hypothesis applies to these trees. give a regular grammar and prove that this grammar generates the language. why not? 2. give a regular grammar and prove that this grammar generates the language. For all trees s rooted at the symbols of v we have height s j. why not? 2. which proves the induction step. 9. and the lists corresponding to the trees rooted at the symbols of v all have length at most mj . and returns the number of occurrences of a in x. If it is not. If it is not.11. S → S → A → A → A → Consider the grammar G with the following productions. Is this language regular? If it is. Is this language regular? If it is. why not? Exercise 9. If it is not. Is this language context-free? If it is. give a context-free grammar and prove that this grammar generates the language. Exercise 9. Show that the following language is not regular. Consider the following language: {ai bj | 0 i j} 1. Is this grammar 164 . and since the longest right-hand side of any production has length m. If it is not. 1}∗ ∧ nr 1 x = nr 0 x} where function nr takes an element a and a list x. Since A → v is a production of G. give a context-free grammar and prove that this grammar generates the language.12. {x | x ∈ {0. Consider the following language: {wcw | w ∈ {a. Pumping Lemmas: the expressive power of languages corresponds to the production A → v in G. b}∗ } 1.9.

Give the language of G without referring to G itself.5. Exercises • Context-free? • Regular? Why? 2. Is the language of G • Context-free? • Regular? Why? Exercise 9. S → S → S → S → Consider the grammar G with the following productions. 0 1 S0 1.13. Prove that your description is correct. Is this grammar • Context-free? • Regular? Why? 2. 3.9. Is the language of G • Context-free? • Regular? Why? 165 . Prove that your description is correct. 3. Give the language of G without referring to G itself.

9. Pumping Lemmas: the expressive power of languages 166 .

2 describes an implementation in Haskell of LL(1) parsing and the diﬀerent kinds of grammar analyses needed for checking whether or not a grammar is LL(1). LL(1) parsing is an eﬃcient (linear in the length of the input string) method for parsing that can be used for all LL(1) grammars. such as determining whether or not a nonterminal can derive the empty string. and Section 10. LL Parsing: Background In the previous chapters we have shown how to construct parsers for sentences of context-free languages using combinator parsers. etc. This chapter discusses one such restriction: LL(1). LALR(1).1 describes the background of LL(1) parsing. A grammar is LL(1) if at each step in a derivation the next symbol in the input uniquely determines the production that should be applied. Goals After studying this chapter you will • know the deﬁnition of LL(1) grammars. not discussed in these lecture notes are LR(1). In order to determine whether or not a grammar is LL(1). This chapter is organised as follows. LL Parsing Introduction This chapter introduces LL(1) parsing. There are several ways in which we can put extra restrictions on context-free grammars such that we can parse sentences of the corresponding languages eﬃciently. the resulting parsers are sometimes a bit slow. • be able to apply diﬀerent kinds of grammar analyses in order to determine whether or not a grammar is LL(1). we introduce several kinds of grammar analyses. and determining the set of symbols that can appear as the ﬁrst symbol in a derivation from a nonterminal.1. • know how to parse a sentence from an LL(1) grammar. Other restrictions.10. SLR(1). 167 . Section 10. Since these parsers may backtrack. 10.

let grammar G be the grammar with the productions: S → aS | cS | b The stack machine for G accepts the input string aab because it can perform the following actions (the ﬁrst component of the state (before the |in the picture below) is the symbol stack and the second component of the state (after the |) is the unmatched (remaining part of the) input string): stack S aS S aS S b input aab aab ab ab b b and end with an empty input.2. So not all possible sequences of actions from the state (S. If there is at least one sequence of actions that ends with an empty input on an input string. and the input sentence cannot be accepted. the input 168 . If they are equal. A stack machine for a grammar G has a stack and an input.1. LL Parsing 10.10. A stack machine for G accepts an input if it can terminate with an empty input when starting with the start-symbol from G on the stack. The production is chosen nondeterministically.4. However. We will use this machine in the following subsections to illustrate why we need grammar analysis. it is popped from the stack and a right-hand side from a production of G for the nonterminal is pushed onto the stack. 2.4 and 7. then it is popped from the stack and compared with the next symbol of the input sequence. the machine signals an error. The stack machine we use in this section diﬀers from the stack machines introduced in Sections 5. For example. Expand: If the top stack symbol is a nonterminal. These actions are performed until the stack is empty. Match: If the top stack symbol is a terminal. if the machine had chosen the production S → cS in its ﬁrst step.1. A stack machine for parsing This section presents a stack machine for parsing sentences of context-free grammars. 1. If the stack symbol and the next input symbol do not match. and performs one of the following two actions.aab) lead to an empty input. it would have been stuck. then this terminal symbol is ‘read’.

10. Each of the examples exempliﬁes why diﬀerent kinds of grammar analyses are useful in parsing. b. LL Parsing: Background string is accepted. amongst others. the nondeterministic stack machine can act in a deterministic way by looking ahead one (or two) symbols of the sequence of input symbols.1. corresponding with a leftmost derivation of ccccba.1. the set of nonterminal symbols is {S. Some example derivations This section gives three examples in which the stack machine for parsing is applied. 10. A.2. for all three examples. B. c}. the start symbol is S. the stack machine is similar to a nondeterministic ﬁnite-state automaton. It turns out that. In this sense. The ﬁrst example Our ﬁrst example is gramm1. The stack machine produces. the following sequence. C}. The set of terminal symbols of gramm1 is {a. and the productions are S → cA | b A → cBC | bSA | a B → cc | Cb C → aS | ba We want to know whether or not the string ccccba is a sentence of the language of this grammar. stack S cA A cBC BC ccC cC C ba a input ccccba ccccba cccba cccba ccba ccba cba ba ba a Starting with S the machine chooses between two productions: S → cA | b 169 .

the second production cannot be applied. the only applicable production is the ﬁrst one. The second example A second example is the grammar gramm2 whose productions are S → abA | aa A → bb | bS Now we want to know whether or not the string abbb is a sentence of the language of this grammar. Determinicity is obtained by looking at the set of ﬁrsts of the nonterminals. From the productions C → aS | ba it is immediately clear that a string derived from C starts with either an a or a b. The machine chooses between three productions: A → cBC | bSA | a and again. and. the only applicable production is the ﬁrst one. Clearly. LL Parsing but. amongst others. the following sequence 170 . The machine chooses between two productions: B → cc | Cb The next symbol of the remaining string ccba to be recognised is. Deriving the sentence ccccba using gramm1 is a deterministic computation: at each step of the derivation there is only one applicable alternative for the nonterminal on top of the stack. leads to success. since the next symbol of the remaining string cccba to be recognised is c. After expanding A a match-action removes the leading c from cccba and cBC. after two match-actions. So now we have to derive the string ccba from BC. The machine chooses between two productions C → aS and C → ba. So now we have to derive the string cccba from A. since the ﬁrst symbol of the string ccccba to be recognised is c. The top stack symbol of BC is B. The ﬁrst production is applicable. After expanding B and performing two match-actions it remains to derive the string ba from C. The stack machine produces. b} is called the ﬁrst set of the nonterminal C. only the second one applies. The set {a. but the second production may be applicable as well.10. c. Since the next symbol in the remaining string to be recognised is a c. To decide whether it also applies we have to determine the symbols that can appear as the ﬁrst element of a string derived from B starting with the second production. From the above derivation we conclude the following. After expanding S a match-action removes the leading c from ccccba and cA. once again. The ﬁrst symbol of the alternative Cb is the nonterminal C.

However. looking ahead one symbol in the input string does not give suﬃcient information for choosing one of the two productions A → bb | bS for A. From the above derivation we conclude the following. Again. if we look at the ﬁrst two symbols ab.10. After expanding and matching it remains to derive the string bb from A. LL Parsing: Background stack S abA bA A bb b input abbb abbb bbb bb bb b Starting with S the machine chooses between two productions: S → abA | aa since both alternatives start with an a. The problem is that the lookahead sets (the lookahead set of a production N → α is the set of terminal symbols that can appear as the ﬁrst symbol of a string that can be derived from N starting with the production N → α.1. it is not suﬃcient to look at the ﬁrst symbol a of the string to be recognised. determinicity is obtained by analysing the set of ﬁrsts (of strings of length 2) of the nonterminals. after matching. Each string derived from A starting with the second production starts with a b and. since it is not possible to derive a string starting with another b from S. then we ﬁnd that the only applicable production is the ﬁrst one. the second production does not apply. 171 . the deﬁnition is given in the following subsection) of the two productions for S both contain a. If we look at the ﬁrst two symbols bb of the input string. leads to success). then we ﬁnd that the ﬁrst production applies (and. we can left-factor the grammar to obtain a grammar in which all productions for a nonterminal start with a diﬀerent terminal symbol. Deriving the string abbb using gramm2 is a deterministic computation: at each step of the derivation there is only one applicable alternative for the nonterminal on the top of the stack. Alternatively. Again.

However. we can apply the ﬁrst production. and each nonempty string derived from B starts with a b. since A can also derive the empty string.10. and which terminal symbols can follow upon a nonterminal in a derivation. Deriving the string acbab using gramm3 is a deterministic computation: at each step of the derivation there is only one applicable alternative for the nonterminal on the top of the stack. We do not explain the rest of the leftmost derivation since it does not use any empty strings any more. as required. Determinicity is obtained by analysing whether or not nonterminals can derive the empty string.2. LL Parsing The third example A third example is grammar gramm3 with the following productions: S → AaS | B A → cS | B → b Now we want to know whether or not the string acbab is an element of the language of this grammar. starts with an a. and then apply the empty string for A. Nonterminal symbols that can derive the empty sequence will play a central role in the grammar analysis problems which we will consider in Section 10. there does not seem to be a candidate production to start a leftmost derivation of acabb with. The stack machine produces the following sequence stack S AaS aS input acbab acbab acbab Starting with S the machine chooses between two productions: S → AaS | B since each nonempty string derived from A starts with a c. From the above derivation we conclude the following. 172 . producing aS which.

1. LL Parsing: Background 10. gramm1 is an LL(1) grammar. Deﬁnition 10. To formalise this deﬁnition. In the rest of this chapter we assume that we may look 1 symbol ahead. A grammar for which all derivations are deterministic with 1 symbol lookahead is called LL(1): Leftmost with a Lookahead of 1.1. The lookahead set of a production N → α is the set of terminal symbols that can appear as the ﬁrst symbol of a string that can be derived from N δ (where N δ appears as a tail substring in a derivation from the start-symbol) starting with the production N → α. For gramm2 we have: lookAhead lookAhead lookAhead lookAhead (S → abA) (S → aa) (A → bb) (A → bS) = = = = {a} {a} {b} {b} 173 . the answer to this question depends on the number of symbols we are allowed to look ahead. N → β of G: lookAhead (N → α) ∩ lookAhead (N → β) = ∅ Since all lookAhead sets for productions of the same nonterminal of gramm1 are disjoint. LL(1) is a desirable property of grammars. that is: for all productions N → α. So lookAhead (N → α) = {x | S ⇒ γN δ ⇒ γαδ ⇒ γxβ} For example. A grammar G is LL(1) if all pairs of productions of the same nonterminal have disjoint lookahead sets. Since all derivations of sentences of LL(1) grammars are deterministic.10. as the second example shows. LL(1) grammars The examples in the previous subsection show that the derivations of the example sentences are deterministic. An obvious question now is: for which grammars are all derivations deterministic? Of course. we deﬁne lookAhead sets.3. for the productions of gramm1 we have lookAhead lookAhead lookAhead lookAhead lookAhead lookAhead lookAhead lookAhead lookAhead (S → cA) (S → b) (A → cBC) (A → bSA) (A → a) (B → cc) (B → Cb) (C → aS) (C → ba) = = = = = = = = = {c} {b} {c} {b} {a} {c} {a.2 (LL(1) grammar). b} {a} {b} ∗ ∗ We use lookAhead sets in the deﬁnition of LL(1) grammar.1 (lookAhead set). Deﬁnition 10. provided we can look ahead one or two symbols in the input.

The lookAhead set of a production N → α. The set of ﬁrsts of a nonterminal N is the set of terminal symbols that can appear as the ﬁrst symbol of a string that can be derived from N : firsts N For example. • and which terminal symbols can follow upon a nonterminal in a derivation (follow). for gramm3 we have: firsts S = {a.4 (First). we also have to determine the set of terminal symbols with which a string derived from β can start. and determines whether or not the empty string can be derived from the nonterminal: empty N For example. that is α = P β.3 (Empty). in order to determine the lookAhead sets of productions. In each of the following deﬁnitions we assume that a grammar G is given. for some β? Then we have to determine the set of terminal symbols with which strings derived from P can start. Deﬁnition 10. But what if α starts with a nonterminal P . the lookAhead sets for both nonterminals S and A are not disjoint. gramm2 is an LL(2) grammar. How do we determine whether or not a grammar is LL(1)? Clearly. where α starts with a terminal symbol x. But if P can derive the empty string. c} firsts A = {c} firsts B = {b} = {x | N ⇒ xβ} ∗ = N ⇒ ∗ 174 . Function empty takes a nonterminal N . LL Parsing Here. where an LL(k) grammar for k 2 is deﬁned similarly to an LL(1) grammar: instead of one symbol lookahead we have k symbols lookahead. and it follows that gramm2 is not LL(1).10. for gramm3 we have: empty S = False empty A = True empty B = False Deﬁnition 10. we are interested in • whether or not a nonterminal can derive the empty string (empty). • which terminal symbols can appear as the ﬁrst symbol in a string derived from a nonterminal (firsts). to answer this question we need to know the lookahead sets of the productions of the grammar. is simply x. b. As you see.

10.1. LL Parsing: Background

We could have given more restricted deﬁnitions of empty and firsts, by only looking at derivations from the start-symbol, for example, empty N = S ⇒ αN β ⇒ αβ

∗ ∗

but the simpler deﬁnition above suﬃces for our purposes. Deﬁnition 10.5 (Follow). The follow set of a nonterminal N is the set of terminal symbols that can follow on N in a derivation starting with the start-symbol S from the grammar G: follow N = {x | S ⇒ αN xβ}

∗

For example, for gramm3 we have: follow S = {a} follow A = {a} follow B = {a} In the following section we will give programs with which lookahead, empty, firsts, and follow are computed.

Exercise 10.1. Give the results of the function empty for the grammars gramm1 and gramm2. Exercise 10.2. Give the results of the function firsts for the grammars gramm1 and gramm2. Exercise 10.3. Give the results of the function follow for the grammars gramm1 and gramm2. Exercise 10.4. Give the results of the function lookahead for grammar gramm3. Is gramm3 an LL(1) grammar ? Exercise 10.5. Grammar gramm2 is not LL(1), but it can be transformed into an LL(1) grammar by left factoring. Give this equivalent grammar gramm2’ and give the results of the functions empty, first, follow and lookAhead on this grammar. Is gramm2’ an LL(1) grammar? Exercise 10.6. A non-leftrecursive grammar for Bit-Lists is given by the following grammar (see your answer to Exercise 2.18): L → BR R B → | ,BR → 0 | 1

Give the results of functions empty, firsts, follow and lookAhead on this grammar. Is this grammar LL(1)?

175

10. LL Parsing

**10.2. LL Parsing: Implementation
**

Until now we have written parsers with parser combinators. Parser combinators use backtracking, and this is sometimes a cause of ineﬃciency. If a grammar is LL(1) we do not need backtracking anymore: parsing is deterministic. We can use this fact by either adjusting the parser combinators so that they don’t use backtracking anymore, or by writing a special purpose LL(1) parsing program. We present the latter in this section. This section describes the implementation of a program that parses sentences of LL(1) grammars. The program works for arbitrary context-free LL(1) grammars, so we ﬁrst describe how to represent context-free grammars in Haskell. Another consequence of the fact that the program parses sentences of arbitrary context-free LL(1) grammars is that we need a generic representation of parse trees in Haskell. The second subsection deﬁnes a datatype for parse trees in Haskell. The third subsection presents the program that parses sentences of LL(1) grammars. This program assumes that the input grammar is LL(1), so in the fourth subsection we give a function that determines whether or not a grammar is LL(1). Both this and the LL(1) parsing function use a function that determines the lookahead of a production. This function is presented in the ﬁfth subsection. The last subsections of this section deﬁne functions for determining the empty, ﬁrst, and follow symbols of a nonterminal.

**10.2.1. Context-free grammars in Haskell
**

A context-free grammar may be represented by a pair: its start symbol, and its productions. How do we represent terminal and nonterminal symbols? There are at least two possibilities. • The rigorous approach uses a datatype Symbol: data Symbol a b = N a | T b The advantage of this approach is that nonterminals and terminals are strictly separated, the disadvantage is the notational overhead of constructors that has to be carried around. However, a rigorous implementation of context-free grammars should keep terminals and nonterminals apart, so this is the preferred implementation. But in this section we will use the following implementation: • class isT isN isT Eq s => Symbol s :: s -> Bool :: s -> Bool = not . isN where

where isN and isT determine whether or not a symbol is a nonterminal or a terminal, respectively. This notation is compact, but terminals and nonterminals are no longer strictly separated, and symbols that are used as nonterminals

176

10.2. LL Parsing: Implementation

cannot be used as terminals anymore. For example, the type of characters is made an instance of this class by deﬁning: instance Symbol Char where isN c = ’A’ <= c && c <= ’Z’ that is, capitals are nonterminals, and, by deﬁnition, all other characters are terminals. A context-free grammar is a value of the type CFG: type CFG s = (s,[(s,[s])])

where the list in the second component associates nonterminals to right-hand sides. So an element of this list is a production. For example, the grammar with productions S→ AaS | B | CB A → SC | B→ A|b C→ D D→ d is represented as: exGrammar :: CFG Char exGrammar = (’S’, [(’S’,"AaS"),(’S’,"B"),(’S’,"CB") ,(’A’,"SC"),(’A’,"") ,(’B’,"A"),(’B’,"b") ,(’C’,"D") ,(’D’,"d") ] ) On this type we deﬁne some functions for extracting the productions, nonterminals, terminals, etc. from a grammar. start start prods prods :: CFG s -> s = fst :: CFG s -> [(s,[s])] = snd :: (Symbol s, Ord s) => CFG s -> [s] = unions . map (filter isT . snd) . snd

terminals terminals

177

10. LL Parsing

where unions :: Ord s => [[s]] -> [s] returns the union of the ‘sets’ of symbols in the lists. nonterminals nonterminals :: (Symbol s, Ord s) => CFG s -> [s] = nub . map fst . snd

Here, nub :: Ord s => [s] -> [s] removes duplicates from a list. symbols :: (Symbol s, Ord s) => CFG s -> [s] symbols grammar = union (terminals grammar) (nonterminals grammar) nt2prods :: Eq s => CFG s -> s -> [(s,[s])] nt2prods grammar s = filter (\(nt,rhs) -> nt==s) (prods grammar) where function union returns the set union of two lists (removing duplicates). For example, we have ?start exGrammar ’S’ ? terminals exGrammar "abd"

**10.2.2. Parse trees in Haskell
**

A parse tree is a tree, in which each internal node is labelled with a nonterminal, and has a list of children (corresponding with a right-hand side of a production of the nonterminal). It follows that parse trees can be represented as rose trees with symbols, where the datatype of rose trees is deﬁned by: data Rose a = Node a [Rose a] | Nil The constructor Nil has been added to simplify error handling when parsing: when a sentence cannot be parsed, the ‘parse tree’ Nil is returned. Strictly speaking it should not occur in the datatype of rose trees.

178

10.2. LL Parsing: Implementation

10.2.3. LL(1) parsing

This section deﬁnes a function ll1 that takes a grammar and a terminal string as input, and returns one tuple: a parse tree, and the rest of the inputstring that has not been parsed. So ll1 takes a grammar, and returns a parser with Rose s as its result type. As mentioned in the beginning of this section, our parser doesn’t need backtracking anymore, since parsing with an LL(1) grammar is deterministic. Therefore, the parser type is adjusted as follows: type Parser b a = [b] -> (a,[b]) Using this parser type, the type of the function ll1 is: ll1 :: (Symbol s, Ord s) => CFG s -> Parser s (Rose s) Function ll1 is deﬁned in terms of two functions. Function isll1 :: CFG s -> Bool is a function that checks whether or not a grammar is LL(1). And function gll1 (for generalised LL(1)) produces a list of rose trees for a list of symbols. ll1 is obtained from gll1 by giving gll1 the singleton list containing the start symbol of the grammar as argument. ll1 grammar input = if isll1 grammar then let ([rose], rest) = gll1 grammar [start grammar] input in (rose, rest) else error "ll1: grammar not LL(1)"

So now we have to implement functions isll1 and gll1. Function isll1 is implemented in the following subsection. Function gll1 also uses two functions. Function grammar2ll1table takes a grammar and returns the LL(1) table: the association list that associates productions with their lookahead sets. And function choose takes a terminal symbol, and chooses a production based on the LL(1) table. gll1 :: (Symbol s, Ord s) => CFG s -> [s] -> Parser s [Rose s] gll1 grammar = let ll1table = grammar2ll1table grammar -- The LL(1) table. nt2prods nt = filter (\((n,l),r) -> n==nt) ll1table -- nt2prods returns the productions for nonterminal -- nt from the LL(1) table selectprod nt t = choose t (nt2prods nt) -- selectprod selects the production for nt from the

179

The parse tree is a leaf (a node with -. grammar2ll1table :: (Symbol s. are deﬁned as follows. in ((Node s rts): rrs. ys)] = filter (\((x.rhs).vs) = gll1 grammar ss zs -. which returns the lookahead set of a production and is deﬁned in one of the following subsections.match let (rts. input) -. These functions use function lookaheadp. 180 . else ([Nil].rest) = gll1 grammar ss (tail input) in if s == head input then (Node s []: rts.The input cannot be parsed else -. in \stack input -> case stack of [] -> ([].Try to parse according to the production -.zs) = gll1 grammar (selectprod s t) input -.q) -> t ‘elem‘ q) l in rhs Note that function choose assumes that there is exactly one element in the association list in which the element t occurs.stack.[s])] grammar2ll1table grammar = map (\x -> (x. (rrs.no children). Ord s) => CFG s -> [((s. LL Parsing -.LL(1) table that should be taken when t is the next -.c).lookaheadp grammar x)) (prods grammar) choose :: Eq a => a -> [((b. vs) Functions grammar2ll1table and choose.p).symbol in the input.[s]). rest) -.[a])] -> c choose t l = let [((s.Parse the rest of the symbols on the -. which are used in the above function gll1.obtained from the LL(1) table from s. input) (s:ss) -> if isT s then -.expand let t = head input (rts.10.

It is deﬁned in terms of four functions.2. lookaheadn :: (Symbol s. lookaheadn grammar) (nonterminals grammar)) disjoint disjoint xss :: Ord s => [[s]] -> Bool = length (concat xss) == length (unions xss) Function lookaheadn computes the lookahead sets of a nonterminal by computing all productions of a nonterminal. (This function was called empty in Deﬁnition 10.Symbol s) => CFG s -> s -> Bool Function isEmpty takes a grammar and a nonterminal and determines whether or not the empty string can be derived from the nonterminal in the grammar. the fourth function is deﬁned in this subsection. All sets in a list of sets are disjoint if the length of the concatenation of these sets equals the length of the union of these sets. Symbol s) => CFG s -> [(s. Each of the ﬁrst three functions will be deﬁned in a separate subsection below. It is deﬁned in terms of a function lookaheadn. and computing the lookahead set of each of these productions by means of function lookaheadp.) • firsts :: (Ord s.10. LL Parsing: Implementation 10. nt2prods grammar 10.4.2.5. which determines whether or not all sets in a list of sets are disjoint. isll1 :: (Symbol s. • isEmpty :: (Ord s. 181 . Implementation of lookahead Function lookaheadp takes a grammar and a production. and returns the lookahead set of the production. Ord s) => CFG s -> Bool isll1 grammar = and (map (disjoint . and checking that all of these sets are disjoint. which computes the lookahead sets of a nonterminal. and a function disjoint. It does this by computing for each nonterminal of its argument grammar the set of lookahead sets (one set for each production of the nonterminal). Implementation of isLL(1) Function isll1 checks whether or not a context-free grammar is LL(1).[s])] Function firsts takes a grammar and computes the set of ﬁrsts of each symbol (the set of ﬁrsts of a terminal is the terminal itself).2. Ord s) => CFG s -> s -> [[s]] lookaheadn grammar = map (lookaheadp grammar) .3.

since the nonterminal symbol A can derive the empty string. and a production.[s])] Function follow takes a grammar and computes the follow set of each nonterminal (so it associates a list of symbols with each nonterminal). with the following productions: S→ AaS | B | CB A → SC | B→ A|b C→ D D→ d Consider the production S → AaS. The lookahead set of the production contains the set of symbols which can appear as the ﬁrst terminal symbol of a sequence of symbols derived from A. The lookahead set of the production contains the set of symbols which can appear as the ﬁrst terminal symbol of a sequence of symbols derived from S. the lookahead set also contains the set of symbols which can appear as the ﬁrst terminal symbol of a sequence of symbols derived from C.2. Function lookSet takes a predicate. since the nonterminal symbol S can derive the empty string. Symbol s) => CFG s -> [(s. But. • lookSet :: Ord s (s -> Bool) (s -> [s]) (s -> [s]) (s. and returns the lookahead set of the production. Ord s) => CFG s -> (s. But. LL Parsing • follow :: (Ord s.4. The lookahead set of the production contains the set of symbols which can appear as the ﬁrst terminal symbol of a sequence 182 . Now we deﬁne: lookaheadp :: (Symbol s. Finally. Function lookSet is introduced after the deﬁnition of function lookaheadp. the lookahead set also contains the symbol a. [s]) [s] => -> -> -> -> ------ isEmpty firsts? follow? production lookahead set Note that we use the operator ?. Consider the production A → SC. consider the production B → A.10. on the firsts and follow association lists. see Section 5. two functions that given a nonterminal return the ﬁrst and follow set.[s]) -> [s] lookaheadp grammar = lookSet (isEmpty grammar) ((firsts grammar)?) ((follow grammar)?) We will exemplify the deﬁnition of function lookSet with the grammar exGrammar. respectively.

The function scanrRhs is most naturally deﬁned in terms of the function scanr. The examples show that it is useful to have functions firsts and follow in which. In the exercises we give an alternative characterisation of foldRhs.2. since the nonterminal symbol A can derive the empty string. of course. We will also need a function scanrRhs which is like foldrRhs but accumulates intermediate results in a list. This function is somewhere in between a general purpose and an application speciﬁc function (we could easily have made it more general though). 183 . we can look up the terminal symbols which can appear as the ﬁrst terminal symbol of a sequence of symbols in some derivation from n and the set of terminal symbols which can follow the nonterminal symbol n in a sequence of symbols occurring in some derivation respectively. The examples also illustrate a control structure which will be used very often in the following algorithms: we will fold over right-hand sides. The following function makes this statement more precise.10. It turns out that the deﬁnition of function follow also makes use of a function lasts which is similar to the function firsts. for every nonterminal symbol n. there are always two possibilities: • either we continue folding and return the result of taking the union of the set obtained from the current element and the set obtained by recursively folding over the rest of the right-hand side • or we stop folding and immediately return the set obtained from the current element. But. most naturally deﬁned in terms of the function foldr. Whenever such a list for a symbol is computed. We continue if the current symbol is a nonterminal which can derive the empty sequence and we stop if the current symbol is either a terminal symbol or a nonterminal symbol which cannot derive the empty sequence. foldrRhs :: Ord s => (s -> Bool) -> (s -> [s]) -> [s] -> [s] -> [s] foldrRhs p f start = foldr op start where op x xs = f x ‘union‘ if p x then xs else [] The function foldrRhs is. While doing so we compute sets of symbols for all the symbols of the right-hand side which we encounter and collect them into a ﬁnal set of symbols. LL Parsing: Implementation of symbols derived from A. but which deals with last nonterminal symbols rather than ﬁrst terminal ones. the lookahead set also contains the set of terminal symbols which can follow the nonterminal symbol B in some derivation.

the function foldrRhs continues processing a right-hand side only if it encounters a nonterminal symbol for which p (so isEmpty in the lookSet instance lookaheadp) holds. LL Parsing scanrRhs :: Ord s => (s -> Bool) -> (s -> [s]) -> [s] -> [s] -> [[s]] scanrRhs p f start = scanr op start where op x xs = f x ‘union‘ if p x then xs else [] Finally. reverse We now return to the function lookSet.10.rhs) = foldrRhs p f (g nt) rhs The function lookSet makes use of foldrRhs to fold over a right-hand side. the set g nt (follow?nt in the lookSet instance lookaheadp) is only important for those right-hand sides for nt that consist of nonterminals that can all derive the empty sequence. Thus. scanrRhs p f start . As stated above.rhs) ? look dba ? look dba ? look d ? look ’S’ "AaS" ’S’ "B" ’S’ "CB" ’A’ "SC" 184 . we will also need a function scanlRhs which does the same job as scanrRhs but in the opposite direction. lookSet :: Ord s => (s -> Bool) -> (s -> [s]) -> (s -> [s]) -> (s. We can now (assuming that the deﬁnitions of the auxiliary functions are given) use the function lookaheadp instance of lookSet to compute the lookahead sets of all productions. look nt rhs = lookaheadp exGrammar (nt.[s]) -> [s] lookSet p f g (nt. The easiest way to deﬁne scanlRhs is in terms of scanrRhs and reverse. scanlRhs p f start = reverse .

Let us have a closer look at how these lookahead sets are obtained. and we obtain firsts? ’S’ ‘union‘ firsts? ’C’ == "dba" ‘union‘ "d" == "dba" Finally.10. Folding stops at ’a’ and we obtain firsts? ’A’ ‘union‘ firsts? ’a’ == "dba" ‘union‘ "a" == "dba" For the lookahead set of the production A → SC we fold over the right-hand side SC.2. LL Parsing: Implementation dba ? look ad ? look dba ? look b ? look d ? look d ’A’ "" ’B’ "A" ’B’ "b" ’C’ "D" ’D’ "d" It is clear from this result that exGrammar is not an LL(1)-grammar. We will have to use the functions firsts and follow and the predicate isEmpty for computing intermediate results. Folding stops at C since it cannot derive the empty sequence. For the lookahead set of the production A → AaS we fold over the right-hand side AaS. for the lookahead set of the production B → A we fold over the right-hand side A In this case we fold over the complete (one element) list and and we obtain firsts? ’A’ ‘union‘ follow? ’B’ == "dba" ‘union‘ "d" == "dba" 185 . The corresponding subsections explain how to compute these intermediate results.

Therefore we only have to consider the productions in which no terminal symbols appear in the right-hand sides. until the set does not change anymore. and repeatedly applies the function to the set.2. which takes a function and a set. LL Parsing The other lookahead sets are computed in a similar way. This subsection deﬁnes this function. S → B | CB A → SC | B → A C → D One can immediately see from those productions that the nonterminal A derives the empty string in one step. We now give the Haskell implementation of the algorithm described above. 10. To know whether there are any nonterminals which derive the empty string in more than one step we eliminate the productions for A and we eliminate all occurrences of A in the right hand sides of the remaining productions S → B | CB B → C → D One can now conclude that the nonterminal B derives the empty string in two steps. For this purpose we use function fixedPoint. Consider the grammar exGrammar. | C 186 . Doing the same with B as we did with A gives us the following productions S → C → D One can now conclude that the nonterminal S derives the empty string in three steps.10. The algorithm is iterative: it does the same steps over and over again until some desired condition is met.6. Implementation of empty Many functions deﬁned in this chapter make use of a predicate isEmpty. Doing the same with S as we did with A and B gives us the following productions C → D At this stage we can conclude that there are no more new nonterminals which derive the empty string. We are now only interested in deriving sequences which contain only nonterminal symbols (since it is impossible to derive the empty string if a terminal occurs). which tests whether or not the empty sequence can be derived from a nonterminal.

"b").2.(’d’."A").7. The set of ﬁrsts each symbol consists of that symbol itself.(’A’. Consider the grammar exGrammar again.(’b’. We start with the empty set of nonterminals. Ord s) => CFG s -> s -> Bool isEmpty grammar = (‘elem‘ emptySet grammar) The emptySet of a grammar is obtained by the iterative process described in the example above.2. Function isEmpty determines whether or not a nonterminal can derive the empty string."C"). The ﬁrst set of a terminal symbol is the terminal symbol itself.(’a’. just as the function isEmpty. emptySet :: (Symbol s.10. and at each step n of the computation of the emptySet as a fixedPoint."d") ] 187 .rhs) -> all (‘elem‘ set) rhs) (prods grammar) ) ) 10. We start the iteration with [(’S’. we add the nonterminals that can derive the empty string in n steps."B")."a"). plus the (ﬁrst) symbols that can be derived from that symbol in one or more steps. Ord s) => CFG s -> [s] -> [s] emptyStepf grammar set = nub (map fst (filter (\(nt.(’D’. LL Parsing: Implementation fixedPoint fixedPoint f xs :: Ord a => ([a] -> [a]) -> [a] -> [a] | xs == nexts = xs | otherwise = fixedPoint f nexts where nexts = f xs fixedPoint f is sometimes called the ﬁxed-point of f.(’C’. A nonterminal can derive the empty string if it is a member of the emptySet of a grammar. So the set of ﬁrsts can be computed by an iterative process. isEmpty :: (Symbol s. and returns for each symbol of the grammar (so also the terminal symbols) the set of terminal symbols with which a sentence derived from that symbol can start. Ord s) => CFG s -> [s] emptySet grammar = fixedPoint (emptyStepf grammar) [] emptyStepf :: (Symbol s.(’B’."S")."D") . Function emptyStepf adds a nonterminal if there exists a production for the nonterminal of which all elements can derive the empty string. Implementation of ﬁrst and last Function firsts takes a grammar.

"d")] In two steps we can derive [(’S’."d") ] Function firsts is deﬁned as the fixedPoint of a step function that iterates the ﬁrst computation one more step.[s])] firsts grammar = fixedPoint (firstStepf grammar) (startSingle grammar) startSingle :: (Ord s."Dd") ."D").(’D’. Ord s) => CFG s -> [(s. [(’S’.(’A’."ABC"). We repeat this process until the list doesn’t change anymore."b") .(’D’."")] and again we have to take the union of this list with the previous result."CD").(’d’."CDd") .(’d’."a")."S")."S").(’B’."BAb"). 188 ."Ab")."b").(’D’.(’a’.(’C’. LL Parsing Using the productions of the grammar we can derive in one step the following lists of ﬁrst symbols.(’A’. At each of these iteration steps we have to add the start list with which the iteration started again. Symbol s) => CFG s -> [(s. firsts :: (Symbol s.(’C’."AS").[x])) (symbols grammar) The step function takes the old approximation and performs one more iteration step."SABC").(’B’.(’A’."Dd") ."a") .(’B’."SABCDabd") ."SABCDabd") ."d")] and the union of these two lists is [(’S’."d").(’C’.(’D’.(’A’.[s])] startSingle grammar = map (\x -> (x."SAbD"). For exGrammar this happens when: [(’S’.(’C’.(’a’."ABC").(’b’. The fixedPoint starts with the list that contains all symbols paired with themselves.(’B’.(’b’.10."SABCDabd") .

bs) ((c.foldrRhs (isEmpty grammar) single [] rhs ) ) (prods grammar) ) ) where group groups elements with the same ﬁrst element together group group :: Eq a => [(a.[s])] firstStepf grammar approx = (startSingle grammar) ‘combine‘ (compose (first1 grammar) approx) combine :: Ord s => [(s. 189 .bs) rest) compose :: Ord a => [(a.r1] Finally.[s])] -> [(s. and combines the results for the diﬀerent nonterminals.ds) : (insert (a. function first1 computes the direct ﬁrst symbols of all productions.ds):(insertPair (a. Ord s) => CFG s -> [(s.bs) [] = [(a.[a])] compose r1 r2 = [(a.[b])] = [(a. and unions returns the union of a set of sets.fs) -> (nt.[s])] -> [(s.ds):rest) if a==c then (c.[b])] = else (c.b) rest) insertPair :: Eq a => (a.[a])] -> [(a.[b])] -> [(a. taking into account that some nonterminals can derive the empty string. union bs ds) : rest | otherwise = (c.[s])] -> [(s.[b])] = foldr insertPair [] [(a. unions (map (r2?) bs)) | (a.b) -> insertPair (a.2.bs)] insert (a.[a])] -> [(a.bs) <.10.b) [] insertPair (a.[s])] combine xs = foldr insert xs where insert (a.rhs) -> (nt.b) ((c.(b:ds)):rest function single takes an element and returns the set with the element.[s])] first1 grammar = map (\(nt.ds):rest) | a == c = (a.unions fs)) (group (map (\(nt. LL Parsing: Implementation firstStepf :: (Ord s.b)] -> [(a. first1 :: (Symbol s. Symbol s) => CFG s -> [(s.

an ’A’."d"). The lists of all nonterminal symbols that can appear at the end of sequences of symbols derived from "A" and "Aa" are "ADC" and "" respectively. We can compute pairs of such adjacent symbols by splitting up the right-hand sides with length at least 2 and.2. and a ’B’ can be followed by a ’d’.(’C’."ad"). compute the symbols which appear at the end resp. The lists of all terminal symbols which can appear at the beginning of sequences of symbols derived from "aS" and "S" are "a" and "dba" respectively. which takes a grammar. at the beginning of strings which can be derived from the left resp.(’A’. reverseGrammar 10. Zipping together those lists shows that an ’A’. Implementation of follow The ﬁnal function we have to implement is the function follow. Let’s see what happens with the alternative "AaS". using lasts and firsts. "CB" and "SC". right part of the split alternative. A nonterminal n is associated to a list containing terminal t in follow if n and t follow each other in some sequence of symbols occurring in some leftmost derivation. This idea is implemented in the following functions. The lists produced by these functions are exactly the ones we need: using the function zip from the 190 . a ’D’ and a ’C’ can be be followed by an ’a’.al) -> (s."adb").(’D’."adb")] The function follow uses the functions scanrAlt and scanlAlt. Function follow uses the functions firsts and lasts and the predicate isEmpty for intermediate results.8. LL Parsing Function lasts is deﬁned using function firsts.rhs) -> (nt. Ord s) => CFG s -> [(s. a ’b’ and an ’a’ From the second pair we see that an ’S’. reverseGrammar :: Symbol s => CFG s -> CFG s reverseGrammar = \(s.10. and returns an association list in which nonterminals are associated to symbols that can follow upon the nonterminal in a derivation. The previous subsections explain how to compute these functions. a ’D’."d").[s])] = firsts . Combining all these results gives: [(’S’. a ’C’.reverse rhs)) al) lasts lasts :: (Symbol s.map (\(nt. Splitting the alternative "SC" in the middle produces sets of ﬁrsts and sets of lasts "SDCAB" and "d". Splitting the alternative "CB" in the middle produces sets of ﬁrsts and sets of lasts "CD" and "dba". Suppose we reverse the right-hand sides of all productions of a grammar. Then the set of ﬁrsts of this reversed grammar is the set of lasts of the original grammar.(’B’. From the ﬁrst pair we can see that a ’C’ and a ’D’ can be followed by a ’d’. Our grammar exGrammar has three alternatives with length at least 2: "AaS".

zip (rightscan rhs) (leftscan rhs) 191 .xs) -> (x. Ord s) => [(s.nub rhs)) (group pairs) where pairs = [(l.[s])] followNE grammar = splitProds (prods grammar) (isEmpty grammar) (isTfirsts grammar) (isNlasts grammar) where isTfirsts = map (\(x. f) | rhs <.[s])] follow grammar = combine (followNE grammar) (startEmpty grammar) startEmpty grammar = map (\x -> (x. followNE :: (Symbol s. and removes all nonterminals from the set of ﬁrsts. Function splitProds splits the productions of length at least 2. and pairs the last nonterminals with the ﬁrst terminals. Function followNE passes the right arguments on to function splitProds.[])) (symbols grammar) The real work is done by functions followNE and function splitProds.isEmpty [(s. Thus.terminal firsts [(s.[s])] -> -. LL Parsing: Implementation standard prelude we can combine the lists.10. "a".[s])] -> -. firsts isNlasts = map (\(x. "SDCAB"] ["dba". For example: for the alternative "AaS" the functions scanlAlt and scanrAlt produce the following lists: [[]. Ord s) => CFG s -> [(s. Ord s) => CFG s -> [(s.rhs) -> (nt.productions (s -> Bool) -> -.nonterminal lasts [(s.filter isT xs)) . length rhs >= 2 . "DC". We start the computation of follow with assigning the empty follow set to each symbol: follow :: (Symbol s. []] Only the two middle elements of both lists are important (they correspond to the nontrivial splittings of "AaS").[s])] -> -. "ADC". lasts splitProds :: (Symbol s. and all terminals from the set of lasts. "dba".2.xs) -> (x. we only have to consider alternatives of length at least 2. (fs.[s])] splitProds prods p fset lset = map (\(nt.map snd prods .filter isN xs)) . ls) <.

8. Give the Rose tree representation of the parse tree corresponding to the derivation of the sentence acbab using grammar gramm3. It can be shown that foldrRhs p f [] = unions . Exercise 10. Give the Rose tree representation of the parse tree corresponding to the derivation of the sentence abbb using grammar gramm2’ deﬁned in the exercises of the previous section. the following operations may be used: [] union unions single :: :: :: :: [a] [a] → [a] → [a] [[a]] → [a] a → [a] the the the the empty set of a-elements union of two sets generalised union singleton function These operations satisfy the well-known laws for set operations. Give an informal description of the functions foldrRhs p f [] and foldrRhs p f start.10. Deﬁne list2Set as a foldr. 3. . 5. From the deﬁnitions it is clear that grammar analysis is easily expressed via a calculus for (ﬁnite) sets. Since the code in this module is obscured by several implementation details we will derive the functions foldrRhs and scanrRhs in a stepwise fashion. 4. Deﬁne prefplus p as a foldr.10. 6. Exercise 10. In this derivation we will use the following: A (ﬁnite) set is implemented by a list with no duplicates. Give the Rose tree representation of the parse tree corresponding to the derivation of the sentence ccccba using grammar gramm1. prefplus p for all set-valued functions f.fs = scanlRhs p (lset?) [] = scanrRhs p (fset?) [] Exercise 10. Deﬁne a function prefplus p :: [a] → [a] which given a list xs returns the set of elements in the longest preﬁx of xs all of whose elements satisfy p together with the ﬁrst element of xs that does not satisfy p (if this element exists at all). 2.7. In this exercise we will take a closer look at the functions foldrRhs and scanrRhs which are the essential ingredients of the implementation of the grammar analysis algorithms. In order to construct a set. 192 . A calculus for ﬁnite sets is implicit in the programs for LL(1) parsing.ls f <. Deﬁne a function list2Set :: [a] → [a] which returns the set of elements occurring in the argument. Show that prefplus p = foldrRhs p single []. map f . Deﬁne a function pref p :: [a] → [a] which given a list xs returns the set of elements corresponding to the longest preﬁx of xs all of whose elements satisfy p. 1. LL Parsing .9. Exercise 10. 7. ] leftscan rightscan l <.

The function scanrRhs is deﬁned is a similar way scanrRhs p f start = map (foldrRhs p f start) . For terminal symbols s these functions are deﬁned by empty s = False firsts s = {s} Using the deﬁnitions in the previous exercise. LL Parsing: Implementation 8. The standard function scanr is deﬁned by scanr f q0 = map (foldr f q0) . The computation of the functions empty and firsts is not restricted to nonterminals only. For the example grammar gramm1 and two of its productions A → bSA and B → Cb. 1. Give an informal description of the function scanrRhs. tails where tails is a function which takes a list xs and returns a list with all tailsegments (postﬁxes) of xs in decreasing length.10. Exercise 10. compute the following.11. a) foldrRhs empty firsts [] bSA b) foldrRhs empty firsts [] Cb 2. For the example grammar gramm3 and its production S → AaS a) foldrRhs empty firsts [] AaS b) scanrRhs empty firsts [] AaS tails 193 .2.

LL Parsing 194 .10.

They can recognize sentences described by ambiguous grammars by returning a list of solutions. including the so-called LL(1)-property to which grammars must abide. The parsing algorithms are normally referred to as LL(k) or (LR(k). from a ﬁle. So these algorithms are both suitable for reading an input stream.g. we cannot return an empty list of successes anymore. In Chapter 10 we studied the LL(1) parsing algorithm extensively. LL versus LR parsing The parsers that were constructed using parser combinators in Chapter 3 are nondeterministic. intead there should be some mechanism of raising an error. There are two fundamentally diﬀerent deterministic parsing algorithms: • LL parsing. also known as bottom-up parsing The ﬁrst ‘L’ in these acronyms stands for ‘Left-to-right’. Here. where k is the number of unread symbols that the parser is allowed to ‘look ahead’. as the parsing mimics doing a leftmost derivation of a sentence (see Section 2. They impose some additional constraints on the form of the grammar: not every context free grammar is allowable by these algorithms. [a]) Ambiguous grammars are not allowed anymore. consisting of a parse tree of type b and the remaining part of the input string: type Parser a b = [a] -> (b. Also. There are various algorithms for deterministic parsing. Next. we turn to the LR(1) algorithm. In most practical cases.4). In this chapter we start with an example application of the LL(1) algorithm. also known as top-down parsing • LR parsing. input in processed in the order it is read. The second ‘L’ in ‘LL-parsing’ stands for ‘Leftmost derivation’. In chapter 10 we turned to deterministic parsers. each solution contains a parse tree and the part of the input that was left unprocessed. e. [a]) ] where a denotes the alphabet type. 195 . The ‘R’ in ‘LR-parsing’ stands for ‘Rightmost derivation’ of the sentences. that is. parsers have only one result. Nondeterministic parsers are of the following type: type Parser a b = [a] -> [ (b. k is taken to be 1. rather than just one solution. in the case of parsing input containing syntax errors. and b denotes the parse tree type. The parsing algorithms can be modelled by making use of a stack machine.11.

1. Hence the result of the function is simply a boolean value: check :: String -> Bool check input = run [’S’] input As in chapter 10. Now the function run takes. it merely checks whether or not the input string is a sentence. we push rs on the stack. the input should begin with it. the top of the stack. the function is implemented by calling a generalized function run with a singleton containing just the root nonterminal. 11. when called with an empty stack and an empty string. case distinction is done on a. Function run is deﬁned as follows: run :: Stack -> run [] [] run [] (x:xs) run (a:as) input String -> Bool = True = False | isT a = && && | isN a = where rs = select a (hd not(null input) a==hd input run as (tl input) run (rs++as) input input) So.11. This is a simple tail recursive function. a list of symbols (which is initially called with the above-mentioned singleton). and that’s why the function is named run: it runs the stack machine. The new symbols rs that are pushed on the stack is the right hand side of a production rule for nonterminal a.1. we deﬁne a slightly simpliﬁed version of the algorithm: it doesn’t return a parse tree. generalized in the sense that the function takes an additional list of symbols that need to be recognized. as elements are prepended at its front.2. LL(1) parser example The LL(1) parsing algorithm was implemented in Section 10. Function ll1 deﬁned there takes a grammar and an input string. Therefore we refer to the whole operation as a stack machine. The function was implemented by calling a generalized function named gll1. as there is junk input present. If it is a terminal. An LL(1) checker Here. and returns a single result consisting of a parse tree and rest string.1. That generalization is then called with a singleton list containing only the root nonterminal. it fails.3. the function succeeds. and removed at the front in other occasions. but the input is not. in addition to the input string. If the stack is nonempty. If the stack is empty. In the case of a nonterminal. and leave the input unchanged in the recursive call. and the machine is called recursively for the rest of the input. and could imperatively be implemented in a single loop that runs while the stack is non-empty. It is selected from the available alternatives by function 196 . LL versus LR parsing 11. so it need not be concerned with building it. That list is used as some kind of stack.

LL(1) parser example select. P and M . So let’s study an example grammar: arithmetic expressions with operators of two levels of precedence.5.2.1. this grammar doesn’t abide to the LL(1)-property.5.1. hence the call to snd.11. select a alternative . 197 .2. from all productions of the grammar returned by prods. Unfortunately. we need as much additional notions as there are levels of precedence).8. prods) gram This is how the x = (snd is selected: where ok p@(n.7: we need auxiliary notions of ‘Term’ and ‘Factor’ (actually. An LL(1) grammar The idea for the grammar for arithmetical expressions was given in Section 2. its application in a simple case is rather intuitive.2. we also add an additional rule. and of it only the right hand side is needed. 11. The most straightforward deﬁnition of the grammar is shown in the left column below._) = n==a && x ‘elem‘ lahP gram p So. and moreover the ﬁrst input symbol x is member of the lookahead set of this production. It is described in Sections 10. which says that the input consists of an expression followed by end-of-input (designated as ‘#’ here). if the nonterminal n is a. The result is a grammar where there is only one rule for E and T . as described in Section 2. This is where you can see that the algorithm is LL(1): it is allowed to look ahead one symbol.4. Now a production is ok. the ok ones are taken. the ﬁrst input symbol is passed to select as well. Determining the lookahead sets of all productions of a grammar is the tricky part of the LL(1) parsing algorithm. Though the general formulation of the algorithm is quite complex. the one we are looking for. For being able to treat end-of-input as if it were a character. of which there should be only one (this is ensured by the LL(1)-property). For making the choice. The reason for this is that the grammar contains rules with common preﬁxes on the right hand side (for E and T ).5 to 10. that single production is retrieved by hd. and the non-common part of the right hand sides is described by additional nonterminals. The resulting grammar is shown in the right column below. filter ok . A way out is the application of a grammar transformation known as left factoring. hd .

the parse tree should reﬂect that the 2 and 3 should be multiplied. the parse tree does so: 198 . LL versus LR parsing E→T E→T +E T →F T →F *T F →N F →(E) N →1 N →2 N →3 S→E# E → TP P → P →+E T → FM M→ M →*T F →N F →(E) N →1 N →2 N →3 For determining the lookahead sets of each nonterminal. These properties are named empty and ﬁrst respecively.3. production S→E# E → TP P → P →+E T → FM M→ M →*T F →N F →(E) N →1 N →2 N →3 empty no no yes no yes no no ﬁrst ( 1 2 3 ( 1 2 3 + ( 1 2 3 * ( 1 2 3 1 2 3 lookahead ﬁrst(E) ( 1 ﬁrst(T) ( 1 follow(P) ) # immediate + ﬁrst(F) ( 1 follow(M) + ) immediate * ﬁrst(N) 1 2 immediate ( immediate 1 immediate 2 immediate 3 2 3 2 3 2 3 # 3 11. All properties for the example grammar are summarized in the table below. rather than the 1+2 and 3. Using the LL(1) parser A sentence of the language described by the example grammar is 1+2*3. we also need to analyze whether a nonterminal can produce the empty string. whereas lookahead is a property of a single production rule.11. Indeed. Because of the high precedence of * relative to +.1. Note that empty and ﬁrst are properties of a nonterminal. and which terminals can be the ﬁrst symbol of any string produced by each nonterminal.

though. LR does a rightmost derivation 199 . Initially. of course. what is described in this section is known as Simple LR parsing or SLR(1)-parsing. we notice that the parse tree is traversed in a depth-ﬁrst fashion. The order in which the production rules are applied is a preorder traversal of the tree. where the left subtrees are analysed ﬁrst.2. Each node corresponds to the application of a production rule. the corresponing symbol is read from the input. From the table in Figure 11.1 it is clear that the contents of the stack describes what is still to be expected on the input. or dual.2. LR parsing Now when we do a step-by-step analysis of how the parse tree is constructed by the stack machine. 11.11. It is still rather complicated. is LR(1)parsing. Actually. LR parsing Another algorithm for doing deterministic parsing using a stack machine. A nice property of LR-parsing is that it is in many ways exactly the opposite. Some of these ways are: • LL does a leftmost derivation. of LL-parsing. The tree is constructed top-down: ﬁrst the root is visited. Whenever a terminal is on top of the stack. an then each time the ﬁrst remaining nonterminal is expanded. this is the root nonterminal S.

11. LL versus LR parsing

• LL starts with only the root nonterminal on the stack, LR ends with only the root nonterminal on the stack • LL ends when the stack is empty, LR starts with an empty stack • LL uses the stack for designating what is still to be expected, LR uses the stack for designating what has already been seen • LL builds the parse tree top down, LR builds the parse tree bottom up • LL continuously pops a nonterminal oﬀ the stack, and pushes a corresponding right hand side; LR tries to recognize a right hand side on the stack, pops it, and pushes the corresponding nonterminal • LL thus expands nonterminals, while LR reduces them • LL reads terminal when it pops one oﬀ the stack, LR reads terminals while it pushes them on the stack • LL uses grammar rules in an order which corresponds to pre-order traversal of the parse tree, LR does a post-order traversal.

**11.2.1. A stack machine for SLR parsing
**

As in Section 10.1.1 a stack machine is used to parse sentences. A stack machine for a grammar G has a stack and an input. When the parsing starts the stack is empty. The stack machine performs one of the following two actions. 1. Shift: Move the ﬁrst symbol of the remaining input to the top of the stack. 2. Reduce: Choose a production rule N → α; pop the sequence α from the top of the stack; push N onto the stack. These actions are performed until the stack only contains the start symbol. A stack machine for G accepts an input if it can terminate with only the start symbol on the stack when the whole input has been processed. Let us take the example of Section 10.1.1. stack a aa baa Saa Sa S input aab ab b

Note that with our notation the top of the stack is at the left side. To decide which production rule is to be used for the reduction, the symbols on the stack must be read in reverse order.

200

11.3. LR parse example

The stack machine for G accepts the input string aab because it terminates with the start symbol on the stack and the input is completely processed. The stack machine performs three shift actions and then three reduce actions. The SLR parser only performs a reduction with a grammar rule N → α if the next symbol on the input is in the follow set of N .

Exercise 11.1. Give for gramm1 of Section 10.1.2 all the states of the stack machine for a derivation of the input ccccba. Exercise 11.2. Give for gramm2 of Section 10.1.2 all the states of the stack machine for a derivation of the input abbb. Exercise 11.3. Give for gramm3 of Section 10.1.2 all the states of the stack machine for a derivation of the input acbab.

**11.3. LR parse example
**

11.3.1. An LR checker

for a start, the main function for LR parsing calls a generalized stack function with an empty stack:

check’ :: String -> Bool check’ input = run’ [] input

The stack machine terminates when it ﬁnds the stack containing just the root nonterminal. Otherwise it either pushes the ﬁrst input symbol on the stack (‘Shift’), or it drops some symbols oﬀ the stack (which should be the right hand side of a rule) and pushes the corresponding nonterminal (‘Reduce’).

run’ :: Stack -> String -> Bool run’ [’S’] [] = True run’ [’S’] (x:xs) = False run’ stack (x:xs) = case action of Shift -> run’ (x: stack) xs Reduce a n -> run’ (a:drop n stack) (x:xs) Error -> False where action = select’ stack x

In the case of LL-parsing the hard part was selecting the right rule to expand; here we have the hard decision of whether to reduce according to a (and which?) rule, or to shift the next symbol. This is done by the select’ function, which is allowed to inspect the ﬁrst input symbol x and the entire stack: after all, in needs to ﬁnd the right hand side of a rule on the stack. In the right column in Figure 11.1 the LR derivation of sentence 1+2*3 is shown. Compare closely to the left column, which shows the LL derivation, and note the duality of the processes.

201

11. LL versus LR parsing LL derivation rule S→E E → TP T → FM F →N N →1 read M→ P → +E read E → TP T → FM F →N N →2 read M → *T read T → FM F →N N →3 read M→ P → read ↓stack S E TP FMP NMP 1M P MP P +E E TP FMP NMP 2M P MP *T P TP FMP NMP 3M P MP P remaining 1+2*3 1+2*3 1+2*3 1+2*3 1+2*3 1+2*3 +2*3 +2*3 +2*3 2*3 2*3 2*3 2*3 2*3 *3 *3 3 3 3 3 rule shift N →1 F →N M→ T → FM shift shift N →2 F →N shift shift N →3 F →N M→ T → FM M → *T T → FM P → E → TP P → +E E → TP S→E LR derivation read 1 1 1 1 1 1+ 1+2 1+2 1+2 1+2* 1+2*3 1+2*3 1+2*3 1+2*3 1+2*3 1+2*3 1+2*3 1+2*3 1+2*3 1+2*3 1+2*3 1+2*3 stack↓ 1 N F FM T T+ T +2 T +N T +F T +F * T +F *3 T +F *N T +F *F T +F *F M T +F *T T +F M T +T T +T P T +E TP E S remaining 1+2*3 +2*3 +2*3 +2*3 +2*3 +2*3 2*3 *3 *3 *3 3

1 1 1 1+ 1+ 1+ 1+ 1+ 1+2 1+2 1+2* 1+2* 1+2* 1+2* 1+2*3 1+2*3 1+2*3

Figure 11.1.: LL and LR derivations of 1+2*3

**11.3.2. LR action selection
**

The choice whether to shift or to reduce is made by function select’. It is deﬁned as follows:

select’ as x | null items = Error | null redItems = Shift | otherwise = Reduce a (length rs) where items = dfa as redItems = filter red items (a,rs,_) = hd redItems

In the selection process a set (list) of so-called items plays a role. If the set is empty, there is an error. If the set contains at least one item, we ﬁlter the red, or reducible items from it. There should be only one, if the grammar has the LR-property. (Or rather: it is the LR-property that there is only one element in this situation). The reducible item is the production rule that can be reduced by the stack machine. Now what are these items? An item is deﬁned to be a production rule, augmented

202

11.3. LR parse example

with a ‘cursor position’ somewhere in the right hand side of the rule. So for the rule F → (E), there are four possible items: F → ·(E), F → ( · E), F → (E · ) and F → (E)·, where · denotes the cursor. for the rule F → 2 we have two items: one having the cursor in front of the single symbol, and one with the cursor after it: F → · 2 and F → 2 ·. For epsilon-rules there is a single item, where the cursor is at position 0 in the right hand side. In the Haskell representation, the cursor is an integer which is added as a third element of the tuple, which already contains nonterminal and right hand side of the rule. The items thus constructed are taken to be the states of a NFA (Nondeterministic Finite-state Automaton), as described in Section 8.1.2. We have the following transition relations in this NFA: • The cursor can be ‘stepped’ to the next position. This transition is labeled with the symbol that is hopped over • If the cursor is on front of a nonterminal, we can jump to an item describing the application of that nonterminal, where the cursor is at the beginning. This relation is an epsilon-transition of the NFA.

As an example, let’s consider an simpliﬁcation of the arithmetic expression grammar: E→T E→T +E T →N T →N *T T →(E) it skips the ‘factor’ notion as compared to the grammar earlier in this chapter, so it misses some well-formed expressions, but it serves only as an example for creating the states here. There are 18 states in this machine, as depicted here:

203

11. LL versus LR parsing

Exercise 11.4 (no answer provided). How can you predict the number of states from inspecting the grammar?

Exercise 11.5 (no answer provided). In the picture, the transition arrow labels are not shown. Add them.

Exercise 11.6 (no answer provided). What makes this FA nondeterministic?

As was shown in Section 8.1.4, another automaton can be constructed that is deterministic (a DFA). That construction involves deﬁning states which are sets of the original states. So in this case, the states of the DFA are sets of items (where items are rules with a cursor position). In the worst case we would have 218 states for the DFA-version of our example NFA, but it turns out that we are in luck: there are only 11 states in the DFA. Its states are rather complicated to depict, as they are sets of items, but it can be done:

204

9 (no answer provided).. How does this condition reﬂect that ‘the cursor is at the end’ ? 205 ._) = hd redItems It runs the DFA.2.. the ‘red’ ones are ﬁltered out.8. where red (a. which is by construction a set of items. let’s return to our function that selects whether to shift or to reduce: select’ as x | null items = Error | null redItems = Shift | otherwise = Reduce a (length rs) where items = dfa as redItems = filter red items (a.rs.11. Both conditions are in the formal deﬁnition: . Exercise 11. using the contents of the stack (as) as input.7 (no answer provided). Check that this FA is indeed deterministic.8 (no answer provided).3. LR parse example Exercise 11. Function follow was also needed in the LL analysis in Section 10. Which of the states in the picture of the DFA contain red items? How many? This is not the only condition that makes an item reducible. An item is ‘red’ (that is: reducible) if the cursor is at the end of the rule. Given this DFA.r.c) = c==length r && x ‘elem‘ follow a Exercise 11. From the state where the DFA ends. the second condition is that the lookahead input symbol is in the follow set of the nonterminal that is reduced to.

3. It leads to a dramatic increase of the number of states in the DFA. This means that it should vary for each item instead of for each nonterminal. as an optimization. which says: . would be to make the follow set context dependent. Its simplicity is in the reducibility test. in some contexts. so this amounts to just storing extra integers on the stack.. A compromise position is taken by the so-called LALR parsers. it is considered a waste of time to do the full DFA transitions each time. Grammars that do not give rise to shift/reduce conﬂicts in this situation are said to be LALR-grammars. ). with a failing parser as a consequence. for making the shift/reduce decision. the DFA is run on the (new) stack. The states of the DFA can easily be numbered. two tables can be precomputed: • Shift. This basically is the transition relation on the DFA. It is not really a natural 206 . that decides what action to take from a given state seeing a given input symbol. tupled with the symbols that used to be on the stack. . By analysing the grammar. .. but when states in the DFA diﬀer only with respect to the follow-part of their set-members (and not with respect to the itempart of them). the set is smaller. the states are merged. where it should have done a shift. as most of the stack remains the same after some popping and pushing at the top. where it actually should not. In practice. or SLR(1). (A rather silly name. the follow sets are context dependent. containing a dummy symbol and the number of the initial DFA state). In each call. where red (a. Both tables can be implemented as a two-dimensional table of integers. pushes and pops the stack continuously.11. by its nature. may designate items as reducible. In LALR parsers.3. An improvement. For some grammars this might lead to a decision to reduce. And the full power of LR parsing is rarely needed.r. or Look Ahead LR parsing. as all parsers look ahead. and does a recursive call afterwards. LL versus LR parsing 11. LR optimizations and generalizations The stack machine run’. (An artiﬁcial bottom element should be placed on the stack initially. under 100). As said in the introduction. at each stack position.c) = c==length r && x ‘elem‘ follow a The follow set is a rough approximation of what might follow a given nonterminal. Therefore. • Action. the corresponding DFA state is also stored. maybe. the SLR red function. of dimensions the number of states (typically. 100s to 1000s) times the number of symbols (typically. the algorithm described here is a mere Simple LR parsing. that decides what is the new state when pushing a terminal symbol on the stack. leading to a wider class of grammars that are allowable. But this set is not dependent of the context in which the nonterminal is used. So.

cs. identiﬁers.and action-tables reasonable. numbers. keywords. keywords etc.edu/~tkprasad/courses/cs780 207 .3. and part of the description of the algorithm were taken from course notes by Alex Aiken and George Necula.11. The micro structure of identiﬁers. Instead. Yacc is designed for doing the context-free aspects of analysing a language. A commonly used clone of yacc is named Bison. Lex is a preprocessor to yacc. rather. is not handled by this grammar. it is described by regular expressions. that subdivides the input character stream into a stream of meaningful tokens. which can be found at www. such as numbers.wright. it is a performance hack that gets most of LR power while keeping the size of the goto. It comes with Unix. Bibliographical notes The example DFA and NFA for LR parsing. A widely used parser generator tool named yacc (for ‘yet another compiler compiler’) is based on an LALR engine. LR parse example notion. and was originally created for implementing the ﬁrst C compilers. operators. comments etc. which are analysed by a accompanying tool to yacc named lex (for ‘lexical scanner’).

11. LL versus LR parsing 208 .

Springer-Verlag. [12] S. 1986. Dobb’s Journal. Proof-directed debugging. Syst. 1999. backtracking. LNCS 201. and software. volume 925 of Lecture Notes in Computer Science. Soisalon-Soininen. Jouannaud. [3] R. volume 15 of EATCS Monographs on THeoretical Computer Science. April:19 – 22. Introduction to Functional Programming. Pike. 209 . Springer. 1984.S. Semantics of context-free languages. In Recursive Programming Techniques. editors. and pattern matching in lazy functional languages. 1988. 1985. 1992.V. algorithms. 2(3):323 – 343. PhD thesis. Wadler. [2] R. Fast. Bird. [11] S. Aho. Sethi R. Using circular programs to eliminate multiple traversals of data.. Bird and P. [8] B. [13] P.R. Garbage collection and memory eﬃciency in lazy functional o languages. How to replace failure by a list of successes: a method for exception handling. To appear. [7] G. Compilers — Principles. Swierstra and P. In SOFSEM’99. Knuth. Sippu and E. Chalmers University of Technology. Wadler. [6] R. Kernighan and R. In J.S. [10] Niklas R¨jemo. 1999. Addison-Wesley. Hutton. Theory. 1: Languages and Parsing.W. Burge. [5] J. pages 113 – 128. Dr. Parsing. [9] D. Springer-Verlag. and J. Functional Programming Languages and Computer Architecture. Ullman. Meijer. 1995. 21:239–250.D. 1968. 1995. Acta Informatica. Azero Alcocer. Regular expressions — languages. Prentice Hall. 1988. Parsing Theory. 2(2):127– 145. Techniques and Tools.P. Advanced Functional Programming. 1999. Functional parsers.D.H. [4] W. Math. Fokker. error correcting parser combinators: a short tutorial. Harper. In J. Journal of Functional Programming. Jeuring and E. Higher-order functions for parsing. Vol. Journal of Functional Programming.E.Bibliography [1] A. editor. Addison-Wesley. 1975.

Bibliography 210 .

The Stack module module Stack ( Stack . isEmptyStack . mystack ) where data Stack x = MkS [x] deriving (Show. popList . push . top . pushList . split .A. pop .Eq) emptyStack :: Stack x emptyStack = MkS [] isEmptyStack :: Stack x -> Bool isEmptyStack (MkS xs) = null xs push :: x -> Stack x -> Stack x push x (MkS xs) = MkS (x:xs) pushList :: [x] -> Stack x -> Stack x pushList xs (MkS ys) = MkS (xs ++ ys) pop :: Stack x -> Stack x pop (MkS xs) = if isEmptyStack (MkS xs) then error "pop on emptyStack" else MkS (tail xs) popIf :: Eq x => x -> Stack x -> Stack x 211 . emptyStack .

Stack x) 0 stack = ([]. stack’) = split n (MkS xs) mystack = MkS [1. stack) n (MkS []) = error "attempt to split the emptystack" (n+1) (MkS (x:xs)) = (x:ys.2.3. stack’) where (ys.A.5. The Stack module popIf x stack = if top stack == x then pop stack else error "argument and top of stack don’t match" popList :: Eq x => [x] -> Stack x -> Stack x popList xs stack = foldr popIf stack (reverse xs) top :: Stack x -> x top (MkS xs) = if isEmptyStack (MkS xs) then error "top on emptyStack" else head xs split split split split :: Int -> Stack x -> ([x].6.7] 212 .4.

2. 2.1 Three of the fours strings are elements of L∗ : abaabaaabaa.3 ∅L = = ∅ The other equalities can be proved in a similar fashion. b} is given by S →ε S →X S X →a|b 2. baaaaabaa. Syntactical deﬁnitions for such sets follow immediately from this. t ∈ L} {s∈∅} 213 . 1. 2. A grammar for X = {a. A grammar for X = {a} is given by S →ε S → aS 2. The former star operator preserves more structure.2 {ε}.1 contains an inductive deﬁnition of the set of sequences over an arbitrary set X .6 A context free grammar for L is given by S →ε S → aS b { Deﬁnition of concatenation of languages } {st | s ∈ ∅. the star operator on languages concatenates the sentences of the language. aaaabaaaa. 2.4 The star operator on sets injects the elements of a set in a list.B. Answers to exercises 2.5 Section 2.

P →ε | a | b | aP a | bP b 2.10 Again. As a consequence we have to perform at least one derivation step from the start symbol before we end up with a sentence of the language.. ∅.12 The language consisting of the empty string only.7 Analogous to the construction of the grammar for PAL.e. 2. The nonterminals of a grammar do not belong to the alphabet (the set of terminals) of the language we describe using the grammar.11 A sentence is a sentential form consisting only of terminals which can be derived in zero or more derivation steps from the start symbol (to be more precise: the sentential form consisting only of the start symbol). Therefore the start symbol cannot be a sentence of the language. 2.8 Analogous to the construction of PAL. 2. Each derivation will always contain nonterminals.e. there are many other solutions.B. i. 2.e. grammars such as this one that have no production rules without nonterminals on the right hand side.14 The sentences in this language consist of zero or more concatenations of ab. so no sentences can be derived. An example of a grammar that can be derived from the inductive deﬁnition is: S → ε | aS b | bS a | SS Again. 214 . establish an inductive deﬁnition for L.9 First establish an inductive deﬁnition for parity sequences. the language is the set {ab}∗ . i. 2.13 This grammar generates the empty language. {ε}.. M →ε | aM a | bM b 2. An example of a grammar that can be derived from the inductive deﬁnition is: P → ε | 1P 1 | 0P | P 0 There are many other solutions.. Answers to exercises 2. In general. cannot produce any sentences with only terminal symbols. The start symbol is a nonterminal. i.

Each ﬁnite language is context free. We then obtain: A → baA | caA A→b|c 2. this prodedure yields S → ab S → aa S → baa 2. Here are two parse trees for the sentence if b then if b then a else a: S if b if then b S else then S a S if if b b then then S S a else S a S a 215 . For the language in Exercise 2.15 Yes. we introduce a new nonterminal: A → AaA A→B B →b|c Now we can remove the ambiguity: A → B aA A→B B →b|c It is now (optionally) possible to undo the auxiliary step of introducing the additional nonterminal by applying the rules for substituting right hand sides for nonterminal and removing unreachable productions.17 1.16 To bring the grammar into the form where we can directly apply the rule for associative separators.1.2. A context free grammar can be obtained by taking one nonterminal and adding a production rule for each sentence in the language.

LZ |. now with names for the productions: Empty: A: B: A2 : B2 : P P P P P →ε →a →b → aP a → bP b 216 .L B →0|1 2. 2.7. for example by introducing operator priorities.B. we can choose how we want to represent the diﬀerent operators in concrete syntax. this is one possibility: Expr → Expr + Expr | Expr * Expr | Int where Int is a nonterminal that produces integers. An equivalent non left recursive grammar is S → AB A → ε | aaA B → ε | bB 2.19 1. This means we prefer the second of the two parse trees above.18 An equivalent grammar for bit lists is L →B Z |B Z →. Note that the grammar given above is ambiguous. Choosing standard symbols. The disambiguating rule is incorporated directly into the grammar: → MatchedS | UnmatchedS → if b then MatchedS else MatchedS | a UnmatchedS → if b then S | if b then MatchedS else UnmatchedS S MatchedS 3. 2.21 Recall the grammar for palindromes from Exercise 2. n ∈ N}. The rule we apply is: match else with the closest previous unmatched else. An else clause is always matched with the closest previous unmatched if. We could also give an unambiguous version. 2.20 Of course. Answers to exercises 2. The grammar generates the language {a 2n b m | m.

22 1. and calls itself recursively wherever a recursive value of Pal occurs in the datatype. but omit all the terminal symbols. Such a pattern is typical for semantic functions. the desired Haskell deﬁnitions are pal 1 = A2 (B2 (A2 Empty)) pal 2 = B2 (A2 A) 2. 217 . The strings abaaba and baaab can be derived as follows: P ⇒ aP a ⇒ abP ba ⇒ abaP aba ⇒ abaaba P ⇒ bP b ⇒ baP ba ⇒ baaab The parse trees corresponding to the derivations are the following: P P a b a P P P ε Consequently. printPal printPal printPal printPal printPal printPal :: Pal → String Empty = "" A = "a" B = "b" (A2 p) = "a" + printPal p + "a" + + (B2 p) = "b" + printPal p + "b" + + a a b a and b a P P a b Note how the function follows the structure of the datatype Pal closely.We construct the datatype Pal by interpreting the nonterminal as datatype and the names of the productions as names of the constructors: data Pal = Empty | A | B | A2 P | B2 P Note that once again we keep the nonterminals on the right hand sides as arguments to the constructors.

this time with names for the productions: MEmpty: M → ε MA: M → aM a MB: M → bM b 1. this time with names for the productions: Stop: POne: PZeroL: PZeroR: P P P P →ε → 1P 1 → 0P → P0 218 .23 Recall the grammar from Exercise 2.9.B. aCountPal aCountPal aCountPal aCountPal aCountPal aCountPal :: Pal → Int Empty = 0 A =1 B =0 (A2 p) = aCountPal p + 2 (B2 p) = aCountPal p 2.24 Recall the grammar from Exercise 2. we obtain the following datatype: data Mir = MEmpty | MA Mir | MB Mir The concrete mirror palindromes cMir 1 and cMir 2 correspond to the following terms of type Mir : aMir 1 = MA (MB (MA Empty)) aMir 2 = MA (MB (MB Empty)) 2. mirToPal mirToPal mirToPal mirToPal :: Mir → Pal MEmpty = Empty (MA m) = A2 (mirToPal m) (MB m) = B2 (mirToPal m) :: Mir → String MEmpty = "" (MA m) = "a" + printMir m + "a" + + (MB m) = "b" + printMir m + "b" + + 2.8. Answers to exercises 2. printMir printMir printMir printMir 3. By systematically transforming the grammar.

data BitList = ConsBit Bit Z | SingleBit Bit data Z = ConsBitList BitList Z | SingleBitList BitList data Bit = Bit 0 | Bit 1 aBitList 1 = ConsBit Bit 0 (ConsBitList (SingleBit Bit 1 ) (SingleBitList (SingleBit Bit 0 ))) aBitList 2 = ConsBit Bit 0 (ConsBitList (SingleBit Bit 0 ) (SingleBitList (SingleBit Bit 1 ))) 2. for instance: aEven 1 = PZeroL (PZeroL (POne (PZeroR Stop))) aEven 2 = PZeroR (PZeroL (POne (PZeroL Stop))) 2.LZ |. printBitList :: BitList → String printBitList (ConsBit b z ) = printBit b + printZ z + printBitList (SingleBit b) = printBit b printZ :: Z → String printZ (ConsBitList bs z ) = "." + printBitList bs + printZ z + + printZ (SingleBitList bs) = ".1.25 A grammar for bit lists that is not left-recursive is the following: L →B Z |B Z →. data Parity = Stop | POne Parity | PZeroL Parity | PZeroR Parity aEven 1 = PZeroL (PZeroL (POne (PZeroL Stop))) aEven 2 = PZeroL (PZeroR (POne (PZeroL Stop))) Note that the grammar is ambiguous. and other representations for cEven 1 and cEven 2 are possible. printParity printParity printParity printParity printParity :: Parity → String Stop = "" (POne p) = "1" + printParity p + "1" + + (PZeroL p) = "0" + printParity p + (PZeroR p) = printParity p + "0" + 2." + printBitList bs + 219 .L B →0|1 1.

We never have to match on the second bit list: concatBitList :: BitList → BitList → BitList concatBitList (ConsBit b z ) cs = ConsBit b (concatZ z cs) concatBitList (SingleBit b) cs = ConsBit b (SingleBitList cs) concatZ :: Z → BitList → Z concatZ (ConsBitList bs z ) cs = ConsBitList bs (concatZ z xs) concatZ (SingleBitList bs) cs = ConsBitList bs (SingleBitList cs) 2. e. Answers to exercises printBit :: Bit → String printBit Bit 0 printBit Bit 1 = "0" = "1" When multiple datatypes are involved. We can still make the concatenation function structurally recursive in the ﬁrst of the two bit lists. 220 . and the functions call each other recursively where appropriate – we say they are mutually recursive. i. We will see in Chapter 9 how to prove such a statement. 3. L1 is generated by: S → ZC Z → aZ b | ε C → c∗ and L2 is generated by: S → AZ A → a∗ Z → bZ c | ε 2. there is no context-free grammar that generates this language. We have that L1 ∩ L2 = {a n b n c n | n ∈ N} However. Digs Int → Dig ∗ → Sign? Nat AlphaNums → AlphaNum ∗ AlphaNum → Letter | Dig 2.26 We only give the EBNF notation for the productions that change. this language is not context-free..28 1.27 L(G?) = L(G) ∪ {ε} 2. semantic functions typically still follow the structure of the datatypes closely. We get one function per datatype.B.

A leftmost derivation is: S ⇒ AA ⇒∗ bm AA ⇒∗ bm Abn A ⇒ bm abn A ⇒∗ bm abn Abp ⇒ bm abn abp 2. Several derivations are possible for the string babbab. 2. given any language L. the language L∗ fulﬁlls the desired property. Thus (L 2. the language L = {aab. 2.2. For any language L. Two of them are S ⇒ AA ⇒ bAA ⇒ bAbA ⇒ babA ⇒ babbA ⇒ babbAb ⇒ babbab S ⇒ AA ⇒ AAb ⇒ bAAb ⇒ bAbAb ⇒ bAbbAb ⇒ babbAb ⇒ babbab 3. / 2.36 The language generated by the grammar is {an bn | n ∈ N} 221 .31 This is only the case when ε ∈ L. The shortest derivation is three steps long and yields the sentence aa.32 No. since ε ∈ L∗ . we have ε ∈ (L∗ ). 2. and aab can all be derived in four steps. m 1. for any language L. other hand. On the / ∗ ∗ ) and (L)∗ cannot be equal.29 No. ε ∈ (L) . The sentences baa. aba. The string aabbaabba does not appear in this language.34 The grammar is equivalent to the grammar S → aaB B → bB ba B →a This grammar generates the string aaa and the strings aabm a(ba)m for m ∈ N.35 The language L is generated by: S → aS a | bS b | c The derivation is: S ⇒ aS a ⇒ abS ba ⇒ abcba 2.33 1. Furthermore. it holds that L∗ = (L∗ )∗ so. baa} also satisﬁes L = LR .30 For example L = {x n | n ∈ N}. 2.

37 The ﬁrst language is {an | n ∈ N} This language is also generated by the grammar S → aS | ε The second language is {ε} ∪ {a2n+1 | n ∈ N} This language is also generated by the grammar S →A|ε A → a | aAa or using EBNF notation S → (a(aa)∗ )? 2. Answers to exercises The same language is also generated by the grammar S → aAb | ε 2.B.39 S →A|ε A → aAb | ab 2.38 All three grammars generate the language {an | n ∈ N} 2.41 The language is 222 .40 The language is L = {a2n+1 | n ∈ N} A grammar for L without left-recursive productions is A → aaA | a And a grammar without right-recursive prodcutions is A → Aaa | a 2.

A grammar for the same language with only two nonterminals: S → aS a | U a | US U → S | SU aS a | SUU a The nonterminal T has been substituted for its alternatives aS a and U a.42 A grammar that uses only productions with two or less symbols on the right hand side: S T X U Y → T | US → Xa | Ua → aS → S | YT → SU The sentential forms aS and SU have been abstracted to nonterminals X and Y .44 The language is generated by the grammar: S → (A) | SS A→S |ε A derivation for ( ) ( ( ) ) ( ) is: S ⇒ S S ⇒ S SS ⇒ (A)SS ⇒ ()S S ⇒ ()(A)S ⇒ ()(S )S ⇒ ()((A))S ⇒ ()(())S ⇒ ()(())(A) ⇒ ()(())() 2.45 The language is generated by the grammar 223 .{abn | n ∈ N} A grammar for L without left-recursive productions is X → aY Y → bY | ε A grammar for L without left-recursive productions that is also non-contracting is X → aY | a Y → bY | b 2. 2.43 S → 1O O → 1O | 0N N → 1∗ 2.

up to renaming.45: S → (E )E | [E ] E E →ε|S 2. as the grammar E →E +T |T T →T *F |F F →0|1 224 .49 Here is a leftmost derivation for ♣ 3 ♣ ♠.B.46 First leftmost derivation: Sentence ⇒ Subject Predicate ⇒ they Predicate ⇒ they Verb NounPhrase ⇒ they are NounPhrase ⇒ they are Adjective Noun ⇒ they are flying Noun ⇒ they are flying planes Second leftmost derivation: Sentence ⇒ Subject Predicate ⇒ they Predicate ⇒ they AuxVerb Verb Noun ⇒ they are Verb Noun ⇒ they are flying Noun ⇒ they are flying planes 2. ⊗⇒♣3⊕ ⊗ ⇒ ⊗⇒⊗ ⊗⇒⊗3⊕ ⊗⇒⊕3⊕ ⇒♣3♣ ⊗⇒♣3♣ ⊕⇒♣3♣ ♠ Notice that the grammar of this exercise is the same. Answers to exercises S → (A) | [A] | SS A→S |ε A derivation for [ ( ) ] ( ) is: S ⇒ S S ⇒ [A]S ⇒ [S ]S ⇒ [(A)]S ⇒ [()]S ⇒ [()](A) ⇒ [()]() 2.48 Here is an unambiguous grammar for the language from Exercise 2.

xs)] succeed (f a) xs 3.5 The type and results of (:) <$> symbol ’a’ are (note that you cannot write this as a deﬁnition in Haskell): 225 . but this is easy to see by induction over the length of derivations. there is a derivation P ⇒∗ t.Char . so are asa.2. bsb. Then (f <$> succeed a) xs = = = { deﬁnition of <$> } { deﬁnition of succeed } { deﬁnition of succeed } [(f x . b. Then s can be written as ata or btb or ctc where the length of t is strictly smaller.4 Let xs :: [s]. 3. and c are palindromes. The palindromes a. Certainly ε. capital = satisfy (λs → (’A’ s) ∧ (s ’Z’)) 3. P ⇒ c. and b are the palindromes with length 1. ys) | (x . b. and csc.3 The function epsilon is a special case of succeed : epsilon :: Parser s () epsilon = succeed () 3. for example P ⇒∗ t ⇒ ata in the ﬁrst situation – the other two cases are analogous. respectively. But then. ys) ← succeed a xs] [(f a. there is also a derivation for s.1 Either we use the predeﬁned predicate isUpper in module Data. and t also is a palindrome. And if s is a palindrome that can be derived. a. They can be derivated with P ⇒ a. Suppose that the palindrome s with a length of 2 or more. by induction hypothesis.2 A symbol equal to a satisﬁes the predicate (= = a): symbol a = satisfy (= = a) 3. P ⇒ b. We have now proved that any palindrome can be generated by the grammar – we still have to prove that anything generated by the grammar is a palindrome. capital = satisfy isUpper or we make use of the ordering deﬁned characters.50 The palindrome ε with length 0 can be generated by de grammar with the derivation P ⇒ ε. Thus.

ys) ← p xs] | otherwise = [] 3.B. ys) | (x . Answers to exercises ((:) <$> symbol ’a’) :: Parser Char (String → String) ((:) <$> symbol ’a’) [] = [] ((:) <$> symbol ’a’) (x : xs) | x = = ’a’ = [((x :). palina :: Parser Char Int palina = (λ y → y + 2) <$> symbol ’a’ <∗> palina <∗> symbol ’a’ <|> (λ y → y) <$> symbol ’b’ <∗> palina <∗> symbol ’b’ <|> const 1 <$> symbol ’a’ <|> const 0 <$> symbol ’b’ <|> succeed 0 226 . data Pal 2 = Nil | Leafa | Leafb | Twoa Pal 2 | Twob Pal 2 2.7 pBool :: Parser Char Bool pBool = const True <$> token "True" <|> const False <$> token "False" 3. palin 2 :: Parser Char Pal 2 palin 2 = (\ y → Twoa y) <$> symbol ’a’ <∗> palin 2 <∗> symbol ’a’ <|> (\ y → Twob y) <$> symbol ’b’ <∗> palin 2 <∗> symbol ’b’ <|> const Leafa <$> symbol ’a’ <|> const Leafb <$> symbol ’b’ <|> succeed Nil 3.9 1.6 The type and results of (:) <$> symbol ’a’ <∗> p are: ((:) <$> symbol ’a’ <∗> p) :: Parser Char String ((:) <$> symbol ’a’ <∗> p) [] = [] ((:) <$> symbol ’a’ <∗> p) (x : xs) | x = = ’a’ = [(’a’ : x . xs)] | otherwise = [] 3.

3.15 The parser combinators <∗> and <.>) :: Parser s a → Parser s b → Parser s (a.14 The transformator <$> does to the result part of parsers what map does to the elements of a list. y). zs) |(x .12 The function is the same as <∗>. ) <$> p <∗> q 227 . it pairs them together: (<.> q) xs = [((x . zs) ← q ys ] 3. or ‘parser modiﬁer’ or ‘parser postprocessor’. but instead of applying the result of the ﬁrst parser to that of the second.> q) p <. english :: Parser Char English english = E1 <$> subject <∗> pred subject = E2 <$> token "they" pred = E3 <$> verb <∗> nounp <|> E4 <$> auxv <∗> verb <∗> noun verb = E5 <$> token "are" <|> E6 <$> token "flying" auxv = E7 <$> token "are" nounp = E8 <$> adj <∗> noun adj = E9 <$> token "flying" noun = E10 <$> token "planes" 3. b) (p <.3.11 As <|> uses + it is more eﬃciently evaluated if right-associative.13 ‘Parser transformator’. +.> q = (. data English = E1 Subject Pred data Subject = E2 String data Pred = E3 Verb NounP | E4 AuxV Verb Noun data Verb = E5 String | E6 String data AuxV = E7 String data NounP = E8 Adj Noun data Adj = E9 String data Noun = E10 String 2. 3. 3.> can be deﬁned in terms of each other: p <∗> q = uncurry ($) <$> (p <. etcetera. (y.10 1. ys) ← p xs .

3.18 To the left.2].20 1.2].([1].2. listofdigits :: Parser Char [Int] listofdigits = listOf newdigit (symbol ’ ’) ?> listofdigits "1 2 3" [([1. Answers to exercises 3. snd ) (p s))) 3.19 The function has to check whether applying the parser p to input s returns at least one result with an empty rest sequence: test p s = not (null (ﬁlter (null . You can combine the parser parameter of <$> with a parser that consumes no input and always yields the function parameter of <$>: f <$> p = succeed f <∗> p 3.3]. plusParser :: Num a ⇒ Parser Char (a → a → a) plusParser [] = [] plusParser (x : xs) | x = = ’+’ = [((+).([1. Yes.([1]. xs)] | otherwise = [] 228 ." 2 a")] 2.17 f :: Char → Parentheses → Char → Parentheses → Parentheses open Parser Char Char :: f <$> open :: Parser Char (Parentheses → Char → Parentheses → Parentheses) parens :: Parser Char Parentheses (f <$> open) <∗> parens :: Parser Char (Char → Parentheses → Parentheses) 3." a").B. In order to use the parser chainr we ﬁrst deﬁne a parser plusParser that recognises the character ’+’ and returns the function (+).16 Yes." 3")." 2 3")] ?> listofdigits "1 2 a" [([1."").

""). xs = []: many (symbol ’a’) [] = = = = { deﬁnition of many and listOfa } { deﬁnition of <|> } { Exercise 3."+2+3")] ?> sumParser "1+2+a" [(3. The parser many should be replaced by the parser greedy in de deﬁnition of listOf .6.(3.5 and 3."+3")."+a").6. 3.(1.(1."")] Note that the parser also recognises a single integer.6."+2+a")] ?> sumParser "1" [(1. [])] + [([]. previous calculation } (listOfa <∗> many (symbol ’a’) <|> succeed []) [’a’] (listOfa <∗> many (symbol ’a’)) [’a’] + succeed [] [’a’] + 229 . deﬁnition of succeed } { deﬁnition of + } + (listOfa <∗> many (symbol ’a’) <|> succeed []) [] (listOfa <∗> many (symbol ’a’)) [] + succeed [] [] + [] + [([]. 3.21 We introduce the abbreviation listOfa = (:) <$> symbol ’a’ and use the results of Exercises 3. [])] xs = [’a’]: many (symbol ’a’) [’a’] = = = { deﬁnition of many and listOfa } { deﬁnition of <|> } { Exercise 3.The deﬁniton of the parser sumParser is: sumParser :: Parser Char Int sumParser = chainr newdigit plusParser ?> sumParser "1+2+3" [(6.

[’a’. ’b’])] 3. []). [’a’.6. 3. ’b’] + succeed [] [’a’. [’a’. ’a’]. [’a’])] xs = [’b’]: many (symbol ’a’) [’b’] = = { as before } { Exercise 3. ([]. ’b’] + [([’a’. ’b’]: many (symbol ’a’) [’a’. This also holds for the recursive calls. previous calculation } (listOfa <∗> many (symbol ’a’)) [’a’. ’b’]).22 The empty alternative is presented last. ’a’. ([]. ’b’] + [([’a’]. [’b’]). [’b’])] xs = [’a’. ’b’] = = { as before } { Exercise 3.6.B. then two a’s with a singleton rest string. then one a. Answers to exercises [([’a’]. thus the ‘greedy’ parsing of all three a’s is presented ﬁrst. and ﬁnally the empty result with the original input as rest string. ’b’]: many (symbol ’a’) [’a’. because the <|> combinator uses list concatenation for concatenating lists of successes. ’b’] + succeed [] [’a’. ’a’. previous calculation } (listOfa <∗> many (symbol ’a’)) [’b’] + succeed [] [’b’] + [([]. previous calculation } (listOfa <∗> many (symbol ’a’)) [’a’. ’a’.6. ’b’] = = { as before } { Exercise 3. ([’a’]. ’a’. ’a’.24 — Combinators for repetition psequence :: [Parser s a] → Parser s [a] psequence [] = succeed [] psequence (p : ps) = (:) <$> p <∗> psequence ps psequence :: [Parser s a] → Parser s [a] psequence = foldr f (succeed []) where f p q = (:) <$> p <∗> q 230 . [’b’]). ’b’])] xs = [’a’. ([].

When the parser encounters a function application.b)": Var "abc" Var "abc" Var "a" :∗: Var "b" :+: Con 1 Var "a" :∗: (Var "b" :+: Con 1) Con (−1) :−: Var "a" Fun "a" [Con 1. This ﬁrst solution will however 231 ."b")] ?> (choice [digit. A function application is a variable followed by an argument list.27 identiﬁer :: Parser Char String identiﬁer = (:) <$> satisfy isAlpha <∗> greedy (satisfy isAlphaNum) 3. satisfy isUpper]) "ab" [] 3. The parser fact ﬁrst tries to parse an integer.choice :: [Parser s a] → Parser s a choice = foldr (<|>) failp ?> (psequence [digit. then a function application and ﬁnally a parenthesised expression.25 token :: Eq s ⇒ [s] → Parser s [s] token = psequence . map symbol 3. satisfy isUpper]) "1A" [("1A". satisfy isUpper]) "Ab" [(’A’. satisfy isUpper]) "1Ab" [("1A". Var "b"] 2."")] ?> (psequence [digit.28 1. satisfy isUpper]) "1ab" [(’1’. then a variable. a variable will ﬁrst be recognised. As Haskell terms: "abc": "(abc)": "a*b+1": "a*(b+1)": "-1-a": "a(1."ab")] ?> (choice [digit."b")] ?> (psequence [digit. satisfy isUpper]) "1ab" [] ?> (choice [digit.

(Var "f".Var "b"]. Answers to exercises not lead to a parse tree for the complete expression because the list of arguments that comes after the variable cannot be parsed.29 A function with no arguments is not accepted by the parser: ?> expr "f()" [(Var "f"."()")] 3."")."").30 expr = chainr (chainl term (const (:−:) <$> symbol ’-’)) (const (:+:) <$> symbol ’+’) 3.b)" [(Fun "a" [Con 1.B.b)")] 3. the parse tree for a function application will be the ﬁrst solution of the parser: fact :: Parser Char Expr fact = Con <$> integer <|> Fun <$> identiﬁer <∗> parenthesised (commaList expr ) <|> Var <$> identiﬁer <|> parenthesised expr ?> expr "a(1.31 The datatype Expr is extended as follows to allow raising an expression to the power of an expression: data Expr = Con Int | Var String 232 ."(1."()")] The parser parenthesised (commaList expr ) that is used in the parser fact does not accept an empty list of arguments because commaList does not. To accept an empty list we modify the parser fact as follows: fact :: Parser Char Expr fact = Con <$> integer <|> Fun <$> identiﬁer <∗> parenthesised (commaList expr <|> succeed []) <|> Var <$> identiﬁer <|> parenthesised expr ?> expr "f()" [(Fun "f" []. If we swap the second and the third line in the deﬁnition of the parser fact.(Var "a".

but here we prefer to exploit the following equation (f <$> p) xs = map (f ∗∗∗ id ) (p xs) where (∗∗∗) is deﬁned by (∗∗∗) :: (a → c) → (b → d ) → (a. (h <$> (f <$> p)) xs = = { (B. powis] Note that because of the use of chainl all the operators listed in addis. b) → (c. map g = map (f . multis and powis are treated as left-associative. 3. (: ˆ :))] expr :: Parser Char Expr expr = foldr gen fact [addis. (h ∗∗∗ k ) = (f .32 The proofs can be given by using laws for list comprehension. concat = concat .2) (B. multis.| Fun String [Expr ] | Expr :+: Expr | Expr :−: Expr | Expr :∗: Expr | Expr :/: Expr | Expr : ˆ : Expr deriving Show Now the parser expr of Listing 3. map (map f ) 1.8 can be extended with a new level of priorities: powis = [(’^’.5) 233 . k ) (B.4) (B. b) = (f a. d ) (f ∗∗∗ g) (a. concatenation. g b) It has the following property: (f ∗∗∗ g) .1) } { (B.1) Furthermore.3) (B. g) map f (x + y) = map f x + map f y + + map f . h) ∗∗∗ (g . we will use the following laws about map in our proof: map distributes over composition. and the function concat: map f .1) } map (h ∗∗∗ id ) ((f <$> p) xs) map (h ∗∗∗ id ) (map (f ∗∗∗ id ) (p xs)) (B.

First note that (p <∗> q) xs can be written as (p <∗> q) xs = concat (map (mc q) (p xs)) where mc q (f .) <$> p) <∗> q) xs = = = = { (B.6) 234 . f ) <$> p) xs 2.2) } { (B.1) } { deﬁnition of <|> } map (h ∗∗∗ id ) ((p <|> q) xs) map (h ∗∗∗ id ) (p xs + q xs) + map (h ∗∗∗ id ) (p xs) + map (h ∗∗∗ id ) (q xs) + (h <$> p) xs + (h <$> q) xs + ((h <$> p) <|> (h <$> q)) xs 3. (h <$> (p <|> q)) xs = = = = = { (B.6) } { (B. mc q) (p xs)) (B.3) } { (B.1) } { (B. ys) = map (f ∗∗∗ id ) (q ys) Now we calculate (((h.) <$> p) xs)) concat (map (mc q) (map ((h.4) } { (B.1) } map ((h ∗∗∗ id ) .B. (f ∗∗∗ id )) (p xs) map ((h .3) } { (B.7). f ) ∗∗∗ id ) (p xs) ((h . see below } concat (map (mc q) (((h.1) } { deﬁnition of <|> } { (B. Answers to exercises = = = { (B.) ∗∗∗ id ) (p xs))) concat (map (map (h ∗∗∗ id ) (map (mc q) (p xs)))) concat (map ((map (h ∗∗∗ id )) .

’ <∗> pBitList pBit = const Bit 0 <$> symbol ’0’ <|> const Bit 1 <$> symbol ’1’ (B. ys) 3. ((h.3) } { (B.6) } { (B.34 pBitList :: Parser Char BitList pBitList = SingleB <$> pBit <|> (λb bs → ConsB b bs) <$> pBit <∗> symbol ’. mc q This claim is also proved by calculation: ((map (h ∗∗∗ id )) . } { deﬁnition of mc q } { map and ∗∗∗ distribute over composition } { deﬁnition of mc q } { deﬁnition of ∗∗∗ } map (h ∗∗∗ id ) (mc q (f .7) 235 . ys) (mc q . f .) ∗∗∗ id ) = map (h ∗∗∗ id ) .5) } { (B.33 pMir :: Parser Char Mir pMir = (λ m → MB m) <$> symbol ’b’ <∗> pMir <∗> symbol ’b’ <|> (λ m → MA m) <$> symbol ’a’ <∗> pMir <∗> symbol ’a’ <|> succeed MEmpty 3. mc q) (f .) ∗∗∗ id )) (f . ys)) map (h ∗∗∗ id ) (map (f ∗∗∗ id ) (q ys)) map ((h . f ) ∗∗∗ id ) (q y) mc q (h .1) } concat (map (map (h ∗∗∗ id )) (map (mc q) (p xs))) map (h ∗∗∗ id ) (concat (map (mc q) (p xs))) map (h ∗∗∗ id ) ((p <∗> q) xs) (h <$> (p <∗> q)) xs It remains to prove the claim mc q .= = = = { (B. ((h. ys) = = = = = { deﬁnition of .

" [(JAssign "x1" (Var "a" :+: Con 1 :*: Con 2).36 ﬂoat :: Parser Char Float ﬂoat = f <$> ﬁxed <∗> (((λ y → y) <$> symbol ’E’ <∗> integer ) ‘option‘ 0) where f m e = m ∗ power e power e | e < 0 = 1."")] ?> assign "x=a+1" [] Note that the second example is not recognised as an assignment because the string does not end with a semicolon.35 — Parser for ﬂoating point numbers ﬁxed :: Parser Char Float ﬁxed = (+) <$> (fromIntegral <$> greedy integer ) <∗> (((λ y → y) <$> symbol ’. 4.0) fractpart :: Parser Char Float fractpart = foldr f 0.’ <∗> fractpart) ‘option‘ 0.0 <$> greedy newdigit where f d n = (n + fromIntegral d ) / 10.0 3.37 Parse trees for Java assignments are of type: data JavaAssign = JAssign String Expr deriving Show The parser is deﬁned as follows: assign :: Parser Char JavaAssign assign = JAssign <$> identiﬁer <∗> ((λ y → y) <$> symbol ’=’ <∗> expr <∗> symbol ’.1 data FloatLiteral = FL1 IntPart FractPart ExponentPart FloatSuﬃx | FL2 FractPart ExponentPart FloatSuﬃx | FL3 IntPart ExponentPart FloatSuﬃx 236 .’) ?> assign "x1=(a+1)*2.0 / power (−e) | otherwise = fromIntegral (10e) 3. Answers to exercises 3.B.

digit digits floatLiteral = f <$> satisfy isDigit where f c = ord c . only the parsers return another (semantic) result.| deriving Show FL4 IntPart ExponentPart FloatSuﬃx = String = String = String = String = String = String type ExponentPart type ExponentIndicator type SignedInteger type IntPart type FractPart type FloatSuﬃx digit digits ﬂoatLiteral = satisfy isDigit = many 1 digit (λa b c d e → FL1 a c d e) <$> intPart <∗> period <∗> optfract <∗> optexp <∗> optﬂoat <|> (λa b c d → FL2 b c d ) <$> period <∗> fractPart <∗> optexp <∗> optﬂoat <|> (λa b c → FL3 a b c) <$> intPart <∗> exponentPart <∗> optﬂoat <|> (λa b c → FL4 a b c) <$> intPart <∗> optexp <∗> ﬂoatSuﬃx = = signedInteger = digits = (+ <$> exponentIndicator <∗> signedInteger +) = (+ <$> option sign "" <∗> digits +) = token "e" <|> token "E" = token "+" <|> token "-" = token "f" <|> token "F" <|> token "d" <|> token "D" = token ".2 The data and type deﬁnitions are the same as before." = option exponentPart "" = option fractPart "" = option ﬂoatSuﬃx "" intPart fractPart exponentPart signedInteger exponentIndicator sign ﬂoatSuﬃx period optexp optfract optﬂoat 4.ord ’0’ = foldl f 0 <$> many1 digit where f a b = 10*a + b = (\a b c d e -> (fromIntegral a + c) * power d) <$> intPart <*> period <*> optfract <*> optexp <*> optfloat <|> (\a b c d -> b * power c) <$> period <*> fractPart <*> optexp <*> optfloat 237 .

.0 <$> many1 digit where f a b = (fromIntegral a + b)/10 exponentPart = (\x y -> y) <$> exponentIndicator <*> signedInteger signedInteger = (\ x y -> x y) <$> option sign id <*> digits exponentIndicator = symbol ’e’ <|> symbol ’E’ sign = const id <$> symbol ’+’ <|> const negate <$> symbol ’-’ floatSuffix = symbol ’f’ <|> symbol ’F’<|> symbol ’d’ <|> symbol ’D’ period = symbol ’. = f <$> many1 digit where f ds = .....0 optfloat = option floatSuffix ’ ’ power e | e < 0 = 1 / power (-e) | otherwise = fromIntegral (10^e) 4.. f2 a b c d = .. intPart fractPart exponentPart 238 . = f <$> exponentIndicator <*> signedInteger where f x y = . = signedInteger = f <$> many1 digit where f ds = .B.. = f1 <$> intPart <*> period <*> optfract <*> optexp <*> optfloat <|> f2 <$> period <*> fractPart <*> optexp <*> optfloat <|> f3 <$> intPart <*> exponentPart <*> optfloat <|> f4 <$> intPart <*> optexp <*> floatSuffix where f1 a b c d e = . f4 a b c = .........’ optexp = option exponentPart 0 optfract = option fractPart 0.....3 The parsing scheme for Java ﬂoats is digit digits floatLiteral = f <$> satisfy isDigit where f c = . f3 a b c = ....... Answers to exercises <|> (\a b c -> (fromIntegral a) * power b) <$> intPart <*> exponentPart <*> optfloat <|> (\a b c -> (fromIntegral a) * power b) <$> intPart <*> optexp <*> floatSuffix intPart = signedInteger fractPart = foldr f 0...

...’ optexp = option exponentPart ?? optfract = option fractPart ?? optfloat = option floatSuffix ?? 5. const 0) signedInteger 239 ... floatSuffix = f1 <$> symbol ’f’ <|> f2 <$> symbol ’F’ <|> f3 <$> symbol ’d’ <|> f4 <$> symbol ’D’ where f1 c = .= f <$> option sign ?? <*> digits where f x y = . sign = f1 <$> symbol ’+’ <|> f2 <$> symbol ’-’ where f1 h = ....1 type LNTreeAlgebra a b x = (a → x ... f4 c = .....2 1.. Deﬁnition of height by case analysis: height (Leaf x ) = 0 height (Bin lt rt) = 1 + (height lt ‘max ‘ height rt) Deﬁnition as a fold: height :: BinTree x → Int height = foldBinTree heightAlgebra heightAlgebra = (λu v → 1 + (u ‘max ‘ v ).. f3 c = ....... node) = fold where fold (Leaf a) = leaf a fold (Node l m r ) = node (fold l ) m (fold r ) 5.. x → b → x → x ) foldLNTree :: LNTreeAlgebra a b x → LNTree a b → x foldLNTree (leaf ... f2 h = .. period = symbol ’. exponentIndicator = f1 <$> symbol ’e’ <|> f2 <$> symbol ’E’ where f1 c = ..... f2 c = ... f2 c = .

Deﬁnition of mapBinTree by case analysis: mapBinTree f (Leaf x ) = Leaf (f x ) mapBinTree f (Bin lt rt) = Bin (mapBinTree f lt) (mapBinTree f rt) Deﬁnition as a fold: mapBinTree :: (a → b) → BinTree a → BinTree b mapBinTree f = foldBinTree (Bin. Deﬁnition of ﬂatten by case analysis: ﬂatten (Leaf x ) = [x ] ﬂatten (Bin lt rt) = ﬂatten lt + ﬂatten rt + Deﬁnition as a fold: ﬂatten :: BinTree x → [x ] ﬂatten = foldBinTree ((+ λx → [x ]) +).3 Using explicit recursion: allPaths (Leaf x ) = [[]] allPaths (Bin lt rt) = map (L:) (allPaths lt) + map (R:) (allPaths rt) + 240 . Answers to exercises 2. 3. Deﬁnition of maxBinTree by case analysis: maxBinTree (Leaf x ) = x maxBinTree (Bin lt rt) = maxBinTree lt ‘max ‘ maxBinTree rt Deﬁnition as a fold: maxBinTree :: Ord x ⇒ BinTree x → x maxBinTree = foldBinTree (max . f ) 5. Deﬁnition of sp by case analysis: sp (Leaf x ) = 0 sp (Bin lt rt) = 1 + (sp lt) ‘min‘ (sp rt) Deﬁnition as a fold: sp :: BinTree x → Int sp = foldBinTree spAlgebra spAlgebra = (λu v → 1 + u ‘min‘ v .B. const 0) 5. id ) 4. Leaf .

(++) 2. 241 .const True . a → a → a. Float → a) foldResist :: ResistAlgebra a → Resist → a foldResist (par .\x -> (&&) ) vars = foldExpr ((++) .\x y -> False . const [[]]) + 5. result :: Resist → Float result = foldResist resultAlgebra resultAlgebra :: ResistAlgebra Float resultAlgebra = (λu v → (u ∗ v ) / (u + v ). (+).As a fold: allPaths :: BinTree a → [[Direction]] allPaths = foldBinTree psAlgebra psAlgebra :: BinTreeAlgebra a [[Direction]] psAlgebra = (λu v → map (L:) u + map (R:) v . seq.(++) .\x y -> False . isSum = foldExpr ((&&) .\x y -> False . basic) = fold where fold (r1 :|: r2 ) = par (fold r1 ) (fold r2 ) fold (r1 :∗: r2 ) = seq (fold r1 ) (fold r2 ) fold (BasicR f ) = basic f 2. data Resist = Resist :|: Resist | Resist :∗: Resist | BasicR Float deriving Show type ResistAlgebra a = (a → a → a.4 1.const True . id ) 5.5 1.

data Exp = Exp ‘Plus‘ Exp | Exp ‘Sub‘ Exp | Con Float | Idf String deriving Show type ExpAlgebra a = (a → a → a .a → a → a .(++) . 2. The function der is not a compositional function on Expr because the righthand sides of the ‘Mul ‘ and ‘Dvd ‘ expressions do not only use der e1 dx and der e2 dx . der computes a symbolic diﬀerentiation of an expression. but also e1 and e2 themselves. 3. Answers to exercises .\x y z -> x : (y ++ z) ) 5. sub. Float → a . idf ) = fold where fold (e1 ‘Plus‘ e2 ) = plus (fold e1 ) (fold e2 ) fold (e1 ‘Sub‘ e2 ) = sub (fold e1 ) (fold e2 ) fold (Con n) = con n fold (Idf s) = idf s 4.6 1. String → a ) foldExp :: ExpAlgebra a → Exp → a foldExp (plus.const [] .B.\x -> [x] . Using explicit resursion: der der der der der :: Exp → String → Exp (e1 ‘Plus‘ e2 ) dx = der e1 dx ‘Plus‘ der e2 dx (e1 ‘Sub‘ e2 ) dx = der e1 dx ‘Sub‘ der e2 dx (Con f ) dx = Con 0 (Idf s) dx = if s = = dx then Con 1 else Con 0 Using a fold: der = foldExp derAlgebra derAlgebra :: ExpAlgebra (String → Exp) 242 . con.

p → p) 243 . p.derAlgebra = (λf g → λs → f s ‘Plus‘ g s . λf g → λs → f s ‘Sub‘ g s . type PalAlgebra p = (p. λs → λt → if s = = t then Con 1 else Con 0 ) 5. λx → λbs → if null bs then x else error "no roothpath") 5. p. λn → λs → Con 0 . p → p.8 Using explicit recursion: path2Value (Leaf x ) = λbs → if null bs then x else error "no roothpath" path2Value (Bin lt rt) = λbs → case bs of [] → error "no roothpath" (L : rs) → path2Value lt rs (R : rs) → path2Value rt rs Using a fold: path2Value :: BinTree a → [Direction] → a path2Value = foldBinTree pvAlgebra pvAlgebra :: BinTreeAlgebra a ([Direction] → a) pvAlgebra = (λﬂ fr → λbs → case bs of [] → error "no roothpath" (L : rs) → ﬂ rs (R : rs) → fr rs .7 Using explicit recursion: replace (Leaf x ) y = Leaf y replace (Bin lt rt) y = Bin (replace lt y) (replace rt y) As a fold: replace :: BinTree a → a → BinTree a replace = foldBinTree repAlgebra repAlgebra = (λf g → λy → Bin (f y) (g y).9 1. λx y → Leaf y) 5.

λp → p + 2. a2cPal = foldPal ("" . 2. pal 4 .B. Answers to exercises 2.mir3) = fMir where fMir Mir1 = mir1 fMir (Mir2 m) = mir2 (fMir m) fMir (Mir3 m) = mir3 (fMir m) a2cMir = foldMir ("". 244 . λp → p) 4.Pal4. pal 2 . "b" . The parser pfoldPal m1 returns the concrete representation of a palindrome. pal 2 .m->m. 5.\m->"a"++m++"a". λp → "a" + p + "a" + + . pal 3 . "a" . pfoldPal :: PalAlgebra p → Parser Char p pfoldPal (pal 1 . type MirAlgebra m = (m. λp → "b" + p + "b" + + ) aCountPal = foldPal (0. pal 3 . pal 5 ) = fPal where fPal Pal 1 = pal 1 fPal Pal 2 = pal 2 fPal Pal 3 = pal 3 fPal (Pal 4 p) = pal 4 (fPal p) fPal (Pal 5 p) = pal 5 (fPal p) 3.mir2.\m->"b"++m++"b") m2pMir = foldMir (Pal1. pal 5 ) = pPal where pPal = const pal 1 <$> ε <|> const pal 2 <$> syma <|> const pal 3 <$> symb <|> (\ p → pal 4 p) <$> syma <∗> pPal <∗> syma <|> (\ p → pal 5 p) <$> symb <∗> pPal <∗> symb syma = symbol ’a’ symb = symbol ’b’ 5.Pal5) 3. 1. pal 4 . 0. The parser pfoldPal m2 returns the number of a’s occurring in a palindrome. foldPal :: PalAlgebra p → Pal → p foldPal (pal 1 .10 1.m->m) foldMir :: MirAlgebra m -> Mir -> m foldMir (mir1.

("0"."1") ) pfoldBitList :: BitListAlgebra bl z b -> Parser Char bl pfoldBitList ((consb.p->p) foldParity :: ParityAlgebra p -> Parity -> p foldParity (empty. 2. 5. The parser pfoldMir m2 returns the abstract representation of a palindrome. 5.4.singlebl).(consbl.\x->"1"++x++"1" .parityR0) = fParity where fParity Empty = empty fParity (Parity1 p) = parity1 (fParity p) fParity (ParityL0 p) = parityL0 (fParity p) fParity (ParityR0 p) = parityR0 (fParity p) a2cParity = foldParity ("" .(bl->z->z.12 1.p->p.((++).\x->x++"0" ) 3. The parser pfoldMir m1 returns the concrete representation of a palindrome.singleb).parityL0.b)) foldBitList :: BitListAlgebra bl z b -> BitList -> bl foldBitList ((consb. 4.bl->z).\x->"0"++x .id) .(bit0.singlebl). type BitListAlgebra bl z b = ((b->z->bl.bit1)) = pBitList where pBitList = consb <$> pBit <*> pZ <|> singleb <$> pBit 3.bit1)) = fBitList where fBitList (ConsB b z) = consb (fBit b) (fZ z) fBitList (SingleB b) = singleb (fBit b) fZ (ConsBL bl z) = consbl (fBitList bl) (fZ z) fZ (SingleBL bl) = singlebl (fBitList bl) fBit Bit0 = bit0 fBit Bit1 = bit1 a2cBitList = foldBitList (((++).(consbl.p->p.id) .singleb). 245 .b->bl).11 1. mir2.(bit0. 2.(b.parity1. mir3) = pMir where pMir = const mir1 <$> epsilon <|> (\_ m _ -> mir2 m) <$> syma <*> pMir <*> syma <|> (\_ m _ -> mir3 m) <$> symb <*> pMir <*> symb syma = symbol ’a’ symb = symbol ’b’ 5. pfoldMir :: MirAlgebra m -> Parser Char m pfoldMir (mir1. type ParityAlgebra p = (p.

(s1 . (r1 . s2 .’ <*> pBitList <*> pZ <|> singlebl <$> pBitList = const bit0 <$> symbol ’0’ <|> const bit1 <$> symbol ’1’ = = B1 Stat Rest = R1 Stat Rest | Nix = S1 Decl | S2 Use | S3 Nest = Dx | Dy = UX | UY = N1 Block The abstract representation of x. (d → s. dy).13 1.(y. (u. Answers to exercises pZ pBit 5. n1 ) = foldB where foldB (B1 stat rest) = b1 (foldS stat) (foldR rest) foldR (R1 stat rest) = r1 (foldS stat) (foldR rest) 246 . (ux . (dx . r ) .b → n ) 3. nix ). data Block data Rest data Stat data Decl data Use data Nest (\_ bl z -> consbl bl z) <$> symbol ’. foldBlock :: BlockAlgebra b r s d u n → Block → b foldBlock (b1 . type BlockAlgebra b r s d u n = (s → r → b . n → s) . (s → r → r .B. s3 ). uy). u) .Y). (d . d ) .X is B1 (S1 Dx ) (R1 stat 2 rest 2 ) where stat 2 = S3 (N1 block ) block = B1 (S1 Dy) (R1 (S2 UY ) Nix ) rest 2 = R1 (S2 UX ) Nix 2. u → s.

foldR Nix foldS (S1 decl ) foldS (S2 use) foldS (S3 nest) foldD Dx foldD Dy foldU UX foldU UY foldN (N1 block ) 4.

= nix = s1 (foldD decl ) = s2 (foldU use) = s3 (foldN nest) = dx = dy = ux = uy = n1 (foldB block )

a2cBlock :: Block → String a2cBlock = foldBlock a2cBlockAlgebra a2cBlockAlgebra :: BlockAlgebra String String String String String String a2cBlockAlgebra = (b1 , (r1 , nix ), (s1 , s2 , s3 ), (dx , dy), (ux , uy), n1 ) where b1 u v = u + v + r1 u v = ";" + u + v + + nix = "" s1 u = u s2 u = u s3 u = u dx = "x" dy = "y" ux = "X" uy = "Y" n1 u = "(" + u + ")" + + 7.1 The type TreeAlgebra and the function foldTree are deﬁned in the rep-min problem. 1. height = foldTree heightAlgebra heightAlgebra = (const 0, \l r -> (l ‘max‘ r) + 1) --frontAtLevel :: Tree -> Int -> [Int] --frontAtLevel (Leaf i) h = if h == 0 then [i] else [] --frontAtLevel (Bin l r) h = -- if h > 0 then frontAtLevel l (h-1) ++ frontAtLevel r (h-1) -else [] frontAtLevel = foldTree frontAtLevelAlgebra frontAtLevelAlgebra = ( \i h -> if h == 0 then [i] else []

247

B. Answers to exercises

, \f g h ) 2.

-> if h > 0 then f (h-1) ++ g (h-1) else []

a) Straightforward Solution: impossible to give a solution like rm.sol1. heightAlgebra = (const 0, \l r -> (l ‘max‘ r) + 1) front t = foldTree frontAtLevelAlgebra t h where h = foldTree heightAlgebra t b) Lambda Lifting front’ t = foldTree frontAtLevelAlgebra t (foldTree heightAlgebra t) c) Tupling Computations htupfr :: TreeAlgebra (Int, Int -> [Int]) htupfr -= (heightAlgebra ‘tuple‘ frontAtLevelAlgebra) = ( \i -> ( 0 , \h -> if h == 0 then [i] else [] ) , \(lh,f) (rh,g) -> ( (lh ‘max‘ rh) + 1 , \h -> if h > 0 then f (h-1) ++ g (h-1) else [] ) ) front’’ t = fr h where (h, fr) = foldTree htupfr t d) Merging Tupled Functions It is helpful to note that the merged algebra is constructed such that foldTree mergedAlgebra t i = (height t, frontAtLevel t i) Therefore, the deﬁnitions corresponding to the fourth solution are mergedAlgebra :: TreeAlgebra (Int -> (Int, [Int])) mergedAlgebra = (\i -> \h -> ( 0 , if h == 0 then [i] else [] ) , \lfun rfun -> \h -> let (lh, xs) = lfun (h-1) (rh, ys) = rfun (h-1) in ( (lh ‘max‘ rh) + 1

248

, if h > 0 then xs ++ ys else [] ) ) front’’’ t = fr where (h, fr) = foldTree mergedAlgebra t h 7.2 The highest front problem is the problem of ﬁnding the ﬁrst non-empty list of nodes which are at the lowest level. The solutions are similar to these of the deepest front problem. 1. --lowest :: Tree -> Int --lowest (Leaf i) = 0 --lowest (Bin l r) = ((lowest l) ‘min‘ (lowest r)) + 1 lowest = foldTree lowAlgebra lowAlgebra = (const 0, \ l r -> (l ‘min‘ r) + 1) 2. a) Straightforward Solution: impossible to give a solution like rm.sol1. lfront t = foldTree frontAtLevelAlgebra t l where l = foldTree lowAlgebra t b) Lambda Lifting lfront’ t = foldTree frontAtLevelAlgebra t (foldTree lowAlgebra t) c) Tupling Computations ltupfr :: TreeAlgebra (Int, Int -> [Int]) ltupfr = ( \i -> ( 0 , (\l -> if l == 0 then [i] else []) ) , \ (lh,f) (rh,g) -> ( (lh ‘min‘ rh) + 1 , \l -> if l > 0 then f (l-1) ++ g (l-1) else [] ) ) lfront’’ t = fr l where (l, fr) = foldTree ltupfr t d) Merging Tupled Functions It is helpful to note that the merged algebra is constructed such that foldTree mergedAlgebra t i = (lowest t, frontAtLevel t i)

249

B. Answers to exercises

Therefore, the deﬁnitions corresponding to the fourth solution are lmergedAlgebra :: TreeAlgebra (Int -> (Int, [Int])) lmergedAlgebra = ( \i -> \l -> ( 0 , if l == 0 then [i] else [] ) , \lfun rfun -> \l -> let (ll, lres) = lfun (l-1) (rl, rres) = rfun (l-1) in ( (ll ‘min‘ rl) + 1 , if l > 0 then lres ++ rres else [] ) ) lfront’’’ t = fr where (l, fr) = foldTree lmergedAlgebra t l 8.1 Let G = (T, N, R, S). Deﬁne the regular grammar G = (T, N ∪ {S }, R , S ) as follows. S is a new nonterminal symbol. For the productions R , divide the productions of R in two sets: the productions R1 of which the right-hand side consists just of terminal symbols, and the productions R2 of which the right-hand side ends with a nonterminal. Deﬁne a set R3 of productions by adding the nonterminal S at the right end of each production in R1. Deﬁne R = R ∪ R3 ∪ S → S | . The grammar G thus deﬁned is regular, and generates the language L∗ . 8.2 In step 2, add the productions S → aS S→ S → cC S→a A→ A → cC A→a B → cC B→a and remove the productions S→A A→B B→C In step 3, add the production S → a, and remove the productions A → , B → .

250

8.3 ndfsa d qs [ ] = qs ndfsa d qs (x : xs) = ndfsa d (d qs x) xs

8.4 Let M = (X, Q, d, S, F ) be a deterministic ﬁnite-state automaton. Then the nondeterministic ﬁnite-state automaton M deﬁned by M = (X, Q, d , {S}, F ), where d q x = {d q x} accepts the same language. 8.5 Let M = (X, Q, d, S, F ) be a deterministic ﬁnite-state automaton that accepts language L. Then the deterministic ﬁnite-state automaton M = (X, Q, d, S, Q − F ), where Q − F is the set of states Q from which all states in F have been removed, accepts the language L. Here we assume that d q a is deﬁned for all q and a. 8.6 1. L1 ∩ L2 = (L1 ∪ L2 ), so it follows that regular languages are closed under intersection if they are closed under complementation and union. Regular languages are closed under union, see Theorem 8.10, and they are closed under complementation, see Exercise 8.5. 2. = {w ∈ X ∗ | dfa accept w (d, (S1 , S2 ), (F1 × F2 ))} = {w ∈ X ∗ | dfa d (S1 , S2 ) w ∈ F1 × F2 } = = {w ∈ X ∗ | dfa d1 S1 w ∈ F1 ∧ dfa d2 S2 w ∈ F2 } = {w ∈ X ∗ | dfa d1 S1 w ∈ F1 } ∩ {w ∈ X ∗ | dfa d2 S2 w ∈ F2 } = Ldfa M1 ∩ Ldfa M2 8.7 1.

89:; ?>=< /.-, ()*+ S @A 89:; / ?>=< A O BC@A 89:; /.-, ()*+ / ?>=< B BCO

Ldfa M

This requires a proof by induction {w ∈ X ∗ | (dfa d1 S1 w, dfa d2 S2 w) ∈ F1 × F2 }

( )

) (

251

(bc∗ ) 3. 8. GF ?>=< 89:. = (Lre(R))(Lre(S + T )) = (Lre(R))(Lre(S) ∪ Lre(T )) = (Lre(R))(Lre(S)) ∪ (Lre(R))(Lre(T )) = Lre(RS + RT ) 2. V = S and W = ∅ satisfy the requirement. at least one of S ∩ R and S ∩ R is not empty. Answers to exercises 2. Another solution is V W = S∩R = S∩R Lre(R(S + T )) ?>=< 89:. 8. Take 1∗0∗. So when a 0 appears somewhere. {a}b∗ ∪ c∗ 8. ()*+ / ?>=< C 8.-. Similar to the above calculation. / ?>=< A 89:. Since S = ∅. ()*+ B 0 Since S = ∅.-. 8. and it follows that V = W .8 Use the deﬁnition of Lre to calculate the languages: 1. and R = a.9 1. b} 2. S 1 0 0 ED 89:. then Lre(R(S + V )) = Lre(R(S + W )).-. /. /. { .11 If both V and W are subsets of S. S = b. /. 8. There exist other solutions than these. only 0’s can follow.B.10 Take R = a.12 The string 01 may not occur in a string in the language of the regular expression. ()*+ / ?>=< D 1 0 89:. S = a.13 252 .

14 First construct a nondeterministic ﬁnite-state automaton for these grammars. a(b + b(b + ab)∗(ab + )) + bb∗ + 2. w be such that y = uvw with v = . we can use the procedure constructed in Exercise 8. v = aq and w = ar with p + q + r = n and q > 0.16 1. (0 + 10∗1)∗ 8.1. y = an . u = ap . Take i = 2. Let u. (b + a(bb) ∗ b)∗ 9. we use Exercise 8. v.1 We follow the proof pattern for non-regular languages given in the text. 2 2 Take s = an with x = . The language of the regular expression a + bb is generated by the regular grammar S2 → a | bb 2. 1. Let n ∈ N. The language of the regular expression a∗ is generated by the regular grammar S0 → | aS0 The language of the regular expression b∗ is generated by the regular grammar S1 → | bS1 The language of the regular expression ab is generated by the regular grammar S2 → ab It follows that the language of the regular expression a∗ + b∗ + ab is generated by the regular grammar S3 → S0 | S1 | S2 8. The following regular grammar generates (a + bb)∗ + c: S0 → S1 | c provided S1 generates (a + bb)∗. Again. b ∗ +b ∗ (aa ∗ b(a + b)∗) 2. If we can give a regular grammar for (a + bb)∗ + c. that is. and z = an −n . then 253 .1 to obtain a regular grammar for the complete regular expression.1 and a regular grammar for a + bb to obtain a regular grammar for (a + bb)∗. and use the automata to construct the following regular expressions.

v = bq and w = br with p + q + r = n and q > 0. v. v. u. Take i = 0. u. Take i = 2. z. Let u. v. x. y = bn . and z = bn . y = bn .3 Using the proof pattern. v. v = bq and w = br with p + q + r = n and q > 0. Take s = an b2n with x = an . and z = b.B. Take s = an bn+1 with x = an . w. then xuwz ∈ L ⇐ ⇐ p+q+r ⇐ q>0 p+r+1 defn. calculus an b2n+q ∈L 254 . that is. w. u = bp . w be such that y = uvw with v = . z. Let u. u. Answers to exercises xuv 2 wz ∈ L ⇐ ⇐ ⇐ n2 < n2 + q < (n + 1)2 ⇐ q < 2n + 1 defn. Let n ∈ N. calculus 2 ap+2q+r an −n ∈L p + q + r = n and q > 0 n2 + q is not a square 9. that is. Let n ∈ N. then xuv 2 wz ∈ L ⇐ ⇐ 2n + q > 2n ⇐ q>0 defn. z. v. calculus an bp+r+1 ∈L 9. x. w be such that y = uvw with v = . x. w. u = bp .2 Using the proof pattern.

v. d ∈ N. and |vwx| r. w = ar . and z = . |vx| > 0 and |vwx | d That is u = ap . calculus 2 ak +q+s ∈L q+s>0 k 2 + q + s is not a square 9. x. calculus a6 bn+4 an+5q ∈ L 9. u. that is.7 Using the proof pattern. Let c. Take i = 6. u = ap . then uv 2 wx 2 y ∈ L ⇐ ⇐ ⇐ k 2 < k 2 + q + s < (k + 1)2 ⇐ q + s < 2k + 1 ⇐ defn. w.5 The length of a substring with a’s. v = aq . v. b’s. Take s = a6 bn+4 an with x = a6 bn+4 . d). x. Let u. Let u. Take i = 2. q + r + s d and q + s > 0. 255 . then xuv 6 wz ∈ L ⇐ ⇐ n + 5q > n + 4 ⇐ q>0 9.6 We follow the proof pattern for proving that a language is not context-free given in the text. y be such that z = uvwxy. x. v = aq and w = ar with p + q + r = n and q > 0. z. d ∈ N. Let c. y. x = as and y = at with p + q + r + s + t = k 2 . and c’s is at least r + 2. u. 2 Take z = ak with k = max(c. k q+s d defn. v. w. v. d defn.9. Let n ∈ N. w. y = an .4 Using the proof pattern. w be such that y = uvw with v = .

w.t. y be such that z = uvwxy. Let u. or just b’s. calculus ak+kq+ks ∈L for some s. t. |vx| > 0 and |vwx | d Note that our choice for k guarantees that substring vwx has one of the following shapes: • vwx consists of just a’s. v. x. w be such that y = uvw with v = . then 256 . v = bq and w = br with p + q + r = n and q > 0. Let n ∈ N. p and q is less than k. At least one of s . u. • If vwx contains both a’s and b’s. Let u. Take z = ak bk ak bk with k = max(c. q. v. v = aq . u = bp . q + r + s d and q + s > 0. then uv k+1 wx k+1 y ∈ L ⇐ ⇐ k(1 + q + s) is not a prime ⇐ q+s>0 9. Take i = 2. y = 0n . y. d ∈ N. Then the string uwy can be written as uwy = as bt ap bq defn. Again this sentence is not an element of the language. Take i = k + 1. then • If vwx consists of just a’s. respectively.9 Using the proof pattern. w. then it is impossible to write the string uwy as ww for some string w. w = ar . while two of them are equal to k. v. x. 9. d). x = as and y = at with p + q + r + s + t = k. |vx| > 0 and |vwx | d That is u = ap . w. p. Answers to exercises Take z = ak with k is prime and k > max(c. since only the number of terminals of one kind is decreased. y be such that z = uvwxy. Let c. Take i = 0. d).8 Using the proof pattern. v. Take s = 1n 0n with x = 1n . x. it lies somewhere on the border between a’s and b’s. and z = . or just b’s. that is. or on the border between b’s and a’s. Let u. • vwx contains both a’s and b’s.B.

x. Let c. then xuv 3 wz ∈ L ⇐ ⇐ ⇐ defn. Take s = an bn+1 with x = . w be such that y = uvw with v = . u. x. These are exactly the strings of the given language. The language {ai bj | 0 i j} is context-free. The language is not regular. 2.10 1. Let n ∈ N. u = ap . w. v = aq and w = ar with p + q + r = n and q > 0. and z = bn+1 . n ∈ N. v. z. calculus 1n 0p+2q+r ∈L p+q+r =n 1n 0n+q ∈ L q>0 true 9. u. We prove this using the proof pattern. b}∗ } is not context-free.11 1. d ∈ N. 257 . The language {wcw | w ∈ {a. v. w. z. v. y = an . Let u. Take i = 3. that is. The language can be generated by the following context-free grammar: S → aSbA S → A → bA A → This grammar generates the strings am bm bn for m. calculus ap+3q+r bn+1 ∈ L p+q+r =n n + 2q > n + 1 q>0 true 9.xuv 2 w ∈ L ⇐ ⇐ ⇐ defn. We prove this using the proof pattern.

We use the proof pattern and choose s = {n }n . If the string uwy does contain c then it has either fewer b’s at the left-hand side than at the right-hand side or fewer a’s at the right-hand side than at the left-hand side. The grammar is context-free. 2.12 1. • If vwx contains c then the string vwx can be written as vwx = bs cat for some s.2. The substring vwx has one of the following shapes: • vwx does not contain c. then • If vwx does not contain c it is a substring at the left-hand side or at the right-hand side of c. since only the number of terminals on one side of c is decreased. Let u. The language is not regular because it is not context-free. The grammar is not right-regular but left-regular because the nonterminal appears at the left hand side in the last production rule. 2. • vwx contains c.13 1. 258 . For uwy there are two possibilities. w. d). y be such that z = uvwxy. The grammar is not regular because there is a production rule with two nonterminals at the right hand side. Take i = 0.B. The language is context-free because it is generated by a context-free grammar. So it is impossible to write the string uwy as scs for some string s. t respectively. The grammar G is context-free because there is only one nonterminal at the left hand side of the production rules and the right hand side is a sequence of terminals and nonterminals. Then it is impossible to write the string uwy as scs for some string s. or both. 9. The language is not regular. The string uwy does not contain c so it is not an element of the language. v. L(G) = {(st)∗ | s ∈ {∗ ∧ t ∈ }∗ ∧ |s| = |t| } L(G) is all sequences of nested braces. Answers to exercises Take z = ak bk cak bk with k = max(c. x. The set of regular languages is a subset of the set of context-free languages. |vx| > 0 and |vwx | d Note that our choice for k guarantees that |vwx | k. The proof is exactly the same as the proof for non-regularity of L = {am bm | m 0} in Section 9. 9. 3.

b.2.b.3 For gramm1 we have: follow follow follow follow S A B C = = = = {a.c} For gramm2: follow = const {} 10.b.c} {a.c} {a.1 For both grammars we have: empty = const False 10. The language is context-free and regular because it is generated by a regular grammar.2 For grammar1 we have: firsts firsts firsts firsts S A B C = = = = {b. 10.4 For the productions of gramm3 we have: 259 .b} {a.b} For grammar2: firsts S = {a} firsts A = {b} 10.c} {a. L(G) = 0∗ ∪ 10∗ Note that the language is also generated by the following right-regular grammar: S → S → 1A S → 0A A → A → 0 A → 0A 3.c} {a.b.c} {a.b.

c } {b} {c} {a} {b} Since all lookAhead sets for productions of the same nonterminal are disjoint.b} {b} {a. gramm2’ is an LL(1) grammar.b} The follow function follow = const {} The lookAhead function lookAhead lookAhead lookAhead lookAhead lookAhead lookAhead (S → aC) (C → bA) (C → a) (A → bD) (D → b) (D → S) = = = = = = {a} {b} {a} {b} {b} {a} Clearly. Answers to exercises lookAhead lookAhead lookAhead lookAhead lookAhead (S → AaS) (S → B) (A → cS) (A → []) (B → b) = = = = = {a.B. 10. 10. gramm3 is an LL(1) grammar.5 After left factoring gramm2 we obtain gramm2’ with productions S → aC C → bA | a A → bD D → b | S For this transformed grammar gramm2’ we have: The empty function empty = const False The first function firsts firsts firsts firsts S C A D = = = = {a} {a.6 For the empty function we have: 260 .

Node D [ Node b [] ] ] ] ] 10.7 Node S [ Node c [] .9 261 . Node C [ Node b []. 1} {} {. Node C [ Node b [] . Node c [] ] .8 Node S [ Node a [] . 10.} {} The lookAhead function is lookAhead lookAhead lookAhead lookAhead lookAhead (L → B R) (R → ) (R → . Node B [ Node c []. B R) (B → 0) (B → 1) = = = = = {0.} {0. Node a [] ] ] ] 10.1} {.} {0} {1} Since all lookAhead sets for productions of the same nonterminal are disjoint. Node A [ Node b []. Node A [ Node c [] .1} The follow function is follow B follow _ = = {. the grammar is LL(1).empty R empty _ = = True False For the firsts function we have firsts L firsts R firsts B = = = {0.

262 .10 1. Node S [ Node B [ Node b []]]] . Answers to exercises Node S [ Node A [] . Node a [] . 6. Node S [ Node A [ Node c []. prefplus p [] = [] prefplus p (x:xs) = if p x then single x ‘union‘ prefplus p xs else single x prefplus p = foldr op [] where op x us = if p x then single x ‘union‘ us else single x prefplus p = foldr op [] where op x us = if p x then single x ‘union‘ us else single x = foldr op [] 3. Node S [ Node B [ Node b [] ]] ] ] 10. 5. 2. list2Set :: Ord s => [s] -> [s] list2Set = unions .B. Node a [] . map single list2Set :: Ord s => [s] -> [s] list2Set = foldr op [] where op x xs = single x ‘union‘ xs pref :: Ord s => (s -> Bool) -> [s] -> [s] pref p = list2Set . takeWhile p or pref p [] = [] pref p (x:xs) = if p x then single x ‘union‘ pref p xs else [] or pref p = foldr op [] where op x xs = if p x then single x ‘union‘ xs else [] 4.

10. 8. The function foldrRhs p f start takes a list xs and returns the set start ‘union‘ foldrRhs p f [] xs. For gramm3 a) = unions (map first (prefplus empty AaS)) foldrRhs empty first [] AaS empty C = False unions (map first [C]) empty b = False unions (map first [b]) foldrRhs empty first [] bSA 263 . The function foldrRhs p f [] takes a list xs and returns the set of all f-images of the elements of the preﬁx of xs all of whose elements satisfy p together with the f-image of the ﬁrst element of xs that does not satisfy p (if this element exists). For gramm1 a) = unions (map first (prefplus empty bSA)) = = unions [{b}] = {b} b) foldrRhs empty first [] Cb = unions (map first (prefplus empty Cb)) = = unions [{a. b } 2.where op x us = single x ‘union‘ rest where rest = if p x then us else [] = foldrRhs p single [] 7. b }] = { a.11 1.

b. b.c} The production rule B → cc cannot be used for a reduction when the input begins with the symbol c.B. . 264 . aS.c} {a. [] ] 11.1 Recall.b.b} {a. {a. c}. ] = foldrRhs foldrRhs foldrRhs foldrRhs empty empty empty empty first first first first [] [] [] [] AaS aS S [] scanrRhs empty first [] AaS calculus [ {a. for gramm1 we have: follow follow follow follow S A B C = = = = {a. {a}. . {a}] = {a.c} {a. c} b) = map (foldrRhs empty first []) (tails AaS) = map (foldrRhs empty first []) [AaS. S. []] = [ . a])) = unions [first A. Answers to exercises = empty A = True unions (map first ([A] ‘union‘ prefplus empty aS)) empty a = False unions (map first ([A] ‘union‘ [a])) = unions (map first ([A. first a] = unions [{c}.b. c }.

stack c cc ccc cccc Bcc bBcc abBcc CBcc Ac S input ccccba cccba ccba cba ba ba a 11.2 Recall.3 Recall. for gramm2 we have: follow S = {} follow A = {} No reduction can be made until the input is empty. stack a ba bba bbba Aba S input abbb bbb bb b 11. 265 . for gramm3 we have: follow S = {a} follow A = {a} follow B = {a} At the beginning a reduction is applied using the production rule A → ε.

Answers to exercises stack A aA caA bcaA BcaA ScaA AaA aAaA baAaA SaAaA SaA S input acbab acbab cbab bab ab ab ab ab b 266 .B.