- Program on Windows Using FREE Tools
- Visual Basic Tutorial Chapter 1
- Ch 7 Data Handling
- 05-FunctionsInC++
- C Programming Notes
- Lexicon Reflex MIDI Implementation Guide
- Grapics Notes
- datatyp.docx
- 31008084_K01_000_00
- TUTORIAL se PFD
- C Language.doc
- Xii Mat April-may
- lesson 1
- Photoshop CS3 VBScript Ref.pdf
- object oriented programming with java.pdf
- Monitor Pro Tuto2
- STACKnotes.doc
- CS F213 RL 6.5
- .Net Framework
- JavaHTP7e_15
- NCT90T Operator Programmer
- Course File Programming and Problem Solving in c
- item analysis paper final 1
- 17 Using Code Blocks.docx
- OOP LAB
- CADKEY Advanced Design Language (CADL) Guide
- 9300_Servo_EP Commission.pdf
- CSC138 Sept 2011
- IDENTIFIERS
- First Monthly Test 2018
- Yes Please
- The Unwinding: An Inner History of the New America
- Sapiens: A Brief History of Humankind
- The Innovators: How a Group of Hackers, Geniuses, and Geeks Created the Digital Revolution
- Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
- Dispatches from Pluto: Lost and Found in the Mississippi Delta
- Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
- John Adams
- The Prize: The Epic Quest for Oil, Money & Power
- The Emperor of All Maladies: A Biography of Cancer
- This Changes Everything: Capitalism vs. The Climate
- Grand Pursuit: The Story of Economic Genius
- A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
- The New Confessions of an Economic Hit Man
- Team of Rivals: The Political Genius of Abraham Lincoln
- The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
- Smart People Should Build Things: How to Restore Our Culture of Achievement, Build a Path for Entrepreneurs, and Create New Jobs in America
- Rise of ISIS: A Threat We Can't Ignore
- The World Is Flat 3.0: A Brief History of the Twenty-first Century
- Bad Feminist: Essays
- Steve Jobs
- Angela's Ashes: A Memoir
- How To Win Friends and Influence People
- The Sympathizer: A Novel (Pulitzer Prize for Fiction)
- Extremely Loud and Incredibly Close: A Novel
- Leaving Berlin: A Novel
- The Silver Linings Playbook: A Novel
- The Light Between Oceans: A Novel
- The Incarnations: A Novel
- You Too Can Have a Body Like Mine: A Novel
- The Love Affairs of Nathaniel P.: A Novel
- Life of Pi
- The Flamethrowers: A Novel
- Brooklyn: A Novel
- The First Bad Man: A Novel
- We Are Not Ourselves: A Novel
- The Blazing World: A Novel
- The Rosie Project: A Novel
- The Master
- Bel Canto
- A Man Called Ove: A Novel
- The Kitchen House: A Novel
- Beautiful Ruins: A Novel
- Interpreter of Maladies
- The Wallcreeper
- Wolf Hall: A Novel
- The Art of Racing in the Rain: A Novel
- The Cider House Rules
- A Prayer for Owen Meany: A Novel
- Lovers at the Chameleon Club, Paris 1932: A Novel
- The Bonfire of the Vanities: A Novel
- The Perks of Being a Wallflower
- The Constant Gardener: A Novel

Marcus Kracht

Department of Linguistics, UCLA

3125 Campbell Hall

450 Hilgard Avenue

Los Angeles, CA 90095–1543

kracht@humnet.ucla.edu

1 General Remarks

This lecture is an introduction to computational linguistics. As such it is also an

introduction to the use of the computer in general. This means that the course will

not only teach theoretical methods but also practical skills, namely programming.

In principle, this course does not require any knowledge of either mathematics or

programming, but it does require a level of sophistication that is acquired either

by programming or by doing some mathematics (for example, an introduction to

mathematical linguistics).

The course uses the language OCaML for programming. It is however not

an introduction into OCaML itself. It oﬀers no comprehensive account of the

language and its capabilities but focuses instead on the way it can be used in

practical computations. OCaML can be downloaded from the oﬃcial site at

http://caml.inria.fr/

together with a reference manual and a lot of other useful material. (The latest

release is version number 3.09.1 as of January 2006.) In particular, there is a

translation of an introduction into programming with OCaML which is still not

published and can therefore be downloaded from that site (see under OCaML

Light). The book is not needed, but helpful. I will assume that everyone has

1

2. Practical Remarks Concerning OCaML 2

a version of OCaML installed on his or her computer and has a version of the

manual. If help is needed in installing the program I shall be of assistance.

The choice of OCaML perhaps needs comment, since it is not so widely

known. In the last twenty years the most popular language in computational

linguistics (at least as far as introductions was concerned) was certainly Prolog.

However, its lack of imperative features make its runtime performance rather bad

for the problems that we shall look at (though this keeps changing, too). Much of

actual programming nowadays takes place in C or C++, but these two are not so

easy to understand, and moreover take a lot of preparation. OCaML actually is

rather fast, if speed is a concern. However, its main features that interest us here

are:

It is completely typed. This is particularly interesting in a linguistics course,

since it oﬀers a view into a completely typed universe. We shall actually

begin by explaining this aspect of OCaML.

It is completely functional. In contrast to imperative languages, functional

languages require more explicitness in the way recursion is handled. Rather

than saying that something needs to be done, one has to say how it is done.

It is object oriented. This aspect will not be so important at earlier stages,

but turns into a plus once more complex applications are looked at.

2 Practical Remarks Concerning OCaML

When you have installed OCaML, you can invoke the program in two ways:

In Windows by clicking on the icon for the program. This will open an

interactive window much like the xterminal in Unix.

by typing ocaml after the prompt in a terminal window.

Either way you have a window with an interactive shell that opens by declaring

what version is running, like this:

(1) Objective Caml version 3.09.1

2. Practical Remarks Concerning OCaML 3

It then starts the interactive session by giving you a command prompt: #. It is then

waiting for you to type in a command. For example

(2) # 4+5;;

It will execute this and return to you as follows:

(3)

- : int = 9

#

So, it gives you an answer (the number 9), which is always accompanied by some

indication of the type of the object. (Often you ﬁnd that you get only the type in-

formation, see below for reasons.) After that, OCaML gives you back the prompt.

Notice that your line must end in ;; (after which you have to hit < return >, of

course). This will cause OCaML to parse it as a complete expression and it will

try to execute it. Here is another example.

(4) # let a = ’q’;;

In this case, the command says that it should assign the character q to the object

with name a. Here is what OCaML answers:

(5) val a : char = ’q’

Then it will give you the prompt again. Notice that OCaML answers without

using the prompt, #. This is how you can tell your input from OCaMLs answers.

Suppose you type

(6) # Char.code ’a’;;

Then OCaML will answer

(7) - : int = 113

Then it will give you the prompt again. At a certain point this way of using

OCaML turns out to be tedious. For even though one can type in entire programs

this way, there is no way to correct mistakes once a line has been entered. There-

fore, the following is advisable to do.

First, you need to use an editor (I recommend to use either emacs or vim;

vim runs on all platforms, and it can be downloaded and installed with no charge

2. Practical Remarks Concerning OCaML 4

and with a mouseclick). Editors that Windows provides by default are not good.

Either they do not let you edit a raw text ﬁle (like Word or the like) or they do

not let you choose where to store data (like Notepad). So, go for an independent

editor, install it (you can get help on this too from us).

Using the editor, open a ﬁle < myﬁle > .ml (it has to have the extension .ml).

Type in your program or whatever part of it you want to store separately. Then,

whenever you want OCaML to use that program, type after the prompt:

(8) # #use "<myﬁle>.ml";;

(The line will contain two #, one is the prompt of OCaML, which obviously you

do not enter, and the other is from you! And no space in between the # and the

word use.) This will cause OCaML to look at the content of the ﬁle as if it had

been entered from the command line. OCaML will process it, and return a result,

or, which happens very frequently in the beginning, some error message, upon

which you have to go back to your ﬁle, edit it, write it, and load it again. You may

reuse the program, this will just overwrite the assignments you have made earlier.

Notice that the ﬁle name must be enclosed in quotes. This is because OCaML

expects a string as input for #use and it will interpret the string as a ﬁlename.

Thus, the extension, even though technically superﬂuous, has to be put in there as

well.

You may have stored your program in several places, because you may want

to keep certain parts separate. You can load them independently, but only insofar

as the dependencies are respected. If you load a program that calls functions that

have not been deﬁned, OCaML will tell you so. Therefore, make sure you load the

ﬁles in the correct order. Moreover, if you make changes and load a ﬁle again, you

may have to reload every ﬁle that depends on it (you will see why if you try...).

Of course there are higher level tools available, but this technique is rather

eﬀective and fast enough for the beginning practice.

Notice that OCaML goes through your program three times. First, it parses the

program syntactically. At this stage it will only tell you if the program is incorrect

syntactically. Everything is left associative, as we will discuss below. The second

time OCaML will check the types. If there is a type mismatch, it returns to you

and tells you so. After it is done with the types it is ready to execute the program.

Here, too, it may hit type mismatches (because some of them are not apparent at

ﬁrst inspection), and, more often, run time errors such as accessing an item that

3. Welcome To The Typed Universe 5

isn’t there (the most frequent mistake).

3 Welcome To The Typed Universe

In OCaML, every expression is typed. There are certain basic types, but types can

also be created. The type is named by a string. The following are inbuilt types

(the list is not complete, you ﬁnd a description in the user manual):

(9)

character char

string string

integer int

boolean bool

ﬂoat float

There are conventions to communicate a token of a given type. Characters must be

enclosed in single quotes. ’a’, for example refers to the character a. Strings must

be enclosed in double quotes: "mail" denotes the string mail. The distinction

matters: "a" is a string, containing the single letter a. Although identical in

appearance to the character a, they must be kept distinct. The next type, integer

is typed in as such, for example 10763. Notice that "10763" is a string, not

a number! Booleans have two values, typed in as follows: true and false.

Finally, ﬂoat is the type of ﬂoating point numbers. They are distinct from integers,

although they may have identical values. For example, 1.0 is a number of type

ﬂoat, 1 is a number of type int. Type in 1.0 = 1;; and OCaML will respond

with an error message:

# 1.0 = 1;; (10)

This expression has type int but is here used with type

float

And it will underline the oﬀending item. It parses expressions from left to right

and stops at the ﬁrst point where things do not match. First it sees a number of

type ﬂoat and expects that to the right of the equation sign there also is a number

of type ﬂoat. When it reads the integer it complains. For things can only be equal

via = if they have identical type.

The typing is very strict. Whenever something is declared to be of certain type,

it can only be used with operations that apply to things of that type. Moreover,

3. Welcome To The Typed Universe 6

OCaML immediately sets out to type the expression you have given. For example,

you wish to deﬁne a function which you name f, and which adds a number to

itself. This is what happens:

# let f x = x + x;; (11)

val f : int −> int = <fun>

This says that f is a function from integers to integers. OCaML knows this be-

cause + is only deﬁned on integers. Addition on ﬂoat is +.:

# let g x = x +. x;; (12)

val g : float −> float = <fun>

There exist type variables, which are ’a, ’b and so on. If OCaML cannot infer

the type, or if the function is polymorphic (it can be used on more than one type)

it writes ’a in place of a speciﬁc type.

# let iden x = x;; (13)

val iden : ’a −> ’a = <fun>

There exists an option to make subtypes inherit properties from the supertype, but

OCaML needs to be explicitly told to do that. It will not do any type conversion

by itself.

Let us say something about types in general. Types are simply terms of a par-

ticular signature. For example, a very popular signature in Montague grammar

consists in just two so-called type constructors, → and •. A type constructor is

something that makes newtypes fromold types; both →and • are binary type con-

structors. They require two types each. In addition to type constructors we need

basic types. (There are also unary type constructors, so this is only an example.)

The full deﬁnition is as follows.

Deﬁnition 1 (Types) Let B be a set, the set of basic types. Then Typ

→,•

(B), the

set of types over B, with type constructors → and •, is the smallest set such that

• B ⊆ Typ

→,•

(B), and

• if α, β ∈ Typ

→,•

(B) then also α → β, α • β ∈ Typ

→,•

(B).

3. Welcome To The Typed Universe 7

Each type is interpreted by particular elements. Typically, the elements of diﬀerent

types are diﬀerent from each other. (This is when no type conversion occurs.

But see below.) If x is of type α and y is of type β α then x is distinct from

y. Here is how it is done. We interpret each type for b ∈ B by a set M

b

in

such a way that M

a

∩ M

b

= ∅ if a b. Notice, for example, that OCaML

has a basic type char and a basic type string, which get interpreted by the

set of ASCII–characters, and strings of ASCII–characters, respectively. We may

construe strings over characters as function from numbers to characters. Then,

even though a single character is thought of in the same way as the string of

length 1 with that character, they are now physically distinct. The string k, for

example, is the function f : ¦0¦ → M

char

such that f (0) = k. OCaML makes

sure you respect the diﬀerence by displaying the string as "k" and the character

as ’k’. The quotes are not part of the object, they tell you what its type is.

Given two types α and β we can form the type of functions from α–objects to

β–objects. These functions have type α → β. Hence, we say that

(14) M

α→β

= ¦ f : f is a function from M

α

to M

β

¦

For example, we can deﬁne a function f by saying that f (x) = 2x +3. The way to

do this in OCaML is by issuing

# let f x = (2 * x) + 3;; (15)

val f : int −> int = <fun>

Now we understand better what OCaML is telling us. It says that the value of the

symbol f is of type int -> int because it maps integers to integers, and that

it is a function. Notice that OCaML inferred the type from the deﬁnition of the

function. We can take a look how it does that.

First, if x is an element of type α → β and y is an element of type α, then x is

a function that takes y as its argument. In this situation the string for x followed

by the string for y (but separated by a blank) is well–formed for OCaML, and it is

of type β. OCaML allows you to enclose the string in brackets, and sometimes it

is even necessary to do so. Hence, f 43 is well–formed, as is (f 43) or even (f

(43)). Type that in and OCaML answers:

# f 43;; (16)

- : int = 89

3. Welcome To The Typed Universe 8

The result is an integer, and its value is 89. You can give the function f any term

that evaluates to an integer. It may contain function symbols, but they must be

deﬁned already. Now, the functions * and + are actually predeﬁned. You can look

them up in the manual. It tells you that their type is int -> int -> int. Thus,

they are functions from integers to functions from integers to integers. Hence (2

* x) is an integer, because x is an integer and 2 is. Likewise, (2 * x) + 3 is an

integer.

Likewise, if α and β are types, so is α•β. This is the product type. It contains

the set of pairs (x, y) such that x is type α and y of type β:

(17) M

α•β

= M

α

× M

β

= ¦(x, y) : x ∈ M

α

, y ∈ M

β

¦

Like function types, product types do not have to be oﬃcially created to be used.

OCaML’s own notation is ∗ in place of •, everything else is the same. Type in, for

example, (’a’, 7) and this will be your response:

(18) - : char * int = (’a’, 7);;

This means that OCaML has understood that your object is composed of two parts,

the left hand part, which is a character, and the right hand part, which is a number.

It is possible to explicitly deﬁne such a type. This comes in the form of a type

declaration:

# type prod = int * int;; (19)

type prod = int * int

The declaration just binds the string prod to the type of pairs of integers, called the

product type int * int by OCaML. Since there is nothing more to say, OCaML

just repeats what you have said. It means it has understood you and prod is now

reserved for pairs of integers. However, such explicit deﬁnition hardly makes

sense for types that can be inferred anyway. For if you issue let h = (3,4)

then OCaML will respond val h : int * int = (3,4), saying that h has

the product type. Notice that OCaML will not say that h is of type prod, even if

that follows from the deﬁnition.

There is often a need to access the ﬁrst and second components of a pair. For

that there exist functions fst and snd. These functions are polymorphic. For

every product type α • β, they are deﬁned and fst maps the element into its ﬁrst

component, and snd onto its second component.

3. Welcome To The Typed Universe 9

An alternative to the pair constructor is the constructor for records. The latter

is very useful. It is most instructive to look at a particular example:

type car = ¦brand : string; vintage : int; (20)

used : bool¦;;

This deﬁnes a type structure called car, which has three components: a brand, a

vintage and a usedness value. On the face of it, a record can be replicaed with

products. However, notice that records can have any number of entries, while a

pair must consist of exactly two. So, the record type car can be replicated by

either string * (int * bool) or by (string * int) * bool.

But there are also other diﬀerences. One is that the order in which you give

the arguments is irrelevant. The other is that the names of the projections can be

deﬁned by yourself. For example, you can declare that mycar is a car, by issuing

# let mycar = ¦brand = "honda"; vintage = 1977;¦ (21)

used = false¦;;

This will bind mycar to the element of type car. Moreover, the expression

mycar.brand has the value honda. (To communicate that, it has to be enclosed

in quotes, of course. Otherwise it is mistaken for an identiﬁer.) It is not legal to

omit the speciﬁcation of some of the values. Try this, for example:

# let mycar = ¦brand = "honda"¦;; (22)

Some record field labels are undefined: vintage used;;

This is not a warning: the object named ”mycar” has not been deﬁned.

# mycar.brand;; (23)

Unbound value mycar

If you look carefully, you will see that OCaML has many more type construc-

tors, some of which are quite intricate. One is the disjunction, written |. We shall

use the more suggestive ∪. We have

(24) M

α∪β

= M

α

∪ M

β

Notice that by deﬁnition objects of type α are also objects of type M

α∪β

. Hence

what we said before about types keeping everything distinct is not accurate. In

3. Welcome To The Typed Universe 10

fact, what is closer to the truth is that types oﬀer a classifying system for objects.

Each type comes with a range of possible objects that fall under it. An object x

is of type α, in symbols x : α, if and only if x ∈ M

α

. The interpretation of basic

types must be given, the interpretation of complex types is inferred from the rules,

so far (14), (17) and (24). The basic types are disjoint by deﬁnition.

There is another constructor, the list constructor. Given a type, it returns the

type of lists over that type. You may think of them as sequences from the numbers

0, 1, . . . , n − 1 to members of α. However, notice that this interpretation makes

them look like functions from integers to M

α

. We could think of them as members

of int −> α. However, this is not the proper way to think of them. OCaML

keeps lists as distinct objects, and it calls the type constructor list. It is a postﬁx

operator (it is put after its argument). You can check this by issuing

# type u = int list;; (25)

type int list

This binds the string u to the type of lists of integers. As with pairs, there are ways

to handle lists. OCaML has inbuilt functions to do this. Lists are communicated

using [, ; and ]. For example, [3;2;4] is a list of integers, whose ﬁrst element

(called head) is 3, whose second element is 2, and whose third and last element is

4. [’a’;’g’;’c’] is a list of characters, and it is not astring, while ["mouse"] is

a list containing just a single element, the string mouse. There is a constant, [ ],

which denotes the empty list. The double colon denotes the function that appends

an element to a list: so 3::[4;5] equals the list [3;4;5]. List concatenation is

denoted by @. The element to the left of the double colon will be the new head

of the list, the list to the right the so–called tail. You cannot easily mix elements.

Try, for example, typing

let h = [3; "st"; ’a’];; (26)

This expression has type string but is here used with

type int

Namely, OCaML tries to type expression. To do this, it opens a list beginning

with 3, so it thinks this is a list of integers. The next item is a string, so no match

and OCaML complains. This is the way OCaML looks at it. (It could form the

disjunctive type all = int | char | string and then take the object to be a

list of all objects. But it does not do that. It can be done somehow, but then

OCaML needs to be taken by the hand.)

4. Function Deﬁnitions 11

4 Function Deﬁnitions

Functions can be deﬁned in OCaML in various ways. The ﬁrst is to say explicitly

what it does, as in the following case.

(27) # let appen x = x

ˆ

’a’;;

This function takes a string and appends the letter a. Notice that OCaML interprets

the ﬁrst identiﬁer after let as the name of the function to be deﬁned and treats the

remaining ones as arguments to be consumed. So they are bound variables.

There is another way to achieve the same, using a constructor much like the

λ-abstractor. For example,

(28) # let app = (fun x -> x

ˆ

’a’;;

Notice that to the left the variable is no longer present. The second method has

the advantage that a function like app can be used without having to give it a

name. You can replace app wherever it occurs by (fun x -> x

ˆ

’a’. The choice

between the methods is basically one of taste and conciseness.

The next possibility is to use a deﬁnition by cases. One method is to use

if then else, which can be iterated as often as necessary. OCaML provides

another method, namely the construct match with. You can use it to match

objects with variables, pairs with pairs of variables, and list with and tail (iterated

to any degree needed).

Recursion is the most fundamental tool to deﬁning a function. If you do logic,

one of the ﬁrst things you get told is that just about any function on the integers

can be deﬁned by recursion, in fact by what is known as primitive recursion, from

only one function: the successor function! Abstractly, to deﬁne a function by

recursion is to do the following. Suppose that f (x) is a function on one argument.

Then you have to say what f (0), and second, you have to say what f (n + 1) is on

the basis of the value f (n). Here is an example. The function that adds x to y,

written x + y, can be deﬁned by recursion as follows.

(29) 0 + y := y; (n + 1) + y := (n + y) + 1

At ﬁrst it is hard to see that this is indeed a deﬁnition by recursion. So, let us write

‘s’ for the successor function; then the clauses are

(30) 0 + y := y; s(n) + y := s(n + y)

4. Function Deﬁnitions 12

Indeed, to know the value of f (0) = 0 + y you just look up y. The value of

f (n + 1) := (s(n)) + y is deﬁned using the value f (n) = n + y. Just take that value

and add 1 (that is, form the successor).

The principle of recursion is widespread and can be used in other circum-

stances as well. For example, a function can be deﬁned over strings by recursion.

What you have to do is the following: you say what the value is of the empty

string, and then you say what the value is of f (x

**a) on the basis of f (x), where x
**

is a string variable and a a letter. Likewise, functions on lists can be deﬁned by

recursion.

Now, in OCaML there is no type of natural numbers, just that of integers. Still,

deﬁnitions by recursion are very important. In deﬁning a function by recursion

you have to tell OCaML that you intend such a deﬁnition rather than a direct

deﬁnition. For a deﬁnition by recursion means that the function calls itself during

evaluation. For example, if you want to deﬁne exponentiation for numbers, you

can do the following:

let rec expt n x (31)

if x = 0 then 1 else

n * (expt n (x-1));;

Type this and then say expt 2 4 and OCaML replies:

- : int = 16 (32)

As usual, recursions need to be grounded, otherwise the program keeps calling

itself over and over again. If the deﬁnition was just given for natural numbers,

there would be no problem. But as we have said, there is no such type, just the

type of integeres. This lets you evaluate the function on arguments where it is not

grounded. In the present example negative numbers let the recursion loop forever.

This is why the above deﬁnition is not good. Negative arguments are accepted,

but never terminate (because the value on which the recursion is based is further

removed from the base, here 0). The evaluation of expt 2 -3 never terminates.

Here is what happens:

# expt 2 (-3);; (33)

Stack overflow during evaluation (looping recursion?)

One way to prevent this is to check whether the second argument is negative, and

then rejecting the input. It is however good style to add a diagnostic so that you

5. Modules 13

can ﬁnd out why the program rejects the input. To do this, OCaML lets you deﬁne

so–called exceptions. This is done as follows.

exception Input_is_negative;; (34)

Exceptions always succeed, and they succeed by OCaML typing their name. So,

we improve our deﬁnition as follows.

let rec expt n x (35)

if x < 0 then raise Input_is_negative else

if x = 0 then 1 else

n * (expt n (x-1))

;;

Now look at the following dialog:

# expt 2 (-3) (36)

Exception: Input_is_negative

Thus, OCaML quits the computation and raises the exception that you have de-

ﬁned (there are also inbuilt exceptions). In this way you can make every deﬁnition

total.

Recursive deﬁnitions of functions on strings or lists are automatically grounded

if you include a basic clause for the value on the empty string or the empty list,

respectively. It is a common source of error to omit such a clause. Beware! How-

ever, recursion can be done in many diﬀerent ways. For strings, for example, it

does not need to be character by character. You can deﬁne a function that takes

away two characters at a time, or even more. However, you do need to make sure

that the recursion ends somewhere. Otherwise you will not get a reply.

5 Modules

Modules are basically convenient ways to package the programs. When the pro-

gram gets larger and larger, it is sometimes diﬃcult to keep track of all the names

and identiﬁers. Moreover, one may want to identify parts of the programs that

5. Modules 14

can be used elsewhere. The way to do this is to use a module. The syntax is as

follows:

module name-of-module = (37)

struct

any sequence of deﬁnitions

end;;

For example, suppose we have deﬁned a module called Seminar. It contains

deﬁnitions of objects and functions and whatever we like, for example a function

next_speaker. In order to use this function outside of the module, we have to

write Seminar.next_speaker. It is useful at this point to draw attention to the

naming conventions (see the reference manual). Some names must start with a

lower–case letter (names of functions, objects and so on), while others must start

with an uppercase–letter. The latter are names of exceptions (see below), and

names of modules. So, the module may not be called seminar. In the manual

you ﬁnd a long list of modules that you can use, for example the module List.

You may look at the source code of this module by looking up where the libraries

are stored. On my PC they sit in

usr/local/lib/ocaml

There you ﬁnd a ﬁle called list.ml (yes, the ﬁle is named with a lower case

letter!) and a ﬁle list.mli. The ﬁrst is the plain source code for the list module.

You can look at it and learn howto code the functions. The ﬁle list.mli contains

the type interface. More on that later. The ﬁrst function that you see deﬁned is

length. This function you can apply to a list and get its length. The way you call

it is however by List.length, since it sits inside the module List. However, you

will search in vain for an explicit deﬁnition for such a module. Instead OCaML

creates this module from the ﬁle list.ml. If you compile this ﬁle separately, its

content will be available for you in the form of a module whose name is obtained

from the ﬁlename by raising the initial letter to upper case (whence the ﬁle name

list.ml as opposed to the module name List).

Lets look at the code. The way this is done is interesting.

let rec length_aux len = function

[] −> len

| a::l −> length_aux (len + 1) l (38)

5. Modules 15

let length l = length_aux 0 l

First, the function is deﬁned via an auxiliary function. You expect therefore to

be able to use a function List.length_aux. But you cannot. The answer to

the puzzle is provided by the ﬁle list.mli. Apart form the commentary (which

OCaML ignores anyway) it only provides a list of functions and their types. This

is the type interface. It tells OCaML which of the functions deﬁned in list.ml

are public. By default all functions are public (for example, if you did not make

a type interface ﬁle). If a type interface exists, only the listed functions can be

used outside of the module. The second surprise is the deﬁnition of the function

length_aux. It uses only one argument, but at one occasion uses two! The

answer is that a function deﬁnition of the form let f x = speciﬁes that x is

the ﬁrst argument of f , but the resulting type of the function may be a function

and so take further arguments. Thus, unless you have to, you need not mention

all arguments. However, dropping arguments must proceed from the rightmost

boundary. For example, suppose we have deﬁned the function exp x y, which

calculates x

y

. The following function will yield [5; 25; 125] for the list [1;2;3]:

(39) let lexp l = List.map (exp 5) l

You may even deﬁne this function as follows:

(40) let lexp = List.map (exp 5)

However, if you want to deﬁne a function that raises every member of the list to

the ﬁfth power, this requires more thought. For then we must abstract from the

innermost argument position. Here is a solution:

(41) let rexp = List.map (fun y -> exp y 5)

A more abstract version is this:

let switch f x y = f y x;; (42)

let lexp = List.map ((switch exp) 5)

The function switch switches the order of the function (in fact any function). The

so deﬁned reversal can be applied in the same way as the original, with arguments

now in reverse order.

6. Sets and Functors 16

I have said above that a module also has a signature. You may explicitly deﬁne

signatures in the following way.

module typename-of-signature = (43)

sig

any sequence of type deﬁnitions

end;;

The way you tell OCaML that this is not a deﬁnition of a module is by saying

module type. This deﬁnition is abstract, just telling OCaML what kind of func-

tions and objects reside in a module of this type. The statements that you may put

there are type and val. The ﬁrst declares a ground type (which can actually also

be abstract!), and the second a derived type of a function. Both can be empty. If

you look into list.mli you will ﬁnd that it declares no types, only functions. To

see a complex example of types and type deﬁnitions, take a look at set.mli.

OCaML is partitioned into three parts: the core language, the basic language,

and extensions. Everything that does not belong to the core language is added in

the form of a module. The manual lists several dozens of such modules. When

you invoke OCaML it will automatically make available the core and everything

in the basic language. Since you are always free to add more modules, the manual

does not even attempt to list all modules, instead it lists the more frequent ones

(that come with the distribution).

6 Sets and Functors

Perhaps the most interesting type constructor is that of sets. In type theory we can

simply say that like for lists there is a unary constructor, s such that

(44) M

s(α)

= ℘(M

α

)

However, this is not the way it can be done in OCaML. The reason is that OCaML

needs to be told how the elements of a set are ordered. For OCaML will internally

store a set as a binary branching tree so that it can search for elements eﬃciently.

To see why this is necessary, look at the way we write sets. We write sets linearly.

This, however, means that we pay a price. We have to write, for example, ¦∅, ¦∅¦¦.

The two occurrences of ∅ are two occurrences of the same set, but you have to

6. Sets and Functors 17

write it down twice since the set is written out linearly. In the same way, OCaML

stores sets in a particular way, here in form of a binary branching tree. Next,

OCaML demands from you that you order the elements linearly, in advance. You

can order them in any way you please, but given two distinct elements a and b,

either a < b or b < a must hold. This is needed to access the set, to deﬁne set

union, and so on. The best way to think of a set as being a list of objects ordered

in a strictly ascending sequence. If you want to access an element, you can say:

take the least of the elements. This picks out an element. And it picks out exactly

one. The latter is important because OCaML operates deterministically. Every

operation you deﬁne must be total and deterministic.

If elements must always be ordered—how can we arrange the ordering? Here

is how.

module OPStrings = (45)

struct

type t = string * string

let compare x y =

if x = y then 0

else if fst x > fst y || (fst x = fst y

&& snd x > snd y) then 1

else -1

end;;

(The indentation is just for aesthetic purposes and not necessary.) This is what

OCaML answers:

module OPStrings : (46)

sig type t = string * string val compare :

’a * ’b −> ’a * ’b −> int end

It says that there is now a module called OPString with the following signature:

there is a type t and a function compare, and their types inside the signature are

given. A signature, by the way, is a set of functions together with their types. The

signature is given inside sig end.

Now what is this program doing for us? It deﬁnes a module, which is a com-

plete unit that has its own name. We have given it the name OPStrings. The

6. Sets and Functors 18

deﬁnition is given enclosed in struct end;;. First, we say what entity we

make the members from. Here they are strings. Second, we deﬁne the function

compare. It must have integer values; its value is 0 if the two are considered

equal, 1 if the left argument precedes the right hand argument and -1 otherwise.

To deﬁne the predicate compare we make use of two things: ﬁrst, strings can be

compared; there is an ordering predeﬁned on strings. Then we use the projection

functions to access the ﬁrst and the second component.

Finally, to make sets of the desired kind, we issue

module PStringSet = Set.Make(OPStrings);; (47)

To this, OCaML answers with a long list. This is because Make is a functor. A

functor is a function that makes new modules from existing modules. They are

thus more abstract than modules. The functor Make, deﬁned inside the module

Set, allows to create sets over entities of any type. It imports into PStringSet

all set functions that it has predeﬁned in the module Set, for example Set.add,

which adds an element to a set or Set.empty, which denotes the empty set. How-

ever, the functions are now called PStringSet. For example, the empty set is

called PStringSet.empty. Now, type

# st = PStringSet.empty;; (48)

and this assigns st to the empty set. If you want to have the set ¦(cat, mouse)¦

you can issue

# PStringSet.add ("cat", "mouse") st;; (49)

Alternatively, you can say

# PStringSet.add ("cat", "mouse") PStringSet.empty;; (50)

However, what will not work is if you type

# let st = PStringSet.add ("cat", "mouse") st;; (51)

This is because OCaML does not have dynamic assignments like imperative lan-

guages. st has been assigned already. You cannot change its binding. How

dynamic assignments can be implemented will be explained later. We only men-

tion the following. The type of sets is PStringSet.t. It is abstract. If you want

7. Hash Tables 19

to actually see the set, you have to tell OCaML how to show it to you. One way

of doing that is to convert the set into a list. There is a function called elements

that converts the set into a list. Since OCaML has a predeﬁned way of com-

municating sets (which we explained above), you can now look at the elements

without trouble. However, what you are looking at are members of a list, not that

of the set from which the list was compiled. This can be a source of mistakes in

programming. Now if you type, say,

# let h = PStringSet.element st;; (52)

OCaML will incidentally give you the list. It is important to realize that the pro-

gram has no idea how it make itself understood to you if you ask it for the value

of an object of a newly deﬁned type. You have to tell it how you want it to show

you.

Also, once sets are deﬁned, a comparison predicate is available. That is to say,

the sets are also ordered linearly by PStringSet.compare. This is useful, for it

makes it easy to deﬁne sets of sets. Notice that PStringSet.compare takes its

arguments to the right. The argument immediately to its right is the one that shows

up to the right in inﬁx notation. So, PStringSet.compare f g is the same as g

< f in normal notation. Beware!

7 Hash Tables

This section explains some basics about hash tables. Suppose there is a func-

tion, which is based on a ﬁnite look up table (so, there are ﬁnitely many possible

inputs) and you want to compute this function as fast as possible. This is the

moment when you want to consider using hash tables. They are some imple-

mentation of a fact look up procedure. You make the hash table using magical

incantations similar to those for sets. First you need to declare from what kind

of objects to what kind of objects the function is working. Also, you sometimes

(but not always) need to issue a function that assigns every input value a unique

number (called key). So, you deﬁne ﬁrst a module of inputs, after which you issue

HashTbl.make. This looks as follows.

module HashedTrans = (53)

struct

8. Combinators 20

type t = int * char

let equal x y = (((fst x = fst y) && (snd x = snd y)))

let hash x = ((256 * (fst x)) + (int_of_char (snd x)))

end;;

module HTrans = Hashtbl.Make(HashedTrans);;

8 Combinators

Combinators are functional expressions that use nothing but application of a func-

tion to another. They can be deﬁned without any symbols except brackets. Func-

tions can only be applied to one argument at a time. If f is a function and x some

object, then f x or ( f x) denotes the result of applying f to x. f x might be another

function, which can be applied to something else. Then ( f x)y is the result of ap-

plying it to y. Brackets may be dropped in this case. Bracketing is left associative

in general. This is actually the convention that is used in OCaML.

The most common combinators are I, K and S. They are deﬁned as follows.

Ix := x, (54)

Kxy := x, (55)

Sxyz := xz(yz) (56)

This can be programmed straightforwardly as follows. (I give you a transcript of

a session with comments by OCaML.)

# let i x = x;; (57)

val i : ’a −> ’a = <fun>

# let k x y = x;;

val k : ’a −> ’b −> ’a = <fun>

# let s x y z = x z(y z)

val s : (’a −> ’b −> ’c) −> (’a −> ’b) −> ’a −> ’c

= <fun>

Let us look carefully at what OCaML says. First, notice that it types every combi-

nator right away, using the symbols ’a, ’b and so on. This is because as such they

8. Combinators 21

can be used on any input. Moreover, OCaML returns the type using the right (!)

associative bracketing convention. It drops as many brackets as it possibly can.

Using maximal bracketing, the last type is

(58) ((’a −> (’b −> ’c)) −> ((’a −> ’b) −> (’a −> ’c)))

Notice also that the type of the combinator is not entirely trivial. The arguments x,

y and z must be of type (’a −> (’b −> ’c)), ’a −> ’b and ’c, respectively.

Otherwise, the right hand side is not well–deﬁned. You may check, however, that

with the type assignment given everything works well.

A few words on notation. Everything that you deﬁne and is not a concrete

thing (like a string or a digit sequence) is written as a plain string, and the string is

called an identiﬁer. It identiﬁes the object. Identiﬁers must always be unique. If

you deﬁne an identiﬁer twice, only the last assignment will be used. Keeping iden-

tiﬁers unique can be diﬃcult, so there are ways to ensure that one does not make

mistakes. There are certain conventions on identiﬁers. First, identiﬁers for func-

tions are variables must be begin with a lower case letter. So, if you try to deﬁne

combinators named I, K or S, OCaML will complain about syntax error. Second,

if you have a sequence of identiﬁers, you must always type a space between them

unless there is a symbol that counts as a separator (like brackets), otherwise the

whole sequence will be read as a single identiﬁer. Evidently, separators cannot be

part of legal identiﬁers. For example, we typically write Sxyz = xz(yz). But to

communicate this to OCaML you must insert spaces between the variable identi-

ﬁers except where brackets intervene. It is evident that an identiﬁer cannot contain

blanks nor brackets, otherwise the input is misread. (Read Page 89–90 on the is-

sue of naming conventions. It incidentally tells you that comments are enclosed

in (* and *), with no intervening blanks between these two characters.) Finally,

OCaML knows which is function and which is argument variable because the ﬁrst

identiﬁer after let is interpreted as the value to be bound, and the sequence of

identiﬁers up to the equation sign is interpreted as the argument variables.

Now, having introduced these combinators, we can apply them:

# k i ’a’ ’b’;; (59)

− : char = ’b’

This is because KIa = I and Ib = b. On the other hand, you may verify that

# k (i ’a’) ’b’;; (60)

− : char = ’a’

8. Combinators 22

You can of course now deﬁne

# let p x y = k i x y;; (61)

val p = ’a −> ’b −> ’b = <fun>

and this deﬁnes the function pxy = y. Notice that you cannot ask OCaML to

evaluate a function abstractly. This is because it does not know that when you ask

k i x y the letters x and y are variables. There are no free variables!

So, you can apply combinators to concrete values. You cannot calculate ab-

stractly using the inbuilt functions of OCaML. One thing you can do, however, is

check whether an expression is typable. For example, the combinator SII has no

type. This is because no matter how we assign the primitive types, the function is

not well–deﬁned. OCaML has this to say:

# s i i;; (62)

This expression has type (’a −> ’b’) −> ’a −> ’b but

is used here with type (’a -> ’b) -> ’a

How does it get there? It matches the deﬁnition of s and its ﬁrst argument i.

Since its ﬁrst argument must be of type ’a -> ’b -> ’c, this can only be the

case if ’a = ’b -> ’c. Applying the function yields an expression of type (’a

-> ’b) -> ’a -> ’c. Hence, the next argument must have type (’b -> ’c)

-> ’b. This is what OCaML expects to ﬁnd. It communicates this using ’a

and ’b in place of ’b and ’c, since these are variables and can be renamed at

will. (Yes, OCaML does use variables, but these are variables over types, not over

actual objects.) However, if this is the case, and the argument is actually i, then

the argument actually has type (’b -> ’c) -> ’b -> ’c. Once again, since

OCaML did some renaming, it informs us that the argument has type (’a ->

’b) -> ’a -> ’b. The types do not match, so OCaML rejects the input as not

well formed.

Now, even though OCaML does not have free variables, it does have variables.

We have met them already. Every let–clause deﬁnes some function, but this

deﬁnition involves saying what it does on its arguments. We have deﬁned i, k,

and s in this way. The identiﬁers x, y and z that we have used there are unbound

outside of the let–clause. Now notice that upon the deﬁnition of k, OCaML

issued its type as ’a -> ’b -> ’a. This means that it is a function from objects

9. Objects and Methods 23

of type ’a to objects of ’b -> ’a. It is legitimate to apply it to just one object:

# k "cat";; (63)

− : ’_a −> string = <fun>

This means that if you give it another argument, the result is a string (because the

result is "cat"). You can check this by assigning an identiﬁer to the function and

applying it:

# let katz = k "cat";; (64)

val katz : ’_a −> string = <fun>

# katz "mouse";;

- : string = "cat"

Eﬀectively, this means that you deﬁne functions by abstraction; in fact, this is

the way in which they are deﬁned. However, the way in which you present the

arguments may be important. (However, you can deﬁne the order of the arguments

in any way you want. Once you have made a choice, you must strictly obey that

order, though.)

9 Objects and Methods

The role of variables is played in OCaMl by objects. Objects are abstract data

structures, they can be passed around and manipulated. However, it is important

that OCaML has to be told every detail about an object, including how it is ac-

cessed, how it can communicate the identity of that object to you and how it can

manipulate objects. Here is a program that deﬁnes a natural number:

class number = (65)

object

val mutable x = 0

method get = x

method succ = x <- x + 1

method assign d = x <- d

end;;

9. Objects and Methods 24

This does the following: it deﬁnes a data type number. The object of this type

can be manipulated by three so–called methods: get, which gives you the value,

succ which adds 1 to the value, and assign which allows you to assign any

number you want to it. But it must be an integer; you may want to ﬁnd out how

OCaML knows that x must be integer. For this is what it will say:

class number : (66)

object

val mutable x : int

method assign int −> <unit>

method get : int

method succ : <unit>

end

Notice that it uses the type <unit>. This is the type that has no object associated

with it. You may think of it as the type of actions. For succ asks OCaML to add

the number 1 to the number.

Once we have deﬁned the class, we can have as many objects of that class as

we want. Look at the following dialog.

# let m = new number;; (67)

val m : number = <obj>

# m#get;;

- : int = 0

The ﬁrst line binds the identiﬁer m to an object of class number. OCaML repeats

this. The third line asks for the value of m. Notice the syntax: the value of m is

not simply m itself. The latter is an object (it can have several parameters in it).

So, if you want to see the number you have just deﬁned, you need to type m#get.

The reason why this works is that we have deﬁned the method get for that class.

If you deﬁne another object, say pi, then its value is pi#get. And the result of

this method is the number associated with m. Notice that the third line of the class

deﬁnition (66) asserts that the object contains some object x which can be changed

(it is declared mutable) and which is set to 0 upon the creation of the object.

If you deﬁne another object q and issue, for example q#succ and then ask for

10. Characters, Strings and Regular Expressions 25

the value you get 1:

# let q = new number;; (68)

val q : number = <obj>

# q#succ;;

- : unit = ()

# q#get;;

- : int = 1

Notice the following. If you have deﬁned an object which is a set, issuing method

get = x will not help you much in seeing what the current value is. You will have

to say, for example, method get = PStringSet.elements x if the object is

basically a set of pairs of strings as deﬁned above.

10 Characters, Strings and Regular Expressions

We are now ready to do some actual theory and implementation of linguistics.

We shall deal ﬁrst with strings and then with ﬁnite state automata. Before we can

talk about strings, some words are necessary on characters. Characters are drawn

from a special set, which is also referred to as the alphabet. In principle the alpha-

bet can be anything, but in actual implementations the alphabet is always ﬁxed.

OCaML, for example, is based on the character table of ISO Latin 1 (also referred

to as ISO 8859-1). It is included on the web page for you to look at. You may use

characters from this set only. In theory, any character can be used. However, there

arises a problem of communication with OCaML. There are a number of charac-

ters that do not show up on the screen, like carriage return. Other characters are

used as delimiters (such as the quotes). It is for this reason that one has to use

certain naming conventions. They are given on Page 90 – 91 of the manual. If

you write \ followed by 3 digits, this accesses the ASCII character named by that

sequence. OCaML has a module called Char that has a few useful functions. The

function code, for example, converts a character into its 3–digit code. Its inverse

is the function chr. Type Char.code ’L’;; and OCaML gives you 76. So, the

string ###BOT_TEXT###76 refers to the character L. You can try out that function and see that

it actually does support the full character set of ISO Latin 1. Another issue is of

course your editor: in order to put in that character, you have to learn how your

editor lets you do this. (In vi, you type either ¡Ctrl¡V and then the numeric code

10. Characters, Strings and Regular Expressions 26

as deﬁned in ISO Latin 1 (this did not work when I tried it) or ¡Ctrl¡K and then

a two-keystroke code for the character (this did work for me). If you want to see

what the options are, type :digraphs or simply :dig in commmand mode and

you get the list of options.)

It is not easy to change the alphabet in dealing with computers. Until Uni-

code becomes standard, diﬀerent alphabets must be communicated using certain

combinations of ASCII–symbols. For example, HTML documents use the com-

bination ä for the symbol ¨a. As far as the computer is concerned this is

a sequence of characters. But some programs allow you to treat this as a single

character. Practically, computer programs dealing with natural language read the

input ﬁrst through a so–called tokenizer. The tokenizer does nothing but deter-

mine which character sequence is to be treated as a single symbol. Or, if you wish,

the tokenizer translates strings in the standard alphabet into strings of an arbitrary,

user deﬁned alphabet. The Xerox tools for ﬁnite state automata, for example, al-

low you to deﬁne any character sequence as a token. This means that if you have

a token of the form ab the sequence abc could be read as a string of length three,

a followed by b followed by c, or as a string of two tokens, ab followed by c.

Careful thought has to go into the choice of token strings.

Now we start with strings. OCaML has a few inbuilt string functions, and a

module called String, which contains a couple of useful functions to manipulate

strings. Mathematically, a string of length n is deﬁned as a function from the set

of numbers 0, 1, 2, . . . , n − 1 into the alphabet. The string "cat", for example, is

the function f such that f (0) = c, f (1) = a and f (2) = t. OCaML uses a similar

convention.

# let u = "cat";; (69)

val u : string = "cat"

# u.[1];;

char = ’a’

So, notice that OCaML starts counting with 0, as we generally do in this course.

The ﬁrst letter in u is addressed as u.[0], the second as u.[1], and so on. Notice

that technically n = 0 is admitted. This is a string that has no characters in it. It

is called the empty string. Here is where the naming conventions become really

useful. The empty string can be denoted by "". Without the quotes you would not

see anything. And OCaML would not either. We deﬁne the following operations

on strings:

10. Characters, Strings and Regular Expressions 27

Deﬁnition 2 Let u be a string. The length of u is denoted by ¦u¦. Suppose that

¦u¦ = m and ¦v¦ = n. Then u

**v is a string of length m + n, which is deﬁned as
**

follows.

(70) (u

v)( j) =

u( j) if j < m,

v( j − m) else.

u is a preﬁx of v if there is a w such that v = u

**w, and a postﬁx if there is a w such
**

that v = w

u. u is a substring if there are w and x such that v = w

v

x.

OCaML has a function String.length that returns the length of a given string.

For example, String.length "cat" will give 3. Notice that by our conventions,

you cannot access the symbol with number 3. Look at the following dialog.

# "cat".[2];; (71)

- : char = ’t’

# "cat".[String.length "cat"];;

Exception: Invalid_argument "String.get".

The last symbol of the string has the number 2, but the string has length 3. If

you try to access an element that does not exist, OCaML raises the exception

Invalid_argument "String.get".

In OCaML, ^ is an inﬁx operator for string concatenation. So, if we write

"tom"^"cat" OCaML returns "tomcat". Here is a useful abbreviation. x

n

de-

notes the string obtained by repeating x n–times. This can be deﬁned recursively

as follows.

x

0

= ε (72)

x

n+1

= x

n

x (73)

So, vux

3

= vuxvuxvux.

A language over A is a set of strings over A. Here are a few useful operations

on languages.

L M := ¦xy : x ∈ L, y ∈ M¦ (74a)

L/M := ¦x : exists y ∈ M: xy ∈ L¦ (74b)

M\L := ¦x : exists y ∈ M: yx ∈ L¦ (74c)

10. Characters, Strings and Regular Expressions 28

The ﬁrst deﬁnes the language that contains all concatenations of elements from

the ﬁrst, and elements from the second. This construction is also used in ‘switch-

board’ constructions. For example, deﬁne

(75) L := ¦John, my neighbour, . . . ¦

the set of noun phrases, and

(76) M := ¦sings, rescued the queen, . . . ¦

then L ¦¦ M ( denotes the blank here) is the following set:

¦John sings, John rescued the queen, (77)

my neighbour sings, my neighbour rescued the queen, . . . ¦

Notice that (77) is actually not the same as L M. Secondly, notice that there is no

period at the end, and no uppercase for my. Thus, although we do get what looks

like sentences, some orthographic conventions are not respected.

The construction L/M denotes the set of strings that will be L–strings if some

M–string is appended. For example, let L = ¦chairs, cars, cats, . . . ¦ and M =

¦s¦. Then

(78) L/M = ¦chair, car, cat, . . . ¦

Notice that if L contains a word that does not end in s, then no suﬃx from M

exists. Hence

(79) ¦milk, papers¦/M = ¦paper¦

In the sequel we shall study ﬁrst regular languages and then context free lan-

guages. Every regular language is context free, but there are also languages that

are neither regular nor context free. These languages will not be studied here.

A regular expression is built from letters using the following additional sym-

bols: 0, ε, , ∪, and

∗

. 0 and ε are constants, as are the letters of the alphabet.

and ∪ are binary operations,

∗

is unary. The language that they specify is given as

follows:

L(0) := ∅ (80)

L(∅) := ¦∅¦ (81)

L(s t) := L(s) L(t) (82)

L(s ∪ t) := L(s) ∪ L(t) (83)

L(s

∗

) := ¦x

n

: x ∈ L(s), n ∈ N¦ (84)

10. Characters, Strings and Regular Expressions 29

(N is the set of natural numbers. It contains all integers starting from 0.) is often

omitted. Hence, st is the same as s t. Hence, cat = c a t. It is easily seen

that every set containing exactly one string is a regular language. Hence, every

set containing a ﬁnite set of strings is also regular. With more eﬀort one can show

that a set containing all but a ﬁnite set of strings is regular, too.

Notice that whereas s is a term, L(s) is a language, the language of terms that

fall under the term s. In ordinary usage, these two are not distinguished, though.

We write a both for the string a and the regular term whose language is ¦a¦. It

follows that

L(a (b ∪ c)) = ¦ab, ac¦ (85)

L((cb)

∗

a) = ¦a, cba, cbcba, . . . ¦ (86)

A couple of abbreviations are also used:

s? := ε ∪ s (87)

s

+

:= s s

∗

(88)

s

n

:= s s s (n–times) (89)

We say that s ⊆ t if L(s) ⊆ L(t). This is the same as saying that L(s ∪ t) = L(t).

Regular expressions are used in a lot of applications. For example, if you are

searching for a particular string, say department in a long document, it may

actually appear in two shapes, department or Department. Also, if you are

searching for two words in a sequence, you may face the fact that they appear on

diﬀerent lines. This means that they are separated by any number of blanks and

an optional carriage return. In order not to loose any occurrence of that sort you

will want to write a regular expression that matches any of these occurrences. The

Unix command egrep allows you to search for strings that match a regular term.

The particular constructors look a little bit diﬀerent, but the underlying concepts

are the same.

Notice that there are regular expressions which are diﬀerent but denote the

same language. We have in general the following laws. (A note of caution. These

laws are not identities between the terms; the terms are distinct. These are identi-

ties concerning the languages that these terms denote.)

s (t u) = (s t) u (90a)

s ε = s ε s = s (90b)

10. Characters, Strings and Regular Expressions 30

s 0 = 0 0 s = 0 (90c)

s ∪ (t ∪ u) = (s ∪ t) ∪ u (90d)

s ∪ t = t ∪ s (90e)

s ∪ s = s (90f)

s ∪ 0 = s 0 ∪ s = s (90g)

s (t ∪ t) = (s t) ∪ (s u) (90h)

(s ∪ t) u = (s u) ∪ (t u) (90i)

s

∗

= ε ∪ s s

∗

(90j)

It is important to note that the regular terms appear as solutions of very simple

equations. For example, let the following equation be given. ( binds stronger

than ∪.)

(91) X = a ∪ b X

Here, the variable X denotes a language, that is, a set of strings. We ask: what set

X satisﬁes the equation above? It is a set X such that a is contained in it and if a

string v is in X, so is b

**v. So, since it contains a, it contains ba, bba, bbba, and
**

so on. Hence, as you may easily verify,

(92) X = b

∗

a

is the smallest solution to the equation. (There are many more solutions, for ex-

ample b

∗

a ∪ b

∗

. The one given is the smallest, however.)

Theorem 3 Let s and t be regular terms. Then the smallest solution of the equa-

tion X = t ∪ s X is s

∗

t.

Now, s t is the unique solution of Y = t, X = s t; s ∪ t is the solution of

X = Y ∪ Z, Y = s, Z = t. Using this, one can characterize regular languages as

minimal solutions of certain systems of equations.

Let X

i

, i < n, be variables. A regular equation has the form

(93) X

i

= s

0

X

0

∪ s

1

X

1

. . . ∪ s

n−1

X

n−1

(If s

i

is 0, the term s

i

X

i

may be dropped.) The equation is simple if for all i < n,

s

i

is either 0, ε, or a single letter. The procedure we have just outlined can be used

to convert any regular expression into a set of simple equations whose minimal

solution is that term. The converse is actually also true.

11. Interlude: Regular Expressions in OCaML 31

Theorem 4 Let E be a set of regular equations in the variables X

i

, i < n, such

that X

i

is exactly once to the left of an equation. Then there are regular terms s

i

,

i < n, such that the least solution to E is X

i

= L(s

i

), i < n.

Notice that the theorem does not require the equations of the set to be simple. The

proof is by induction on n. We suppose that it is true for a set of equations in n −1

letters. Suppose now the ﬁrst equation has the form

(94) X

0

= s

0

X

0

∪ s

1

X

1

. . . ∪ s

n−1

X

n−1

Two cases arise. Case (1). s

0

= 0. Then the equation basically tells us what

X

0

is in terms of the X

i

, 0 < i < n. We solve the set of remaining equations by

hypothesis, and we obtain regular solutions t

i

for X

i

. Then we insert them:

(95) X

0

= s

1

t

1

∪ s

1

t

1

∪ . . . ∪ s

n−1

t

n−1

This is a regular term for X

0

. Case (2). s

0

0. Then using Theorem 3 we can

solve the equation by

X

0

= s

∗

0

(s

1

X

1

. . . ∪ s

n−1

X

n−1

) (96)

= s

∗

0

s

1

X

1

∪ s

∗

0

s

2

X

2

∪ . . . ∪ s

∗

0

s

n−1

X

n−1

and proceed as in Case (1).

The procedure is best explained by an example. Take a look at Table 1. It

shows a set of three equations, in the variables X

0

, X

1

and X

2

. (This is (I).) We

want to ﬁnd out what the minimal solution is. First, we solve the second equation

for X

1

with the help of Theorem 3 and get (II). The simpliﬁcation is that X

1

is now

deﬁned in terms of X

0

alone. The recursion has been eliminated. Next we insert

the solution we have for X

1

into the equation for X

2

. This gives (III). Next, we

can also put into the ﬁrst equation the solution for X

2

. This is (IV). Now we have

to solve the equation for X

0

. Using Theorem 3 again we get (V). Now X

0

given

explicitly. This solution is ﬁnally inserted in the equations for X

1

and X

2

to yield

(VI).

11 Interlude: Regular Expressions in OCaML

OCaML has a module called Str that allows to handle regular expressions, see

Page 397. If you have started OCaML via the command ocaml you must make

11. Interlude: Regular Expressions in OCaML 32

Table 1: Solving a set of equations

(I) X

0

= ε ∪dX

2

X

1

= bX

0

∪dX

1

X

2

= cX

1

(II) X

0

= ε ∪dX

2

X

1

= d

∗

bX

0

X

2

= cX

1

(III) X

0

= ε ∪dX

2

X

1

= d

∗

bX

0

X

2

= cd

∗

bX

0

(IV) X

0

= ε ∪dcd

∗

bX

0

X

1

= d

∗

bX

0

X

2

= cd

∗

bX

0

(V) X

0

= dcd

∗

b

X

1

= d

∗

bX

0

X

2

= cd

∗

bX

0

(VI) X

0

= dcd

∗

b

X

1

= d

∗

bdcd

∗

b

X

2

= cd

∗

bdcd

∗

b

11. Interlude: Regular Expressions in OCaML 33

sure to bind the module into the program. This can be done interactively by

(97) # #load "str.cma";;

The module provides a type regexp and various functions that go between strings

and regular expressions. The module actually ”outsources” the matching and uses

the matcher provided by the platform (POSIX in Unix). This has two immediate

eﬀects: ﬁrst, diﬀerent platforms implement diﬀerent matching algorithms and this

may result in diﬀerent results of the same program; second, the syntax of the

regular expressions depends on the syntax of regular expressions of the matcher

that the platform provides. Thus, if you enter a regular expression, there is a two

stage conversion process. The ﬁrst is that of the string that you give to OCaML

into a regular term. The conventions are described in the manual. The second

is the conversion from the regular term of OCaML into one that the platform

uses. This is why the manual does not tell you exactly what the exact syntax of

regular expressions is. In what is to follow, I shall describe the syntax of regular

expressions in Unix (they are based on Gnu Emacs).

A regular expression is issued to OCaML as a string. However, the string

itself is not seen by OCaML as a regular expression; to get the regular expres-

sion you have to apply the function regexp (to do that, you must, of course, type

Str.regexp). It is worthwhile to look at the way regular expressions are deﬁned.

First, the characters $^.*+?\[] are special and set aside. Every other symbol is

treated as a constant for the character that it is. (We have used the same sym-

bolic overload above. The symbol a denotes not only the character a but also the

regular expression whose associated language is ¦a¦.) Now here are a few of the

deﬁnitions.

• . matches any character except newline

• * (postﬁx) translates into

∗

.

• + (postﬁx) translates into

+

.

• ? (postﬁx) translates into ?.

• ^ matches at the beginning of the line.

• $ matches at the end of line.

• \( left (opening) bracket.

11. Interlude: Regular Expressions in OCaML 34

• \) right (closing) bracket.

• \| translates into ∪.

Thus, (a ∪ b)

+

d? translates into \(a\|b\)+d?. Notice that in OCaML there

is also a way to quote a character by using \ followed by the 3–digit code, and

this method can also be used in regular expressions. For example, the code of the

symbol + is 43. Thus, if you are looking for a nonempty sequence of +, you have

to give OCaML the following string: ###BOT_TEXT###43+. Notice that ordinarily, that is, in a

string, the sequence ###BOT_TEXT###43 is treated in the same way as +. Here, however, we get

a diﬀerence. The sequence ###BOT_TEXT###43 is treated as a character, while + is now treated

as an operation symbol. It is clear that this syntax allows us to deﬁne any regular

language over the entire set of characters, including the ones we have set aside.

There are several shortcuts. One big group concerns grouping together char-

acters. To do this, one can use square brackets. The square brackets eliminate

the need to use the disjunction sign. Also it gives additional power, as we shall

explain now. The expression [abx09] denotes a single character matching any

of the symbols occuring in the bracket. Additionally, we can use the hyphen to

denote from-to: [0-9] denotes the digits, [a-f] the lower case letters from ’a’

to ’f’, [A-F] the upper case letters from ’A’ to ’F’; ﬁnally, [a-fA-F] denotes the

lower and upper case letters from ’a’ to ’f’ (’A’ to ’F’). If the opening bracket is

immediately followed by a caret, the range is inverted: we now take anything that

is not contained in the list. Notice that ordinarily, the hyphen is a string literal, but

when it occurs inside the square brackets it takes a new meaning. If we do want

the hyphen to be included in the character range, it has to be put at the beginning

or the end. If we want the caret to be a literal it must be placed last.

Now that we know how to write regular expressions, let us turn to the question

how they can be used. By far the most frequent application of matching against

a regular expression is not that of an exact match. Rather, one tries to ﬁnd an

occurrence of a string that falls under the regular expression in a large ﬁle. For

example, if you are editing a ﬁle and you what to look to the word ownership in

a case insensitive way, you will have to produce a regular expression that matches

either ownership or Ownership. Similarly, looking for a word in German that

may have ¨a or ae, ¨u next to ue and so on as alternate spellings may prompt the

need to use regular expressions. The functions that OCaML provides are geared

towards this application. For example, the function search_forward needs a

regular expression as input, then a string and next an integer. The integer speciﬁes

11. Interlude: Regular Expressions in OCaML 35

the position in the string where the search should start. The result is an integer

and it gives the initial position of the next string that falls under the regular ex-

pression. If none can be found, OCaML raises the exception Not_found. If you

want to avoid getting the exception you can also use string_match before to see

whether or not there is any string of that sort. It is actually superﬂuous to ﬁrst

check to see whether a match exists and then actually performing the search. This

requires to search the same string twice. Rather, there are two useful functions,

match_beginning and match_end, which return the initial position (ﬁnal posi-

tion) of the string that was found to match the regular expression. Notice that you

have to say Str.match_beginning(), supplying an empty parenthesis.

There is an interesting diﬀerence in the way matching can be done. The Unix

matcher tries to match the longest possible substring against the regular expression

(this is called greedy matching), while the Windows matcher looks for the just

enough of the string to get a match (lazy matching). Say we are looking for a

∗

in

the string bcagaa.

(98)

let ast = Str.regexp "a*";;

Str.string_match ast "bcagaa" 0;;

In both versions, the answer is ”yes”. Moreover, if we ask for the ﬁrst and the last

position, both version will give 0 and 0. This is because there is no way to match

more than zero a at that point. Now try instead

(99) Str.string_match ast "bcagaa" 2;;

Here, the two algorithms yield diﬀerent values. The greedy algorithm will say that

it has found a match between 2 and 3, since it was able to match one successive

a at that point. The lazy algorithm is content with the empty string. The empty

string matches, so it does not look further. Notice that the greedy algorithm does

not look for the longest occurrence in the entire string. Rather, it proceeds from

the starting position and tries to match from the successive positions against the

regular expression. If it cannot match at all, it proceeds to the next position. If it

can match, however, it will take the longest possible string that matches.

The greedy algorithm actually allows to check whether a string falls under a

11. Interlude: Regular Expressions in OCaML 36

regular expression. The way to do this is as follows.

(100)

let exact_match r s =

if Str.string_match r s 0

then ((Str.match_beginning () = 0)

&& (Str.match_end () = (String.length s)))

else false;;

The most popular application for which regular expressions are used are actually

string replacements. String replacements work in two stages. The ﬁrst is the phase

of matching. Given a regular expression s and a string x, the pattern matcher looks

for a string that matches the expression s. It always looks for the ﬁrst possible

match. Let this string be y. The next phase is the actual replacement. This can be

a simple replacement by a particular string. Very often we do want to use parts of

the string y in the new expression. An example may help.

An IP address is a sequence of the formc.c.c.c, where c is a string representing

a number from 0 to 255. Leading zeros are suppressed, but c may not be empty.

Thus, we may have 192.168.0.1, 145.146.29.243, of 127.0.0.1. To deﬁne

the legal sequences, deﬁne the following string:

(101) [01][0−9][0−9]| 2[0−4][0−9]| 25[0−4]

This string can be converted to a regular expression using Str.regexp. It matches

exactly the sequences just discussed. Now look at the following:

let x = "[01][0−9][0−9] \| 2[0−4][0−9]\| 25[0−4]"

let m = Str.regexp "\("^x^"\."^"\)"^x^"\. (102)

"^"\("^x^"\."^"\)"

This expression matches IP addresses. The bracketing that I have introduced can

be used for the replacement as follows. The matcher that is sent to ﬁnd a string

that matches a regular expression actually takes note for each bracketed expression

where it matched in the string that it is looking at. It will have a record that says,

for example, that the ﬁrst bracket matched from position 1230 to position 1235,

and the second bracket from position 1237 to position 1244. For example, if your

data is as follows:

(103) . . . 129.23.145.110 . . .

12. Finite State Automata 37

Suppose that the ﬁrst character is at position 1230. Then the string 129.023

matches the ﬁrst bracket and the string 145.110 the second bracket. These strings

can be recalled using the function matched_group. It takes as input a number

and the original string, and it returns the string of the nth matching bracket. So, if

directly after the match on the string assigned to u we deﬁne

let s = "The first half of the IP address is (104)

"^(Str.matched_group 1 u)

we get the following value for s:

(105) "The first half of the IP address is 129.23

To use this in an automated string replacement procedure, the variables \0, \1,

\2,..., \9. After a successful match, \0 is assigned to the entire string, \1, to

the ﬁrst matched string, \2 to the second matched string, and so on. A template

is a string that in place of characters also contains these variables (but nothing

more). The function global_replace takes as input a regular expression, and

two strings. The ﬁrst string is used as a template. Whenever a match is found it

uses the template to execute the replacement. For example, to cut the IP to its ﬁrst

half, we write the template "\1". If we want to replace the original IP address

by its ﬁrst part followed by .0.1, then we use "\.0.1". If we want to replace the

second part by the ﬁrst, we use "\1.\".

12 Finite State Automata

A ﬁnite state automaton is a quintuple

(106) A = (A, Q, i

0

, F, δ)

where A, the alphabet, is a ﬁnite set, Q, the set of states, also is a ﬁnite set,

i

0

∈ Q is the initial state, F ⊆ Q is the set of ﬁnal or accepting states and,

ﬁnally, δ ⊆ Q × A × Q is the transition relation. We write x

a

→ y if (x, a, y) ∈ δ.

12. Finite State Automata 38

We extend the notation to regular expressions as follows.

x

0

−→ y :⇔ false (107a)

x

ε

−→ y :⇔ x = y (107b)

x

s∪t

−→ y :⇔ x

s

−→ y or x

t

−→ y (107c)

x

st

−→ y :⇔ exists z: x

s

−→ z and z

t

−→ y (107d)

x

s

∗

−→ y :⇔ exists n: x

s

n

−→ y (107e)

In this way, we can say that

(108) L(A) := ¦x ∈ A

∗

: there is q ∈ F: i

0

x

−→ q¦

and call this the language accepted by A. Now, an automaton is partial if for

some q and a there is no q

·

such that q

a

→ q

·

.

Here is part of the listing of the ﬁle fstate.ml.

(109)

class automaton =

object

val mutable i = 0

val mutable x = StateSet.empty

val mutable y = StateSet.empty

val mutable z = CharSet.empty

val mutable t = Transitions.empty

method get_initial = i

method get_states = x

method get_astates = y

method get_alph = z

method get_transitions = t

method initialize_alph = z <− CharSet.empty

method list_transitions = Transitions.elements t

end;;

12. Finite State Automata 39

There is a lot of repetition involved in this deﬁnition, so it is necessary only to

look at part of this. First, notice that prior to this deﬁnition, the program contains

deﬁnitions of sets of states, which are simply sets of integers. The type is called

StateSet. Similarly, the type CharSet is deﬁned. Also there is the type of

transition, which has three entries, first, symbol and second. They deﬁne the

state prior to scanning the symbol, the symbol, and the state after scanning the

symbol, respectively. Then a type of sets of transitions is deﬁned. These are

now used in the deﬁnition of the object type automaton. These things need to

be formally introduced. The attribute mutable shows that their values can be

changed. After the equation sign is a value that is set by default. You do not have

to set the value, but to prevent error it is wise to do so. You need a method to even

get at the value of the objects involved, and a method to change or assign a value

to them. The idea behind an object type is that the objects can be changed, and

you can change them interactively. For example, after you have loaded the ﬁle

fsa.ml typing #use "fsa.ml" you may issue

# let a = new automaton;; (110)

Then you have created an object identiﬁed by the letter a, which is a ﬁnite state

automaton. The initial state is 0, the state set, the alphabet, the set of accepting

states is empty, and so is the set of transitions. You can change any of the values

using the appropriate method. For example, type

# a#add_alph ’z’;; (111)

and you have added the letter ‘z’ to the alphabet. Similarly with all the other

components. By way of illustration let me show you how to add a transition.

# a#add_transitions {first = 0; second = 2; symbol = ’z’};; (112)

So, eﬀectively you can deﬁne and modify the object step by step.

If A is not partial it is called total. It is an easy matter to make an automaton

total. Just add a state q

and add a transition (q, a, q

**) when there was no transition
**

(q, a, q

·

) for any q

·

. Finally, add all transitions (q

, a, q

), q

is not accepting.

Theorem 5 Let A be a ﬁnite state automaton. Then L(A) is regular.

The idea is to convert the automaton into an equation. To this end, for every state

q let T

q

:= ¦x : i

0

x

−→ q¦. Then if (r, a, q), we have T

q

⊇ a T

r

. In fact, if i i

0

we

12. Finite State Automata 40

have

(113) T

q

=

(r,a,q)∈δ

T

r

a

If q = i

0

we have

(114) T

i

0

= ε ∪

(r,a,q)∈δ

T

r

a

This is a set of regular equations, and it has a solution t

i

for every q, so that

T

q

= L(t

q

). Now,

(115) L(A) =

q∈F

T

q

which is regular.

On the other hand, for every regular term there is an automaton A that accepts

exactly the language of that term. We shall give an explicit construction of such

an automaton.

First, s = 0. Take an automaton with a single state, and put F = ∅. No

accepting state, so no string is accepted. Next, s = ε. Take two states, 0 and 1.

Let 0 be initial and accepting, and δ = ¦(0, a, 1), (1, a, 1) : a ∈ A¦. For s = a, let

Q = ¦0, 1, 2¦, 0 initial, 1 accepting. δ = ¦(0, a, 1)¦ ∪ ¦(1, b, 1) : b ∈ A¦ ∪ ¦(0, b, 2) :

b ∈ A − ¦a¦¦ ∪ ¦(2, b, 2) : b ∈ A¦. Next st. Take an automaton A accepting s,

and an automaton B accepting t. The new states are the states of A and the states

B (considered distinct) with the initial state of B removed. For every transition

(i

0

, a, j) of B, remove that transition, and add (q, a, j) for every accepting state

of A. The accepting states are the accepting states of B. (If the initial state was

accepting, throw in also the accepting states of A.) Now for s

∗

. Make F ∪ ¦i

0

¦

accepting. Add a transition (u, a, j) for every (i

0

, a, j) ∈ δ such that u ∈ F. This

automaton recognizes s

∗

. Finally, s ∪ t. We construct the following automaton.

A × B = (A, Q

A

× Q

B

, (i

A

0

, i

B

0

), F

A

× F

B

, δ

A

× δ

B

) (116)

δ

A

× δ

B

= ¦(x

0

, x

1

)

a

→ (y

0

, y

1

) : (x

0

, a, y

0

) ∈ δ

A

, (x

1

, a, y

1

) ∈ δ

B

¦ (117)

Lemma 6 For every string u

(118) (x

0

, x

1

)

u

→

A×B

(y

0

, y

1

) ⇔ x

0

u

→

A

y

0

and y

0

u

→

B

y

1

12. Finite State Automata 41

The proof is by induction on the length of u. It is now easy to see that L(A×B) =

L(A)∩L(B). For u ∈ L(A×B) if and only if (i

A

0

, i

B

0

)

u

→ (x

0

, x

1

), which is equivalent

to i

A

0

u

→ x

0

and i

B

0

u

→ x

1

. The latter is nothing but u ∈ L(A) and u ∈ L(B).

Now, assume that A and B are total. Deﬁne the set G := F

A

× Q

B

∪ Q

A

×

F

B

. Make G the accepting set. Then the language accepted by this automaton is

L(A) ∪ L(B) = s ∪ t. For suppose u is such that (i

A

0

, i

B

0

)

u

→ q ∈ G. Then either

q = (q

0

, y) with q

0

∈ F

A

and then u ∈ L(A), or q = (x, q

1

) with q

1

∈ F

B

. Then

u ∈ L(B). The converse is also easy.

Here is another method.

Deﬁnition 7 Let L ⊆ A

∗

be a language. Then put

(119) L

p

= ¦x : there is y such that x

y ∈ L¦

Let s be a regular term. We deﬁne the preﬁx closure as follows.

0

†

:= ∅ (120a)

a

†

:= ¦ε, a¦ (120b)

ε

†

:= ¦ε¦ (120c)

(s ∪ t)

†

:= s

†

∪ t

†

(120d)

(st)

†

:= ¦su : u ∈ t

†

¦ ∪ s

†

(120e)

(s

∗

)

†

:= s

†

∪ ¦s

∗

u : u ∈ s

†

¦ (120f)

Notice the following.

Lemma 8 x is a preﬁx of some string from s iﬀ x ∈ L(t) for some t ∈ s

†

. Hence,

(121) L(s)

p

=

t∈s

†

L(t)

Proof. Suppose that t ∈ s

†

. We show that L(t) is the union of all L(u) where

u ∈ s

†

. The proof is by induction on s. The case s = ∅ and s = ε are actually easy.

Next, let s = a where a ∈ A. Then t = a or t = ε and the claim is veriﬁed directly.

12. Finite State Automata 42

Next, assume that s = s

1

∪ s

2

and let t ∈ s

†

. Either t ∈ s

†

1

or t ∈ s

†

2

. In

the ﬁrst case, L(t) ⊆ L(s

1

)

p

by inductive hypothesis, and so L(t) ⊆ L(s)

p

, since

L(s) ⊇ L(s

1

) and so L(s)

p

⊇ L(s

1

)

p

. Analogously for the second case. Now, if

x ∈ L(s)

p

then either x ∈ L(s

1

)

p

or L(s

2

)

p

and hence by inductive hypothesis x ∈ u

for some u ∈ s

†

1

or some u ∈ s

†

2

, so u ∈ s

†

, as had to be shown.

Next let s = s

1

s

2

and t ∈ s

†

. Case 1. t ∈ s

†

1

. Then by inductive hypothesis,

L(t) ⊆ L(s

1

)

p

, and since L(s

1

)

p

⊆ L(s)

p

(can yous ee this?), we get the desired

conclusion. Case 2. t = s u with u ∈ L(s

2

)

†

. Now, a string in L(t) is of the

form xy, where x ∈ L(s

1

) and y ∈ L(u) ⊆ L(s

2

)

p

, by induction hypothesis. So,

x ∈ L(s

1

)L(s

2

)

p

⊆ L(s)

p

, as had to be shown. Conversely, if a string is in L(s)

p

then either it is of the form x ∈ L(s

1

)

p

or yz, with y ∈ L(s

1

) and z ∈ L(s

2

)

p

. In

the ﬁrst case there is a u ∈ s

†

1

such that x ∈ L(u); in the second case there is a

u ∈ s

†

2

such that z ∈ L(u) and y ∈ L(s

1

). In the ﬁrst case u ∈ s

†

; in the second case,

x ∈ L(s

1

u) and s

1

u ∈ s

†

. This shows the claim.

Finally, let s = s

∗

1

. Then either t ∈ s

†

1

or t = s

∗

1

u where u ∈ s

†

1

. In the ﬁrst case

we have L(t) ⊆ L(s

1

)

p

⊆ L(s)

p

, by inductive hypothesis. In the second case we

have L(t) ⊆ L(s

∗

1

)L(u) ⊆ L(s

∗

1

)L(s

1

)

p

⊆ L(s)

p

. This ﬁnishes one direction. Now,

suppose that x ∈ L(s). Then there are y

i

, i < n, and z such that y

i

∈ L(s

1

) for

all i < n, and z ∈ L(s

1

)

p

. By inductive hypothesis there is a u ∈ L(s

1

) such that

z ∈ L(u). Also y

0

y

1

y

n−1

∈ L(s

∗

1

), so x ∈ L(s

∗

1

u), and the regular term is in s

†

.

We remark here that for all s ∅: ε ∈ s

†

; moreover, s ∈ s

†

. Both are easily

established by induction.

Let s be a term. For t, u ∈ s

†

put t

a

→ u iﬀ ta ⊆ u. (The latter means

L(ta) ⊆ L(u).) The start symbol is ε.

(122)

A(s) := ¦a ∈ A : a ∈ s

†

¦

δ(t, a) := ¦u : ta ⊆ u¦

A(s) := (A(s), s

†

, ε, δ)

canonical automaton of s. We shall show that the language recognized by the

canonical automaton is s. This follows immediately from the next theorem, which

establishes a somewhat more general result.

Lemma 9 In A(s), ε

x

→ t iﬀ x ∈ L(t). Hence, the language accepted by state t is

exactly L(t).

12. Finite State Automata 43

Proof. First, we show that if ε

x

→ u then x ∈ L(u). This is done by induction on

the length of x. If x = ε the claim trivially follows. Now let x = ya for some

a. Assume that ε

x

→ u. Then there is t such that ε

y

→ t

a

→ u. By inductive

assumption, y ∈ L(t), and by deﬁnition of the transition relation, L(t)a ⊆ L(u).

Whence the claim follows. Now for the converse, the claim that if x ∈ L(u) then

ε

x

→ u. Again we show this by induction on the length of x. If x = ε we are done.

Now let x = ya for some a ∈ A. We have to show that there is a t ∈ s

†

such that

ta ⊆ u. For then by inductive assumption, ε

y

→ t, and so ε

ya

→ u, by deﬁnition of

the transition relation.

Now for the remaining claim: if ya ∈ L(u) then there is a t ∈ s

†

such that

y ∈ L(t) and L(ta) ⊆ L(u). Again induction this time on u. Notice right away that

u

†

⊆ s

†

, a fact that will become useful. Case 1. u = b for some letter. Clearly,

ε ∈ s

†

, and putting t := ε will do. Case 2. u = u

1

∪ u

2

. Then u

1

and u

2

are

both in s

†

. Now, suppose ya ∈ L(u

1

). By inductive hypothesis, there is t such

that L(ta) ⊆ L(u

1

) ⊆ L(u), so the claim follows. Similarly if ya ∈ u

2

. Case 3.

u = u

1

u

2

. Subcase 2a. y = y

1

y

2

, with y

1

∈ L(u

1

) and y

2

∈ L(u

2

). Now, by inductive

hypothesis, there is a t

2

such that L(t

2

a) ⊆ L(u

2

). Then t := u

1

t

2

is the desired

term. Since it is in u

†

, it is also in s

†

. And L(ta) = L(u

1

t

2

a) ⊆ L(u

1

u

2

) = L(u).

Case 4. u = u

∗

1

. Suppose ya ∈ L(u). Then y has a decompositionz

0

z

1

z

n−1

v such

that z

i

∈ L(u

1

) for all i < n, and also va is in L(u

1

). By inductive hypothesis, there

is a t

1

such that L(t

1

a) ⊆ L(u

1

), and v ∈ L(t

1

). And z

0

z

1

z

n−1

∈ L(u). Now put

t := ut

1

. This has the desired properties.

Now, as a consequence, if L is regular, so is L

p

. Namely, take the automaton

A(s), where s is a regular term of L. Now change this automaton to make every

state accepting. This deﬁnes a new automaton which accepts every string that falls

under some t ∈ s

†

, by the previous results. Hence, it accepts all preﬁxes of string

from L.

We discuss an application. Syllables of a given language are subject to certain

conditions. One of the most famous constraints (presumed to be universal) is the

sonoricity hierarchy. It states that the sonoricity of phonemes in a syllable must

rise until the nucleus, and then fall. The nucleus contains the sounds of highest

sonoricity (which do not have to be vowels). The rising part is called the onset and

the falling part the rhyme. The sonoricity hierarchy translates into a ﬁnite state

automaton as follows. It has states (o, i), and (r, i), where i is a number smaller

13. Complexity and Minimal Automata 44

than 10. The transitions are of the form q

0

a

→ (o, i) if a has sonoricity i, q

0

a

→ (r, i)

if a has sonoricity i (o, i)

a

→ (o, j) if a has sonoricity j ≥ i; (o, i)

a

→ (r, j) if a has

sonoricity j > i, (r, i)

a

→ (r, j) where a has sonoricity j ≤ i. All states of the form

(r, i) are accepting. (This accounts for the fact that a syllable must have a rhyme.

It may lack an onset, however.) Reﬁnements of this hierarchy can be implemented

as well. There are also language speciﬁc constraints, for example, that there is no

syllable that begins with [N] in English. Moreover, only vowels, nasals and laterals

may be nuclear. We have seen above that a conjunction of conditions which each

are regular also is a regular condition. Thus, eﬀectively (this has proved to be

correct over and over) phonological conditions on syllable and word structure are

regular.

13 Complexity and Minimal Automata

In this section we shall look the problem of recognizing a string by an automaton.

Even though computers are nowadays very fast, it is still possible to reach the

limit of their capabilities very easily, for example by making a simple task overly

complicated. Although ﬁnite automata seem very easy at ﬁrst sight, to make the

programs run as fast as possible is a complex task and requires sophistication.

Given an automaton A and a string x = x

0

x

1

. . . x

n−1

, how long does it take to

see whether or not x ∈ L(A)? Evidently, the answer depends on both A and x.

Notice that x ∈ L(A) if and only if there are q

i

, i < n + 1 such that

(123) i

0

= q

0

x

0

→ q

1

x

1

→ q

2

x

2

→ q

2

. . .

x

n−1

→ x

n

∈ F

Whether or not q

i

x

i

→ q

i+1

just take constant time: it is equivalent to (q

i

, x

i

, q

i+1

) ∈

δ. The latter is a matter of looking up δ. Looking up takes time logarithmic in the

size of the input. But the input is bounded by the size of the automaton. So it take

roughly the same amount all the time.

A crude strategy is to just go through all possible assignments for the q

i

and

check whether they satisfy (123). This requires checking up to ¦A

n

¦ many assign-

ments. This suggests that the time required is exponential in the length of the

string. In fact, far more eﬃcient techniques exist. What we do is the following.

We start with i

0

. In Step 1 we collect all states q

1

such that i

0

x

0

→ q

1

. Call this the

set H

1

. In Step 2 we collect all states q

2

such that there is a q

1

∈ H

1

and q

1

x

1

→ q

2

.

13. Complexity and Minimal Automata 45

In Step 3 we collect into H

3

all the q

3

such that there exists a q

2

∈ H

2

such that

q

2

x

3

→ q

3

. And so on. It is easy to see that

(124) i

0

x

0

x

1

...x

j

−→ q

j+1

iﬀ q

j+1

∈ H

j+1

Hence, x ∈ L(A) iﬀ H

n

∩ F ∅. For then there exists an accepting state in H

n

.

Here each step takes time quadratic in the number of states: for given H

j

, we

compute H

j+1

by doing Q many lookups for every q ∈ H

j

. However, this number

is basically bounded for given A. So the time requirement is down to a constant

depending on A times the length of x. This is much better. However, in practical

terms this is still not good enough. Because the constant it takes to compute a

single step is too large. This is because we recompute the transition H

j

to H

j+1

.

If the string is very long, this means that we recompute the same problem over

and over. Instead, we can precompute all the transitions. It turns out that we can

deﬁne an automaton in this way that we can use in the place of A.

Deﬁnition 10 Let A = (A, Q, i

0

, F, δ) be an automaton. Put F

℘

:= ¦U ⊆ Q :

U ∩ F ∅¦. And let

(125) δ

℘

= ¦(H, a, J) : for all q ∈ H there is q

·

∈ J: q

a

→ q

·

¦

Then

(126) A

℘

= (A, ℘(Q), ¦i

0

¦, F

℘

, δ

℘

)

is called the exponential of A.

Deﬁnition 11 An automaton is deterministic if for all q ∈ Q and a ∈ A there is at

most one q

·

∈ Q such that q

a

→ q

·

.

It is clear that for a deterministic and total automaton, all we have to do is to

look up the next state in (123), which exists and is unique. For these automata,

recognizing the language is linear in the string.

Theorem 12 For every automaton A, the exponential A

℘

is total and determinis-

tic. Moreover, L(A

℘

) = L(A).

13. Complexity and Minimal Automata 46

So, the recipe to attack the problem ‘x ∈ L(A)?’ is this: ﬁrst compute A

℘

and then

check ‘x ∈ L(A)?’. Since the latter is deterministic, the time needed is actually

linear in the length of the string, and logarithmic in the size of ℘(Q), the state set

of A

℘

. Hence, the time is linear also in ¦Q¦.

We mention a corollary of Theorem 12.

Theorem 13 If L ⊆ A

∗

is regular, so is A

∗

− L.

Proof. Let L be regular. Then there exists a total deterministic automaton A =

(A, Q, i

0

, F, δ) such that L = L(A). Now let B = (A, Q, i

0

, Q − F, δ). Then it turns

out that i

0

x

→

B

q if and only if i

0

x

→

A

q. Now, x ∈ L(B) if and only if there is a

q ∈ Q − F such that i

0

x

→ q if and only if there is no q ∈ F such that i

0

x

→ q if and

only if x L(A) = L. This proves the claim.

Now we have reduced the problem to the recognition by some deterministic

automaton, we may still not have done the best possible. It may turn out, namely,

that the automaton has more states than are actually necessary. Actually, there is

no limit on the complexity of an automaton that recognizes a given language, we

can make it as complicated as we want! The art, as always, is to make it simple.

Deﬁnition 14 Let L be a language. Given a string x, let [x]

L

= ¦y : x

y ∈ L¦. We

also write x ∼

L

y if [x]

L

= [y]

L

. The index of L is the number of distinct sets [x]

L

.

Here is an example. The language L = ¦ab, ac, bc¦ has the following index sets:

(127)

[ε]

L

= ¦ab, ac, bc¦

[a]

L

= ¦b, c¦

[b]

L

= ¦c¦

[c]

L

= ∅

[ab]

L

= ¦ε¦

Let us take a slightly diﬀerent language M = ¦ab, ac, bc, bb¦.

(128)

[ε]

M

= ¦ab, ac, bb, bc¦

[a]

M

= ¦b, c¦

[c]

M

= ∅

[ab]

M

= ¦ε¦

13. Complexity and Minimal Automata 47

It is easy to check that [b]

M

= [a]

M

. We shall see that this diﬀerence means that

there is an automaton checking M can based on less states than any automaton

checking membership in L.

Given two index set I and J, put I

a

→ J if and only if J = a\I. This is well–

deﬁned. For let I = [x]

L

. Then suppose that xa is a preﬁx of an accepted string.

Then [xa]

L

= ¦y : xay ∈ L¦ = ¦y : ay ∈ I¦ = a\I. This deﬁnes a deterministic

automaton with initial element L. Accepting sets are those which contain ε. We

call this the index automaton and denote it by I(L).

Theorem 15 L(I(L)) = L.

Proof. By induction on x we show that L

x

→ I if and only if [x]

L

= I. If x = ε

the claim reads L = L if and only if [ε]

L

= L. But [ε]

L

= L, so the claim holds.

Next, let x = ya. By induction hypothesis, L

y

→ J if and only if [y]

L

= I. Now,

J

a

→ J/a = [ya]

L

. So, L

x

→ J/a = [x]

L

, as promised.

Now, x is accepted by I(L) if and only if there is a computation from L to a set

[y]

L

containing ε. By the above this is equivalent to ε ∈ [x]

L

, which means x ∈ L.

**Given an automaton A and a state q put
**

(129) [q] := ¦x : there is q

·

∈ F: q

x

→ q

·

¦

It is easy to see that for every q there is a string x such that [q] ⊆ [x]

L

. Namely, let

x be such that i

0

x

→ q. Then for all y ∈ [q], xy ∈ L(A), by deﬁnition of [q]. Hence

[q] ⊆ [x]

L

. Conversely, for every [x]

L

there must be a state q such that [q] ⊆ [x]

L

.

Again, q is found as a state such that i

0

x

→ q. Suppose now that A is deterministic

and total. Then for each string x there is exactly one state [q] such that [q] ⊆ [x]

L

.

Then obviously [q] = [x]

L

. For if y ∈ [x]

L

then xy ∈ L, whence i

0

xy

−→ q

·

∈ F for

some q

·

. Since the automaton is deterministic, i

0

x

→ q

y

→ q

·

, whence y ∈ [q].

It follows now that the index automaton is the smallest deterministic and total

automaton that recognizes the language. The next question is: how do we make

that automaton? There are two procedures; one starts from a given automaton,

and the other starts from the regular term. Given an automaton, we know how to

13. Complexity and Minimal Automata 48

make a deterministic total automaton by using the exponentiation. On the other

hand, let A by an automaton. Call a relation ∼ a net if

from q ∼ q

·

and q

a

→ r, q

·

a

→ r

·

follows q

·

∼ r

·

, and

if q ∈ F and q ∼ q

·

then also q

·

∈ F.

A net induces a partition of the set of states into sets of the form [q]

∼

= ¦q

·

: q

·

∼

q¦. In general, a partition of Q is a set Π of nonempty subsets of Q such that any

two S, S

·

∈ Π which are distinct are disjoint; and every element is in one (and

therefore only one) member of Π.

Given a net ∼, put

(130)

[q]

∼

:= ¦q

·

: q ∼ q

·

¦

Q/∼ := ¦[q]

∼

: q ∈ Q¦

F/∼ := ¦[q]

∼

: q ∈ F¦

[q

·

]

∼

∈ δ/ ∼ ([q]

∼

, a) ⇔ there is r ∼ q

·

: r ∈ δ(q, a)

A/ ∼ := (A, Q/ ∼, [i

0

]

∼

, F/ ∼, δ/ ∼)

Lemma 16 L(A/ ∼) = L(A).

Proof. By induction on the length of x the following can be shown: if q ∼ q

·

and

q

x

→ r then there is a r

·

∼ r such that q

·

x

→ r

·

. Now consider x ∈ L(A/ ∼). This

means that there is a [q]

∼

∈ F/ ∼ such that [i

0

]

∼

x

→ [q]

∼

. This means that there is a

q

·

∼ q such that i

0

x

→ q

·

. Now, as q ∈ F, also q

·

∈ F, by deﬁnition of nets. Hence

x ∈ L(A). Conversely, suppose that x ∈ L(A). Then i

0

x

→ q for some q ∈ F. Hence

[i

0

]

∼

x

→ [q]

∼

, by an easy induction. By deﬁnition of A/ ∼, [q]

∼

is an accepting

state. Hence x ∈ L(A/ ∼).

All we have to do next is to fuse together all states which have the same index.

Therefore, we need to compute the largest net on A. The is done as follows. In

the ﬁrst step, we put q ∼

0

q

·

iﬀ q, q

·

∈ F or q, q

·

∈ Q − F. This need not be a net.

13. Complexity and Minimal Automata 49

Inductively, we deﬁne the following:

q ∼

i+1

q

·

:⇔q ∼

i

q

·

and for all a ∈ A, for all r ∈ Q: (131)

if q

a

→ r there is r

·

∼

i

r such that q

·

a

→ r

·

if q

·

a

→ r there is r

·

∼

i

r such that q

·

a

→ r

Evidently, ∼

i+1

⊆∼

i

. Also, if q ∼

i+1

q

·

and q ∈ F then also q

·

∈ F, since this already

holds for ∼

0

. Finally, if ∼

i+1

=∼

i

then ∼

i

is a net. This suggests the following

recipe: start with ∼

0

and construct ∼

i

one by one. If ∼

i+1

=∼

i

, stop the construction.

It takes only ﬁnitely many steps to compute this and it returns the largest net on

an automaton.

Deﬁnition 17 Let A be a ﬁnite state automaton. A is called reﬁned if the only net

on it is the identity.

Theorem 18 Let A and B be deterministic, total and reﬁned and every state in

each of the automata is reachable. Then if L(A) = L(B), the two are isomorphic.

Proof. Let A be based on the state set Q

A

and B on the state set Q

B

. For q ∈ Q

A

write I(q) := ¦x : q

x

→ r ∈ F¦, and similarly for q ∈ Q

B

. Clearly, we have

I(i

A

0

) = I(i

B

0

), by assumption. Now, let q ∈ Q

A

and q

a

→ r. Then I(r) = a\I(q).

Hence, if q

·

∈ Q

B

and q

·

a

→ r

·

and I(q) = I(q

·

) then also I(r) = I(r

·

). Now we

construct a map h : Q

A

→ Q

B

as follows. h(i

A

0

) := i

B

0

. If h(q) = q

·

, q

a

→ r

and q

·

a

→ r

·

then h(r) := r

·

. Since all states are reachable in A, h is deﬁned on

all states. This map is injective since I(h(q)) = I(q) and A is reﬁned. (Every

homomorphism induces a net, so if the identity is the only net, h is injective.) It is

surjective since all states in B are reachable and B is reﬁned.

Thus the recipe to get a minimal automaton is this: get a deterministic total

automaton for L and reﬁne it. This yields an automaton which is unique up to

isomorphism.

13. Complexity and Minimal Automata 50

The other recipe is dual to Lemma 8. Put

0

‡

:= ∅ (132a)

a

‡

:= ¦ε, a¦ (132b)

ε

‡

:= ¦ε¦ (132c)

(s ∪ t)

‡

:= s

‡

∪ t

‡

(132d)

(st)

‡

:= ¦ut : u ∈ s

‡

¦ ∪ t

‡

(132e)

(s

∗

)

‡

:= s

‡

∪ ¦us

∗

: u ∈ s

‡

¦ (132f)

Given a regular expression s, we can eﬀectively compute the sets [x]

L

. They are

either 0 or of the form t for t ∈ s

‡

. The start symbol is s, and the accepting states

are of the form t ∈ s

‡

, where ε ∈ t. However, beware that it is not clear a priori

for any two given terms t, u whether or L(t) = L(u). So we need to know how we

can eﬀectively decide this problem. Here is a recipe. Construct an automaton A

such that L(A) = L(t) and an automaton L(B) = A

∗

− L(u). We can assume that

they are deterministic. Then A × B recognizes L(t) ∩ (A

∗

− L(u)). This is empty

exactly when L(t) ⊆ L(u). Dually we can construct an automaton that recognizes

L(u)∩(A

∗

−L(t)), which is empty exactly when L(u) ⊆ L(t). So, everything turns on

the following question: can we decide for any given automaton A = (A, Q, i

0

, F, δ)

whether or not L(A) is empty? The answer is simple: this is decidable. To see how

this can be done, notice that L(A) ∅ iﬀ there is a word x such that i

0

x

→ q ∈ F.

It is easy to see that if there is a word, then there is a word whose length is ≤ ¦Q¦.

Now we just have to search through all words of this size.

Theorem 19 The following problems are decidable for given regular terms t, u:

‘L(t) = ∅’, ‘L(t) ⊆ L(u)’ and ‘L(t) = L(u)’.

It now follows that the index machine can eﬀectively be constructed. The way to

do this is as follows. Starting with s, we construct the machine based on the t ∈ s

‡

.

Next we compute for given t

·

whether the equivalence L(t) = L(t

·

) holds for some

t t

·

. Then we add a new state t

··

. Every arc into t

··

is an arc that goes into t or

into t

·

; every arc leaving t

··

is an arc that leaves t or t

·

. Erase t and t

·

.

Theorem 20 L is regular if and only if it has only ﬁnitely many index sets.

14. Digression: Time Complexity 51

14 Digression: Time Complexity

The algorithmthat decides whether or not two automata accept the same language,

or whether the language accepted by an automaton is empty, takes exponential

time. All it requires us to do is to look through the list or words that are of length

≤ n and check whether they are accepted. If A has α many letters, there are α

n

words of length n, and

(133) 1 + α + α

2

+ . . . + α

n

= (α

n+1

− 1)/(α − 1)

many words of length ≤ n. For simplicity we assume that α = 2. Then (2

n+1

−

1)/(2 − 1) = 2

n+1

− 1. Once again, let us simplify a little bit. Say that the number

of steps we need is 2

n

. Here is how fast the number of steps grows.

(134)

n 2

n

n 2

n

1 2 11 2048

2 4 12 4096

3 8 13 8192

4 16 14 16384

5 32 15 32768

6 64 16 65536

7 128 17 131072

8 256 18 262144

9 512 19 524288

10 1024 20 1048576

Suppose that your computer is able to compute 1 million steps in a second. Then

for n = 10 the number of steps is 1024, still manageable. It only takes a mil-

lisecond (1/1000 of a second). For n = 20, it is 1,048,576; this time you need 1

second. However, for each 10 you add in size, the number of steps multiplies by

more than 1000! Thus, for n = 30 the time needed is 18 minutes. Another 10

added and you can wait for months already. Given that reasonable applications

in natural language require several hundreds of states, you can imagine that your

computer might not even be able to come up with an answer in your lifetime.

Thus, even with problems that seem very innocent we may easily run out of

time. However, it is not necessarily so that the operations we make the computer

perform are really needed. Maybe there is a faster way to do the same job. Indeed,

the problem just outlined can be solved much faster than the algorithm just shown

14. Digression: Time Complexity 52

would lead one to believe. Here is another algorithm: call a state q n–reachable if

there is a word of length ≤ n such that i

0

x

→ q; call it reachable if it is n–reachable

for some n ∈ ω. L(A) is nonempty if some q ∈ F is reachable. Now, denote by

R

n

the set of n–reachable states. R

0

:= ¦i

0

¦. In the nth step we compute R

n+1

by

taking all successors of points in R

n

. If R

n+1

= R

n

we are done. (Checking this

takes ¦Q¦ steps. Alternatively, we need to iterate the construction at most ¦Q¦ − 1

times. Because if at each step we add some state, then R

n

grows by at least 1, so

this can happen only ¦Q¦ − 1 times, for then R

n

has at least ¦Q¦ many elements.

On the other hand, it is a subset of Q, so it can never have more elements.) Let

us now see how much time we need. Each step takes R

n

× ¦A¦ ≤ ¦Q¦ × ¦A¦ steps.

Since n ≤ ¦Q¦, we need at most ¦Q¦ −1 steps. So, we need c ¦A¦¦Q¦

2

steps for some

constant c.

Consider however an algorithm that takes n

2

steps.

(135)

n n

2

n n

2

1 1 11 121

2 4 12 144

3 9 13 169

4 16 14 196

5 25 15 225

6 36 16 256

7 49 17 289

8 64 18 324

9 81 19 361

10 100 20 400

For n = 10 it takes 100 steps, or a tenth of a millisecond, for n = 20 it takes 400,

and for n = 30 only 900, roughly a millisecond. Only if you double the size of the

input, the number of steps quadruple. Or, if you want the number of steps to grow

by a factor 1024, you have to multiply the length of the input by 32! An input of

even a thousand creates a need of only one million steps and takes a second on

our machine.

The algorithm described above is not optimal. There is an algorithm that is

even faster: It goes as follows. We start with R

−1

:= ∅ and S

0

= ¦q

0

¦. Next, S

1

is the set of successors of S

0

minus R

0

, and R

0

:= S

0

. In each step, we have two

sets, S

i

and R

i−1

. R

i−1

is the set of i −1–reachable points, and S

i

is the set of points

that can be reached from it it one step, but which are not in R

i−1

. In the next step

14. Digression: Time Complexity 53

we let S

i+1

be the set of successors of S

i

which are not in R

i

:= R

i−1

∪ S

i

. The

advantage of this algorithm is that it does not recompute the successors of a given

point over and over. Instead, only the successors of point that have recently been

added are calculated. It is easily seen that this algorithm computes the successors

of a point only once. Thus, this algorithm is not even quadratic but linear in ¦Q¦!

And if the algorithm is linear — that is much better. Actually, you cannot get less

than that if the entire input matters. For the machine needs to take a look at the

input — and this take slinear time at least.

It is not always the case that algorithms can be made as fast as this one. The

problem of satisﬁability of a given propositional formula is believed to take expo-

nential time. More exactly, it is in NP — though it may take polynomial time in

many cases. Regardless whether an algorithm takes a lot of time or not it is wise

to try an reduce the number of steps that it actually takes. This can make a big

diﬀerence when the application has to run on real life examples. Often, the best

algorithms are not so diﬃcult to understand — one just has to ﬁnd them. Here

is another one. Suppose you are given a list of numbers. Your task is to sort the

list in ascending order as fast as possible. Here is the algorithm that one would

normally use: scan the list for the least number and put it at the beginning of a new

list; then scan the remainder for the least number and put that behind the number

you already have, and so on. In step n you have you original list L, reduced by

n − 1 elements, and a list L

·

containing n − 1 elements. To ﬁnd the minimal ele-

ment in L you do the following: you need two memory cells, C and P. C initially

contains the ﬁrst element of the list and P := 1. Now take the second element

and see whether it is smaller than the content of C; if so, you put it into C, and

let P be the number of the cell; if not, you leave C and P as is. You need to do

¦L¦ −1 comparisons, as you have to go through the entire list. When you are done,

P tells you which element to put into M. Now you start again. The number of

comparisons is overall

(136) (¦L¦ − 1) + (¦L¦ − 2) + . . . + 2 + 1 = ¦L¦ (¦L¦ − 1)/2

This number grows quadratically. Not bad, but typically lists are very long, so we

should try the best we can.

Here is another algorithm. Divide the list into blocks of two. (There might be

a remainder of one element, but that does no harm.) We order these blocks. This

takes just one comparison between the two elements of the block. In the next step

we merge two blocks of two into one block of four. In the third step we merge

14. Digression: Time Complexity 54

two blocks of four into one block of eight, and so on. How many comparisons are

needed to merge two ordered lists M and N of length m and n, respectively, into a

list L of length m + n? The answer is: m + n − 1. The idea is as follows. We take

the ﬁrst elements, M

0

and N

0

. Then L

0

is the smaller of the two, which is then

removed from its list. The next element is again obtained by comparing the ﬁrst

elements of the lists, and so on. For example, let M = [1, 3, 5] and N = [2, 4, 5].

We compare the ﬁrst two elements. The smaller one is put into L: the lists are

now

(137) M = [3, 5], N = [2, 4, 5], L = [1]

Now, we compare the ﬁrst elements of M and N. The smallest element is 2 and is

put into L:

(138) M = [3, 5], N = [4, 5], L = [1, 2]

Now, this is how the algorithm goes on:

M = [5], N = [4, 5], L = [1, 2, 3] (139)

M = [5], N = [5], L = [1, 2, 3, 4] (140)

M = [], N = [5], L = [1, 2, 3, 4, 5] (141)

M = [], N = [], L = [1, 2, 3, 4, 5, 5] (142)

Each time we put an element into L we need just one comparison, except for the

last element, which can be put in without further checking. If we want to avoid

repetitions then we need to check each element against the last member of the list

before putting it in (this increases the number of checks by n + m − 1).

In the ﬁrst step we have n/2 many blocks, and n/4 many comparisons are

being made to order them. The next step takes 3n/4 comparisons, the third step

needs 7n/8, and so on. Let us round the numbers somewhat: each time we need

certainly less than n comparisons. How often do we have to merge? This number

is log

2

n. This is the number x such that 2

x

= n. So we need in total n log

2

n many

steps. We show this number in comparison to n and n

2

.

(143)

2

0

2

4

2

8

2

12

2

16

n 1 16 256 4096 65536

n log

2

n 0 64 2048 59152 1048576

n

2

/2 1/2 128 32768 8388608 2147483648

15. Finite State Transducers 55

Consider again your computer. On an input of length 65536 (= 2

16

) it takes one

second under the algorithm just described, while the naive algorithm would re-

quire it run for 2159 seconds, which is more than half an hour.

In practice, one does not want to spell out in painful detail how many steps

an algorithm consumes. Therefore, simplifying notation is used. One writes that

a problem is in O(n) if there is a constant C such that from some n

0

on for an

input of length n the algorithm takes C n steps to compute the solution. (One

says that the estimate holds for ‘almost all’ inputs if it holds only from a certain

point onwards.) This notation makes sense also in view of the fact that it is not

clear how much time an individual step takes, so that the time consumption cannot

not really be measured in seconds (which is what is really of interest for us). If

tomorrow computers can compute twice as fast, everything runs in shorter time.

Notice that O(bn + a) = O(bn) = O(n). It is worth understanding why. First,

assume that n ≥ a. Then (b+1)n ≥ bn+n ≥ bn+a. This means that for almost all

n: (b+1)n ≥ bn+a. Next, O((b+1)n) = O(n), since O((b+1)n) eﬀectively means

that there is a constant C such that for almost all n the complexity is ≤ C(b + 1)n.

Now put D := C(b + 1). Then there is a constant (namely D) such that for almost

all n the complexity is ≤ Dn. Hence the problem is in O(n).

Also O(cn

2

+ bn + a) = O(n

2

) and so on. In general, the highest exponent

wins by any given margin over the others. Polynomial complexity is therefore

measured only in terms of the leading exponent. This makes calculations much

simpler.

15 Finite State Transducers

Finite state transducers are similar to ﬁnite state automata. You think of them

as ﬁnite state automata that leave a trace of their actions in the form of a string.

However, the more popular way is to think of them as translation devices with

ﬁnite memory. A ﬁnite state transducer is a sextuple

(144) T = (A, B, Q, i

0

, F, δ)

where A and B are alphabets, Q a ﬁnite set (the set of states), i

0

the initial state,

F the set of ﬁnal states and

(145) δ ⊆ ℘(A

ε

× Q × B

ε

× Q)

15. Finite State Transducers 56

(Here, A

ε

:= A ∪ ¦ε¦, and

ε

:= B ∪ ¦ε¦.) We write q

a:b

→ q

·

if δ(a, q, b, q

·

) ∈ δ. We

say in this case that T makes a transition from q to q

·

given input a, and that it

outputs b. Again, it is possible to deﬁne this notation for strings. So, if q

u:v

−→ q

·

and q

·

x:y

−→ q

··

then we write q

ux:vy

−→ q

··

.

It is not hard to show that a ﬁnite state transducer from A to B is equivalent to

a ﬁnite state automaton over A × B. Namely, put

(146) A := (A × B, Q, i

0

, F, θ)

where q

·

∈ θ(q, (a, b)) iﬀ (a, q, b, q

·

) ∈ δ. Now deﬁne

(147) (u, v) (x, y) := (ux, xy)

Then one can show that q

x:y

−→ q

·

in T if and only if q

(x,y)

−→ q

·

in A. Thus, the

theory of ﬁnite state automata can be used here. Transducers have been introduced

as devices that translate languages. We shall see many examples below. Here,

we shall indicate a simple one. Suppose that the input is A := ¦a, b, c, d¦ and

B := ¦0, 1¦. We want to translate words over A into words over B in the following

way. a is translated by 00, b by 01, c by 10 and d by 11. Here is how this is done.

The set of states is ¦0, 1, 2¦. The initial state is 0 and the accepting states are ¦0¦.

The transitions are

(0, a, 1, 0), (0, b, 2, 0), (0, c, 1, 1), (0, d, 2, 1), (148)

(1, ε, 0, 0), (2, ε, 0, 1)

This means the following. Upon input a, the machine enters state 1, and outputs

the letter 0. It is not in an accepting state, so it has to go on. There is only one

way: it reads no input, returns to state 0 and outputs the letter 0. Now it is back

to where is was. It has read the letter a and mapped it to the word 00. Similarly it

maps b to 01, c to 10 and d to 11.

Let us write R

T

for the following relation.

(149) R

T

:= ¦(x, y) : i

0

x:y

−→ q ∈ F¦

Then, for every word x, put

(150) R

T

(x) := ¦y : x R

T

y¦

15. Finite State Transducers 57

And for a set S ⊆ A

∗

(151) R

T

[S ] := ¦y : exists x ∈ S : x R

T

y¦

This is the set of all strings over B that are the result of translation via T of a string

in S . In the present case, notice that every string over A has a translation, but not

every string over B is the translation of a string. This is the case if and only if it

has even length.

The translation need not be unique. Here is a ﬁnite state machine that trans-

lates a into bc

∗

.

A := ¦a¦, B := ¦b, c¦, Q := ¦0, 1¦, i

0

:= 0, F := ¦1¦, (152)

δ := ¦(0, a, 1, b), (1, ε, 1, c)¦

This automaton takes as only input the word a. However, it outputs any of b, bc,

bcc and so on. Thus, the translation of a given input can be highly underdeter-

mined. Notice also the following. For a language S ⊆ A

∗

, we have the following.

(153) R

T

[S ] =

bc

∗

if a ∈ S

∅ otherwise.

This is because only the string a has a translation. For all words x a we have

R

T

(x) = ∅.

The transducer can also be used the other way: then it translates words over B

into words over A. We use the same machine, but now we look at the relation

(154) R

T

(y) := ¦x : x R

T

y¦

(This is the converse of the relation R

T

.) Similarly we deﬁne

R

T

(y) := ¦x : x R

T

y¦ (155)

R

T

[S ] := ¦x : exists y ∈ S : x R

T

y¦ (156)

Notice that there is a transducer U such that R

U

= R

T

.

We quote the following theorem.

Theorem 21 (Transducer Theorem) Let T be a transducer from A to B, and let

L ⊆ A

∗

be a regular language. Then R

T

[L] is regular as well.

16. Finite State Morphology 58

Notice however that the transducer can be used to translate any language. In fact,

the image of a context free language under R

T

can be shown to be context free as

well.

Let us observe that the construction above can be generalized. Let A and B be

alphabets, and f a function assigning each letter of A a word over B. We extend f

to words over A as follows.

(157) f (x a) := x f (a)

The translation function f can be eﬀected by a ﬁnite state machine in the following

way. The initial state is i

0

. On input a the machine goes into state q

a

, outputs ε.

Then it returns to i

0

in one or several steps, outputting f (a). Then it is ready to take

the next input. However, there are more complex functions that can be calculated

with transducers.

16 Finite State Morphology

One of the most frequent applications of transducers is in morphology. Practically

all morphology is ﬁnite state. This means the following. There is a ﬁnite state

transducer that translates a gloss (= deep morphological analysis) into surface

morphology. For example, there is a simple machine that puts English nouns into

the plural. It has two states, 0, and 1; 0 is initial, 1 is ﬁnal. The transitions are as

follows (we only use lower case letters for ease of exposition).

(158) (0, a, 0, a), (0, b, 0, b), . . . , (0, z, 0, z), (0, ε, 1, s).

The machine takes an input and repeats it, and ﬁnally attaches s. We shall later

see how we can deal with the full range of plural forms including exceptional

plurals. We can also write a machine that takes a deep representation, such as car

plus singular or car plus plural and outputs car in the ﬁrst case and cars in the

second. For this machine, the input alphabet has two additional symbols, say,

and , and works as follows.

(159) (0, a, 0, a), (0, b, 0, b), . . . , (0, z, 0, z), (0, , 1, ε), (0, , 1, s).

This machine accepts one or at the end and transforms it into ε in the case

of and into s otherwise. As explained above we can turn the machine around.

16. Finite State Morphology 59

Then it acts as a map from surface forms to deep forms. It will translate arm into

arm and arms into arm and arms. The latter may be surprising, but the

machine has no idea about the lexicon of English. It assumes that s can be either

the sign of plural or the last letter of the root. Both cases arise. For example,

the word bus has as its last letter indeed s. Thus, in one direction the machine

synthesizes surface forms, and in the other direction it analyses them.

Now, let us make the machine more sophisticated. The regular plural is formed

by adding es, not just s, when the word ends in sh: bushes, splashes. If the

word ends in s then the plural is obtained by adding ses: busses, plusses. We

can account for this as follows. The machine will take the input and end in three

diﬀerent states, according to whether the word ends in s, sh or something else.

(160)

(0, a, 0, a), (0, b, 0, b), . . . , (0, z, 0, z),

(0, a, 4, a), . . . , (0, r, 4, r), (0, t, 4, t),

. . . , (0, z, 4, z), (0, s, 2, s), (2, a, 4, a),

. . . , (2, g, 4, g), (2, h, 3, h), (2, i, 4, i),

. . . , (2, z, 4, z), (2, ε, 3, s), (3, ε, 4, e),

(4, , 1, ε), (4, , 1, s).

This does not exhaust the actual spelling rules for English, but it should suﬃce.

Notice that the machine, when turned around, will analyze busses correctly as

bus, and also as busses. Once again, the mistake is due to the fact that

the machine does not know that busses is no basic word of English. Suppose

we want to implement that kind of knowledge into the machine. Then what we

would have to do is write a machine that can distinguish an English word from

a nonword. Such a machine probably requires very many states. It is probably

no exaggeration to say that several hundreds of states will be required. This is

certainly the case if we take into account that certain nouns form the plural dif-

ferently: we only mention formulae (from formula, indices (from index),

tableaux (from tableau), men, children, oxen, sheep, mice.

Here is another task. In Hungarian, case suﬃxes come in diﬀerent forms. For

example, the dative is formed by adding nak or nek. The form depends on the

following factors. If the root contains a back vowel (a, o, u) then the suﬃx is nak;

otherwise it is nek. The comitative suﬃx is another special case: when added, it

becomes a sequence of consonant plus al or el (the choice of vowel depends in

the same way as that of nak versus nek). The consonant is v if the root ends in a

vowel; otherwise it is the same as the preceding one. (So: dob (‘drum’) dobnak

16. Finite State Morphology 60

(‘for a drum’) dobbal (‘with a drum’); szeg (‘nail’) szegnek (‘for a nail’) and

szeggel (‘with a nail’).) On the other hand, there are a handful of words that end

in z (ez ‘this’, az ‘that’) where the ﬁnal z assimilates to the following consonant

(ennek ‘to this’), except in the comitative where we have ezzel ‘with this’. To

write a ﬁnite state transducer, we need to record in the state two things: whether

or not the root contained a back vowel, and what consonant the root ended in.

Plural in German is a jungle. First, there are many ways in which the plural

can be formed: suﬃx s, suﬃx en, suﬃx er, Umlaut, which is the change (graphi-

cally) from a to ¨a, from o to ¨o and from u to ¨u of the last vowel; and combinations

thereof. Second, there is no way to predict phonologically which word will take

which plural. Hence, we have to be content with a word list. This means, trans-

lated into ﬁnite state machine, that we end up with a machine of several hundreds

of states.

Another area where a transducer is useful is in writing conventions. In English,

ﬁnal y changes to i when a vowel is added: happy : happier, fly : flies. In

Hungarian, the palatal sound [dj] is written gy. When this sound is doubled it

becomes ggy and not, as one would expect, gygy. The word hegy should be

the above rule become hegygyel, but the orthography dictates heggyel. (Actu-

ally, the spelling gets undone in hyphenation: you write hegy-<newline>gyel.)

Thus, the following procedure suggests itself: we deﬁne a machine that regular-

izes the orthography by reversing the conventions as just shown. This machine

translates heggyel into hegygyel. Actually, it is not necessary that gy is treated

as a digraph. We can deﬁne a new alphabet in which gy is written by a single

symbol. Next, we take this as input to a second machine which produces the deep

morphological representations.

We close with an example from Egyptian Arabic. Like in many semitic lan-

guages, roots only consist of consonants. Typically, they have three consonants,

for example ktb ‘to write’ and drs ‘to study’. To words are made by adding some

material in front (preﬁxation), some material after (suﬃxation) and some material

in between (inﬁxation). Moreover, all these typically happen at the same time.

17. Using Finite State Transducers 61

Let’s look at the following list.

(161)

[katab] ‘he wrote’ [daras] ‘he studied’

[Pamal] ‘he did’ [na\al] ‘he copied’

[baktib] ‘I write’ [badris] ‘I study’

[baPmil] ‘I do’ [ban\il] ‘I copy’

[iktib] ‘write!’ [idris] ‘study!’

[iPmil] ‘do!’ [in\il] ‘copy!’

[kaatib] ‘writer’ [daaris] ‘studier’

[Paamil] ‘doer’ [naa\il] ‘copier’

[maktuub] ‘written’ [madruus] ‘studied’

[maPmuu] ‘done’ [man\uul] ‘copied’

Now, we want a transducer to translate katab into a sequence ktb plus 3rd, plus

singular plus past. Similarly with the other roots. And it shall translate baktib

into ktb plus 1st, plus singular plus present. And so on. The transducer will take

the form CaCaC and translate it into CCC plus the markers 3rd, singular and past.

(Obviously, one can reverse this and ask the transducer to spell out the form CCC

plus 3rd, singular, past into CaCaC.)

17 Using Finite State Transducers

Transducers can be nondeterministic, and this nondeterminism we cannot always

eliminate. One example is the following machine. It has two states, 0 and 1, 0 is

initial, and 1 is accepting. It has the transitions 0

a:b

−→ 1 and 1

ε:a

−→ 1. This machine

accepts a as input, together with any output string ba

n

, n ∈ N. What runs does

this machine have? Even given input a there are inﬁnitely many runs:

0

a:b

−→ 1 (162)

0

a:b

−→ 1

ε:b

−→ 1

0

a:b

−→ 1

ε:b

−→ 1

ε:b

−→ 1

0

a:b

−→ 1

ε:b

−→ 1

ε:b

−→ 1

ε:b

−→ 1

In addition, given a speciﬁc output, sometimes there are several runs on that given

pair. Here is an example. Again two states, 0 and 1, and the following transitions:

17. Using Finite State Transducers 62

0

a:b

−→ 1 and then 1

ε:b

−→ 1 and 1

a:ε

−→ 1. Now, for the pair aaa and bbbb there are

several runs:

0

a:b

−→ 1

a:ε

−→ 1

a:ε

−→ 1

ε:b

−→ 1

ε:b

−→ 1

ε:b

−→ 1 (163)

0

a:b

−→ 1

a:ε

−→ 1

ε:b

−→ 1

a:ε

−→ 1

ε:b

−→ 1

ε:b

−→ 1

0

a:b

−→ 1

ε:b

−→ 1

a:ε

−→ 1

a:ε

−→ 1

ε:b

−→ 1

ε:b

−→ 1

0

a:b

−→ 1

ε:b

−→ 1

a:ε

−→ 1

ε:b

−→ 1

a:ε

−→ 1

ε:b

−→ 1

(There are more runs that shown.) In the case of the machine just shown, the

number of runs grows with the size of the input/output pair. There is therefore no

hope of ﬁxing the number of runs a priori.

Despite all this, we want to give algorithms to answer the following questions:

Given a transducer T and a pair x : y of strings, is there a run of T that

accepts this pair? (That is to say: is (x, y) ∈ R

T

?)

Given x, is there a string y such that (x, y) ∈ R

T

?

Given x, is there a way to enumerate R

T

(x), that is to say, all y such that

(x, y) ∈ R

T

?

Let us do these questions in turn. There are two ways to deal with nondetermin-

ism: we follow a run to the end and backtrack on need (depth ﬁrst analysis); or we

store all partial runs and in each cycle try to extend each run if possible (breadth

ﬁrst search). Let us look at them in turn.

Let is take the second machine. Assume that the transitions are numbered:

t(0) = (0, a, 1, b) (164)

t(1) = (1, a, 1, ε) (165)

t(2) = (1, ε, 1, b) (166)

This numbering is needed to order the transitions. We start with the initial state

0. Let input aaa and output bbbb be given. Initially, no part of input or output

is read. (Output is at present a bit of a misnomer because it is part of the input.)

Now we want to look for a transition that starts at the state we are in, and matches

17. Using Finite State Transducers 63

characters of the input and output strings. There is only one possibility: the tran-

sition t(0). This results in the following: we are now in state 1, and one character

from both strings has been read. The conjunction of facts (i) is in state 1, (ii) has

read one character from in put, (iii) has read one character from output we call

a conﬁguration. We visualize this conﬁguration by 1, a¦aa : b¦bbb. The bar is

placed right after the characters that have been consumed or read.

Next we face a choice: either we take t(1) and consume the input a but no

output letter, or we take t(2) and consume the output b but no input letter. Faced

with this choice, we take the transition with the least number, so we follow t(1).

The conﬁguration is now this: 1, aa¦a : b¦bbb. Now face the same choice again,

to do either t(1) or t(2). We decide in the same way, choosing t(1), giving 1, aaa¦ :

b¦bbb. Now, no more choice exists, and we can only proceed with t(2), giving

1, aaa¦ : bbb¦b. One last time we do t(2), and we are done. This is exactly the ﬁrst

of the runs. Now, backtracking can be one for any reason. In this case the reason

is: we want to ﬁnd another run. To do that, we go back to the last point where

there was an actual choice. This was the conﬁguration 1, aa¦a : b¦bbb. Here,

we were able to choose between t(1) and t(2) and we chose t(1). In this case we

decide diﬀerently: now we take t(2), giving 1, aa¦a : bb¦bb. From this moment

we have again a choice, but now we fall back on our typical recipe: choose t(1)

whenever possible. This gives the third run above. If again we backtrack, it is

the last choice point within the run we are currently considering that we look at.

Where we chose t(1) we now choose t(2), and we get this run:

0

a:b

−→ 1

a:ε

−→ 1

ε:b

−→ 1

ε:b

−→ 1

a:ε

−→ 1

ε:b

−→ 1 (167)

Here is how backtracking continues to enumerate the runs:

0

a:b

−→ 1

a:ε

−→ 1

ε:b

−→ 1

ε:b

−→ 1

a:ε

−→ 1

ε:b

−→ 1 (168)

0

a:b

−→ 1

a:ε

−→ 1

ε:b

−→ 1

ε:b

−→ 1

ε:b

−→ 1

a:ε

−→ 1

0

a:b

−→ 1

ε:b

−→ 1

a:ε

−→ 1

a:ε

−→ 1

ε:b

−→ 1

ε:b

−→ 1

0

a:b

−→ 1

ε:b

−→ 1

a:ε

−→ 1

ε:b

−→ 1

a:ε

−→ 1

ε:b

−→ 1

0

a:b

−→ 1

ε:b

−→ 1

a:ε

−→ 1

ε:b

−→ 1

ε:b

−→ 1

a:ε

−→ 1

0

a:b

−→ 1

ε:b

−→ 1

ε:b

−→ 1

a:ε

−→ 1

ε:b

−→ 1

a:ε

−→ 1

0

a:b

−→ 1

ε:b

−→ 1

ε:b

−→ 1

ε:b

−→ 1

a:ε

−→ 1

a:ε

−→ 1

17. Using Finite State Transducers 64

In general, backtracking works as follows. Every time we face a choice, we pick

the available transition with the smallest number. If we backtrack we go back in

the run to the point where there was a choice, and then we change the previously

chosen next transition to the next up. From that point on we consistently choose

the least numbers again. This actually enumerates all runs. It is sometimes a

dangerous strategy since we need to guarantee that the run we are following ac-

tually terminates: this is no problem here, since every step consumes at least one

character.

Important in backtracking is that it does not give the solutions in one blow:

it gives us one at a time. We need to memorize only the last run, and then back-

tracking gives us the next if there is one. (If there is none, it will tell us. Basically,

there is none if going back we cannot ﬁnd any point at which there was a choice

(because you can never choose lower number than you had chosen).)

Now let us look at the other strategy. It consists in remembering partial runs,

and trying to complete them. A partial run on a pair of strings simply is a sequence

of transitions that sucessfully maps the machine into some state, consuming some

part of the input string and some part of the outpu string. We do not require that

the partial run is part of a successful run, otherwise we would require to know

what we are about to ﬁnd out: whether any of the partial runs can be completed.

We start with a single partial run: the empty run — no action taken. In the ﬁrst

step we list all transitions that extend this by one step. There is only one, t(0). So,

the next set of runs is t(0). It leads to the situation 1, a¦aa : b¦bbb. From here we

can do two things: t(1) or t(2). We do both, and note the result:

Step 2 (169)

t(0), t(1) : 1, aa¦a : b¦bbb

t(0), t(2) : 1, a¦aa : bb¦bb

In the next round we try to extend the result by one more step. In each of the two,

we can perform either t(1) or t(2):

Step 3 (170)

t(0), t(1), t(1) : 1, aaa¦ : b¦bbb

t(0), t(1), t(2) : 1, aa¦a : bb¦bb

t(0), t(2), t(1) : 1, aa¦a : bb¦bb

t(0), t(2), t(2) : 1, a¦aa : bbb¦b

17. Using Finite State Transducers 65

In the fourth step, we have no choice in the ﬁrst line: we have consumed the input,

now we need to choose t(2). In the other cases we have both options still.

Step 4 (171)

t(0), t(1), t(1), t(2) : 1, aaa¦ : bb¦bb

t(0), t(1), t(2), t(1) : 1, aaa¦ : bb¦bb

t(0), t(1), t(2), t(2) : 1, aaa¦ : bbb¦b

t(0), t(2), t(1), t(1) : 1, aaa¦ : bbb¦b

t(0), t(2), t(1), t(2) : 1, aa¦a : bbb¦b

t(0), t(2), t(2), t(1) : 1, aa¦a : bbb¦b

t(0), t(2), t(2), t(2) : 1, a¦aa : bbbb¦

And so on.

It should be noted that remembering each and every run is actually excessive.

If two runs terminate in the same conﬁguration (state plus pair of read strings),

they can be extended in the same way. Basically, the number of runs initially

shows exponential growth, while the number of conﬁgurations is quadratic in the

length of the smaller of the strings. So, removing this excess can be vital for

computational reaons. In Step 3 we had 4 runs and 3 diﬀerent end situations, in

Step 4 we have 7 runs, but only 4 diﬀerent end conﬁgurations.

Thus we can improve the algorithm once more as follows. We do not keep a

record of the run, only of its resulting conﬁguration, which consists of the state

and positions at the input and the output string. In each step we just calculate the

possible next conﬁgurations for the situations that have recently been added. Each

step will advance the positions by at least one, so we are sure to make progress.

How long does this algorithm take to work? Let’s count. The conﬁgurations are

triples (x, j, k), where x is a state of the machine, i is the position of the character

to the right of the bar on the input string, k the position of the character to the right

of the bar on the output string. On (x, y) there are ¦Q¦ ¦x¦ ¦y¦ many such triples.

All the algorithm does is compute which ones are accessible from (0, 0, 0). At

the end, it looks whether there is one of the form (q, ¦x ¦y¦), where q ∈ F (q is

accepting). Thus the algorithm takes ¦Q¦ ¦x¦ ¦y¦ time, time propotional in both

lengths. (Recall here the discussion of Section 13. What we are computing is the

reachability relation in the set of conﬁguations. This can be done in linear time.)

One might be inclined to think that the ﬁrst algorithm is faster because it goes

into the run very quickly. However, it is possible to show that if the input is

17. Using Finite State Transducers 66

appropriately chosen, it might take exponential time to reach an answer. So, on

certain examples this algorithm is excessively slow. (Nevertheless, sometimes it

might be much faster that the breadth ﬁrst algorithm.)

Nowwe look ata diﬀerent problem: given an input string, what are the possible

output strings? We have already seen that there can be inﬁnitely many, so one

might need a regular expression to list them all. Let us not do that here. Let

us look instead into ways of ﬁnding at least one output string. First, an artiﬁcial

example. The automaton has the following transitions.

0

a:c

−→ 1 0

a:d

−→ 2 (172)

1

a:c

−→ 1 2

a:d

−→ 2

2

b:ε

−→ 3

0 is initial, and 1 and 3 are accepting. This machine has as input strings all strings

of the form a

+

b?. However, the relation it deﬁnes is this:

(173) ¦(a

n

, c

n

) : n > 0¦ ∪ ¦(a

n

b, d) : n > 0¦

The translation of each given string is unique; however, it depends on the last

character of the input! So, we cannot simply think of the machine as taking each

character fromthe input and immediately returning the corresponding output char-

acter. We have to buﬀer the entire output until we are sure that the run we are

looking at will actually succeed.

Let us look at a real problem. Suppose we design a transducer for Arabic.

Let us say that the surface form katab should be translated into the deep form

ktb+3Pf, but the form baktib into ktb+1Pr. Given: input string baktib. Con-

ﬁguration is now: 0, ¦baktib. No output yet. The ﬁrst character is b. We face

two choices: consider it as part of a root (say, of a form badan to be translated

as bdn+3Pf), or to consider it is the ﬁrst letter of the preﬁx ba. So far there is

nothing that we can do to eliminate either option, so we keep them both open. It

is only when we have reached the fourth letter, t, when the situation is clear: if

this was a 3rd perfect form, we should see a. Thus, the algorithm we propose

is this: try to ﬁnd a run that matches the input string and then look at the output

string it deﬁnes. The algorithm to ﬁnd a run for the input string can be done as

above, using depth ﬁrst or breadth ﬁrst search. Once again, this can be done in

quadratic time. (Basically, the possible runs can be compactly represented and

then converted into an output string.)

18. Context Free Grammars 67

18 Context Free Grammars

Deﬁnition 22 Let A be as usual an alphabet. Let N be a set of symbols, N disjoint

with A, Σ ⊆ N. A context free rule over N as the set of nonterminal symbols and

A the set of terminal symbols is a pair (X, α), where X ∈ N and α is a string over

N ∪ A. We write X → α for the rule. A context free grammar is a quadruple

(A, N, Σ, R), where R is a ﬁnite set of context free rules. The set Σ is called the set

of start symbols

Often it is assumed that a context free grammar has just one start symbol, but in

actual practice this is mostly not observed. Notice that context free rules allow

the use of nonterminals and terminals alike. The implicit convention we use is as

follows. A string of terminals is denoted by a Roman letter with a vector arrow,

a mixed string containing both terminals and nonterminals is denoted by a Greek

letter with a vector arrow. (The vector is generally used to denote strings.) If the

diﬀerence is not important, we shall use Roman letters.

The following are examples of context free rules. The conventions that apply

here are as follows. Symbols in print are terminal symbols; they are denoted us-

ing typewriter font. Nonterminal symbols are produced by enclosing an arbitrary

string in brackets: < >.

< digit >→ 0 (174)

< digit >→ 1

< number >→< digit >

< number >→< number >< digit >

There are some conventions that apply to display the rules in a more compact

form. The vertical slash ‘¦’ is used to merge two rules with the same left hand

symbol. So, when X → α and X → γ are rules, you can write X → α ¦ γ. Notice

that one speaks of one rule in the latter case, but technically we have two rules

now. This allows us to write as follows.

< digit >→ 0 ¦ 1 ¦ 2 ¦ . . . ¦ 9 (175)

And this stands for ten rules. Concatenation is not written. It is understood that

symbols that follow each other are concatenated the way they are written down.

18. Context Free Grammars 68

A context free grammar can be seen as a statement about sentential structure,

or as a device to generate sentences. We begin with the second viewpoint. We

write α ⇒ γ and say that γ is derived from α in one step if there is a rule X → η,

a single occurrence of X in α such that γ is obtained by replacing that occurrence

of X in α by η:

(176) α = σ

X

τ, γ = σ

η

τ

For example, if the rule is < number >→< number >< digit > then replacing

the occurrence of < number > in the string 1 < digit >< number > 0 will give

1 < digit >< number >< digit > 0. Now write α ⇒

n+1

γ if there is a ρ such that

α ⇒

n

ρ and ρ ⇒ γ. We say that γ is derivable from α in n + 1 steps. Using the

above grammar we have:

(177) < number >⇒

4

< digit >< digit >< digit >< number >

Finally, α ⇒

∗

γ (γ is derivable from α) if there is an n such that α ⇒

n

γ.

Now, if X ⇒

∗

γ we say that γ is a string of category X. The language generated

by G consists of all terminal strings that have category S ∈ Σ. Equivalently,

(178) L(G) := ¦x ∈ A

∗

: there is S ∈ Σ : S ⇒

∗

x¦

You may verify that the grammar above generates all strings of digits from the

symbol < number >. If you take this as the start symbol, the language will consist

in all strings of digits. If, however, you take the symbol < digit > as the start

symbol then the language is just the set of all strings of length one. This is because

even though the symbol < number > is present in the nonterminals and the rules,

there is no way to generate it by applying the rules in succession from < digit >. If

on the other hand, Σ contains both symbols then you get just the set of all strings,

since a digit also is a number. (A fourth possibility is Σ = ∅, in which case the

language is empty.)

There is a trick to reduce the set of start symbols to just one. Introduce a new

symbol S

∗

together with the rules

(179) S

∗

→ S

for every S ∈ Σ. This is a diﬀerent grammar but it derives the same strings.

Notice that actual linguistics is diﬀerent. In natural language having a set of start

18. Context Free Grammars 69

symbols is very useful. There are several basic types of sentences (assertions,

questions, commands, and so on). These represent diﬀerent sets of strings. It real

grammars (GPSG, HPSG, for example), one does not start from a single symbol

but rather from diﬀerent symbols, one for each class of saturated utterance. This

is useful also for other reasons. It is noted that in answers to questions almost

any constituent can function as a complete utterance. To account for that, one

would need to enter a rule of the kind above for each constituent. But there is

a sense in which these constituents are not proper sentences, they are just parts

of constituents. Rather than having to write new rules to encompass the use of

these constituents, we can just place the burden on the start set. So, in certain

circumstances what shifts is the set of start symbols, the rule set however remains

constant.

A derivation is a full speciﬁcation of the way in which a string has been gen-

erated. This may be given as follows. It is a sequence of strings, where each

subsequent string is obtained by applying a rule to some occurrence of a nonter-

minal symbol. We need to specify in each step which occurrence of a nonterminal

is targeted. We can do so by underlining it. For example, here is a grammar.

(180) A → BA ¦ AA, B → AB ¦ BB, A → a, B → b ¦ bc ¦ cb

Here is a derivation:

(181) A, BA, BBA, BBBA, BBBa, BcbBa, bcbBa, bcbbca

Notice that without underlining the symbol which is being replaced some infor-

mation concerning the derivation is lost. In the second step we could have applied

the rule A → BA instead, yielding the same result. Thus, the following also is a

derivation:

(182) A, BA, BBA, BBBA, BBBa, BcbBa, bcbBa, bcbbca

(I leave it to you to ﬁgure out how one can identify the rule that has been applied.)

This type of information is essential, however. To see this, let us talk about the

second approach to CFGs, the analytic one. Inductively, on the basis of a deriva-

tion, we assign constituents to the strings. First, however, we need to ﬁx some

notation.

Deﬁnition 23 Let x and y be strings. An occurrence of y in x is a pair (u, v) of

strings such that x = uyv. We call u the left context of y in x and v the right

context.

18. Context Free Grammars 70

For example, the string aa has three occurrences in aacaaa: (ε, caaa), (aac, a)

and (aaca, ε). It is very important to distinguish a substring from its occurrences

in a string. Often we denote a particular occurrence of a substring by underlining

it, or by other means. In parsing it is very popular to assign numbers to the posi-

tions between (!) the letters. Every letter then spans two adjacent positions, and

in general a substring is represented by a pair of positions (i, j) where i ≤ j. (If

i = j we have on occurrence of the empty string.) Here is an example:

(183)

E d d y

0 1 2 3 4

The pair (2, 3) represents the occurrence (Ed, y) of the letter d. The pair (1, 2)

represents the occurrence (E, dy). The pair (1, 4) represents the (unique) occur-

rence of the string ddy. And so on. Notice that the strings are representatives of

the strings, and they can represent the substring only in virtue of the fact that the

larger string is actually given. This way of talking recommends itself when the

larger string is ﬁxed and we are only talking about its substrings (which is the case

in parsing). For present purposes, it is however less explicit.

Let us return to our derivation. The derivation may start with any string α, but

it is useful to think of the derivation as having started with a start symbol. Now,

let the ﬁnal string be γ (again, it need not be a terminal string, but it is useful to

think that it is). We look at derivation (181). The last string is bcbbca. The last

step is the replacement of B in bcbBa by bc. It is this replacement that means that

the occurrence (bcb, a) of the string bc is a constituent of category B.

(184) bcbbca

The step before that was the replacement of the ﬁrst B by b. Thus, the ﬁrst occur-

rence of b is a constituent of category b. The third last step was the replacement

of the second B by cb, showing us that the occurrence (b, bca) is a constituent

of category B. Then comes the replacement of A by a. So, we have the following

ﬁrst level constituents: (ε, b), (b, bca), (bcb, a) and (bcbbc, ε). Now we get to

the replacement of B by BB. Now, the occurrence of BB in the string corresponds

to the sequence of the occurrences (b, bca) of cb and (bcb, a) of bc. Their con-

catenation is cbbc, and the occurrence is (b, a).

(185) bcbbca

We go one step back and ﬁnd that, since now the ﬁrst B is replaced by BB, the new

constituent that we get is

(186) bcbbca

18. Context Free Grammars 71

which is of category B. In the last step we get that the entire string is of category

A under this derivation. Thus, each step in a derivation adds a constituent to the

analysis of the ﬁnal string. The structure that we get is as follows. There is a

string α and a set Γ of occurrences of substrings of α together with a map f that

assigns to each member of Γ a nonterminal (that is, an element of N). We call the

members of Γ constituents and f (C) the category of C, C ∈ Γ.

By comparison, the derivation (182) imposes a diﬀerent constituent structure

on the string. The diﬀerent is that it is not (186) that is a constituent but rather

(187) bcbbca

It is however not true that the constituents identify the derivation uniquely. For

example, the following is a derivation that gives the same constituents as (181).

(188) A, BA, BBA, BBBA, BBBa, BBbca, bBbca, bcbbca

The diﬀerence between (181) and (188) are regarded inessential, however. Basi-

cally, only the constituent structure is really important, because it may give rise to

diﬀerent meanings, while diﬀerent derivations that yield the same structure will

give the same meaning.

The constituent structure is displayed by means of trees. Recall the deﬁnition

of a tree. It is a pair (T, <) where < is transitive (that is, if x < y and y < z then

also x < z), it has a root (that is, there is a x such that for all y x: y < x) and

furthermore, if x < y, z then either y < z, y = z or y > z. We say that x dominates

y if x > y; and that it immediately dominates y if it dominates y but there is no z

such that x dominates z and z dominates y.

Now, let us return to the constituent structure. Let C = (γ

1

, γ

2

) and D =

(η

1

, η

2

) be occurrences of substrings. We say that C is dominated by D, in sym-

bols C ≺ D, if C D and (1) η

1

is a preﬁx of γ

1

and (2) η

2

is a suﬃx of γ2. (It

may happen that η

1

= γ

1

or that η

2

= γ

2

, but not both.) Visually, what the deﬁni-

tion amounts to is that if one underlines the substring of C and the substring of D

then the latter line includes everything that the former underlines. For example,

let C = (abc, dx) and (a, x). Then C ≺ D, as they are diﬀerent and (1) and (2) are

satisﬁed. Visually this is what we get.

(189) abccddx

Now, let the tree be deﬁned as follows. The set of nodes is the set Γ. The relation

< is deﬁned by ≺.

19. Parsing and Recognition 72

The trees used in linguistic analysis are often ordered. The ordering is here

implicitly represented in the string. Let C = (γ

1

, γ

2

) be an occurrences of σ and

D = (η

1

, η

2

) be an occurrence of τ. We write C D and say that C precedes D

if γ

1

σ is a preﬁx of η

1

(the preﬁx need not be proper). If one underlines C and D

this deﬁnition amounts to saying that the line of C ends before the line of D starts.

(190) abccddx

Here C = (a, cddx) and D = (abcc, x).

19 Parsing and Recognition

Given a grammar G, and a string x, we ask the following questions:

• (Recognition:) Is x ∈ L(G)?

• (Parsing:) What derivation(s) does x have?

Obviously, as the derivations give information about the meaning associated with

an expression, the problem of recognition is generally not of interest. Still, some-

times it is useful to ﬁrst solve the recognition task, and then the parsing task. For if

the string is not in the language it is unnecessary to look for derivations. The pars-

ing problem for context free languages is actually not the one we are interested

in: what we really want is only to know which constituent structures are associ-

ated with a given string. This vastly reduces the problem, but still the remaining

problem may be very complex. Let us see how.

Now, in general a given string can have any number of derivations, even in-

ﬁnitely many. Consider by way of example the grammar

(191) A → A ¦ a

It can be shown that if the grammar has no unary rules and nor rules of the form

X → ε then a given string x has an exponential number of derivations. We shall

show that it is possible to eliminate these rules (this reduction is not semantically

innocent!). Given a rule X → ε and a rule that contains X on the right, say

Z → UXVX, we eliminate the ﬁrst rule (X → ε); furthermore, we add all rules

19. Parsing and Recognition 73

obtained by replacing any number of occurrences of X on the right by the empty

string. Thus, we add the rules Z → UVX, Z → UXV and Z → UV. (Since other

rules may have X on the left, it is not advisable to replace all occurrences of X

uniformly.) We do this for all such rules. The resulting grammar generates the

same set of strings, with the same set of constituents, excluding occurrences of

the empty string. Now we are still left with unary rules, for example, the rule

X → Y. Let ρ be a rule having Y on the left. We add the rule obtained by replacing

Y on the left by X. For example, let Y → UVX be a rule. Then we add the rule

X → UVX. We do this for all rules of the grammar. Then we remove X → Y.

These two steps remove the rules that do not expand the length of a string. We

can express this formally as follows. If ρ = X → α is a rule, we call ¦ α¦ − 1 the

productivity of ρ, and denote it by p(ρ). Clearly, p(ρ) ≥ −1. If p(ρ) = −1 then

α = ε, and if p(ρ) = 0 then we have a rule of the form X → Y. In all other cases,

p(ρ) > 0 and we call ρ productive.

Now, if η is obtained in one step from γ by use of ρ, then ¦η¦ = ¦γ¦ + p(ρ).

Hence ¦η¦ > ¦γ¦ if p(ρ) > 0, that is, if ρ is productive. So, if the grammar only

contains productive rules, each step in a derivation increases the length of the

string, unless it replaces a nonterminal by a terminal. It follows that a string of

length n has derivations of length 2n−1 at most. Here is now a very simple minded

strategy to ﬁnd out whether a string is in the language of the grammar (and to ﬁnd

a derivation if it is): let x be given, of length n. Enumerate all derivations of

length < 2n and look at the last member of the derivation. If x is found once,

it is in the language; otherwise not. It is not hard to see that this algorithm is

exponential. We shall see later that there are far better algorithms, which are

polynomial of order 3. Before we do so, let us note, however, that there are

strings which have exponentially many diﬀerent constituents, so that the task of

enumerating the derivations is exponential. However, it still is the case that we

can represent them is a very concise way, and this again takes only exponential

time.

The idea to the algorithm is surprisingly simple. Start with the string x. Scan

the string for a substring y which occurs to the right of a rule ρ = X → y. Then

write down all occurrences C = (u, v) (which we now represent by pairs of posi-

tions — see above) of y and declare them constituents of category X. There is an

actual derivation that deﬁnes this constituent structure:

(192) uXv, uyv = x

19. Parsing and Recognition 74

We scan the string for each rule of the grammar. In doing so we have all possible

constituents for derivations of length 1. Now we can discard the original string,

and work instead with the strings obtained by undoing the last step. In the above

case we analyze uXv in the same way as we did with x.

In actual practice, there is a faster way of doing this. All we want to know

is what substrings qualify as constituents of some sort. The entire string is in

the language if it qualiﬁes as a constituent of category S for some S ∈ Σ. The

procedure of establishing the categories is as follows. Let x be given, length n.

Constituents are represented by pairs [i, δ], where i is the ﬁrst position and i +δ the

last. (Hence 0 < δ ≤ n.) We deﬁne a matrix M of dimension (n+1) ×n. The entry

m(i, j) is the set of categories that the constituent [i, j] has given the grammar G.

(It is clear that we do not need to ﬁll the entries m(i, j) where i + j > n. They

simply remain undeﬁned or empty, whichever is more appropriate.) The matrix

is ﬁlled inductively, starting with j = 1. We put into m(i, 1) all symbols X such

that X → x is a rule of the grammar, and x is the string between i and i + 1. Now

assume that we have ﬁlled m(i, k) for all k < j. Now we ﬁll m(i, j) as follows. For

every rule X → α check to see whether the string between the positions i and i +k

has a decomposition as given by α. This can be done by cutting the string into

parts and checking whether they have been assigned appropriate categories. For

example, assume that we have a rule of the form

(193) X → AbuXV

Then x = [i, j] is a string of category X if there are numbers k, m, n, p such that

[i, k] is of category A, [i +k, m] = bu (so m = 2), [i +k +m, n] is of category X and

[i + k + m + n, p] is of category V, and, ﬁnally k + m + n + p = k. This involves

choosing three numbers, k, n and p, such that k + 2 + n + p = j, and checking

whether the entry m(i, k) contains A, whether m(i +k+2, n) contains X and whether

m(i +k+2+n+p) contains V. The latter entries have been computed, so this is just

a matter of looking them up. Now, given k and n, p is ﬁxed since p = j −k −2−n.

There are O(k

2

) ways to choose these numbers. When we have ﬁlled the relevant

entries of the matrix, we look up the entry m(0, n). If it contains a S ∈ Σ the string

is in the language. (Do you see why?)

The algorithm just given is already polynomial. To see why, notice that in each

step we need to cut up a string into a given number of parts. Depending on the

number ν of nonterminals to the right, this takes O(n

ν−1

) steps. There are O(n

2

)

many steps. Thus, in total we have O(n

ν+1

) many steps to compute, each of which

19. Parsing and Recognition 75

consumes time proportional to the size of the grammar. The best we can hope for

is to have ν = 2 in which case this algorithm needs time O(n

3

). In fact, this can

always be achieved. Here is how. Replace the rule X → AbuXV be the following

set of rules.

(194) X → AbuY, Y → XV.

Here, Y is assumed not to occur in the set of nonterminals. We remark without

proof that the new grammar generates the same language. In general, a rule of the

form X → Y

1

Y

2

. . . Y

n

is replaced by

(195) X → Y

1

Z

1

, Z

1

→ Y

2

Z

2

, . . . , Z

n−2

→ Y

n−1

Y

n

So, given this we can recognize the language in O(n

3

) time!

Now, given this algorithm, can we actually ﬁnd a derivation or a constituent

structure for the string in question? This can be done: start with m(0, n). It con-

tains a S ∈ Σ. Pick one of them. Now choose i such that 0 < i < n and look up the

entries m(0, i) and m(i, n−i). If they contain A and B, respectively, and if S → AB

is a rule, then begin the derivation as follows:

(196) S , AB

Now underline one of A or B and continue with them in the same way. This is

a downward search in the matrix. Each step requires linear time, since we only

have to choose a point to split up the constituent. The derivation is linear in the

length of the string. Hence the overall time is quadratic! Thus, surprisingly, the

recognition consumes most of the work. Once that is done, parsing is easy.

Notice that the grammar transformation adds new constituents. In the case

of the rule above we have added a new nonterminal Y and added the rules (194).

However, it is easy to return to the original grammar: just remove all constituents

of category Y. It is perhaps worth examining why adding the constituents saves

time in parsing. The reason is simply that the task of identifying constituents

occurs over and over again. The fact that a sequence of two constituents has been

identiﬁed is knowledge that can save time later on when we have to ﬁnd such a

sequence. But in the original grammar there is no way of remembering that we

did have it. Instead, the new grammar provides us with a constituent to handle the

sequence.

19. Parsing and Recognition 76

Deﬁnition 24 A CFG is in standard form if all the rules have the form X →

Y

1

Y

2

Y

n

, with X, Y

i

∈ N, or X → x. If in addition n = 2 for all rules of the ﬁrst

form, the grammar is said to be in Chomsky Normal Form.

There is an easy way to convert a grammar into standard form. Just introduce a

new nonterminal Y

a

for each letter a ∈ A together with the rules Y

a

→ a. Next

replace each terminal a that cooccurs with a nonterminal on the right hand side of

a rule by Y

a

. The new grammar generates more constituents, since letters that are

introduced together with nonterminals do not form a constituent of their own in

the old grammar. Such letter occurrences are called syncategorematic. Typical

examples of syncategorematic occurrences of letters are brackets that are inserted

in the formation of a term. Consider the following expansion of the grammar

(174).

< term >→ < number >¦ (<term>+<term>) (197)

¦ (<term>*<term>)

Here, operator symbols as well as brackets are added syncategorematically. The

procedure of elimination will yield the following grammar.

< term >→ < number >¦ Y

(

< term > Y

+

< term > Y

)

(198)

¦ Y

(

< term > Y

+

< term > Y

)

Y

(

→(

Y

)

→)

Y

+

→+

Y

∗

→*

However, often the conversion to standard form can be avoided however. It is

mainly interesting for theoretic purposes.

Now, it may happen that a grammar uses more nonterminals than necessary.

For example, the above grammar distinguishes Y

+

from Y

∗

, but this is not neces-

sary. Instead the following grammar will just as well.

< term >→ < number >¦ Y

(

< term > Y

o

< term > Y

)

(199)

Y

(

→(

Y

)

→)

Y

o

→+ ¦ *

20. Greibach Normal Form 77

The reduction of the number of nonterminals has the same signiﬁcance as in the

case of ﬁnite state automata: it speeds up recognition, and this can be signiﬁcant

because not only the number of states is reduced but also the number of rules.

Another source of time eﬃciency is the number of rules that match a given

right hand side. If there are several rules, we need to add the left hand symbol for

each of them.

Deﬁnition 25 A CFG is invertible if for any pair of rules X → α and Y → α we

have X = Y.

There is a way to convert a given grammar into invertible form. The set of nonter-

minals is ℘(N), the powerset of the set of nonterminals of the original grammar.

The rules are

(200) S → T

0

T

1

. . . T

n−1

where S is the set of all X such that there are Y

i

∈ T

i

(i < n) such that X →

Y

0

Y

1

. . . Y

n−1

∈ R. This grammar is clearly invertible: for any given sequence

T

0

T

1

T

n−1

of nonterminals the left hand side S is uniquely deﬁned. What needs

to be shown is that it generates the same language (in fact, it generates the same

constituent structures, though with diﬀerent labels for the constituents).

20 Greibach Normal Form

We have spoken earlier about diﬀerent derivations deﬁning the same constituent

structure. Basically, if in a given string we have several occurrences of nonter-

minals, we can choose any of them and expand them ﬁrst using a rule. This is

because the application of two rules that target the same string but diﬀerent non-

terminals commute:

(201)

X Y

α Y X γ

α γ

This can be exploited in many ways, for example by always choosing a particular

derivation. For example, we can agree to always expand the leftmost nonterminal,

or always the rightmost nonterminal.

20. Greibach Normal Form 78

Recall that a derivation deﬁnes a set of constituent occurrences, which in turn

constitute the nodes of the tree. Notice that each occurrence of a nonterminal

is replaced by some right hand side of a rule during a derivation that leads to a

terminal string. After it has been replaced, it is gone and can no longer ﬁgure in a

derivation. Given a tree, a linearization is an ordering of the nodes which results

from a valid derivation in the following way. We write x y if the constituent of

x is expanded before the constituent of y is. One can characterize exactly what it

takes for such an order to be a linearization. First, it is linear. Second if x > y then

also x y. It follows that the root is the ﬁrst node in the linearization.

Linearizations are closely connected with search strategies in a tree. We shall

present examples. The ﬁrst is a particular case of the so–called depth–ﬁrst search

and the linearization shall be called leftmost linearization. It is as follows. x y

iﬀ x > y or x y. (Recall that x y iﬀ x precedes y. Trees are always considered

ordered.) For every tree there is exactly one leftmost linearization. We shall

denote the fact that there is a leftmost derivation of α from X by X -

G

α. We can

generalize the situation as follows. Let be a linear ordering uniformly deﬁned

on the leaves of local subtrees. That is to say, if Band C are isomorphic local trees

(that is, if they correspond to the same rule ρ) then orders the leaves B linearly

in the same way as orders the leaves of C (modulo the unique (!) isomorphism).

In the case of the leftmost linearization the ordering is the one given by . Now

a minute’s reﬂection reveals that every linearization of the local subtrees of a tree

induces a linearization of the entire tree but not conversely (there are orderings

which do not proceed in this way, as we shall see shortly). X -

G

α denotes

the fact that there is a derivation of α from X determined by . Now call π a

priorization for G = (S, N, A, R) if π deﬁnes a linearization on the local tree H

ρ

,

for every ρ ∈ R. Since the root is always the ﬁrst element in a linearization, we

only need to order the daughters of the root node, that is, the leaves. Let this

ordering be . We write X -

π

G

α if X -

G

α for the linearization deﬁned by π.

Proposition 26 Let π be a priorization. Then X -

π

G

x iﬀ X -

G

x.

A diﬀerent strategy is the breadth–ﬁrst search. This search goes through the tree

in increasing depth. Let S

n

be the set of all nodes x with d(x) = n. For each n,

S

n

shall be ordered linearly by . The breadth–ﬁrst search is a linearization ∆,

which is deﬁned as follows. (a) If d(x) = d(y) then x ∆ y iﬀ x y, and (b) if

d(x) < d(y) then x ∆ y. The diﬀerence between these search strategies, depth–ﬁrst

and breadth–ﬁrst, can be made very clear with tree domains.

20. Greibach Normal Form 79

Deﬁnition 27 A tree domain is a set T of strings of natural numbers such that (i)

if x if a preﬁx of y ∈ T then also x ∈ T, (b) if x j ∈ T and i < j then also xi ∈ T.

We deﬁne x > y if x is a proper preﬁx of y and x y iﬀ x = uiv and y = u j w for

some sequences u, v, w and numbers i < j.

The depth–ﬁrst search traverses the tree domain in the lexicographical order, the

breadth–ﬁrst search in the numerical order. Let the following tree domain be

given.

00 10 11 20

0 1 2

ε

¢

¢

¢

¢

f

f

f

f

d

d

d

d

The depth–ﬁrst linearization is

(202) ε, 0, 00, 1, 10, 11, 2, 20

The breadth–ﬁrst linearization, however, is

(203) ε, 0, 1, 2, 00, 10, 11, 20

Notice that with these linearizations the tree domain ω

∗

cannot be enumerated.

Namely, the depth–ﬁrst linearization begins as follows.

(204) ε, 0, 00, 000, 0000, . . .

So we never reach 1. The breadth–ﬁrst linearization goes like this.

(205) ε, 0, 1, 2, 3, . . .

So, we never reach 00. On the other hand, ω

∗

is countable, so we do have a

linearization, but it is more complicated than the given ones.

20. Greibach Normal Form 80

The ﬁrst reduction of grammars we look at is the elimination of superﬂuous

symbols and rules. Let G = (S, A, N, R) be a CFG. Call X ∈ N reachable if

G - α

X

**β for some α and
**

β. X is called completable if there is an x such that

X ⇒

∗

R

x.

(206)

S → AB A → CB

B → AB A → x

D → Ay C → y

In the given grammar A, C and D are completable, and S, A, B and C are reachable.

Since S, the start symbol, is not completable, no symbol is both reachable and

completable. The grammar generates no terminal strings.

Let N

·

be the set of symbols which are both reachable and completable. If

S N

·

then L(G) = ∅. In this case we put N

·

:= ¦S¦ and R

·

:= ∅. Otherwise, let R

·

be the restriction of R to the symbols from A∪N

·

. This deﬁnes G

·

= (S, N

·

, A, R

·

).

It may be that throwing away rules may make some nonterminals unreachable or

uncompletable. Therefore, this process must be repeated until G

·

= G, in which

case every element is both reachable and completable. Call the resulting grammar

G

s

. It is clear that G - α iﬀ G

s

- α. Additionally, it can be shown that every

derivation in G is a derivation in G

s

and conversely.

Deﬁnition 28 A CFG is called slender if either L(G) = ∅ and G has no nontermi-

nals except for the start symbol and no rules; or L(G) ∅ and every nonterminal

is both reachable and completable.

Two slender grammars have identical sets of derivations iﬀ their rule sets are iden-

tical.

Proposition 29 Let G and H be slender. Then G = H iﬀ der(G) = der(H).

Proposition 30 For every CFG G there is an eﬀectively constructible slender

CFG G

s

= (S, N

s

, A, R

s

) such that N

s

⊆ N, which has the same set of deriva-

tions as G. In this case it also follows that L

B

(G

s

) = L

B

(G).

Deﬁnition 31 Let G = (S, N, A, R) be a CFG. G is in Greibach (Normal) Form if

every rule is of the form S → ε or of the form X → x

Y.

20. Greibach Normal Form 81

Proposition 32 Let G be in Greibach Normal Form. If X -

G

α then α has a

leftmost derivation from X in G iﬀ α = y

Y for some y ∈ A

∗

and

Y ∈ N

∗

and y = ε

only if

Y = X.

The proof is not hard. It is also not hard to see that this property characterizes the

Greibach form uniquely. For if there is a rule of the form X → Y

γ then there is

a leftmost derivation of Y

**γ from X, but not in the desired form. Here we assume
**

that there are no rules of the form X → X.

Theorem 33 (Greibach) For every CFG one can eﬀectively construct a grammar

G

g

in Greibach Normal Form with L(G

g

) = L(G).

Before we start with the actual proof we shall prove some auxiliary statements.

We call ρ an X–production if ρ = X → α for some α. Such a production is

called left recursive if it has the form X → X

β. Let ρ = X → α be a rule;

deﬁne R

−ρ

as follows. For every factorization α = α

1

Y

α

2

of α and every rule

Y →

β add the rule X → α

1

β

α

2

to R and ﬁnally remove the rule ρ. Now

let G

−ρ

:= (S, N, A, R

−ρ

). Then L(G

−ρ

) = L(G). We call this construction as

skipping the rule ρ. The reader may convince himself that the tree for G

−ρ

can be

obtained in a very simple way from trees for G simply by removing all nodes x

which dominate a local tree corresponding to the rule ρ, that is to say, which are

isomorphic to H

ρ

. This technique works only if ρ is not an S–production. In this

case we proceed as follows. Replace ρ by all rules of the form S →

β where

β

derives from α by applying a rule. Skipping a rule does not necessarily yield a

new grammar. This is so if there are rules of the form X → Y (in particular rules

like X → X).

Lemma 34 Let G = (S, N, A, R) be a CFG and let X → X

α

i

, i < m, be all

left recursive X–productions as well as X →

β

j

, j < n, all non left recursive X–

productions. Now let G

1

:= (S, N ∪ ¦Z¦, A, R

1

), where Z N ∪ A and R

1

consists

of all Y–productions from R with Y X as well as the productions

(207)

X →

β

j

j < n, Z → α

i

i < m,

X →

β

j

Z j < n, Z → α

i

Z i < m.

Then L(G

1

) = L(G).

20. Greibach Normal Form 82

Proof. We shall prove this lemma rather extensively since the method is relatively

tricky. We consider the following priorization on G

1

. In all rules of the form

X →

β

j

and Z → α

i

we take the natural ordering (that is, the leftmost ordering)

and in all rules X →

β

j

Z as well as Z → α

i

Z we also put the left to right ordering

except that Z precedes all elements from α

j

and

β

i

, respectively. This deﬁnes the

linearization . Now, let M(X) be the set of all γ such that there is a leftmost

derivation from X in G in such a way that γ is the ﬁrst element not of the form

X

**δ. Likewise, we deﬁne P(X) to be the set of all γ which can be derived from X
**

priorized by in G

1

such that γ is the ﬁrst element which does not contain Z. We

claim that P(X) = M(X). It can be seen that

(208) M(X) =

j<n

β

j

(

i<m

α

i

)

∗

= P(X)

From this the desired conclusion follows thus. Let x ∈ L(G). Then there exists a

leftmost derivation Γ = (A

i

: i < n + 1) of x. (Recall that the A

i

are instances of

rules.) This derivation is cut into segments Σ

i

, i < σ, of length k

i

, such that

(209) Σ

i

= (A

j

:

p<i

k

p

≤ j < 1 +

p<i+1

k

i

)

This partitioning is done in such a way that each Σ

i

is a maximal portion of Γ of X–

productions or a maximal portion of Y–productions with Y X. The X–segments

can be replaced by a –derivation

Σ

i

in G

1

, by the previous considerations. The

segments which do not contain X–productions are already G

1

–derivations. For

them we put

Σ

i

:= Σ

i

. Now let

Γ be result of stringing together the

Σ

i

. This

is well–deﬁned, since the ﬁrst string of

Σ

i

equals the ﬁrst string of Σ

i

, as the last

string of

Σ

i

equals the last string of Σ

i

.

Γ is a G

1

–derivation, priorized by . Hence

x ∈ L(G

1

). The converse is analogously proved, by beginning with a derivation

priorized by .

Now to the proof of Theorem 33. We may assume at the outset that G is in

Chomsky Normal Form. We choose an enumeration of N as N = ¦X

i

: i < p¦.

We claim ﬁrst that by taking in new nonterminals we can see to it that we get a

grammar G

1

such that L(G

1

) = L(G) in which the X

i

–productions have the form

X

i

→ x

Y or X

i

→ X

j

Y with j > i. This we prove by induction on i. Let i

0

be

the smallest i such that there is a rule X

i

→ X

j

Y with j ≤ i. Let j

0

be the largest

j such that X

i

0

→ X

j

Y is a rule. We distinguish two cases. The ﬁrst is j

0

= i

0

.

By the previous lemma we can eliminate the production by introducing some new

20. Greibach Normal Form 83

nonterminal symbol Z

i

0

. The second case is j

0

< i

0

. Here we apply the induction

hypothesis on j

0

. We can skip the rule X

i

0

→ X

j

0

**Y and introduce rules of the form
**

(a) X

i

0

→ X

k

Y

·

with k > j

0

. In this way the second case is either eliminated or

reduced to the ﬁrst.

Now let P := ¦Z

i

: i < p¦ be the set of newly introduced nonterminals. It may

happen that for some j Z

j

does not occur in the grammar, but this does not disturb

the proof. Let ﬁnally P

i

:= ¦Z

j

: j < i¦. At the end of this reduction we have rules

of the form

X

i

→ X

j

Y ( j > i) (210a)

X

i

→ x

Y (x ∈ A) (210b)

Z

i

→

W (

W ∈ (N ∪ P

i

)

+

(ε ∪ Z

i

)) (210c)

It is clear that every X

p−1

–production already has the form X

p−1

→ x

Y. If some

X

p−2

–production has the form (210a) then we can skip this rule and get rules of the

form X

p−2

→ x

Y

·

. Inductively we see that all rules of the form can be eliminated

in favour of rules of the form (210b). Now ﬁnally the rules of type (210c). Also

these rules can be skipped, and then we get rules of the form Z → x

Y for some

x ∈ A, as desired.

For example, let the following grammar be given.

(211)

S → SDA ¦ CC A → a

D → DC ¦ AB B → b

C → c

The production S → SDA is left recursive. We replace it according to the above

lemma by

(212) S → CCZ, Z → DA, Z → DAZ

Likewise we replace the production D → DC by

(213) D → ABY, Y → C, Y → CY

With this we get the grammar

(214)

S → CC ¦ CCZ A → a

Z → DA ¦ DAZ B → b

D → AB ¦ ABY C → c

Y → C ¦ CY

21. Pushdown Automata 84

Next we skip the D–productions.

(215)

S → CC ¦ CCZ A → a

Z → ABA ¦ ABYA ¦ ABAZ ¦ ABYAZ B → b

D → AB ¦ ABY C → c

Y → C ¦ CY

Next D can be eliminated (since it is not reachable) and we can replace on the right

hand side of the productions the ﬁrst nonterminals by terminals.

(216)

S → cC ¦ cCZ

Z → aBA ¦ aBYA ¦ aBAZ ¦ aBYZ

Y → c ¦ cY

Now the grammar is in Greibach Normal Form.

21 Pushdown Automata

Regular languages can be recognized by a special machine, the ﬁnite state automa-

ton. Recognition here means that the machine scans the string left–to–right and

when the string ends (the machine is told when the string ends) then it says ‘yes’

if the string is in the language and ‘no’ otherwise. (This picture is only accurate

for deterministic machines; more on that later.)

There are context free languages which are not regular, for example ¦a

n

b

n

: n ∈

N¦. Thus devices that can check membership in L(G) for a CFG must therefore

be more powerful. The devices that can do this are called pushdown automata.

They are ﬁnite state machines which have a memory in form of a pushdown. A

pushdown memory is potentially inﬁnite. You can store in it as much as you like.

However, the operations that you can perform on a pushdown storage are limited.

You can see only the last item you added to it, and you can either put something

on top of that element, or you can remove that element and then the element that

was added before it becomes visible. This behaviour is also called LIFO (last–

in–ﬁrst–out) storage. It is realized for example when storing plates. You always

add plates on top, and remove from the top, so that the bottommost plates are only

used when there is a lot of people. It is easy to see that you can decide whether

a given string has the form a

n

b

n

given an additional pushdown storage. Namely,

you scan the string from left to right. As long as you get a, you put it on the

21. Pushdown Automata 85

pushdown. Once you hit on a b you start popping the as from the pushdown, one

for each b that you ﬁnd. If the pushdown is emptied before the string is complete,

then you have more bs than a. If the pushdown is not emptied but the string is

complete, then you have more as than bs. So, you can tell whether you have a

string of the required form if you can tell whether you have an empty storage. We

assume that this is the case. In fact, typically what one does is to ﬁll the storage

before we start with a special symbol #, the end-of-storage marker. The storage

is represented as a string over an alphabet D that contains #, the storage alphabet.

Then we are only in need of the following operations and predicates:

For each d ∈ D we have an operation push

d

: x ·→ x

d.

For each d ∈ D we have a predicate top

d

which if true of x iﬀ x = y

d.

We have an operation pop : x

d ·→ x.

(If we do not have an end of stack marker, we also need a predicate ‘empty’, which

is true of a stack x iﬀ x = ε.

Now, notice that the control structure is a ﬁnite state automaton. It schedules

the actions using the stack as a storage. This is done as follows. We have two

alphabets, A, the alphabet of letters read from the tape, and I, the stack alphabet,

which contains a special symbol, #. Initially, the stack contains one symbol, #. A

transition instruction is a a quintuple (s, c, t, s

·

, p), where s and s

·

are states, c is a

character or empty (the character read from the string), and t is a character (read

from the top of the stack) and ﬁnally p an instruction to either pop from the stack

or push a character (diﬀerent from #) onto it. A PDA contains a set of instructions.

Formally, it is deﬁned to be a quintuple (A, I, Q, i

0

, F, σ), where A is the input

alphabet, I the stack alphabet, Q the set of states, i

0

∈ Q the start state, F ⊆ Q the

set of accepting states, and σ a set of instructions. If A is reading a string, then it

does the following. It is initialized with the stack containing # and the initial state

i

0

. Each instruction is an option for the machine to proceed. However, it can use

that option only if it is in state s, if the topmost stack symbol is t and if c ε, the

next character must match c (and is then consumed). The next state is s

·

and the

stack is determined from p. If p = pop, then the topmost symbol is popped, if it

is push

a

, then a is pushed onto stack. PDAs can be nondeterministic. For a given

situation we may have several options. If given the current stack, the current state

and the next character there is at most one operation that can be chosen, we call

21. Pushdown Automata 86

the PDA deterministic. We add here that theoretically the operation that reads the

top of the stack removes the topmost symbol. The stack really is just a memory

device. In order to look at the topmost symbol we actally need to pop it oﬀ the

stack. However, if we put it back, then this as if we had just ‘peeked’ into the top

of the stack. (We shall not go into the details here: but it is possible to peek into

any number of topmost symbols. The price one pays is an exponential blowup of

the number of states.)

We say that the PDA accepts x by state if it is in a ﬁnal state at the end of x.

To continue the above example, we put the automaton in an accepting state if after

popping as the topmost symbol is #. Alternatively, we say that the PDA accepts

x by stack if the stack is empty after x has been scanned. A slight modiﬁcation of

the machine results in a machine that accepts the language by stack. Basically, it

needs to put one a less than needed on the stack and then cancel # on the last move.

It can be shown that the class of languages accepted by PDAs by state is the same

as the class of languages accepted by PDAs by stack, although for a given machine

the two languages may be diﬀerent. We shall establish that the class of languages

accepted by PDAs by stack are exactly the CFGs. There is a slight problem in

that the PDAs might actually be nondeterministic. While in the case of ﬁnite state

automata there was a way to turn the machine into an equivalent deterministic

machine, this is not possible here. There are languages which are CF but cannot be

recognized by a deterministic PDA. An example is the language of palindromes:

¦xx

T

: x ∈ A

∗

¦, where x

T

is the reversal of x. For example, abddc

T

= cddba.

The obvious mechanism is this: scan the input and start pushing the input onto the

stack until you are half through the string, and then start comparing the stack with

the string you have left. You accept the string if at the end the stack is #. Since

the stack is popped in reverse order, you recognize exactly the palindromes. The

trouble is that there is no way for the machine to knowwhen to shift gear: it cannot

tell when it is half through the string. Here is the dilemma. Let x = abc. Then

abccba is a palindrome, but so is abccbaabccba and abccbaabccbaabccba. In

general, abccba

n

is a palindrome. If you are scanning a word like this, there is no

way of knowing when you should turn and pop symbols, because the string might

be longer than you have thought.

It is for this reason that we need to review our notion of acceptance. First, we

say that a run of the machine is a series of actions that it takes, given the input. Al-

ternatively, the run speciﬁes what the machine chooses each time it faces a choice.

(The alternatives are simply diﬀerent actions and diﬀerent subsequent states.) A

21. Pushdown Automata 87

machine is deterministic iﬀ for every input there is only one run. We say that the

machine accepts the string if there is an accepting run on that input. Notice that

the deﬁnition of a run is delicate. Computers are not parallel devices, they can

only execute one thing at a time. They also are deterministic. The PDA has the

same problem: it chooses a particular run but has no knowledge of what the out-

come would have been had it chosen a diﬀerent run. Thus, to check whether a run

exists on a input we need to emulate the machine, and enumerate all possible runs

and check the outcome. Alternatively, we keep a record of the run and backtrack.

Nowback to the recognition problem. To showthe theoremwe use the Greibach

Normal Form. Observe the following. The Greibach Form always puts a terminal

symbol in front of a series of nonterminals. We deﬁne the following machine.

Its stack alphabet is N, the beginning of stack is the start symbol of G. Now let

X → aY

0

Y

1

Y

n−1

be a rule. Then we translate this rule into the following ac-

tions. Whenever the scanned symbol is a, and whenever the topmost symbol of

the stack is X then pop that symbol from the stack and put Y

n−1

then Y

n−2

then Y

n−3

etc. on the stack. (To schedule this correctly, the machine needs to go through

several steps, since each step allows to put only one symbol onto the stack. But

we ignore that ﬁnesse here.) Thus the last symbol put on the stack is Y

0

, which is

then visible. It can be shown that the machine ends with an empty stack on input

x iﬀ there is a leftmost derivation of x that corresponds to the run.

Let us see an example. Take the grammar

(217)

0 S → cC

1 S → cCZ

2 Z → aBA

3 Z → aBYA

4 Z → aBAZ

5 Z → aBYZ

6 Y → c

7 Y → cY

8 A → a

9 B → b

10 C → c

We take the string ccabccaba. Here is a leftmost derivation (to the left we show

22. Shift–Reduce–Parsing 88

the string, and to the right the last rule that we applied):

(218)

S −

cCZ 1

ccZ 10

ccaBYZ 5

ccabYZ 9

ccabcYZ 7

ccabccZ 8

ccabccaBA 2

ccabccabA 9

ccabccaba 8

The PDA is parsing the string as follows. (We show the successful run.) The

stack is initialized to S (Line 1). It reads c and deletes S, but puts ﬁrst Z and then

C on it. The stack now is ZC (leftmost symbol is at the bottom!). We are in Line

2. Now it reads c and deletes C from stack (Line 3). Then it reads a and pops

Z, but pushes ﬁrst Z, then Y and then B (Line 4). Then it pops B on reading b,

pushing nothing on top (Line 5). It is clear that the strings to the left represent the

following: the terminal string is that part of the input string that has been read,

and the nonterminal string is the stack (in reverse order). As said, this is just one

of the possible runs. There is an unsuccessful run which starts as follows. The

stack is S. The machine reads c and decides to go with Rule 0: so it pops C but

pushes only C. Then it reads c, and is forced to go with Rule 10, popping oﬀ C,

but pushing nothing onto it. Now the stack is empty, and no further operation can

take place. That run fails.

Thus, we have shown that context free languages can be recognized by PDAs

by empty stack. The converse is a little trickier, we will not spell it out here.

22 Shift–Reduce–Parsing

We have seen above that by changing the grammar to Greibach Normal Form we

can easily implement a PDA that recognizes the strings using a pushdown of the

nonterminals. It is not necessary to switch to Greibach Normal Form, though. We

can translate directly the grammar into a PDA. Alos, grammar and automaton are

not uniquely linked to each other. Given a particular grammar, we can deﬁne quite

22. Shift–Reduce–Parsing 89

diﬀerent automata that recognize the languages based on that grammar. The PDA

implements what is often called a parsing strategy. The parsing strategy makes

use of the rules of the grammar, but depending on the grammar in quite diﬀerent

ways. One very popular method of parsing is the following. We scan the string,

putting the symbols one by one on the stack. Every time we hit a right hand side

of a rule we undo the rule. This is to do the following: suppose the top of the

stack contains the right hand side of a rule (in reverse order). Then we pop that

sequence and push the left–hand side of the rule on the stack. So, if the rule is

A → a and we get a, we ﬁrst push it onto the stack. The top of the stack now

matches the right hand side, we pop it again, and then push A. Suppose the top of

the stack contains the sequence BA and there is a rule S → AB, then we pop twice,

removing that part and push S. To do this, we need to be able to remember the top

of the stack. If a rule has two symbols to its right, then the top two symbols need

to be remembered. We have seen earlier that this is possible, if the machine is

allowed to do some empty transitions in between. Again, notice that the strategy

is nondeterministic in general, because several options can be pursued at each

step. (a) Suppose the top part of the stack is BA, and we have a rule S → AB. Then

either we push the next symbol onto the stack, or we use the rule that we have. (b)

Suppose the top part of the stack is BA and we have the rules S → AB and C → A.

Then we may either use the ﬁrst or the second rule.

The strongest variant is to always reduce when the right hand side of a rule is

on top of the stack. Despite not being always successful, this strategy is actually

useful in a number of cases. The condition under which it works can be spelled out

as follows. Say that a CFG is transparent if for every string α and nonterminal

X ⇒

∗

α, if there is an occurrence of a substring γ, say α = κ

1

γκ

2

, and if there is

a rule Y → γ, then (κ

1

, κ2) is a constituent occurrence of category Y in α. This

means that up to inessential reorderings there is just one parse for any given string

if there is any. An example of a transparent language is arithmetical terms where

no brackets are omitted. Polish Notation and Reverse Polish Notation also belong

to this category. A broader class of languages where this strategory is successful

is the class of NTS–languages. A grammar has the NTS–property if whenever

there is a string S ⇒

∗

κ

1

γκ

2

and if Y → γ then S ⇒

∗

κ

1

Yκ

2

. (Notice that this does

not mean that the constituent is a constituent under the same parse; it says that one

can ﬁnd a parse that ends in Y being expanded in this way, but it might just be a

diﬀerent parse.) Here is how the concepts diﬀer. There is a grammar for numbers

that runs as follows.

< digit > → 0 ¦ 1 ¦ ¦ 9 (219)

22. Shift–Reduce–Parsing 90

< number > → < digit >¦< number >< number >

This grammar has the NTS–property. Any digit is of category < digit >, and a

number can be broken down into two numbers at any point. The standard deﬁni-

tion is as follows.

< digit > → 0 ¦ 1 ¦ ¦ 9 (220)

< number > → < digit >¦< digit >< number >

This grammar is not an NTS–grammar, because in the expression

(221) < digt >< digit >< digit >

the occurrence (< digit >, ε) of the string < digit >< digit > cannot be a

constituent occurrence under any parse. Finally, the language of strings has no

transparent grammar! This is because the string 732 possesses occurrences of the

strings 73 and 32. These occurrences must be occurrences under one and the same

parse if the grammar is transparent. But they overlap. Contradiction.

Now, the strategy above is deterministic. It is easy to see that there is a non-

deterministic algorithm implementing this idea that actually deﬁnes a machine

parsing exactly the CFLs. It is a machine where you only have a stack, and sym-

bols scanned are moved onto the stack. The top k symbols are visible (with PDAs

we had k = 1). If the last m symbols, m ≤ k, are the right hand side of a rule,

you are entitled to replace it by the corresponding left hand side. (This style of

parsing is called shift–reduce parsing. One can show that PDAs can emulate a

shift–reduce parser.) The stack gets cleared by m − 1 symbols, so you might end

up seeing more of the stack after this move. Then you may either decide that once

again there is a right hand side of a rule which you want to replace by the left hand

side (reduce), or you want to see one more symbol of the string (shift). In general,

like with PDAs, the nondeterminism is unavoidable.

There is an important class of languages, the so–called LR(k)–grammars. These

are languages where the question whether to shift or to reduce can be based on a

lookahead of k symbols. That is to say, the parser might not know directly if it

needs to shift or to reduce, but it can know for sure if it sees the next k symbols (or

whatever is left of the string). One can show that for a language which there is an

LR(k)–grammar with k > 0 there also is a LR(1)–grammar. So, a lookahead of just

one symbol is enough to make an informed guess (at least with respect to some

grammar, which however need not generate the constituents we are interested in).

23. Some Metatheorems 91

If the lookahead is 0 and we are interested in acceptance by stack, then the PDA

also tells us when the string should be ﬁnished. Because in consuming the last

letter it should know that this is the last letter because it will erase the symbol # at

the last moment. Unlike nondeterministic PDAs which may have alternative paths

they can follow, a deterministic PDA with out lookahead must make a ﬁrm guess:

if the last letter is there it must know that this is so. This is an important class.

Language of this form have the following property: no proper preﬁx of a string

of the language is a string of the language. The language B = ¦a

n

b

n

: n ∈ N¦ of

balanced strings is in LR(0), while for example B

+

is not (if contains the string

abab, which has a proper preﬁx ab.)

23 Some Metatheorems

It is often useful to know whether a given language can be generated by geammars

of a given type. For example: how do we decide whether a language is regular?

The following is a very useful citerion.

Theorem 35 Suppose that L is a regular language. Then there exists a number

k such that every x ∈ L of length at least k has a decomposition x = uv w with

nonempty v such that for all n ∈ N:

(222) uv

n

w ∈ L

Before we enter the proof, let us see some consequences of this theorem. First,

we may choose n = 0, in which case we get u w ∈ L.

The proof is as follows. Since L is regular, there is a fsa A such that L = L(A).

Let k be the number of states of A. Let x be a string of length at least k. If x ∈ L

then there is an accepting run of A on x:

(223) q

0

x

0

→ q

1

x

1

→ q

2

x

2

→ q

3

q

n−1

x

n−1

→ q

n

This run visits n + 1 states. But A has at most n states, so there is a state, which

has been visited twice. There are i and j such that i < j and q

i

= q

j

. Then put

u := x

0

x

1

x

i−1

, v := x

i

x

i+1

x

j−1

and w := x

j

x

j+1

x

n−1

. We claim that there

23. Some Metatheorems 92

exists an accepting run for every string of the form uv

q

w. For example,

q

0

u

→ q

i

= q

j

w

→ q

n

(224)

q

0

u

→ q

i

v

→ q

j

= q

i

w

→ x

n

(225)

q

0

u

→ q

i

v

→ q

j

= q

i

v

→ q

j

w

→ x

n

(226)

q

0

u

→ q

i

v

→ q

j

= q

i

v

→ q

j

= q

i

v

→ q

j

= q

i

w

→ x

n

(227)

There are examples of this kind in natural languages. An amusing example is from

Steven Pinker. In the days of the arms race one produced not only missiles

but also anti missile missile, to be used against missiles; and then anti

anti missile missile missiles to attach anti missile missiles. And to at-

tack those, one needed anti anti anti missile missile missile missiles.

And so on. The general recipe for these expressions is as follows:

(228) (anti

)

n

(missile

)

n

missile

We shall show that this is not a regular language. Suppose it is regular. Then there

is a k satisfying the conditions above. Nowtake the word (anti

)

k

(missile

)

k

missiile.

There must be a subword of nonempty length that can be omitted or repeated with-

out punishment. No such word exists: let us break the original string into a preﬁx

(anti

)

k

of length 5k and a suﬃx (missile

)

k

missile of length 8(k+1)−1.

The entire string has length 13k + 7. The string we take out must therefore have

length 13n. We assume for simplicity that n = 1; the general argument is similar.

It is easy to see that it must contain the letters of anti missile. Suppose, we

decompose the original string as follows:

(229) u = (anti

)

k−1

, v = anti missile

,

w = (missile

)

k−1

missile

Then u w is of the required form. Unfortunately,

(230) uv

2

w = (anti

)

k−n

anti missile anti missile

(missile

)

k−2

missile L

Similarly for any other attempt to divide the string.

23. Some Metatheorems 93

This proof can be simpliﬁed using another result. Consider a map µ : A → B

∗

,

which assigns to each letter a ∈ A a string (possibly empty) of letters from B. We

extend µ to strings as follows.

(231) µ(x

0

x

1

x

2

x

n−1

) = µ(x

0

)µ(x

1

)µ(x

2

) µ(x

n−1

)

For example, let B = ¦c, d¦, and µ(c) := anti

and µ(d) := missile

. Then

µ(d) = missile

(232)

µ(cdd) = anti missile missile

(233)

µ(ccddd) = anti anti missile missile missile

(234)

So if M = ¦c

k

d

k+1

: k ∈ N¦ then the language above is the µ–image of M (modulo

the blank at the end).

Theorem 36 Let µ : A → B

∗

and L ⊆ A

∗

be a regular language. Then the set

¦µ(v) : v ∈ L¦ ⊆ B

∗

also is regular.

However, we can also do the following: let ν be the map

(235) a ·→ c, n ·→ ε, t ·→ ε, i ·→ ε, m ·→ d, s ·→ ε, l ·→ ε, e ·→ ε, ·→ ε

Then

(236) ν(anti) = c, ν(missile) = d, ν() = ε

So, M is also the image of L under ν. Now, to disprove that L is regular it is enough

to show that M is not regular. The proof is similar. Choose a number k. We show

that the conditions are not met for this k. And since it is arbitrary, the condition

is not met for any number. We take the string c

k

d

k+1

. We try to decompose it into

uv w such that uv

j

w ∈ M for any j. Three cases are to be considered. (a) v = c

p

for some p (which must be > 0):

(237) cc c • c c • c cdd d

Then u w = c

k−p

d

k+1

, which is not in M. (b) v = d

p

for some p > 0. Similarly. (c)

v = c

p

d

q

, with p, q > 0.

(238) cc c • c cd d • dd d

23. Some Metatheorems 94

Then uv

2

w contains a substring dc, so it is not in M. Contradiction in all cases.

Hence, k does not satisfy the conditions.

A third theorem is this: if L and M are regular, so is L ∩ M. This follows from

Theorem 13. For if L and M are regular, so is A

∗

− L, and A

∗

− M. Then so is

(A

∗

− L) ∪ (A

∗

− M) and, again by Theorem 13,

(239) A

∗

− ((A

∗

− L) ∪ (A

∗

− M)) = L ∩ M

This can be used in the following example. Inﬁnitives of German are stacked

inside each other as follows:

Maria sagte, dass Hans die Kinder spielen ließ. (240)

Maria sagte, dass Hans Peter die Kinder spielen lassen (241)

ließ.

Maria sagte, dass Hans Peter Peter die Kinder spielen (242)

lassen lassen ließ.

Maria sagte, dass Hans Peter Peter Peter die Kinder (243)

spielen lassen lassen lassen ließ.

Call the language of these strings H. Although it is from a certain point on hard to

follow what the sentences mean, they are grammatically impeccable. We shall this

language is not regular. From this we shall deduce that German as whole is not

regular; namely, we claim that H is the intersection of German with the following

language

Maria sagte, dass Hans

(

Peter)

k

die Kinder spielen (244)

(

lassen)

k

ließ.

Hence if German is regular, so is H. But H is not regular. The argument is similar

to the ones we have given above.

Now, there is a similar property of context free languages, known as the

Pumping Lemma or Ogden’s Lemma. I shall give only the weakest form of

it.

Theorem 37 (Pumping Lemma) Suppose that L is a context free language. Then

there is a number k ≥ 0 such that every string x ∈ L of length at least k possesses

23. Some Metatheorems 95

a decomposition x = uv wyz with v ε or y ε such that for every j ∈ N (which

can be zero):

(245) uv

j

wy

j

z ∈ L

The condition v ε or y ε must be put in, otherwise the claim is trivial: every

language has the property for k = 0. Simply put u = x, v = w = y = z = ε. Notice

however that u, w and z may be empty.

I shall brieﬂy indicate why this theorem holds. Suppose that x is very large

(for example: let π the maximum length of the right hand side of a production

and ν = ¦N¦ the number of nonterminals; then put k > π

ν+1

). Then any tree for

x contains a branch leading from the top to the bottom and has length > ν + 1.

Along this branch you will ﬁnd that some nonterminal label, say A, must occur

twice. This means that the string is cut up in this way:

(246) x

0

x

1

x

i−1

◦ x

i

x

i+1

x

j−1

• x

j

x

−1

• x

x

m−1

◦ x

m

x

k−1

where the pair of • encloses a substring of label A and the ◦ encloses another one.

So,

(247) x

i

x

i+1

x

m−1

x

j

x

−1

have that same nonterminal label. Now we deﬁne the following substrings:

u := x

0

x

1

x

i−1

, v := x

i

x

j−1

, w := x

j

x

−1

, (248)

y := x

x

m−1

, z := x

m

x

k−1

Then one can easily show that uv

j

wy

j

z ∈ L for every j. We mention also the

following theorems.

Theorem 38 Let L be a context free language over A and M a regular language.

Then L ∩ M is context free. Also, let µ : A → B

∗

be a function. Then the image of

L under µ is context free as well.

I omit the proofs. Instead we shall see how the theorems can be used to show that

certain languages are not context free. A popular example is ¦a

n

b

n

c

n

: n ∈ N¦.

Suppose it is context free. Then we should be able to name a k ∈ N such that for

all strings x of length at least k a decomposition of the kind above exists. Now,

23. Some Metatheorems 96

we take x = a

k

b

b

c

k

. We shall have to ﬁnd appropriate v and y. First, it is easy

to see that vy must contain the same number of a, b and c (can you see why?).

The product is nonempty, so at least one a, b and c must occur. We show that

uv

2

wy

2

z L. Suppose, v contains a and b. Then v

2

contains a b before an a.

Contradiction. Likewise, v cannot contain a and c. So, v contains only a. Now

y contains b and c. But then y

2

contains a c before a b. Contradiction. Hence,

uv

2

wy

2

z L.

Now, the Pumping Lemma is not an exact characterization. Here is a language

that satisﬁes the test but is not context free:

(249) C := ¦xx : x ∈ A

∗

¦

This is known as the copy language. We shall leave it as an assignment to show

that C fulﬁlls the conclusion of the pumping lemma. We concentrate on showing

that it is not context free. The argument is a bit involved. Basically, we take a

string as in the pumping lemma and ﬁx a decomposition so that uv

j

wy

j

z ∈ C for

every j. The idea is now that no matter what string y

0

y

1

y

p−1

we are given, if it

contains a constituent A:

(250) y

0

g

1

y

p−1

• y

p

y

q−1

• y

j

y

q−1

then we can insert the pair v, y like this:

(251) y

0

y

1

y

p−1

• x

p

x

j−1

◦ y

i

y

j−1

• x

x

m−1

y

p

y

q−1

Now one can show that since there are a limited number of nonterminals we are

bound to ﬁnd a string which contains such a constituent where inserting the strings

is inappropriate.

Natural languages do contain a certain amount of copying. For example,

Malay (or Indonesian) forms the plural of a noun by reduplicating it. Chinese

has been argued to employ copying in yes-no questions.

2. Practical Remarks Concerning OCaML

2

a version of OCaML installed on his or her computer and has a version of the manual. If help is needed in installing the program I shall be of assistance. The choice of OCaML perhaps needs comment, since it is not so widely known. In the last twenty years the most popular language in computational linguistics (at least as far as introductions was concerned) was certainly Prolog. However, its lack of imperative features make its runtime performance rather bad for the problems that we shall look at (though this keeps changing, too). Much of actual programming nowadays takes place in C or C++, but these two are not so easy to understand, and moreover take a lot of preparation. OCaML actually is rather fast, if speed is a concern. However, its main features that interest us here are: It is completely typed. This is particularly interesting in a linguistics course, since it oﬀers a view into a completely typed universe. We shall actually begin by explaining this aspect of OCaML. It is completely functional. In contrast to imperative languages, functional languages require more explicitness in the way recursion is handled. Rather than saying that something needs to be done, one has to say how it is done. It is object oriented. This aspect will not be so important at earlier stages, but turns into a plus once more complex applications are looked at.

2

Practical Remarks Concerning OCaML

When you have installed OCaML, you can invoke the program in two ways: In Windows by clicking on the icon for the program. This will open an interactive window much like the xterminal in Unix. by typing ocaml after the prompt in a terminal window. Either way you have a window with an interactive shell that opens by declaring what version is running, like this: (1) Objective Caml version 3.09.1

2. Practical Remarks Concerning OCaML

3

It then starts the interactive session by giving you a command prompt: #. It is then waiting for you to type in a command. For example (2) # 4+5;;

It will execute this and return to you as follows: (3) - : # int = 9

So, it gives you an answer (the number 9), which is always accompanied by some indication of the type of the object. (Often you ﬁnd that you get only the type information, see below for reasons.) After that, OCaML gives you back the prompt. Notice that your line must end in ;; (after which you have to hit < return >, of course). This will cause OCaML to parse it as a complete expression and it will try to execute it. Here is another example. (4) # let a = ’q’;;

In this case, the command says that it should assign the character q to the object with name a. Here is what OCaML answers: (5) val a : char = ’q’

Then it will give you the prompt again. Notice that OCaML answers without using the prompt, #. This is how you can tell your input from OCaMLs answers. Suppose you type (6) # Char.code ’a’;;

Then OCaML will answer (7) - : int = 113

Then it will give you the prompt again. At a certain point this way of using OCaML turns out to be tedious. For even though one can type in entire programs this way, there is no way to correct mistakes once a line has been entered. Therefore, the following is advisable to do. First, you need to use an editor (I recommend to use either emacs or vim; vim runs on all platforms, and it can be downloaded and installed with no charge

Therefore. it may hit type mismatches (because some of them are not apparent at ﬁrst inspection). one is the prompt of OCaML. If you load a program that calls functions that have not been deﬁned. this will just overwrite the assignments you have made earlier. The second time OCaML will check the types.. Notice that the ﬁle name must be enclosed in quotes. edit it. has to be put in there as well. So. but only insofar as the dependencies are respected.). (The line will contain two #. as we will discuss below. open a ﬁle < myﬁle > .. If there is a type mismatch. OCaML will tell you so.. Everything is left associative. You can load them independently. upon which you have to go back to your ﬁle. and.ml". Thus. it returns to you and tells you so. you may have to reload every ﬁle that depends on it (you will see why if you try. OCaML will process it. make sure you load the ﬁles in the correct order. At this stage it will only tell you if the program is incorrect syntactically.ml). Type in your program or whatever part of it you want to store separately. go for an independent editor. First. Practical Remarks Concerning OCaML 4 and with a mouseclick). Moreover. or. Using the editor.2. even though technically superﬂuous.) This will cause OCaML to look at the content of the ﬁle as if it had been entered from the command line. and return a result. This is because OCaML expects a string as input for #use and it will interpret the string as a ﬁlename. Either they do not let you edit a raw text ﬁle (like Word or the like) or they do not let you choose where to store data (like Notepad). whenever you want OCaML to use that program. Editors that Windows provides by default are not good. and load it again. After it is done with the types it is ready to execute the program. run time errors such as accessing an item that . Here. more often. Then. Notice that OCaML goes through your program three times. You may have stored your program in several places. write it. Of course there are higher level tools available. but this technique is rather eﬀective and fast enough for the beginning practice. if you make changes and load a ﬁle again. too.ml (it has to have the extension . which obviously you do not enter. it parses the program syntactically. type after the prompt: (8) # #use "<myﬁle>. because you may want to keep certain parts separate. some error message. You may reuse the program. the extension. install it (you can get help on this too from us). which happens very frequently in the beginning. and the other is from you! And no space in between the # and the word use.

3. Moreover. for example 10763. Notice that "10763" is a string. Characters must be enclosed in single quotes. . Whenever something is declared to be of certain type.0 = 1. For things can only be equal via = if they have identical type. for example refers to the character a. Although identical in appearance to the character a. When it reads the integer it complains.. they must be kept distinct.. 1. There are certain basic types. The next type. Welcome To The Typed Universe 5 isn’t there (the most frequent mistake). They are distinct from integers. integer is typed in as such.0 is a number of type ﬂoat. First it sees a number of type ﬂoat and expects that to the right of the equation sign there also is a number of type ﬂoat. The typing is very strict. but types can also be created. The type is named by a string. you ﬁnd a description in the user manual): character string integer boolean ﬂoat char string int bool float (9) There are conventions to communicate a token of a given type. 3 Welcome To The Typed Universe In OCaML. 1 is a number of type int. For example. every expression is typed. ﬂoat is the type of ﬂoating point numbers. containing the single letter a. It parses expressions from left to right and stops at the ﬁrst point where things do not match. Finally. The following are inbuilt types (the list is not complete. not a number! Booleans have two values. although they may have identical values. and OCaML will respond with an error message: (10) # 1. ’a’. typed in as follows: true and false. it can only be used with operations that apply to things of that type. Strings must be enclosed in double quotes: "mail" denotes the string mail. This expression has type int but is here used with type float And it will underline the oﬀending item. The distinction matters: "a" is a string. Type in 1.0 = 1.

A type constructor is something that makes new types from old types. val iden : ’a −> ’a = <fun> There exists an option to make subtypes inherit properties from the supertype. Then Typ→. This is what happens: (11) # let f x = x + x. Types are simply terms of a particular signature. (There are also unary type constructors. If OCaML cannot infer the type.. ’b and so on. and • if α.. Deﬁnition 1 (Types) Let B be a set. For example. Let us say something about types in general. but OCaML needs to be explicitly told to do that.• (B). Addition on ﬂoat is +. val g : float −> float = <fun> There exist type variables. so this is only an example. which are ’a.) The full deﬁnition is as follows. Welcome To The Typed Universe 6 OCaML immediately sets out to type the expression you have given. or if the function is polymorphic (it can be used on more than one type) it writes ’a in place of a speciﬁc type.• (B). the set of types over B. OCaML knows this because + is only deﬁned on integers. They require two types each.3. β ∈ Typ→.• (B) then also α → β. It will not do any type conversion by itself.• (B). x. and which adds a number to itself. is the smallest set such that • B ⊆ Typ→. both → and • are binary type constructors.: (12) # let g x = x +. In addition to type constructors we need basic types. → and •. val f : int −> int = <fun> This says that f is a function from integers to integers. α • β ∈ Typ→. the set of basic types. with type constructors → and •.. . (13) # let iden x = x. you wish to deﬁne a function which you name f. a very popular signature in Montague grammar consists in just two so-called type constructors. For example.

if x is an element of type α → β and y is an element of type α. then x is a function that takes y as its argument. and sometimes it is even necessary to do so. the elements of diﬀerent types are diﬀerent from each other. Hence. First. (This is when no type conversion occurs. is the function f : {0} → Mchar such that f (0) = k. Hence.: int = 89 . we say that (14) Mα→β = { f : f is a function from Mα to Mβ } For example. and it is of type β. even though a single character is thought of in the same way as the string of length 1 with that character.. The string k. It says that the value of the symbol f is of type int -> int because it maps integers to integers. respectively. Typically. OCaML allows you to enclose the string in brackets. We can take a look how it does that. These functions have type α → β. Here is how it is done. We may construe strings over characters as function from numbers to characters.. The quotes are not part of the object. Type that in and OCaML answers: (16) # f 43. In this situation the string for x followed by the string for y (but separated by a blank) is well–formed for OCaML. val f : int −> int = <fun> Now we understand better what OCaML is telling us. and strings of ASCII–characters. But see below. . we can deﬁne a function f by saying that f (x) = 2x + 3. The way to do this in OCaML is by issuing (15) # let f x = (2 * x) + 3. and that it is a function. Notice that OCaML inferred the type from the deﬁnition of the function. Given two types α and β we can form the type of functions from α–objects to β–objects. they are now physically distinct. Welcome To The Typed Universe 7 Each type is interpreted by particular elements. they tell you what its type is. that OCaML has a basic type char and a basic type string. We interpret each type for b ∈ B by a set Mb in such a way that Ma ∩ Mb = ∅ if a b. for example. f 43 is well–formed. Notice.) If x is of type α and y is of type β α then x is distinct from y. OCaML makes sure you respect the diﬀerence by displaying the string as "k" and the character as ’k’. as is (f 43) or even (f (43)). Then.3. for example. which get interpreted by the set of ASCII–characters.

y such that x is type α and y of type β: (17) Mα•β = Mα × Mβ = { x.. OCaML just repeats what you have said. You can look them up in the manual. called the product type int * int by OCaML. .3. This is the product type. OCaML’s own notation is ∗ in place of •. y : x ∈ Mα . Likewise. Now. Likewise. Notice that OCaML will not say that h is of type prod. It may contain function symbols. (2 * x) + 3 is an integer. For every product type α • β. This means that OCaML has understood that your object is composed of two parts. For that there exist functions fst and snd.4). product types do not have to be oﬃcially created to be used. and the right hand part. but they must be deﬁned already. such explicit deﬁnition hardly makes sense for types that can be inferred anyway. Thus. However. the functions * and + are actually predeﬁned. so is α • β. Type in. It is possible to explicitly deﬁne such a type. 7). It contains the set of pairs x. You can give the function f any term that evaluates to an integer. if α and β are types. which is a character. even if that follows from the deﬁnition.: char * int = (’a’. they are functions from integers to functions from integers to integers. and its value is 89. There is often a need to access the ﬁrst and second components of a pair. y ∈ Mβ } Like function types. they are deﬁned and fst maps the element into its ﬁrst component. Since there is nothing more to say. It tells you that their type is int -> int -> int. everything else is the same. which is a number.4) then OCaML will respond val h : int * int = (3. 7) and this will be your response: (18) . for example. This comes in the form of a type declaration: (19) # type prod = int * int. the left hand part. It means it has understood you and prod is now reserved for pairs of integers. because x is an integer and 2 is. saying that h has the product type. and snd onto its second component. Welcome To The Typed Universe 8 The result is an integer. (’a’. type prod = int * int The declaration just binds the string prod to the type of pairs of integers. Hence (2 * x) is an integer. These functions are polymorphic. For if you issue let h = (3..

of course.. you will see that OCaML has many more type constructors. notice that records can have any number of entries. However. It is most instructive to look at a particular example: (20) type car = {brand : used : bool}. written |. Unbound value mycar If you look carefully. a vintage and a usedness value. some of which are quite intricate. This is not a warning: the object named ”mycar” has not been deﬁned. which has three components: a brand. the expression mycar. So. for example: (22) # let mycar = {brand = "honda"}. Moreover. vintage = 1977. The latter is very useful. a record can be replicaed with products. (23) # mycar. We have (24) Mα∪β = Mα ∪ Mβ Notice that by deﬁnition objects of type α are also objects of type Mα∪β . This will bind mycar to the element of type car. Hence what we said before about types keeping everything distinct is not accurate. In . it has to be enclosed in quotes.} used = false}. One is that the order in which you give the arguments is irrelevant...) It is not legal to omit the speciﬁcation of some of the values.brand. vintage : int. Try this. Otherwise it is mistaken for an identiﬁer. Welcome To The Typed Universe 9 An alternative to the pair constructor is the constructor for records. One is the disjunction.3.. But there are also other diﬀerences.brand has the value honda. by issuing (21) # let mycar = {brand = "honda". the record type car can be replicated by either string * (int * bool) or by (string * int) * bool. Some record field labels are undefined: vintage used. string.. On the face of it. We shall use the more suggestive ∪. For example. This deﬁnes a type structure called car. The other is that the names of the projections can be deﬁned by yourself. while a pair must consist of exactly two. you can declare that mycar is a car. (To communicate that.

it opens a list beginning with 3. The next item is a string. The element to the left of the double colon will be the new head of the list. There is a constant. while ["mouse"] is a list containing just a single element. . and it calls the type constructor list. We could think of them as members of int −> α. As with pairs. "st". for example. You may think of them as sequences from the numbers 0. there are ways to handle lists. and whose third and last element is 4. and it is not astring. However. Lists are communicated using [. You can check this by issuing (25) # type u = int list. the interpretation of complex types is inferred from the rules.5]. The double colon denotes the function that appends an element to a list: so 3::[4. [ ]. notice that this interpretation makes them look like functions from integers to Mα . whose ﬁrst element (called head) is 3. For example.4] is a list of integers. An object x is of type α. [’a’. . Given a type. You cannot easily mix elements. typing (26) let h = [3. The interpretation of basic types must be given.. To do this. This expression has type string but is here used with type int Namely.’g’. OCaML tries to type expression. This is the way OCaML looks at it. which denotes the empty list.. and ]. OCaML keeps lists as distinct objects. but then OCaML needs to be taken by the hand. ’a’]. Each type comes with a range of possible objects that fall under it. (17) and (24). But it does not do that. It can be done somehow. . There is another constructor.) . . The basic types are disjoint by deﬁnition. type int list This binds the string u to the type of lists of integers. if and only if x ∈ Mα . It is a postﬁx operator (it is put after its argument). (It could form the disjunctive type all = int | char | string and then take the object to be a list of all objects. it returns the type of lists over that type. the string mouse. the list to the right the so–called tail.3.2. so no match and OCaML complains. [3. Welcome To The Typed Universe 10 fact. n − 1 to members of α. so it thinks this is a list of integers. However. this is not the proper way to think of them. the list constructor.4. 1. . in symbols x : α. what is closer to the truth is that types oﬀer a classifying system for objects.’c’] is a list of characters.5] equals the list [3. Try. whose second element is 2. so far (14). List concatenation is denoted by @. OCaML has inbuilt functions to do this.

4. For example. (n + 1) + y := (n + y) + 1 At ﬁrst it is hard to see that this is indeed a deﬁnition by recursion. There is another way to achieve the same. The function that adds x to y. (29) 0 + y := y. Function Deﬁnitions 11 4 Function Deﬁnitions Functions can be deﬁned in OCaML in various ways.. s(n) + y := s(n + y) . The next possibility is to use a deﬁnition by cases. This function takes a string and appends the letter a. The choice between the methods is basically one of taste and conciseness. So they are bound variables. written x + y. then the clauses are (30) 0 + y := y. which can be iterated as often as necessary. You can replace app wherever it occurs by (fun x -> xˆ ’a’. let us write ‘s’ for the successor function. namely the construct match· · · with. to deﬁne a function by recursion is to do the following. you have to say what f (n + 1) is on the basis of the value f (n). using a constructor much like the λ-abstractor. OCaML provides another method. can be deﬁned by recursion as follows. (28) # let app = (fun x -> xˆ ’a’. You can use it to match objects with variables.. So. as in the following case. (27) # let appen x = xˆ ’a’. Notice that OCaML interprets the ﬁrst identiﬁer after let as the name of the function to be deﬁned and treats the remaining ones as arguments to be consumed. and list with and tail (iterated to any degree needed). Recursion is the most fundamental tool to deﬁning a function. The ﬁrst is to say explicitly what it does. pairs with pairs of variables. and second. one of the ﬁrst things you get told is that just about any function on the integers can be deﬁned by recursion. Notice that to the left the variable is no longer present. If you do logic. Here is an example. One method is to use if· · · then· · · else. from only one function: the successor function! Abstractly. Then you have to say what f (0). Suppose that f (x) is a function on one argument. The second method has the advantage that a function like app can be used without having to give it a name. in fact by what is known as primitive recursion.

functions on lists can be deﬁned by recursion.: int = 16 As usual. This lets you evaluate the function on arguments where it is not grounded. Negative arguments are accepted. What you have to do is the following: you say what the value is of the empty string. Likewise. Here is what happens: (33) # expt 2 (-3).. But as we have said. Stack overflow during evaluation (looping recursion?) One way to prevent this is to check whether the second argument is negative. The principle of recursion is widespread and can be used in other circumstances as well. For example. otherwise the program keeps calling itself over and over again. to know the value of f (0) = 0 + y you just look up y. just the type of integeres. and then you say what the value is of f (x a) on the basis of f (x). The value of f (n + 1) := (s(n)) + y is deﬁned using the value f (n) = n + y. If the deﬁnition was just given for natural numbers. a function can be deﬁned over strings by recursion. It is however good style to add a diagnostic so that you . and then rejecting the input. if you want to deﬁne exponentiation for numbers. just that of integers. there would be no problem. form the successor). recursions need to be grounded. in OCaML there is no type of natural numbers. The evaluation of expt 2 -3 never terminates. For example. deﬁnitions by recursion are very important. Just take that value and add 1 (that is.. you can do the following: (31) let rec expt n x if x = 0 then 1 else n * (expt n (x-1)). This is why the above deﬁnition is not good. here 0). In the present example negative numbers let the recursion loop forever. but never terminate (because the value on which the recursion is based is further removed from the base. Now. Type this and then say expt 2 4 and OCaML replies: (32) . For a deﬁnition by recursion means that the function calls itself during evaluation. where x is a string variable and a a letter. Function Deﬁnitions 12 Indeed. there is no such type. Still. In deﬁning a function by recursion you have to tell OCaML that you intend such a deﬁnition rather than a direct deﬁnition.4.

When the program gets larger and larger. we improve our deﬁnition as follows. and they succeed by OCaML typing their name. This is done as follows. for example.5. Moreover. Modules 13 can ﬁnd out why the program rejects the input... or even more. You can deﬁne a function that takes away two characters at a time. one may want to identify parts of the programs that . So. 5 Modules Modules are basically convenient ways to package the programs. However. To do this. (34) exception Input_is_negative. Recursive deﬁnitions of functions on strings or lists are automatically grounded if you include a basic clause for the value on the empty string or the empty list. it does not need to be character by character. For strings. OCaML lets you deﬁne so–called exceptions. Otherwise you will not get a reply. you do need to make sure that the recursion ends somewhere. (35) let rec expt n x if x < 0 then raise Input_is_negative else if x = 0 then 1 else n * (expt n (x-1)) . In this way you can make every deﬁnition total. It is a common source of error to omit such a clause. OCaML quits the computation and raises the exception that you have deﬁned (there are also inbuilt exceptions). recursion can be done in many diﬀerent ways. Exceptions always succeed. respectively. Now look at the following dialog: (36) # expt 2 (-3) Exception: Input_is_negative Thus. Beware! However. it is sometimes diﬃcult to keep track of all the names and identiﬁers.

The latter are names of exceptions (see below). for example the module List. However. The way to do this is to use a module. For example. The way you call it is however by List.next_speaker. let rec length_aux len = function [] −> len | a::l −> length_aux (len + 1) l (38) . The way this is done is interesting. objects and so on).ml.. It contains deﬁnitions of objects and functions and whatever we like. The ﬁle list. Instead OCaML creates this module from the ﬁle list.length. since it sits inside the module List.ml (yes. the module may not be called seminar. The syntax is as follows: (37) module name-of-module = struct any sequence of deﬁnitions end.5. while others must start with an uppercase–letter. its content will be available for you in the form of a module whose name is obtained from the ﬁlename by raising the initial letter to upper case (whence the ﬁle name list. You may look at the source code of this module by looking up where the libraries are stored. the ﬁle is named with a lower case letter!) and a ﬁle list. Modules 14 can be used elsewhere. The ﬁrst is the plain source code for the list module. You can look at it and learn how to code the functions. If you compile this ﬁle separately. we have to write Seminar.mli. Some names must start with a lower–case letter (names of functions. Lets look at the code. On my PC they sit in usr/local/lib/ocaml There you ﬁnd a ﬁle called list. This function you can apply to a list and get its length. suppose we have deﬁned a module called Seminar.mli contains the type interface. and names of modules. The ﬁrst function that you see deﬁned is length. So. for example a function next_speaker. you will search in vain for an explicit deﬁnition for such a module. In order to use this function outside of the module. In the manual you ﬁnd a long list of modules that you can use.ml as opposed to the module name List). It is useful at this point to draw attention to the naming conventions (see the reference manual). More on that later.

with arguments now in reverse order. the function is deﬁned via an auxiliary function.length_aux. The so deﬁned reversal can be applied in the same way as the original. dropping arguments must proceed from the rightmost boundary.mli. It tells OCaML which of the functions deﬁned in list. if you did not make a type interface ﬁle).3]: (39) let lexp l = List. if you want to deﬁne a function that raises every member of the list to the ﬁfth power.map (fun y -> exp y 5) A more abstract version is this: (42) let switch f x y = f y x. only the listed functions can be used outside of the module. Apart form the commentary (which OCaML ignores anyway) it only provides a list of functions and their types. By default all functions are public (for example. but the resulting type of the function may be a function and so take further arguments. This is the type interface.5. However.map (exp 5) However.. unless you have to. suppose we have deﬁned the function exp x y. 25.map ((switch exp) 5) The function switch switches the order of the function (in fact any function).map (exp 5) l You may even deﬁne this function as follows: (40) let lexp = List.ml are public. . If a type interface exists. Thus. The answer to the puzzle is provided by the ﬁle list. For then we must abstract from the innermost argument position. You expect therefore to be able to use a function List. For example. Here is a solution: (41) let rexp = List. this requires more thought. The second surprise is the deﬁnition of the function length_aux. It uses only one argument. Modules 15 let length l = length_aux 0 l First. 125] for the list [1.2. let lexp = List. which calculates xy . But you cannot. but at one occasion uses two! The answer is that a function deﬁnition of the form let f x = speciﬁes that x is the ﬁrst argument of f . you need not mention all arguments. The following function will yield [5.

In type theory we can simply say that like for lists there is a unary constructor. Sets and Functors 16 I have said above that a module also has a signature. just telling OCaML what kind of functions and objects reside in a module of this type. Both can be empty. Everything that does not belong to the core language is added in the form of a module. this is not the way it can be done in OCaML. the manual does not even attempt to list all modules.mli.. This.mli you will ﬁnd that it declares no types. The manual lists several dozens of such modules. {∅}}. If you look into list. only functions. This deﬁnition is abstract. For OCaML will internally store a set as a binary branching tree so that it can search for elements eﬃciently. OCaML is partitioned into three parts: the core language. {∅. instead it lists the more frequent ones (that come with the distribution). We have to write. (43) module typename-of-signature = sig any sequence of type deﬁnitions end. We write sets linearly. the basic language. and the second a derived type of a function. The two occurrences of ∅ are two occurrences of the same set. and extensions. take a look at set. Since you are always free to add more modules. The way you tell OCaML that this is not a deﬁnition of a module is by saying module type. To see a complex example of types and type deﬁnitions. To see why this is necessary. The reason is that OCaML needs to be told how the elements of a set are ordered. look at the way we write sets. The statements that you may put there are type and val. You may explicitly deﬁne signatures in the following way. means that we pay a price. but you have to . The ﬁrst declares a ground type (which can actually also be abstract!). 6 Sets and Functors Perhaps the most interesting type constructor is that of sets. When you invoke OCaML it will automatically make available the core and everything in the basic language.6. however. for example. s such that (44) M s(α) = ℘(Mα ) However.

And it picks out exactly one. If you want to access an element. (45) module OPStrings = struct type t = string * string let compare x y = if x = y then 0 else if fst x > fst y || (fst x = fst y && snd x > snd y) then 1 else -1 end. This picks out an element. We have given it the name OPStrings. Next. If elements must always be ordered—how can we arrange the ordering? Here is how. in advance. Sets and Functors 17 write it down twice since the set is written out linearly. OCaML stores sets in a particular way. to deﬁne set union.) This is what OCaML answers: (46) module OPStrings : sig type t = string * string val compare : ’a * ’b −> ’a * ’b −> int end It says that there is now a module called OPString with the following signature: there is a type t and a function compare. is a set of functions together with their types. The signature is given inside sig· · · end. which is a complete unit that has its own name. OCaML demands from you that you order the elements linearly. here in form of a binary branching tree. and their types inside the signature are given. The . and so on. You can order them in any way you please.. Now what is this program doing for us? It deﬁnes a module. you can say: take the least of the elements.6. In the same way. but given two distinct elements a and b. This is needed to access the set. The latter is important because OCaML operates deterministically. Every operation you deﬁne must be total and deterministic. by the way. (The indentation is just for aesthetic purposes and not necessary. A signature. either a < b or b < a must hold. The best way to think of a set as being a list of objects ordered in a strictly ascending sequence.

its value is 0 if the two are considered equal. 1 if the left argument precedes the right hand argument and -1 otherwise. Finally.6. what will not work is if you type (51) # let st = PStringSet. "mouse") PStringSet. type (48) # st = PStringSet. First.empty.t. We only mention the following. for example Set.. we say what entity we make the members from. This is because OCaML does not have dynamic assignments like imperative languages.. there is an ordering predeﬁned on strings. It must have integer values.empty.. If you want to have the set {(cat. Alternatively. You cannot change its binding. Now. we deﬁne the function compare. Sets and Functors 18 deﬁnition is given enclosed in struct· · · end.add ("cat". The type of sets is PStringSet. st has been assigned already. the empty set is called PStringSet. Then we use the projection functions to access the ﬁrst and the second component. However. deﬁned inside the module Set. OCaML answers with a long list. To this. It is abstract.add.. If you want .. allows to create sets over entities of any type. However. you can say (50) # PStringSet. strings can be compared. Here they are strings. Second. which adds an element to a set or Set. This is because Make is a functor. and this assigns st to the empty set.empty. which denotes the empty set. It imports into PStringSet all set functions that it has predeﬁned in the module Set. we issue (47) module PStringSet = Set. For example..empty. to make sets of the desired kind. "mouse") st.Make(OPStrings). The functor Make.. the functions are now called PStringSet. A functor is a function that makes new modules from existing modules. "mouse") st. mouse)} you can issue (49) # PStringSet.add ("cat". To deﬁne the predicate compare we make use of two things: ﬁrst. They are thus more abstract than modules.add ("cat". How dynamic assignments can be implemented will be explained later.

Since OCaML has a predeﬁned way of communicating sets (which we explained above). It is important to realize that the program has no idea how it make itself understood to you if you ask it for the value of an object of a newly deﬁned type. not that of the set from which the list was compiled. However. for it makes it easy to deﬁne sets of sets.element st. after which you issue HashTbl. Also. (52) # let h = PStringSet. Notice that PStringSet. One way of doing that is to convert the set into a list. a comparison predicate is available. Hash Tables 19 to actually see the set. You have to tell it how you want it to show you. you can now look at the elements without trouble. what you are looking at are members of a list.compare. So. (53) module HashedTrans = struct .make. This is the moment when you want to consider using hash tables. This can be a source of mistakes in programming. Now if you type. So.7. PStringSet. They are some implementation of a fact look up procedure. you deﬁne ﬁrst a module of inputs. which is based on a ﬁnite look up table (so. There is a function called elements that converts the set into a list. the sets are also ordered linearly by PStringSet. The argument immediately to its right is the one that shows up to the right in inﬁx notation. This is useful. Beware! 7 Hash Tables This section explains some basics about hash tables. This looks as follows. you sometimes (but not always) need to issue a function that assigns every input value a unique number (called key). OCaML will incidentally give you the list. Suppose there is a function. Also.. First you need to declare from what kind of objects to what kind of objects the function is working. you have to tell OCaML how to show it to you.compare f g is the same as g < f in normal notation.compare takes its arguments to the right. say. once sets are deﬁned. there are ﬁnitely many possible inputs) and you want to compute this function as fast as possible. That is to say. You make the hash table using magical incantations similar to those for sets.

. notice that it types every combinator right away. If f is a function and x some object. : ’a −> ’b −> ’a = <fun> s x y z = x z(y z) : (’a −> ’b −> ’c) −> (’a −> ’b) −> ’a −> ’c <fun> Let us look carefully at what OCaML says. 8 Combinators Combinators are functional expressions that use nothing but application of a function to another. K and S. Bracketing is left associative in general. Then ( f x)y is the result of applying it to y. The most common combinators are I. using the symbols ’a. They can be deﬁned without any symbols except brackets. S xyz := xz(yz) This can be programmed straightforwardly as follows. This is actually the convention that is used in OCaML. (I give you a transcript of a session with comments by OCaML.Make(HashedTrans).. f x might be another function. They are deﬁned as follows. which can be applied to something else. then f x or ( f x) denotes the result of applying f to x. Functions can only be applied to one argument at a time. (54) (55) (56) I x := x. Combinators 20 type t = int * char let equal x y = (((fst x = fst y) && (snd x = snd y))) let hash x = ((256 * (fst x)) + (int_of_char (snd x))) end.8. K xy := x. First. module HTrans = Hashtbl. Brackets may be dropped in this case.. This is because as such they . ’b and so on. : ’a −> ’a = <fun> k x y = x.) (57) # let val i # let val k # let val s = i x = x..

Identiﬁers must always be unique. So. we can apply them: (59) # k i ’a’ ’b’. Second. OCaML will complain about syntax error. if you try to deﬁne combinators named I. that with the type assignment given everything works well. K or S. and the sequence of identiﬁers up to the equation sign is interpreted as the argument variables. with no intervening blanks between these two characters. The arguments x. Otherwise. OCaML knows which is function and which is argument variable because the ﬁrst identiﬁer after let is interpreted as the value to be bound. y and z must be of type (’a −> (’b −> ’c)). otherwise the input is misread. There are certain conventions on identiﬁers.8. It identiﬁes the object. It drops as many brackets as it possibly can. respectively. Evidently. It incidentally tells you that comments are enclosed in (* and *).. we typically write S xyz = xz(yz). so there are ways to ensure that one does not make mistakes.. Combinators 21 can be used on any input. (Read Page 89–90 on the issue of naming conventions. if you have a sequence of identiﬁers. It is evident that an identiﬁer cannot contain blanks nor brackets. you must always type a space between them unless there is a symbol that counts as a separator (like brackets). identiﬁers for functions are variables must be begin with a lower case letter. ’a −> ’b and ’c. you may verify that (60) # k (i ’a’) ’b’. For example. − : char = ’b’ This is because KIa = I and Ib = b. the last type is (58) ((’a −> (’b −> ’c)) −> ((’a −> ’b) −> (’a −> ’c))) Notice also that the type of the combinator is not entirely trivial. the right hand side is not well–deﬁned. Keeping identiﬁers unique can be diﬃcult. however. OCaML returns the type using the right (!) associative bracketing convention. Now. A few words on notation. having introduced these combinators. otherwise the whole sequence will be read as a single identiﬁer. only the last assignment will be used.) Finally. and the string is called an identiﬁer. If you deﬁne an identiﬁer twice. You may check. Moreover. First. But to communicate this to OCaML you must insert spaces between the variable identiﬁers except where brackets intervene. separators cannot be part of legal identiﬁers. Everything that you deﬁne and is not a concrete thing (like a string or a digit sequence) is written as a plain string. − : char = ’a’ . Using maximal bracketing. On the other hand.

We have deﬁned i. not over actual objects. It communicates this using ’a and ’b in place of ’b and ’c. if this is the case.. is check whether an expression is typable. This is because no matter how we assign the primitive types. OCaML does use variables. Now notice that upon the deﬁnition of k. This is what OCaML expects to ﬁnd. and s in this way.. The identiﬁers x. Every let–clause deﬁnes some function. The types do not match. Once again. We have met them already. You cannot calculate abstractly using the inbuilt functions of OCaML. even though OCaML does not have free variables. OCaML issued its type as ’a -> ’b -> ’a. val p = ’a −> ’b −> ’b = <fun> and this deﬁnes the function pxy = y. since these are variables and can be renamed at will. the next argument must have type (’b -> ’c) -> ’b. One thing you can do. This means that it is a function from objects . you can apply combinators to concrete values. There are no free variables! So. For example. but these are variables over types. This expression has type (’a −> ’b’) −> ’a −> ’b but is used here with type (’a -> ’b) -> ’a How does it get there? It matches the deﬁnition of s and its ﬁrst argument i. OCaML has this to say: (62) # s i i. then the argument actually has type (’b -> ’c) -> ’b -> ’c. so OCaML rejects the input as not well formed.8. (Yes. it informs us that the argument has type (’a -> ’b) -> ’a -> ’b. this can only be the case if ’a = ’b -> ’c. since OCaML did some renaming. Now. however. k. but this deﬁnition involves saying what it does on its arguments. y and z that we have used there are unbound outside of the let–clause.) However. This is because it does not know that when you ask k i x y the letters x and y are variables. and the argument is actually i. the combinator SII has no type. Applying the function yields an expression of type (’a -> ’b) -> ’a -> ’c. it does have variables. Combinators 22 You can of course now deﬁne (61) # let p x y = k i x y. Since its ﬁrst argument must be of type ’a -> ’b -> ’c. Notice that you cannot ask OCaML to evaluate a function abstractly. the function is not well–deﬁned. Hence.

. though. this is the way in which they are deﬁned. val katz : ’_a −> string = <fun> # katz "mouse". However. this means that you deﬁne functions by abstraction. It is legitimate to apply it to just one object: (63) # k "cat"... the way in which you present the arguments may be important. you can deﬁne the order of the arguments in any way you want. you must strictly obey that order. However. in fact.x + 1 method assign d = x <. (However. Here is a program that deﬁnes a natural number: (65) class number = object val mutable x = 0 method get = x method succ = x <.: string = "cat" Eﬀectively. − : ’_a −> string = <fun> This means that if you give it another argument. Objects and Methods 23 of type ’a to objects of ’b -> ’a. they can be passed around and manipulated. Once you have made a choice.) 9 Objects and Methods The role of variables is played in OCaMl by objects.d end.9. how it can communicate the identity of that object to you and how it can manipulate objects. it is important that OCaML has to be told every detail about an object. . Objects are abstract data structures. You can check this by assigning an identiﬁer to the function and applying it: (64) # let katz = k "cat". . including how it is accessed.. the result is a string (because the result is "cat").

which gives you the value. The reason why this works is that we have deﬁned the method get for that class. if you want to see the number you have just deﬁned. you need to type m#get. OCaML repeats this. So. . Notice that the third line of the class deﬁnition (66) asserts that the object contains some object x which can be changed (it is declared mutable) and which is set to 0 upon the creation of the object. The latter is an object (it can have several parameters in it). If you deﬁne another object q and issue. This is the type that has no object associated with it. And the result of this method is the number associated with m. val m : number = <obj> # m#get. Look at the following dialog. we can have as many objects of that class as we want. The object of this type can be manipulated by three so–called methods: get. for example q#succ and then ask for . But it must be an integer. succ which adds 1 to the value. then its value is pi#get. The third line asks for the value of m.9.. say pi. You may think of it as the type of actions. For succ asks OCaML to add the number 1 to the number. (67) # let m = new number. you may want to ﬁnd out how OCaML knows that x must be integer. and assign which allows you to assign any number you want to it.: int = 0 The ﬁrst line binds the identiﬁer m to an object of class number. If you deﬁne another object. Notice the syntax: the value of m is not simply m itself. For this is what it will say: (66) class number : object val mutable x : int method assign int −> <unit> method get : int method succ : <unit> end Notice that it uses the type <unit>. Objects and Methods 24 This does the following: it deﬁnes a data type number.. Once we have deﬁned the class.

Strings and Regular Expressions 25 the value you get 1: (68) # let q = new number. Characters are drawn from a special set. issuing method get = x will not help you much in seeing what the current value is. In theory. OCaML.10. converts a character into its 3–digit code.elements x if the object is basically a set of pairs of strings as deﬁned above. Other characters are used as delimiters (such as the quotes). You may use characters from this set only. Type Char. Strings and Regular Expressions We are now ready to do some actual theory and implementation of linguistics. They are given on Page 90 – 91 of the manual. which is also referred to as the alphabet. . Its inverse is the function chr. this accesses the ASCII character named by that sequence. like carriage return. there arises a problem of communication with OCaML. val q : number = <obj> # q#succ. is based on the character table of ISO Latin 1 (also referred to as ISO 8859-1). method get = PStringSet. If you have deﬁned an object which is a set. for example. It is for this reason that one has to use certain naming conventions. you type either ¡Ctrl¡V and then the numeric code . for example. some words are necessary on characters. the string ###BOT_TEXT###76 refers to the character L.: unit = () # q#get. any character can be used. Another issue is of course your editor: in order to put in that character. but in actual implementations the alphabet is always ﬁxed.code ’L’. OCaML has a module called Char that has a few useful functions. 10 Characters. So. and OCaML gives you 76. Characters. Before we can talk about strings. If you write \ followed by 3 digits..: int = 1 Notice the following. you have to learn how your editor lets you do this. However.. There are a number of characters that do not show up on the screen. You will have to say. . We shall deal ﬁrst with strings and then with ﬁnite state automata. The function code. for example. It is included on the web page for you to look at. You can try out that function and see that it actually does support the full character set of ISO Latin 1... In principle the alphabet can be anything. (In vi.

Until Unicode becomes standard. user deﬁned alphabet. Practically. The Xerox tools for ﬁnite state automata. val u : string = "cat" # u. is the function f such that f (0) = c. The tokenizer does nothing but determine which character sequence is to be treated as a single symbol. We deﬁne the following operations on strings: . diﬀerent alphabets must be communicated using certain combinations of ASCII–symbols.. for example. type :digraphs or simply :dig in commmand mode and you get the list of options. This means that if you have a token of the form ab the sequence abc could be read as a string of length three.[1]. char = ’a’ So. for example. Characters. If you want to see what the options are. f (1) = a and f (2) = t. notice that OCaML starts counting with 0. computer programs dealing with natural language read the input ﬁrst through a so–called tokenizer. for the symbol ¨.10. Without the quotes you would not see anything. As far as the computer is concerned this is a a sequence of characters. a followed by b followed by c. Mathematically. if you wish. which contains a couple of useful functions to manipulate strings. This is a string that has no characters in it. The empty string can be denoted by "". OCaML has a few inbuilt string functions. . and so on. as we generally do in this course. Careful thought has to go into the choice of token strings. Notice that technically n = 0 is admitted. Here is where the naming conventions become really useful. ab followed by c. And OCaML would not either. the tokenizer translates strings in the standard alphabet into strings of an arbitrary. HTML documents use the combination ä. It is called the empty string. For example. The string "cat". Strings and Regular Expressions 26 as deﬁned in ISO Latin 1 (this did not work when I tried it) or ¡Ctrl¡K and then a two-keystroke code for the character (this did work for me). (69) # let u = "cat". allow you to deﬁne any character sequence as a token.) It is not easy to change the alphabet in dealing with computers. Or. . or as a string of two tokens. and a module called String. . The ﬁrst letter in u is addressed as u.. 2.[0]. 1. But some programs allow you to treat this as a single character.[1]. n − 1 into the alphabet. Now we start with strings. the second as u. . a string of length n is deﬁned as a function from the set of numbers 0. OCaML uses a similar convention.

get". OCaML has a function String. Exception: Invalid_argument "String. Notice that by our conventions. For example. u is a substring if there are w and x such that v = w v x. else. Then u v follows. The last symbol of the string has the number 2.length "cat"]. you cannot access the symbol with number 3. String. Characters.get". A language over A is a set of strings over A. (74a) (74b) (74c) L · M := {xy : x ∈ L. Look at the following dialog.[String. y ∈ M} L/M := {x : exists y ∈ M: xy ∈ L} M\L := {x : exists y ∈ M: yx ∈ L} . This can be deﬁned recursively as follows. (71) # "cat". |u| = m and |v| = n. which is deﬁned as if j < m.[2].length "cat" will give 3. In OCaML. and a postﬁx if there is a w such that v = w u. vux3 = vuxvuxvux. if we write "tom"^"cat" OCaML returns "tomcat". but the string has length 3. (72) (73) x0 = ε xn+1 = xn x So.10. . Here is a useful abbreviation. u( j) (70) (u v)( j) = v( j − m) The length of u is denoted by |u|.length that returns the length of a given string. So. Here are a few useful operations on languages. OCaML raises the exception Invalid_argument "String.: char = ’t’ # "cat". Suppose that is a string of length m + n... Strings and Regular Expressions 27 Deﬁnition 2 Let u be a string. If you try to access an element that does not exist. u is a preﬁx of v if there is a w such that v = u w. ^ is an inﬁx operator for string concatenation. xn denotes the string obtained by repeating x n–times.

and elements from the second. cars. Strings and Regular Expressions 28 The ﬁrst deﬁnes the language that contains all concatenations of elements from the ﬁrst. These languages will not be studied here. . n ∈ N} . notice that there is no period at the end. my neighbour rescued the queen. ∗ is unary. . John rescued the queen. Secondly. some orthographic conventions are not respected. . and no uppercase for my. } M := {sings. Hence (79) {milk. } the set of noun phrases. . rescued the queen. my neighbour. } and M = {s}. 0 and ε are constants. ∪. and ∗ . Thus. deﬁne (75) (76) (77) L := {John. . For example. Characters. This construction is also used in ‘switchboard’ constructions. ·. . . as are the letters of the alphabet. cats. The construction L/M denotes the set of strings that will be L–strings if some M–string is appended. · and ∪ are binary operations. although we do get what looks like sentences. . . then no suﬃx from M exists. . ε. . papers}/M = {paper} In the sequel we shall study ﬁrst regular languages and then context free languages. let L = {chairs. Every regular language is context free. The language that they specify is given as follows: (80) (81) (82) (83) (84) L(0) := ∅ L(∅) := {∅} L(s · t) := L(s) · L(t) L(s ∪ t) := L(s) ∪ L(t) L(s∗ ) := {xn : x ∈ L(s). } {John sings. . A regular expression is built from letters using the following additional symbols: 0. . but there are also languages that are neither regular nor context free. . Then (78) L/M = {chair.10. and then L · { } · M ( denotes the blank here) is the following set: Notice that (77) is actually not the same as L · M. For example. my neighbour sings. } Notice that if L contains a word that does not end in s. cat. . car.

but the underlying concepts are the same. It follows that (85) (86) L(a · (b ∪ c)) = {ab. these two are not distinguished. This is the same as saying that L(s ∪ t) = L(t). if you are searching for a particular string. L(s) is a language. though. too. Hence.)· is often omitted. Also. This means that they are separated by any number of blanks and an optional carriage return. st is the same as s · t. if you are searching for two words in a sequence. . Notice that there are regular expressions which are diﬀerent but denote the same language. In order not to loose any occurrence of that sort you will want to write a regular expression that matches any of these occurrences. cba. For example. These laws are not identities between the terms. In ordinary usage. The particular constructors look a little bit diﬀerent. We have in general the following laws. The Unix command egrep allows you to search for strings that match a regular term. say department in a long document. Hence. Characters. ac} L((cb)∗ a) = {a. you may face the fact that they appear on diﬀerent lines. These are identities concerning the languages that these terms denote. } A couple of abbreviations are also used: (87) (88) (89) s? := ε ∪ s s+ := s · s∗ sn := s · s · · · · · s (n–times) We say that s ⊆ t if L(s) ⊆ L(t). every set containing a ﬁnite set of strings is also regular. It is easily seen that every set containing exactly one string is a regular language. We write a both for the string a and the regular term whose language is {a}. Regular expressions are used in a lot of applications. It contains all integers starting from 0. With more eﬀort one can show that a set containing all but a ﬁnite set of strings is regular. department or Department. cat = c · a · t. it may actually appear in two shapes. . Notice that whereas s is a term. cbcba. the terms are distinct. Hence.) (90a) (90b) s · (t · u) = (s · t) · u s·ε= s ε·s= s .10. Strings and Regular Expressions 29 (N is the set of natural numbers. the language of terms that fall under the term s. (A note of caution. .

Y = s. (92) X = b∗ a is the smallest solution to the equation. a set of strings. The converse is actually also true. So. ∪ sn−1 · Xn−1 (If si is 0. We ask: what set X satisﬁes the equation above? It is a set X such that a is contained in it and if a string v is in X. that is. since it contains a. . . Let Xi . let the following equation be given. for example b∗ a ∪ b∗ . (· binds stronger than ∪. bbba. one can characterize regular languages as minimal solutions of certain systems of equations. be variables. Z = t. Using this. Now. si is either 0. s ∪ t is the solution of X = Y ∪ Z.10.) Theorem 3 Let s and t be regular terms.) The equation is simple if for all i < n. so is b v. as you may easily verify. The one given is the smallest. it contains ba. bba. i < n. For example. The procedure we have just outlined can be used to convert any regular expression into a set of simple equations whose minimal solution is that term. (There are many more solutions. the term si · Xi may be dropped. Characters. Strings and Regular Expressions s·0=0 s ∪ (t ∪ u) = (s ∪ t) ∪ u s∪t =t∪s s∪s= s s∪0= s s · (t ∪ t) = (s · t) ∪ (s · u) (s ∪ t) · u = (s · u) ∪ (t · u) s∗ = ε ∪ s · s∗ 0·s=0 30 (90c) (90d) (90e) (90f) (90g) (90h) (90i) (90j) 0∪s= s It is important to note that the regular terms appear as solutions of very simple equations. ε. however. s · t is the unique solution of Y = t. . and so on. Hence. Then the smallest solution of the equation X = t ∪ s · X is s∗ · t. A regular equation has the form (93) Xi = s0 · X0 ∪ s1 · X1 . or a single letter. X = s · t.) (91) X =a∪b·X Here. the variable X denotes a language.

. s0 solve the equation by (96) X0 = s∗ (s1 · X1 . Then the equation basically tells us what X0 is in terms of the Xi . Then using Theorem 3 we can This is a regular term for X0 . ∪ sn−1 · Xn−1 Two cases arise. X1 and X2 . . ∪ sn−1 · tn−1 0. such that the least solution to E is Xi = L(si ). in the variables X0 . This is (IV). . Using Theorem 3 again we get (V). Then we insert them: (95) X0 = s1 · t1 ∪ s1 · t1 ∪ . i < n. Next. Case (1). . and we obtain regular solutions ti for Xi . The procedure is best explained by an example. Now X0 given explicitly. The proof is by induction on n. Interlude: Regular Expressions in OCaML 31 Theorem 4 Let E be a set of regular equations in the variables Xi . we solve the second equation for X1 with the help of Theorem 3 and get (II). First. ∪ s∗ sn−1 · Xn−1 0 0 and proceed as in Case (1). Next we insert the solution we have for X1 into the equation for X2 . . . Case (2). If you have started OCaML via the command ocaml you must make . Notice that the theorem does not require the equations of the set to be simple. This solution is ﬁnally inserted in the equations for X1 and X2 to yield (VI). Take a look at Table 1. It shows a set of three equations. Now we have to solve the equation for X0 . We suppose that it is true for a set of equations in n − 1 letters. we can also put into the ﬁrst equation the solution for X2 . i < n.11. such that Xi is exactly once to the left of an equation. The simpliﬁcation is that X1 is now deﬁned in terms of X0 alone. s0 = 0. 0 < i < n. . Then there are regular terms si . Suppose now the ﬁrst equation has the form (94) X0 = s0 · X0 ∪ s1 · X1 .) We want to ﬁnd out what the minimal solution is. 11 Interlude: Regular Expressions in OCaML OCaML has a module called Str that allows to handle regular expressions. The recursion has been eliminated. We solve the set of remaining equations by hypothesis. . i < n. see Page 397. ∪ sn−1 · Xn−1 ) 0 ∗ = s0 s1 · X1 ∪ s∗ s2 · X2 ∪ . This gives (III). (This is (I).

11. Interlude: Regular Expressions in OCaML 32 Table 1: Solving a set of equations (I) X0 = ε X1 = X2 = X0 = ε X1 = X2 = ∪dX2 bX0 ∪dX1 cX1 ∪dX2 d∗ bX0 cX1 ∪dX2 (II) (III) X0 = ε X1 = d∗ bX0 X2 = cd∗ bX0 (IV) X0 = ε X1 = X2 = (V) X0 = dcd∗ b X1 = d∗ bX0 X2 = cd∗ bX0 ∪dcd∗ bX0 d∗ bX0 cd∗ bX0 (VI) X0 = X1 = X2 = dcd∗ b d∗ bdcd∗ b cd∗ bdcd∗ b .

Every other symbol is treated as a constant for the character that it is.11. • $ matches at the end of line. • \( left (opening) bracket. of course. The ﬁrst is that of the string that you give to OCaML into a regular term.) Now here are a few of the deﬁnitions. matches any character except newline • * (postﬁx) translates into ∗ . The second is the conversion from the regular term of OCaML into one that the platform uses.cma". • + (postﬁx) translates into + . diﬀerent platforms implement diﬀerent matching algorithms and this may result in diﬀerent results of the same program. second. .regexp). However. This can be done interactively by (97) # #load "str. the characters $^. there is a two stage conversion process. type Str. This has two immediate eﬀects: ﬁrst. if you enter a regular expression. Thus.*+?\[] are special and set aside. A regular expression is issued to OCaML as a string.. The module actually ”outsources” the matching and uses the matcher provided by the platform (POSIX in Unix). • ^ matches at the beginning of the line. (We have used the same symbolic overload above. the string itself is not seen by OCaML as a regular expression. • ? (postﬁx) translates into ?. The conventions are described in the manual. • . I shall describe the syntax of regular expressions in Unix (they are based on Gnu Emacs). The symbol a denotes not only the character a but also the regular expression whose associated language is {a}. you must. It is worthwhile to look at the way regular expressions are deﬁned. This is why the manual does not tell you exactly what the exact syntax of regular expressions is. First. Interlude: Regular Expressions in OCaML 33 sure to bind the module into the program. the syntax of the regular expressions depends on the syntax of regular expressions of the matcher that the platform provides. to get the regular expression you have to apply the function regexp (to do that. In what is to follow. The module provides a type regexp and various functions that go between strings and regular expressions.

Interlude: Regular Expressions in OCaML 34 • \) right (closing) bracket. For example. including the ones we have set aside. Thus. It is clear that this syntax allows us to deﬁne any regular language over the entire set of characters. that is. To do this.11. the code of the symbol + is 43. (a ∪ b)+ d? translates into \(a\|b\)+d?. [a-fA-F] denotes the lower and upper case letters from ’a’ to ’f’ (’A’ to ’F’). the sequence ###BOT_TEXT###43 is treated in the same way as +. we can use the hyphen to denote from-to: [0-9] denotes the digits. The square brackets eliminate the need to use the disjunction sign. in a string. If we do want the hyphen to be included in the character range. The expression [abx09] denotes a single character matching any of the symbols occuring in the bracket. ﬁnally. you will have to produce a regular expression that matches either ownership or Ownership. one tries to ﬁnd an occurrence of a string that falls under the regular expression in a large ﬁle. Rather. One big group concerns grouping together characters. [a-f] the lower case letters from ’a’ to ’f’. Notice that ordinarily. if you are editing a ﬁle and you what to look to the word ownership in a case insensitive way. it has to be put at the beginning or the end. one can use square brackets. if you are looking for a nonempty sequence of +. looking for a word in German that may have ¨ or ae. while + is now treated as an operation symbol. as we shall explain now. • \| translates into ∪. [A-F] the upper case letters from ’A’ to ’F’. Also it gives additional power. For example. the function search_forward needs a regular expression as input. For example. There are several shortcuts. Now that we know how to write regular expressions. Thus. you have to give OCaML the following string: ###BOT_TEXT###43+. The sequence ###BOT_TEXT###43 is treated as a character. If we want the caret to be a literal it must be placed last. then a string and next an integer. however. let us turn to the question how they can be used. Here. we get a diﬀerence. The functions that OCaML provides are geared towards this application. If the opening bracket is immediately followed by a caret. Similarly. By far the most frequent application of matching against a regular expression is not that of an exact match. Notice that ordinarily. Notice that in OCaML there is also a way to quote a character by using \ followed by the 3–digit code. but when it occurs inside the square brackets it takes a new meaning. the range is inverted: we now take anything that is not contained in the list. ¨ next to ue and so on as alternate spellings may prompt the a u need to use regular expressions. The integer speciﬁes . and this method can also be used in regular expressions. the hyphen is a string literal. Additionally.

If it cannot match at all. Say we are looking for a∗ in the string bcagaa. supplying an empty parenthesis. OCaML raises the exception Not_found. Str. This requires to search the same string twice. it proceeds from the starting position and tries to match from the successive positions against the regular expression. It is actually superﬂuous to ﬁrst check to see whether a match exists and then actually performing the search. If you want to avoid getting the exception you can also use string_match before to see whether or not there is any string of that sort. Rather. Interlude: Regular Expressions in OCaML 35 the position in the string where the search should start.. while the Windows matcher looks for the just enough of the string to get a match (lazy matching). The result is an integer and it gives the initial position of the next string that falls under the regular expression. The Unix matcher tries to match the longest possible substring against the regular expression (this is called greedy matching).. There is an interesting diﬀerence in the way matching can be done. The greedy algorithm actually allows to check whether a string falls under a . Here.string_match ast "bcagaa" 0. so it does not look further.regexp "a*". both version will give 0 and 0. which return the initial position (ﬁnal position) of the string that was found to match the regular expression. In both versions. if we ask for the ﬁrst and the last position.. there are two useful functions. This is because there is no way to match more than zero a at that point. the two algorithms yield diﬀerent values. If none can be found. it proceeds to the next position. it will take the longest possible string that matches. Now try instead (99) Str.string_match ast "bcagaa" 2. however. match_beginning and match_end. The greedy algorithm will say that it has found a match between 2 and 3. The lazy algorithm is content with the empty string. since it was able to match one successive a at that point. Notice that you have to say Str. (98) let ast = Str.match_beginning(). Notice that the greedy algorithm does not look for the longest occurrence in the entire string. The empty string matches. Moreover. the answer is ”yes”. If it can match. Rather.11.

0. Thus. It matches exactly the sequences just discussed.29. and the second bracket from position 1237 to position 1244. Very often we do want to use parts of the string y in the new expression.23. The ﬁrst is the phase of matching. deﬁne the following string: (101) [01][0−9][0−9]| 2[0−4][0−9]| 25[0−4] This string can be converted to a regular expression using Str. Now look at the following: let x = "[01][0−9][0−9] \| 2[0−4][0−9]\| 25[0−4]" let m = Str. String replacements work in two stages. It will have a record that says. An IP address is a sequence of the form c. The matcher that is sent to ﬁnd a string that matches a regular expression actually takes note for each bracketed expression where it matched in the string that it is looking at. 145.168.c. It always looks for the ﬁrst possible match. where c is a string representing a number from 0 to 255.match_end () = (String. This can be a simple replacement by a particular string. The bracketing that I have introduced can be used for the replacement as follows. . let exact_match r s = if Str. that the ﬁrst bracket matched from position 1230 to position 1235.1. 129. (100) The most popular application for which regular expressions are used are actually string replacements.145.1. The way to do this is as follows.0. Given a regular expression s and a string x. if your data is as follows: (103) . For example.regexp. .0.c.match_beginning () = 0) && (Str. for example..regexp "\("^x^"\.c. ."^"\)" (102) This expression matches IP addresses. .146. An example may help. we may have 192. The next phase is the actual replacement.110 . To deﬁne the legal sequences."^"\)"^x^"\.243. of 127. Interlude: Regular Expressions in OCaML 36 regular expression.string_match r s 0 then ((Str. Let this string be y. the pattern matcher looks for a string that matches the expression s. Leading zeros are suppressed.length s))) else false. .11. "^"\("^x^"\. but c may not be empty.

Q.12. If we want to replace the second part by the ﬁrst. If we want to replace the original IP address by its ﬁrst part followed by .023 matches the ﬁrst bracket and the string 145. The function global_replace takes as input a regular expression. Finite State Automata 37 Suppose that the ﬁrst character is at position 1230.\".110 the second bracket. \2. if directly after the match on the string assigned to u we deﬁne (104) let s = "The first half of the IP address is "^(Str. δ where A. We write x → y if x.1. Then the string 129. . and two strings. to the ﬁrst matched string. Whenever a match is found it uses the template to execute the replacement. the variables \0. \2 to the second matched string. The ﬁrst string is used as a template.23 To use this in an automated string replacement procedure. \1. also is a ﬁnite set. A template is a string that in place of characters also contains these variables (but nothing more). After a successful match. and so on. 12 Finite State Automata A ﬁnite state automaton is a quintuple (106) A = A.0. the alphabet. a. F. the set of states... i0 ∈ Q is the initial state.matched_group 1 u) we get the following value for s: (105) "The first half of the IP address is 129.. \9. i0 . It takes as input a number and the original string. we write the template "\1". F ⊆ Q is the set of ﬁnal or accepting states and. \0 is assigned to the entire string. \1. is a ﬁnite set. then we use "\.. to cut the IP to its ﬁrst half. a ﬁnally.1". δ ⊆ Q × A × Q is the transition relation. we use "\1. These strings can be recalled using the function matched_group. and it returns the string of the nth matching bracket.0. y ∈ δ. For example. Q. So.

empty method get_initial = i method get_states = x method get_astates = y method get_alph = z method get_transitions = t method initialize_alph = z <− CharSet.ml.empty val mutable t = Transitions.empty val mutable y = StateSet. class automaton = object val mutable i = 0 val mutable x = StateSet. Finite State Automata 38 We extend the notation to regular expressions as follows.12. Now.empty val mutable z = CharSet.elements t end.empty ··· method list_transitions = Transitions. an automaton is partial if for a some q and a there is no q such that q → q . (107a) (107b) (107c) (107d) (107e) x −→ y :⇔ false x −→ y :⇔ x = y x −→ y :⇔ x −→ y or x −→ y x −→ y :⇔ exists z: x −→ z and z −→ y x −→ y :⇔ exists n: x −→ y s∗ sn st s t s∪t s t ε 0 In this way. Here is part of the listing of the ﬁle fstate. we can say that (108) L(A) := {x ∈ A∗ : there is q ∈ F: i0 −→ q} x and call this the language accepted by A. (109) ..

respectively. For example. You need a method to even get at the value of the objects involved. we have T q ⊇ a · T r . symbol = ’z’}. which is a ﬁnite state automaton. So. q . These things need to be formally introduced. a. Then you have created an object identiﬁed by the letter a. the program contains deﬁnitions of sets of states. For example. Theorem 5 Let A be a ﬁnite state automaton. if i i0 we . The idea behind an object type is that the objects can be changed. and so is the set of transitions. after you have loaded the ﬁle fsa. Similarly. second = 2. and you can change them interactively. Then if r. q when there was no transition q. They deﬁne the state prior to scanning the symbol. which are simply sets of integers. The attribute mutable shows that their values can be changed. the state set. If A is not partial it is called total.ml typing #use "fsa. It is an easy matter to make an automaton total. and a method to change or assign a value to them. By way of illustration let me show you how to add a transition. the type CharSet is deﬁned. The initial state is 0. q . The idea is to convert the automaton into an equation. the set of accepting states is empty.12. the alphabet. Similarly with all the other components. a. add all transitions q . the symbol. To this end. You can change any of the values using the appropriate method. After the equation sign is a value that is set by default.. First. and the state after scanning the symbol. q for any q . These are now used in the deﬁnition of the object type automaton. Then a type of sets of transitions is deﬁned. You do not have to set the value. so it is necessary only to look at part of this. for every state x q let T q := {x : i0 −→ q}.. a. Just add a state q and add a transition q. symbol and second. which has three entries. Also there is the type of transition.ml" you may issue (110) # let a = new automaton. first. Finite State Automata 39 There is a lot of repetition involved in this deﬁnition. q is not accepting. but to prevent error it is wise to do so. Finally. In fact. a. The type is called StateSet. Then L(A) is regular. (112) # a#add_transitions {first = 0. and you have added the letter ‘z’ to the alphabet. notice that prior to this deﬁnition. eﬀectively you can deﬁne and modify the object step by step. type (111) # a#add_alph ’z’..

Finally. Now. and δ = { 0. j of B. iA . 0 and 1. x1 . 1 : a ∈ A}.12. y0 ∈ δA . y1 : x0 . The new states are the states of A and the states B (considered distinct) with the initial state of B removed. Finite State Automata 40 have (113) Tq = r. j for every i0 . throw in also the accepting states of A. for every regular term there is an automaton A that accepts exactly the language of that term.) Now for s∗ . and an automaton B accepting t. 2}. x1 →A×B y0 . 2 : b ∈ A}. On the other hand. a. j ∈ δ such that u ∈ F. and add q. Add a transition u. The accepting states are the accepting states of B. 1. 1. a.a. δ = { 0. This automaton recognizes s∗ . 1 . Take an automaton A accepting s. s ∪ t. 0 initial. a. (If the initial state was accepting. (115) L(A) = q∈F Tq which is regular. Make F ∪ {i0 } accepting. 1 accepting. a. a. 2 : b ∈ A − {a}} ∪ { 2.a. remove that transition. No accepting state.q ∈δ Tr · a This is a set of regular equations. b. QA × QB . so that T q = L(tq ). and put F = ∅. so no string is accepted. Take an automaton with a single state. 1 } ∪ { 1. y1 u ⇔ x0 →A y0 and y0 →B y1 u u . Next. Let 0 be initial and accepting. s = 0. let Q = {0. Take two states.q ∈δ Tr · a If q = i0 we have (114) T i0 = ε ∪ r. s = ε. We construct the following automaton. iB . a. y1 ∈ δB } a Lemma 6 For every string u (118) x0 . First. F A × F B . j for every accepting state of A. x1 → y0 . a. b. b. For every transition i0 . and it has a solution ti for every q. a. 1 : b ∈ A} ∪ { 0. δA × δB 0 0 δA × δB = { x0 . For s = a. a. We shall give an explicit construction of such an automaton. Next st. (116) (117) A × B = A.

Then put (119) L p = {x : there is y such that x y ∈ L} u u u u Let s be a regular term. Suppose that t ∈ s† . Then u ∈ L(B). iB → x0 . . We deﬁne the preﬁx closure as follows. Deﬁnition 7 Let L ⊆ A∗ be a language. The latter is nothing but u ∈ L(A) and u ∈ L(B). It is now easy to see that L(A × B) = L(A)∩L(B). We show that L(t) is the union of all L(u) where u ∈ s† . which is equivalent 0 0 to iA → x0 and iB → x1 . or q = x. iB → q ∈ G. The converse is also easy.12. (121) L(s) p = t∈s† L(t) Proof. a} ε† := {ε} (s ∪ t)† := s† ∪ t† (st)† := {su : u ∈ t† } ∪ s† (s∗ )† := s† ∪ {s∗ u : u ∈ s† } Notice the following. The proof is by induction on s. q1 with q1 ∈ F B . y with q0 ∈ F A and then u ∈ L(A). 0 0 Now. (120a) (120b) (120c) (120d) (120e) (120f) 0† := ∅ a† := {ε. For suppose u is such that iA . Next. Then t = a or t = ε and the claim is veriﬁed directly. Deﬁne the set G := F A × QB ∪ QA × F B . let s = a where a ∈ A. Hence. Here is another method. Finite State Automata 41 The proof is by induction on the length of u. Make G the accepting set. The case s = ∅ and s = ε are actually easy. assume that A and B are total. Then the language accepted by this automaton is L(A) ∪ L(B) = s ∪ t. x1 . For u ∈ L(A×B) if and only if iA . Lemma 8 x is a preﬁx of some string from s iﬀ x ∈ L(t) for some t ∈ s† . Then either 0 0 q = q0 .

12. Finite State Automata

42

Next, assume that s = s1 ∪ s2 and let t ∈ s† . Either t ∈ s† or t ∈ s† . In 2 1 the ﬁrst case, L(t) ⊆ L(s1 ) p by inductive hypothesis, and so L(t) ⊆ L(s) p , since L(s) ⊇ L(s1 ) and so L(s) p ⊇ L(s1 ) p . Analogously for the second case. Now, if x ∈ L(s) p then either x ∈ L(s1 ) p or L(s2 ) p and hence by inductive hypothesis x ∈ u for some u ∈ s† or some u ∈ s† , so u ∈ s† , as had to be shown. 1 2 Next let s = s1 · s2 and t ∈ s† . Case 1. t ∈ s† . Then by inductive hypothesis, 1 L(t) ⊆ L(s1 ) p , and since L(s1 ) p ⊆ L(s) p (can yous ee this?), we get the desired conclusion. Case 2. t = s · u with u ∈ L(s2 )† . Now, a string in L(t) is of the form xy, where x ∈ L(s1 ) and y ∈ L(u) ⊆ L(s2 ) p , by induction hypothesis. So, x ∈ L(s1 )L(s2 ) p ⊆ L(s) p , as had to be shown. Conversely, if a string is in L(s) p then either it is of the form x ∈ L(s1 ) p or yz, with y ∈ L(s1 ) and z ∈ L(s2 ) p . In the ﬁrst case there is a u ∈ s† such that x ∈ L(u); in the second case there is a 1 u ∈ s† such that z ∈ L(u) and y ∈ L(s1 ). In the ﬁrst case u ∈ s† ; in the second case, 2 x ∈ L(s1 u) and s1 u ∈ s† . This shows the claim. Finally, let s = s∗ . Then either t ∈ s† or t = s∗ u where u ∈ s† . In the ﬁrst case 1 1 1 1 we have L(t) ⊆ L(s1 ) p ⊆ L(s) p , by inductive hypothesis. In the second case we have L(t) ⊆ L(s∗ )L(u) ⊆ L(s∗ )L(s1 ) p ⊆ L(s) p . This ﬁnishes one direction. Now, 1 1 suppose that x ∈ L(s). Then there are yi , i < n, and z such that yi ∈ L(s1 ) for all i < n, and z ∈ L(s1 ) p . By inductive hypothesis there is a u ∈ L(s1 ) such that z ∈ L(u). Also y0 y1 · · · yn−1 ∈ L(s∗ ), so x ∈ L(s∗ u), and the regular term is in s† . 1 1 We remark here that for all s established by induction. ∅: ε ∈ s† ; moreover, s ∈ s† . Both are easily

a

Let s be a term. For t, u ∈ s† put t → u iﬀ ta ⊆ u. (The latter means L(ta) ⊆ L(u).) The start symbol is ε. A(s) := {a ∈ A : a ∈ s† } δ(t, a) := {u : ta ⊆ u} A(s) := A(s), s† , ε, δ canonical automaton of s. We shall show that the language recognized by the canonical automaton is s. This follows immediately from the next theorem, which establishes a somewhat more general result. Lemma 9 In A(s), ε → t iﬀ x ∈ L(t). Hence, the language accepted by state t is exactly L(t).

x

(122)

**12. Finite State Automata
**

x

43

Proof. First, we show that if ε → u then x ∈ L(u). This is done by induction on the length of x. If x = ε the claim trivially follows. Now let x = ya for some a. Assume that ε → u. Then there is t such that ε → t → u. By inductive assumption, y ∈ L(t), and by deﬁnition of the transition relation, L(t)a ⊆ L(u). Whence the claim follows. Now for the converse, the claim that if x ∈ L(u) then ε → u. Again we show this by induction on the length of x. If x = ε we are done. Now let x = ya for some a ∈ A. We have to show that there is a t ∈ s† such that ta ⊆ u. For then by inductive assumption, ε → t, and so ε → u, by deﬁnition of the transition relation. Now for the remaining claim: if ya ∈ L(u) then there is a t ∈ s† such that y ∈ L(t) and L(ta) ⊆ L(u). Again induction this time on u. Notice right away that u† ⊆ s† , a fact that will become useful. Case 1. u = b for some letter. Clearly, ε ∈ s† , and putting t := ε will do. Case 2. u = u1 ∪ u2 . Then u1 and u2 are both in s† . Now, suppose ya ∈ L(u1 ). By inductive hypothesis, there is t such that L(ta) ⊆ L(u1 ) ⊆ L(u), so the claim follows. Similarly if ya ∈ u2 . Case 3. u = u1 u2 . Subcase 2a. y = y1 y2 , with y1 ∈ L(u1 ) and y2 ∈ L(u2 ). Now, by inductive hypothesis, there is a t2 such that L(t2 a) ⊆ L(u2 ). Then t := u1 t2 is the desired term. Since it is in u† , it is also in s† . And L(ta) = L(u1 t2 a) ⊆ L(u1 u2 ) = L(u). Case 4. u = u∗ . Suppose ya ∈ L(u). Then y has a decomposition z0 z1 · · · zn−1 v such 1 that zi ∈ L(u1 ) for all i < n, and also va is in L(u1 ). By inductive hypothesis, there is a t1 such that L(t1 a) ⊆ L(u1 ), and v ∈ L(t1 ). And z0 z1 · · · zn−1 ∈ L(u). Now put t := ut1 . This has the desired properties. Now, as a consequence, if L is regular, so is L p . Namely, take the automaton A(s), where s is a regular term of L. Now change this automaton to make every state accepting. This deﬁnes a new automaton which accepts every string that falls under some t ∈ s† , by the previous results. Hence, it accepts all preﬁxes of string from L. We discuss an application. Syllables of a given language are subject to certain conditions. One of the most famous constraints (presumed to be universal) is the sonoricity hierarchy. It states that the sonoricity of phonemes in a syllable must rise until the nucleus, and then fall. The nucleus contains the sounds of highest sonoricity (which do not have to be vowels). The rising part is called the onset and the falling part the rhyme. The sonoricity hierarchy translates into a ﬁnite state automaton as follows. It has states o, i , and r, i , where i is a number smaller

y ya x x y a

**13. Complexity and Minimal Automata
**

a a

44

than 10. The transitions are of the form q0 → o, i if a has sonoricity i, q0 → r, i a a if a has sonoricity i o, i → o, j if a has sonoricity j ≥ i; o, i → r, j if a has a sonoricity j > i, r, i → r, j where a has sonoricity j ≤ i. All states of the form r, i are accepting. (This accounts for the fact that a syllable must have a rhyme. It may lack an onset, however.) Reﬁnements of this hierarchy can be implemented as well. There are also language speciﬁc constraints, for example, that there is no syllable that begins with [N in English. Moreover, only vowels, nasals and laterals ] may be nuclear. We have seen above that a conjunction of conditions which each are regular also is a regular condition. Thus, eﬀectively (this has proved to be correct over and over) phonological conditions on syllable and word structure are regular.

13

Complexity and Minimal Automata

In this section we shall look the problem of recognizing a string by an automaton. Even though computers are nowadays very fast, it is still possible to reach the limit of their capabilities very easily, for example by making a simple task overly complicated. Although ﬁnite automata seem very easy at ﬁrst sight, to make the programs run as fast as possible is a complex task and requires sophistication. Given an automaton A and a string x = x0 x1 . . . xn−1 , how long does it take to see whether or not x ∈ L(A)? Evidently, the answer depends on both A and x. Notice that x ∈ L(A) if and only if there are qi , i < n + 1 such that (123) i0 = q0 → q1 → q2 → q2 . . . → xn ∈ F

xi x0 x1 x2 xn−1

Whether or not qi → qi+1 just take constant time: it is equivalent to qi , xi , qi+1 ∈ δ. The latter is a matter of looking up δ. Looking up takes time logarithmic in the size of the input. But the input is bounded by the size of the automaton. So it take roughly the same amount all the time. A crude strategy is to just go through all possible assignments for the qi and check whether they satisfy (123). This requires checking up to |An | many assignments. This suggests that the time required is exponential in the length of the string. In fact, far more eﬃcient techniques exist. What we do is the following. x0 We start with i0 . In Step 1 we collect all states q1 such that i0 → q1 . Call this the x1 set H1 . In Step 2 we collect all states q2 such that there is a q1 ∈ H1 and q1 → q2 .

L(A℘ ) = L(A). Q.13. Deﬁnition 11 An automaton is deterministic if for all q ∈ Q and a ∈ A there is at a most one q ∈ Q such that q → q .. Moreover. all we have to do is to look up the next state in (123). F. So the time requirement is down to a constant depending on A times the length of x. Deﬁnition 10 Let A = A. If the string is very long. which exists and is unique. δ℘ δ℘ = { H. the exponential A℘ is total and deterministic. However. Here each step takes time quadratic in the number of states: for given H j . F ℘ . And let (125) Then (126) A℘ = A. recognizing the language is linear in the string. Instead. ℘(Q). Put F ℘ := {U ⊆ Q : U ∩ F ∅}. And so on. It is easy to see that (124) i0 −→ q j+1 iﬀ q j+1 ∈ H j+1 x0 x1 . in practical terms this is still not good enough. It is clear that for a deterministic and total automaton. {i0 }. a. we compute H j+1 by doing Q many lookups for every q ∈ H j . this means that we recompute the same problem over and over. However. this number is basically bounded for given A. This is much better.x j Hence. we can precompute all the transitions. J : for all q ∈ H there is q ∈ J: q → q } a is called the exponential of A. Theorem 12 For every automaton A. It turns out that we can deﬁne an automaton in this way that we can use in the place of A. Complexity and Minimal Automata 45 In Step 3 we collect into H3 all the q3 such that there exists a q2 ∈ H2 such that x3 q2 → q3 . x ∈ L(A) iﬀ Hn ∩ F ∅. This is because we recompute the transition H j to H j+1 . δ be an automaton.. Because the constant it takes to compute a single step is too large. i0 . For these automata. . For then there exists an accepting state in Hn .

bc. c} ∅ {ε} (128) . [ε] M [a] M [c] M [ab] M = = = = {ab. Since the latter is deterministic. c} {c} ∅ {ε} x x x x (127) Let us take a slightly diﬀerent language M = {ab. bc} {b. Q − F. Now let B = A. The language L = {ab. We mention a corollary of Theorem 12. Now we have reduced the problem to the recognition by some deterministic automaton. δ . the state set of A℘ . Theorem 13 If L ⊆ A∗ is regular. bc} has the following index sets: [ε]L [a]L [b]L [c]L [ab]L = = = = = {ab. ac. Hence. the recipe to attack the problem ‘x ∈ L(A)?’ is this: ﬁrst compute A℘ and then check ‘x ∈ L(A)?’. we may still not have done the best possible. we can make it as complicated as we want! The art. and logarithmic in the size of ℘(Q). bc} {b.13. ac. bb}. Then there exists a total deterministic automaton A = A. Proof. let [x]L = {y : x y ∈ L}. Q. i0 . i0 . x ∈ L(B) if and only if there is a q ∈ Q − F such that i0 → q if and only if there is no q ∈ F such that i0 → q if and only if x L(A) = L. Given a string x. Deﬁnition 14 Let L be a language. δ such that L = L(A). It may turn out. ac. bb. Complexity and Minimal Automata 46 So. the time is linear also in |Q|. there is no limit on the complexity of an automaton that recognizes a given language. Here is an example. ac. Q. so is A∗ − L. the time needed is actually linear in the length of the string. We also write x ∼L y if [x]L = [y]L . The index of L is the number of distinct sets [x]L . Let L be regular. F. Actually. as always. that the automaton has more states than are actually necessary. Then it turns out that i0 →B q if and only if i0 →A q. namely. is to make it simple. Now. This proves the claim.

Given an automaton A and a state q put (129) [q] := {x : there is q ∈ F: q → q } x x a x y x a It is easy to see that for every q there is a string x such that [q] ⊆ [x]L . whence i0 −→ q ∈ F for some q . xy ∈ L(A). let x be such that i0 → q. one starts from a given automaton. L → J/a = [x]L . Complexity and Minimal Automata 47 It is easy to check that [b] M = [a] M . whence y ∈ [q]. Now. we know how to x y xy x . Then suppose that xa is a preﬁx of an accepted string. But [ε]L = L. let x = ya. q is found as a state such that i0 → q. It follows now that the index automaton is the smallest deterministic and total automaton that recognizes the language. Hence [q] ⊆ [x]L . J → J/a = [ya]L . put I → J if and only if J = a\I. If x = ε the claim reads L = L if and only if [ε]L = L. For if y ∈ [x]L then xy ∈ L. Given two index set I and J. Next.13. Again. So. Then obviously [q] = [x]L . By induction on x we show that L → I if and only if [x]L = I. Since the automaton is deterministic. Given an automaton. Namely. i0 → q → q . Theorem 15 L(I(L)) = L. Now. and the other starts from the regular term. L → J if and only if [y]L = I. This is well– deﬁned. by deﬁnition of [q]. Proof. For let I = [x]L . as promised. which means x ∈ L. By the above this is equivalent to ε ∈ [x]L . By induction hypothesis. We shall see that this diﬀerence means that there is an automaton checking M can based on less states than any automaton checking membership in L. x is accepted by I(L) if and only if there is a computation from L to a set [y]L containing ε. Accepting sets are those which contain ε. Then for each string x there is exactly one state [q] such that [q] ⊆ [x]L . The next question is: how do we make that automaton? There are two procedures. Then for all y ∈ [q]. We call this the index automaton and denote it by I(L). for every [x]L there must be a state q such that [q] ⊆ [x]L . Conversely. Then [xa]L = {y : xay ∈ L} = {y : ay ∈ I} = a\I. Suppose now that A is deterministic and total. so the claim holds. This deﬁnes a deterministic automaton with initial element L.

13. In general. we need to compute the largest net on A. x x x x x x . also q ∈ F. F/ ∼. Now. as q ∈ F. a) A/ ∼ := A. Then i0 → q for some q ∈ F. The is done as follows. and if q ∈ F and q ∼ q then also q ∈ F. Given a net ∼. a partition of Q is a set Π of nonempty subsets of Q such that any two S . [q]∼ is an accepting state. δ/ ∼ a a (130) Lemma 16 L(A/ ∼) = L(A). Hence [i0 ]∼ → [q]∼ . Hence x ∈ L(A). by deﬁnition of nets. let A by an automaton. Q/ ∼. Hence x ∈ L(A/ ∼). This need not be a net. This means that there is a q ∼ q such that i0 → q . Therefore. Proof. All we have to do next is to fuse together all states which have the same index. Now consider x ∈ L(A/ ∼). we put q ∼0 q iﬀ q. [i0 ]∼ . In the ﬁrst step. by an easy induction. S ∈ Π which are distinct are disjoint. q ∈ Q − F. A net induces a partition of the set of states into sets of the form [q]∼ = {q : q ∼ q}. q ∈ F or q. Call a relation ∼ a net if from q ∼ q and q → r. By induction on the length of x the following can be shown: if q ∼ q and q → r then there is a r ∼ r such that q → r . and every element is in one (and therefore only one) member of Π. suppose that x ∈ L(A). On the other hand. Complexity and Minimal Automata 48 make a deterministic total automaton by using the exponentiation. put [q]∼ := {q : q ∼ q } Q/∼ := {[q]∼ : q ∈ Q} F/∼ := {[q]∼ : q ∈ F} [q ]∼ ∈ δ/ ∼ ([q]∼ . q → r follows q ∼ r . a) ⇔ there is r ∼ q : r ∈ δ(q. This means that there is a [q]∼ ∈ F/ ∼ such that [i0 ]∼ → [q]∼ . By deﬁnition of A/ ∼. Conversely.

h is injective. Let A be based on the state set QA and B on the state set QB . let q ∈ QA and q → r. we have a I(iA ) = I(iB ). Now we a construct a map h : QA → QB as follows. If h(q) = q . Then I(r) = a\I(q). h is deﬁned on all states. Now. for all r ∈ Q: if q → r there is r ∼i r such that q → r if q → r there is r ∼i r such that q Evidently. if q ∈ QB and q → r and I(q) = I(q ) then also I(r) = I(r ). Also. It takes only ﬁnitely many steps to compute this and it returns the largest net on an automaton. Since all states are reachable in A. Then if L(A) = L(B). if ∼i+1 =∼i then ∼i is a net. Proof. This suggests the following recipe: start with ∼0 and construct ∼i one by one. total and reﬁned and every state in each of the automata is reachable. This yields an automaton which is unique up to isomorphism.) It is surjective since all states in B are reachable and B is reﬁned. by assumption. q → r 0 0 a and q → r then h(r) := r . stop the construction. ∼i+1 ⊆∼i . If ∼i+1 =∼i . 0 0 a Hence.13. since this already holds for ∼0 . we deﬁne the following: (131) q ∼i+1 q :⇔q ∼i q and for all a ∈ A. Thus the recipe to get a minimal automaton is this: get a deterministic total automaton for L and reﬁne it. and similarly for q ∈ QB . (Every homomorphism induces a net. For q ∈ QA write I(q) := {x : q → r ∈ F}. the two are isomorphic. A is called reﬁned if the only net on it is the identity. x a a a . Theorem 18 Let A and B be deterministic. Complexity and Minimal Automata 49 Inductively. Clearly. Finally. h(iA ) := iB . This map is injective since I(h(q)) = I(q) and A is reﬁned. if q ∼i+1 q and q ∈ F then also q ∈ F. so if the identity is the only net. Deﬁnition 17 Let A be a ﬁnite state automaton.

Starting with s. everything turns on the following question: can we decide for any given automaton A = A. We can assume that they are deterministic. Then A × B recognizes L(t) ∩ (A∗ − L(u)). u whether or L(t) = L(u). The start symbol is s. To see how this can be done. Next we compute for given t whether the equivalence L(t) = L(t ) holds for some t t . x . which is empty exactly when L(u) ⊆ L(t). ‘L(t) ⊆ L(u)’ and ‘L(t) = L(u)’. Theorem 19 The following problems are decidable for given regular terms t. δ whether or not L(A) is empty? The answer is simple: this is decidable. every arc leaving t is an arc that leaves t or t . The way to do this is as follows. a} ε‡ := {ε} (s ∪ t)‡ := s‡ ∪ t‡ (st)‡ := {ut : u ∈ s‡ } ∪ t‡ (s∗ )‡ := s‡ ∪ {us∗ : u ∈ s‡ } Given a regular expression s. So. Dually we can construct an automaton that recognizes L(u)∩(A∗ −L(t)). F. Then we add a new state t . notice that L(A) ∅ iﬀ there is a word x such that i0 → q ∈ F. Here is a recipe. It is easy to see that if there is a word. Put (132a) (132b) (132c) (132d) (132e) (132f) 0‡ := ∅ a‡ := {ε. It now follows that the index machine can eﬀectively be constructed. Q. i0 . Theorem 20 L is regular if and only if it has only ﬁnitely many index sets. Complexity and Minimal Automata 50 The other recipe is dual to Lemma 8. So we need to know how we can eﬀectively decide this problem. u: ‘L(t) = ∅’.13. we can eﬀectively compute the sets [x]L . This is empty exactly when L(t) ⊆ L(u). and the accepting states are of the form t ∈ s‡ . Construct an automaton A such that L(A) = L(t) and an automaton L(B) = A∗ − L(u). However. Erase t and t . Every arc into t is an arc that goes into t or into t . where ε ∈ t. beware that it is not clear a priori for any two given terms t. They are either 0 or of the form t for t ∈ s‡ . Now we just have to search through all words of this size. then there is a word whose length is ≤ |Q|. we construct the machine based on the t ∈ s‡ .

takes exponential time. still manageable.14. Thus. Here is how fast the number of steps grows. For simplicity we assume that α = 2. . It only takes a millisecond (1/1000 of a second). However. or whether the language accepted by an automaton is empty.576.048. Digression: Time Complexity 51 14 Digression: Time Complexity The algorithm that decides whether or not two automata accept the same language. Another 10 added and you can wait for months already. for n = 30 the time needed is 18 minutes. and (133) 1 + α + α2 + . this time you need 1 second. Then for n = 10 the number of steps is 1024. the number of steps multiplies by more than 1000! Thus. it is not necessarily so that the operations we make the computer perform are really needed. If A has α many letters. even with problems that seem very innocent we may easily run out of time. For n = 20. let us simplify a little bit. the problem just outlined can be solved much faster than the algorithm just shown . n 2n 1 2 2 4 3 8 4 16 5 32 6 64 7 128 8 256 9 512 10 1024 n 2n 11 2048 12 4096 13 8192 14 16384 15 32768 16 65536 17 131072 18 262144 19 524288 20 1048576 (134) Suppose that your computer is able to compute 1 million steps in a second. + αn = (αn+1 − 1)/(α − 1) many words of length ≤ n. . Given that reasonable applications in natural language require several hundreds of states. there are αn words of length n. Indeed. Then (2n+1 − 1)/(2 − 1) = 2n+1 − 1. it is 1. All it requires us to do is to look through the list or words that are of length ≤ n and check whether they are accepted. you can imagine that your computer might not even be able to come up with an answer in your lifetime. However. Maybe there is a faster way to do the same job. Once again. for each 10 you add in size. Say that the number of steps we need is 2n .

Since n ≤ |Q|.) Let us now see how much time we need. so it can never have more elements. Now. Ri−1 is the set of i − 1–reachable points. Consider however an algorithm that takes n2 steps. roughly a millisecond. we need at most |Q| − 1 steps. Here is another algorithm: call a state q n–reachable if there is a word of length ≤ n such that i0 → q. R0 := {i0 }. If Rn+1 = Rn we are done. and R0 := S 0 . In the next step . but which are not in Ri−1 . On the other hand. Only if you double the size of the input. for n = 20 it takes 400. then Rn grows by at least 1. so this can happen only |Q| − 1 times. Because if at each step we add some state. n n2 1 1 2 4 3 9 4 16 5 25 6 36 7 49 8 64 9 81 10 100 n 11 12 13 14 15 16 17 18 19 20 n2 121 144 169 196 225 256 289 324 361 400 x (135) For n = 10 it takes 100 steps.14. S 1 is the set of successors of S 0 minus R0 . call it reachable if it is n–reachable for some n ∈ ω. Next. L(A) is nonempty if some q ∈ F is reachable. for then Rn has at least |Q| many elements. if you want the number of steps to grow by a factor 1024. we need to iterate the construction at most |Q| − 1 times. you have to multiply the length of the input by 32! An input of even a thousand creates a need of only one million steps and takes a second on our machine. In the nth step we compute Rn+1 by taking all successors of points in Rn . Or. S i and Ri−1 . denote by Rn the set of n–reachable states. There is an algorithm that is even faster: It goes as follows. We start with R−1 := ∅ and S 0 = {q0 }. The algorithm described above is not optimal. the number of steps quadruple. it is a subset of Q. Digression: Time Complexity 52 would lead one to believe. In each step. (Checking this takes |Q| steps. Each step takes Rn × |A| ≤ |Q| × |A| steps. we need c · |A||Q|2 steps for some constant c. or a tenth of a millisecond. and S i is the set of points that can be reached from it it one step. we have two sets. and for n = 30 only 900. Alternatively. So.

In the next step we merge two blocks of two into one block of four. Your task is to sort the list in ascending order as fast as possible. It is not always the case that algorithms can be made as fast as this one. you leave C and P as is. The problem of satisﬁability of a given propositional formula is believed to take exponential time. if so. C and P.) We order these blocks. You need to do |L| − 1 comparisons. The number of comparisons is overall (136) (|L| − 1) + (|L| − 2) + . the best algorithms are not so diﬃcult to understand — one just has to ﬁnd them. C initially contains the ﬁrst element of the list and P := 1. but that does no harm. When you are done. . Here is another algorithm. Suppose you are given a list of numbers. Here is the algorithm that one would normally use: scan the list for the least number and put it at the beginning of a new list. Not bad. Divide the list into blocks of two. and so on. P tells you which element to put into M. Thus. This can make a big diﬀerence when the application has to run on real life examples. More exactly. Often. then scan the remainder for the least number and put that behind the number you already have. so we should try the best we can. To ﬁnd the minimal element in L you do the following: you need two memory cells. you put it into C. it is in NP — though it may take polynomial time in many cases. Now take the second element and see whether it is smaller than the content of C. you cannot get less than that if the entire input matters. The advantage of this algorithm is that it does not recompute the successors of a given point over and over. For the machine needs to take a look at the input — and this take slinear time at least. as you have to go through the entire list. but typically lists are very long. Actually. Regardless whether an algorithm takes a lot of time or not it is wise to try an reduce the number of steps that it actually takes. (There might be a remainder of one element. In the third step we merge . In step n you have you original list L. . Digression: Time Complexity 53 we let S i+1 be the set of successors of S i which are not in Ri := Ri−1 ∪ S i . only the successors of point that have recently been added are calculated. and a list L containing n − 1 elements. Now you start again. reduced by n − 1 elements. Instead. and let P be the number of the cell. this algorithm is not even quadratic but linear in |Q|! And if the algorithm is linear — that is much better. It is easily seen that this algorithm computes the successors of a point only once. + 2 + 1 = |L| · (|L| − 1)/2 This number grows quadratically. if not. This takes just one comparison between the two elements of the block.14. Here is another one.

3] M = [5]. which can be put in without further checking. and so on. 3. So we need in total n log2 n many steps. N = [5]. L = [1. into a list L of length m + n? The answer is: m + n − 1. For example. 2.14. N = [2. The idea is as follows. How many comparisons are needed to merge two ordered lists M and N of length m and n. We compare the ﬁrst two elements. 5] and N = [2. 2. let M = [1. and so on. respectively. 5]. 20 24 28 212 216 1 16 256 4096 65536 0 64 2048 59152 1048576 1/2 128 32768 8388608 2147483648 (143) n n log2 n n2 /2 . 5]. except for the last element. and so on. We show this number in comparison to n and n2 . we compare the ﬁrst elements of M and N. In the ﬁrst step we have n/2 many blocks. N = [5]. 2. L = [1. 2. M0 and N0 . 2] Now. 5]. The smallest element is 2 and is put into L: (138) M = [3. If we want to avoid repetitions then we need to check each element against the last member of the list before putting it in (this increases the number of checks by n + m − 1). This is the number x such that 2 x = n. Then L0 is the smaller of the two. 5. L = [1. which is then removed from its list. 3. 4. 4] M = []. 5]. this is how the algorithm goes on: (139) (140) (141) (142) M = [5]. The next element is again obtained by comparing the ﬁrst elements of the lists. Digression: Time Complexity 54 two blocks of four into one block of eight. 5]. the third step needs 7n/8. N = [4. N = [4. and n/4 many comparisons are being made to order them. N = []. L = [1. 5]. L = [1. 3. The smaller one is put into L: the lists are now (137) M = [3. Let us round the numbers somewhat: each time we need certainly less than n comparisons. 5] Each time we put an element into L we need just one comparison. The next step takes 3n/4 comparisons. 5] M = []. 4. L = [1] Now. 4. We take the ﬁrst elements. How often do we have to merge? This number is log2 n. 3. 4.

Therefore. Finite State Transducers 55 Consider again your computer. In practice. In general. Then there is a constant (namely D) such that for almost all n the complexity is ≤ Dn. Next. one does not want to spell out in painful detail how many steps an algorithm consumes. the more popular way is to think of them as translation devices with ﬁnite memory. Q. (One says that the estimate holds for ‘almost all’ inputs if it holds only from a certain point onwards. Q a ﬁnite set (the set of states). This means that for almost all n: (b + 1)n ≥ bn + a. If tomorrow computers can compute twice as fast. B. which is more than half an hour. Polynomial complexity is therefore measured only in terms of the leading exponent. Hence the problem is in O(n). so that the time consumption cannot not really be measured in seconds (which is what is really of interest for us). i0 the initial state. δ where A and B are alphabets. everything runs in shorter time. assume that n ≥ a. First. One writes that a problem is in O(n) if there is a constant C such that from some n0 on for an input of length n the algorithm takes C · n steps to compute the solution. On an input of length 65536 (= 216 ) it takes one second under the algorithm just described. i0 . 15 Finite State Transducers Finite state transducers are similar to ﬁnite state automata. Notice that O(bn + a) = O(bn) = O(n). You think of them as ﬁnite state automata that leave a trace of their actions in the form of a string. the highest exponent wins by any given margin over the others. However. while the naive algorithm would require it run for 2159 seconds. This makes calculations much simpler. Then (b + 1)n ≥ bn + n ≥ bn + a. F the set of ﬁnal states and (145) δ ⊆ ℘(Aε × Q × Bε × Q) . F.) This notation makes sense also in view of the fact that it is not clear how much time an individual step takes. A ﬁnite state transducer is a sextuple (144) T = A. O((b + 1)n) = O(n). It is worth understanding why. Now put D := C(b + 1). since O((b + 1)n) eﬀectively means that there is a constant C such that for almost all n the complexity is ≤ C(b + 1)n. simplifying notation is used.15. Also O(cn2 + bn + a) = O(n2 ) and so on.

put (146) A := A × B. Transducers have been introduced as devices that translate languages. 2.15. y := ux. 0. 0. Now it is back to where is was. d} and B := {0. 0. Here is how this is done. i0 . and ε := B ∪ {ε}. it is possible to deﬁne this notation for strings. v · x. if q −→ q and q −→ q then we write q −→ q . F. c to 10 and d to 11. So.) We write q → q if δ(a. The transitions are (148) 0. and that it outputs b. Aε := A ∪ {ε}. a is translated by 00. d. b ) iﬀ a. 2. Let us write RT for the following relation. The set of states is {0. y : i0 −→ q ∈ F} x:y Then. c. It is not hard to show that a ﬁnite state transducer from A to B is equivalent to a ﬁnite state automaton over A × B. 1. The initial state is 0 and the accepting states are {0}. 1. Finite State Transducers a:b 56 (Here. 1 This means the following. 1. returns to state 0 and outputs the letter 0. Namely. ε. 1 . Suppose that the input is A := {a. 1 . Upon input a. Here. for every word x.y Then one can show that q −→ q in T if and only if q −→ q in A. xy x:y x. 1}. 1. (149) RT := { x. Q. 2. q. ε. the theory of ﬁnite state automata can be used here. q ∈ δ. 0. θ x:y ux:vy u:v where q ∈ θ(q. Similarly it maps b to 01. There is only one way: it reads no input. 0 . 0 . q ) ∈ δ. put (150) RT (x) := {y : x RT y} . the machine enters state 1. Now deﬁne (147) u. c by 10 and d by 11. It has read the letter a and mapped it to the word 00. we shall indicate a simple one. Thus. It is not in an accepting state. We want to translate words over A into words over B in the following way. 0. b. We shall see many examples below. b. b. a. so it has to go on. a. b by 01. 2}. We say in this case that T makes a transition from q to q given input a. 0 . c. and outputs the letter 0. q. b. Again.

bc. This is the case if and only if it has even length. a we have The transducer can also be used the other way: then it translates words over B into words over A. notice that every string over A has a translation. 1.15. B := {b. The translation need not be unique. the translation of a given input can be highly underdetermined. Thus. c}. we have the following.) Similarly we deﬁne (155) (156) RT (y) := {x : x RT y} RT [S ] := {x : exists y ∈ S : x RT y} Notice that there is a transducer U such that RU = RT . F := {1}. For all words x RT (x) = ∅. For a language S ⊆ A∗ . Notice also the following. a. We use the same machine. c } This automaton takes as only input the word a. it outputs any of b. 1. We quote the following theorem. Finite State Transducers And for a set S ⊆ A∗ (151) RT [S ] := {y : exists x ∈ S : x RT y} 57 This is the set of all strings over B that are the result of translation via T of a string in S . . i0 := 0. bcc and so on. However. In the present case. This is because only the string a has a translation. 1. Q := {0. Theorem 21 (Transducer Theorem) Let T be a transducer from A to B. bc∗ if a ∈ S (153) RT [S ] = ∅ otherwise. Here is a ﬁnite state machine that translates a into bc∗ . 1}. b . but now we look at the relation (154) RT (y) := {x : x RT y} (This is the converse of the relation RT . (152) A := {a}. δ := { 0. but not every string over B is the translation of a string. and let L ⊆ A∗ be a regular language. ε. Then RT [L] is regular as well.

and f a function assigning each letter of A a word over B. a . 0. a. . However. 0. This means the following. (159) 0. . . Let A and B be alphabets. 0. In fact. Let us observe that the construction above can be generalized. and . . The machine takes an input and repeats it. Finite State Morphology 58 Notice however that the transducer can be used to translate any language. . . there is a simple machine that puts English nouns into the plural. Then it returns to i0 in one or several steps. a. b . 0. and 1. z . b . . 16 Finite State Morphology One of the most frequent applications of transducers is in morphology. b. b. there are more complex functions that can be calculated with transducers. . say. The transitions are as follows (we only use lower case letters for ease of exposition).16. z. 0. ε. This machine accepts one or at the end and transforms it into ε in the case of and into s otherwise. 1. As explained above we can turn the machine around. We extend f to words over A as follows. 0. It has two states. outputting f (a). The initial state is i0 . s . 0. z . the image of a context free language under RT can be shown to be context free as well. s . On input a the machine goes into state qa . . For this machine. 0. 0. and works as follows. . the input alphabet has two additional symbols. 1. Practically all morphology is ﬁnite state. (157) f (x · a) := x · f (a) The translation function f can be eﬀected by a ﬁnite state machine in the following way. We can also write a machine that takes a deep representation. Then it is ready to take the next input. 0. There is a ﬁnite state transducer that translates a gloss (= deep morphological analysis) into surface morphology. 0. 1 is ﬁnal. 1. such as car plus singular or car plus plural and outputs car in the ﬁrst case and cars in the second. We shall later see how we can deal with the full range of plural forms including exceptional plurals. a . 0 is initial. outputs ε. 0. ε . 0. 0. z. For example. . and ﬁnally attaches s. (158) 0.

Such a machine probably requires very many states. r. let us make the machine more sophisticated. z . z . (160) This does not exhaust the actual spelling rules for English. when the word ends in sh: bushes. 0. s . h. plusses. but it should suﬃce. 0. z. i. b. In Hungarian. not just s. sh or something else. 1. (So: dob (‘drum’) dobnak .. 0. u) then the suﬃx is nak. 4. a 2. a . otherwise it is nek. g .. r 0. the word bus has as its last letter indeed s. t 2. 4. g. 2. indices (from index). The machine will take the input and end in three diﬀerent states. case suﬃxes come in diﬀerent forms. If the word ends in s then the plural is obtained by adding ses: busses.. a. a . but the machine has no idea about the lexicon of English. a. 4. . Then what we would have to do is write a machine that can distinguish an English word from a nonword.. . it becomes a sequence of consonant plus al or el (the choice of vowel depends in the same way as that of nak versus nek). z 0. 4. 4. 4. sheep.. . according to whether the word ends in s. . ε. For example.. The latter may be surprising. 4. t. men. in one direction the machine synthesizes surface forms. and in the other direction it analyses them.. ε. 0. The consonant is v if the root ends in a vowel. 4. otherwise it is the same as the preceding one. z.. b . children. It assumes that s can be either the sign of plural or the last letter of the root... e . Once again. 4. Here is another task.. . 0.16. The regular plural is formed by adding es. 1. . . Now. It will translate arm into arm and arms into arm and arms . 3.. . 0. when turned around. . tableaux (from tableau). and also as busses .. s . 0. . . The form depends on the following factors. . ε . . 0. . Suppose we want to implement that kind of knowledge into the machine. Thus. This is certainly the case if we take into account that certain nouns form the plural differently: we only mention formulae (from formula. It is probably no exaggeration to say that several hundreds of states will be required. the dative is formed by adding nak or nek. splashes. We can account for this as follows. For example. 4. 4. s 2. will analyze busses correctly as bus .. a. Finite State Morphology 59 Then it acts as a map from surface forms to deep forms. i 3. h 2. z. Notice that the machine. 0. Both cases arise. mice. 2. oxen. s. The comitative suﬃx is another special case: when added. o. If the root contains a back vowel (a.. 3. the mistake is due to the fact that the machine does not know that busses is no basic word of English. 2.

The word hegy should be the above rule become hegygyel. that we end up with a machine of several hundreds of states. We can deﬁne a new alphabet in which gy is written by a single symbol. Hence. When this sound is doubled it becomes ggy and not. Like in many semitic languages. Next. the spelling gets undone in hyphenation: you write hegy-<newline>gyel. . gygy. there is no way to predict phonologically which word will take which plural. In Hungarian. which is the change (graphically) from a to ¨. for example ktb ‘to write’ and drs ‘to study’. we take this as input to a second machine which produces the deep morphological representations. but the orthography dictates heggyel. Moreover. and what consonant the root ended in. To words are made by adding some material in front (preﬁxation). the following procedure suggests itself: we deﬁne a machine that regularizes the orthography by reversing the conventions as just shown. (Actually.16. This means. roots only consist of consonants. Another area where a transducer is useful is in writing conventions. the palatal sound [dj] is written gy. there are many ways in which the plural can be formed: suﬃx s. some material after (suﬃxation) and some material in between (inﬁxation).) Thus. First. Umlaut. To write a ﬁnite state transducer. ﬁnal y changes to i when a vowel is added: happy : happier. as one would expect. suﬃx en. there are a handful of words that end in z (ez ‘this’. szeg (‘nail’) szegnek (‘for a nail’) and szeggel (‘with a nail’). Second. This machine translates heggyel into hegygyel. and combinations a o u thereof. it is not necessary that gy is treated as a digraph. they have three consonants. Plural in German is a jungle. all these typically happen at the same time. translated into ﬁnite state machine. suﬃx er. Typically. we have to be content with a word list. from o to ¨ and from u to ¨ of the last vowel. We close with an example from Egyptian Arabic. we need to record in the state two things: whether or not the root contained a back vowel. Actually. except in the comitative where we have ezzel ‘with this’.) On the other hand. fly : flies. az ‘that’) where the ﬁnal z assimilates to the following consonant (ennek ‘to this’). In English. Finite State Morphology 60 (‘for a drum’) dobbal (‘with a drum’).

plus singular plus past. It has the transitions 0 −→ 1 and 1 −→ 1.17. singular and past. It has two states. 0 is initial. given a speciﬁc output. n ∈ N. singular. Using Finite State Transducers 61 Let’s look at the following list. and the following transitions: a:b ε:b ε:b ε:b a:b ε:a a:b a:b ε:b a:b ε:b ε:b . [katab] [Pamal] [baktib] [baPmil] [iktib] [iPmil] [kaatib] [Paamil] [maktuub] [maPmuu] ‘he wrote’ ‘he did’ ‘I write’ ‘I do’ ‘write!’ ‘do!’ ‘writer’ ‘doer’ ‘written’ ‘done’ [daras] [na\al] [badris] [ban\il] [idris] [in\il] [daaris] [naa\il] [madruus] [man\uul] ‘he studied’ ‘he copied’ ‘I study’ ‘I copy’ ‘study!’ ‘copy!’ ‘studier’ ‘copier’ ‘studied’ ‘copied’ (161) Now. one can reverse this and ask the transducer to spell out the form CCC plus 3rd. Here is an example. What runs does this machine have? Even given input a there are inﬁnitely many runs: (162) 0 −→ 1 0 −→ 1 −→ 1 0 −→ 1 −→ 1 −→ 1 0 −→ 1 −→ 1 −→ 1 −→ 1 In addition. (Obviously.) 17 Using Finite State Transducers Transducers can be nondeterministic. One example is the following machine. we want a transducer to translate katab into a sequence ktb plus 3rd. Similarly with the other roots. and 1 is accepting. plus singular plus present. And so on. And it shall translate baktib into ktb plus 1st. The transducer will take the form CaCaC and translate it into CCC plus the markers 3rd. past into CaCaC. This machine accepts a as input. 0 and 1. 0 and 1. sometimes there are several runs on that given pair. Again two states. together with any output string ban . and this nondeterminism we cannot always eliminate.

we want to give algorithms to answer the following questions: Given a transducer T and a pair x : y of strings. 1. ε. There is therefore no hope of ﬁxing the number of runs a priori. Let input aaa and output bbbb be given. or we store all partial runs and in each cycle try to extend each run if possible (breadth ﬁrst search). (Output is at present a bit of a misnomer because it is part of the input. y ∈ RT ? Let us do these questions in turn.) In the case of the machine just shown. for the pair aaa and bbbb there are several runs: (163) 0 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 0 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 0 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 0 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 (There are more runs that shown. There are two ways to deal with nondeterminism: we follow a run to the end and backtrack on need (depth ﬁrst analysis). a. is there a run of T that accepts this pair? (That is to say: is x. Now.) Now we want to look for a transition that starts at the state we are in. y ∈ RT ?) Given x. the number of runs grows with the size of the input/output pair. is there a way to enumerate RT (x). ε t(2) = 1. y ∈ RT ? Given x. Using Finite State Transducers a:b ε:b 62 0 −→ 1 and then 1 −→ 1 and 1 −→ 1. Initially. that is to say. We start with the initial state 0. Despite all this.17. and matches . all y such that x. b a:b ε:b a:ε a:b a:ε a:ε ε:b ε:b ε:b a:b a:ε ε:b a:ε ε:b ε:b a:b ε:b a:ε a:ε ε:b ε:b ε:b a:ε ε:b a:ε This numbering is needed to order the transitions. Assume that the transitions are numbered: (164) (165) (166) t(0) = 0. Let is take the second machine. is there a string y such that x. b t(1) = 1. Let us look at them in turn. no part of input or output is read. 1. a. 1.

but now we fall back on our typical recipe: choose t(1) whenever possible. The conﬁguration is now this: 1. or we take t(2) and consume the output b but no input letter. Now face the same choice again. no more choice exists. aa|a : bb|bb. and we can only proceed with t(2). Now. From this moment we have again a choice. (ii) has read one character from in put. This was the conﬁguration 1.17. In this case we decide diﬀerently: now we take t(2). backtracking can be one for any reason. we were able to choose between t(1) and t(2) and we chose t(1). If again we backtrack. Here. The conjunction of facts (i) is in state 1. Now. and we are done. choosing t(1). giving 1. and we get this run: (167) 0 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 a:b a:ε ε:b ε:b a:ε ε:b Here is how backtracking continues to enumerate the runs: (168) 0 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 0 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 0 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 0 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 0 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 0 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 0 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 −→ 1 a:b ε:b ε:b ε:b a:b a:ε ε:b ε:b ε:b ε:b a:ε ε:b a:b a:ε ε:b ε:b a:ε a:b ε:b a:ε a:ε ε:b ε:b a:b a:b ε:b ε:b a:ε a:ε ε:b ε:b a:ε ε:b a:ε a:b ε:b ε:b a:ε ε:b a:ε a:ε a:ε . aa|a : b|bbb. Next we face a choice: either we take t(1) and consume the input a but no output letter. This gives the third run above. To do that. Faced with this choice. In this case the reason is: we want to ﬁnd another run. we go back to the last point where there was an actual choice. There is only one possibility: the transition t(0). aaa| : bbb|b. aaa| : b|bbb. This results in the following: we are now in state 1. so we follow t(1). (iii) has read one character from output we call a conﬁguration. aa|a : b|bbb. We visualize this conﬁguration by 1. we take the transition with the least number. it is the last choice point within the run we are currently considering that we look at. The bar is placed right after the characters that have been consumed or read. Where we chose t(1) we now choose t(2). We decide in the same way. giving 1. This is exactly the ﬁrst of the runs. a|aa : b|bbb. and one character from both strings has been read. giving 1. Using Finite State Transducers 63 characters of the input and output strings. One last time we do t(2). to do either t(1) or t(2).

we can perform either t(1) or t(2): (170) Step 3 t(0). It leads to the situation 1. t(1) : 1. This actually enumerates all runs. a|aa : bbb|b . It is sometimes a dangerous strategy since we need to guarantee that the run we are following actually terminates: this is no problem here. backtracking works as follows. A partial run on a pair of strings simply is a sequence of transitions that sucessfully maps the machine into some state. If we backtrack we go back in the run to the point where there was a choice. t(2) : 1. So. (If there is none. a|aa : b|bbb. the next set of runs is t(0). aa|a : b|bbb t(0). there is none if going back we cannot ﬁnd any point at which there was a choice (because you can never choose lower number than you had chosen). t(1). aa|a : bb|bb t(0). t(1) : 1. From that point on we consistently choose the least numbers again. We start with a single partial run: the empty run — no action taken. a|aa : bb|bb In the next round we try to extend the result by one more step. t(0). and note the result: (169) Step 2 t(0). otherwise we would require to know what we are about to ﬁnd out: whether any of the partial runs can be completed. In each of the two.) Now let us look at the other strategy.17. We do both. consuming some part of the input string and some part of the outpu string. There is only one. aaa| : b|bbb t(0). t(2) : 1. t(2). aa|a : bb|bb t(0). Using Finite State Transducers 64 In general. In the ﬁrst step we list all transitions that extend this by one step. since every step consumes at least one character. We need to memorize only the last run. t(2). It consists in remembering partial runs. Important in backtracking is that it does not give the solutions in one blow: it gives us one at a time. we pick the available transition with the smallest number. From here we can do two things: t(1) or t(2). Basically. We do not require that the partial run is part of a successful run. and then we change the previously chosen next transition to the next up. and then backtracking gives us the next if there is one. and trying to complete them. it will tell us. t(2) : 1. Every time we face a choice. t(1) : 1. t(1).

t(2) : 1. |x |y| . k the position of the character to the right of the bar on the output string. t(1) : 1. t(2). t(1). t(2). t(2). aa|a : bbb|b t(0). where x is a state of the machine. the number of runs initially shows exponential growth.) One might be inclined to think that the ﬁrst algorithm is faster because it goes into the run very quickly. Each step will advance the positions by at least one. In Step 3 we had 4 runs and 3 diﬀerent end situations. (171) Step 4 t(0). aaa| : bb|bb t(0). t(1) : 1. aaa| : bbb|b t(0). 0 . j. In each step we just calculate the possible next conﬁgurations for the situations that have recently been added. t(2). aaa| : bbb|b t(0).17. We do not keep a record of the run. t(2). now we need to choose t(2). On x. t(1). aa|a : bbb|b t(0). At the end. t(2) : 1. while the number of conﬁgurations is quadratic in the length of the smaller of the strings. t(2). they can be extended in the same way. t(2). However. Thus the algorithm takes |Q| · |x| · |y| time. t(2) : 1. What we are computing is the reachability relation in the set of conﬁguations. t(1). So. It should be noted that remembering each and every run is actually excessive. t(1). t(2). The conﬁgurations are triples x. only of its resulting conﬁguration. t(1) : 1. time propotional in both lengths. so we are sure to make progress. i is the position of the character to the right of the bar on the input string. it looks whether there is one of the form q. This can be done in linear time. Basically. t(2) : 1. Using Finite State Transducers 65 In the fourth step. t(1). which consists of the state and positions at the input and the output string. but only 4 diﬀerent end conﬁgurations. we have no choice in the ﬁrst line: we have consumed the input. 0. t(1). k . Thus we can improve the algorithm once more as follows. in Step 4 we have 7 runs. How long does this algorithm take to work? Let’s count. it is possible to show that if the input is . removing this excess can be vital for computational reaons. If two runs terminate in the same conﬁguration (state plus pair of read strings). All the algorithm does is compute which ones are accessible from 0. In the other cases we have both options still. aaa| : bb|bb t(0). y there are |Q| · |x| · |y| many such triples. a|aa : bbbb| And so on. where q ∈ F (q is accepting). (Recall here the discussion of Section 13.

So.) . (Nevertheless. the algorithm we propose is this: try to ﬁnd a run that matches the input string and then look at the output string it deﬁnes.) Now we look ata diﬀerent problem: given an input string. t. or to consider it is the ﬁrst letter of the preﬁx ba. Using Finite State Transducers 66 appropriately chosen. we cannot simply think of the machine as taking each character from the input and immediately returning the corresponding output character. it might take exponential time to reach an answer. this can be done in quadratic time. This machine has as input strings all strings of the form a+ b?. of a form badan to be translated as bdn+3Pf). Suppose we design a transducer for Arabic. So far there is nothing that we can do to eliminate either option. it depends on the last character of the input! So. but the form baktib into ktb+1Pr. First. and 1 and 3 are accepting. The automaton has the following transitions. The ﬁrst character is b. on certain examples this algorithm is excessively slow. Conﬁguration is now: 0. |baktib. No output yet. Once again. d : n > 0} b:ε a:c a:c 0 −→ 2 2 −→ 2 a:d a:d The translation of each given string is unique. Let us look at a real problem. It is only when we have reached the fourth letter. (Basically. We have to buﬀer the entire output until we are sure that the run we are looking at will actually succeed. we should see a. an artiﬁcial example. when the situation is clear: if this was a 3rd perfect form. the relation it deﬁnes is this: (173) { an . (172) 0 −→ 1 1 −→ 1 2 −→ 3 0 is initial. Let us not do that here. Thus. However. the possible runs can be compactly represented and then converted into an output string. so one might need a regular expression to list them all. sometimes it might be much faster that the breadth ﬁrst algorithm. cn : n > 0} ∪ { an b. Let us look instead into ways of ﬁnding at least one output string. however. Given: input string baktib. using depth ﬁrst or breadth ﬁrst search. The algorithm to ﬁnd a run for the input string can be done as above. so we keep them both open. what are the possible output strings? We have already seen that there can be inﬁnitely many. We face two choices: consider it as part of a root (say.17. Let us say that the surface form katab should be translated into the deep form ktb+3Pf.

but in actual practice this is mostly not observed. (175) < digit >→ 0 | 1 | 2 | . when X → α and X → γ are rules. The vertical slash ‘|’ is used to merge two rules with the same left hand symbol. a mixed string containing both terminals and nonterminals is denoted by a Greek letter with a vector arrow. A string of terminals is denoted by a Roman letter with a vector arrow.18. A context free rule over N as the set of nonterminal symbols and A the set of terminal symbols is a pair X. α . N. This allows us to write as follows. The set Σ is called the set of start symbols Often it is assumed that a context free grammar has just one start symbol. Notice that context free rules allow the use of nonterminals and terminals alike. It is understood that symbols that follow each other are concatenated the way they are written down. Notice that one speaks of one rule in the latter case. where X ∈ N and α is a string over N ∪ A. but technically we have two rules now. So. The following are examples of context free rules. Nonterminal symbols are produced by enclosing an arbitrary string in brackets: < · · · >. we shall use Roman letters. Σ. you can write X → α | γ. they are denoted using typewriter font. Concatenation is not written. The conventions that apply here are as follows. | 9 And this stands for ten rules. (174) < digit >→ 0 < digit >→ 1 < number >→< digit > < number >→< number >< digit > There are some conventions that apply to display the rules in a more compact form. where R is a ﬁnite set of context free rules. (The vector is generally used to denote strings. A context free grammar is a quadruple A. Let N be a set of symbols. N disjoint with A. Σ ⊆ N. .) If the diﬀerence is not important. R . . The implicit convention we use is as follows. Context Free Grammars 67 18 Context Free Grammars Deﬁnition 22 Let A be as usual an alphabet. . We write X → α for the rule. Symbols in print are terminal symbols.

If. Σ contains both symbols then you get just the set of all strings. γ=σ η τ For example. If you take this as the start symbol. The language generated by G consists of all terminal strings that have category S ∈ Σ.) There is a trick to reduce the set of start symbols to just one. α ⇒∗ γ (γ is derivable from α) if there is an n such that α ⇒n γ. We write α ⇒ γ and say that γ is derived from α in one step if there is a rule X → η. Now. in which case the language is empty. This is because even though the symbol < number > is present in the nonterminals and the rules. Context Free Grammars 68 A context free grammar can be seen as a statement about sentential structure. there is no way to generate it by applying the rules in succession from < digit >. or as a device to generate sentences. Introduce a new symbol S ∗ together with the rules (179) S∗ → S for every S ∈ Σ. Using the above grammar we have: (177) < number >⇒4 < digit >< digit >< digit >< number > Finally. a single occurrence of X in α such that γ is obtained by replacing that occurrence of X in α by η: (176) α = σ X τ. Notice that actual linguistics is diﬀerent. We say that γ is derivable from α in n + 1 steps. This is a diﬀerent grammar but it derives the same strings. if X ⇒∗ γ we say that γ is a string of category X. In natural language having a set of start . you take the symbol < digit > as the start symbol then the language is just the set of all strings of length one. Now write α ⇒n+1 γ if there is a ρ such that α ⇒n ρ and ρ ⇒ γ.18. the language will consist in all strings of digits. if the rule is < number >→< number >< digit > then replacing the occurrence of < number > in the string 1 < digit >< number > 0 will give 1 < digit >< number >< digit > 0. We begin with the second viewpoint. Equivalently. however. since a digit also is a number. If on the other hand. (A fourth possibility is Σ = ∅. (178) L(G) := {x ∈ A∗ : there is S ∈ Σ : S ⇒∗ x} You may verify that the grammar above generates all strings of digits from the symbol < number >.

) This type of information is essential. BBA. In the second step we could have applied the rule A → BA instead. and so on). But there is a sense in which these constituents are not proper sentences. It is a sequence of strings. BA. BBBa. We call u the left context of y in x and v the right context. BBBA. BBBa. the analytic one. Deﬁnition 23 Let x and y be strings. These represent diﬀerent sets of strings. commands. yielding the same result. they are just parts of constituents. HPSG. one would need to enter a rule of the kind above for each constituent. So. however. This may be given as follows. in certain circumstances what shifts is the set of start symbols. To see this. This is useful also for other reasons. To account for that. for example). There are several basic types of sentences (assertions. one does not start from a single symbol but rather from diﬀerent symbols. on the basis of a derivation. It real grammars (GPSG. We can do so by underlining it. It is noted that in answers to questions almost any constituent can function as a complete utterance. We need to specify in each step which occurrence of a nonterminal is targeted. First. Thus. Inductively. Rather than having to write new rules to encompass the use of these constituents. v of strings such that x = uyv. we assign constituents to the strings. we need to ﬁx some notation. BcbBa. BA. bcbBa. the rule set however remains constant. bcbbca (I leave it to you to ﬁgure out how one can identify the rule that has been applied. bcbBa. For example.18. we can just place the burden on the start set. questions. where each subsequent string is obtained by applying a rule to some occurrence of a nonterminal symbol. BBBA. let us talk about the second approach to CFGs. . (180) A → BA | AA. B → AB | BB. one for each class of saturated utterance. A derivation is a full speciﬁcation of the way in which a string has been generated. bcbbca Notice that without underlining the symbol which is being replaced some information concerning the derivation is lost. the following also is a derivation: (182) A. BBA. A → a. Context Free Grammars 69 symbols is very useful. BcbBa. An occurrence of y in x is a pair u. B → b | bc | cb Here is a derivation: (181) A. here is a grammar. however.

and they can represent the substring only in virtue of the fact that the larger string is actually given. (If i = j we have on occurrence of the empty string. For present purposes. Their concatenation is cbbc. It is very important to distinguish a substring from its occurrences in a string. it is however less explicit. (184) bcbbca The step before that was the replacement of the ﬁrst B by b. showing us that the occurrence b. or by other means. Every letter then spans two adjacent positions. Notice that the strings are representatives of the strings.) Here is an example: (183) E 0 1 d 2 d 3 y 4 The pair 2. aac. and the occurrence is b. ε . The last string is bcbbca. a . Let us return to our derivation. It is this replacement that means that the occurrence bcb. The pair 1. The pair 1. We look at derivation (181). 3 represents the occurrence Ed. a of bc. but it is useful to think that it is). a and aaca. The derivation may start with any string α. it need not be a terminal string. 4 represents the (unique) occurrence of the string ddy. b . This way of talking recommends itself when the larger string is ﬁxed and we are only talking about its substrings (which is the case in parsing). y of the letter d. bca is a constituent of category B. dy . The third last step was the replacement of the second B by cb. Now. a of the string bc is a constituent of category B. Context Free Grammars 70 For example. 2 represents the occurrence E. caaa . Thus. a and bcbbc. the occurrence of BB in the string corresponds to the sequence of the occurrences b. bca of cb and bcb. Now. bca . the ﬁrst occurrence of b is a constituent of category b. And so on. Often we denote a particular occurrence of a substring by underlining it. since now the ﬁrst B is replaced by BB. Then comes the replacement of A by a. Now we get to the replacement of B by BB. but it is useful to think of the derivation as having started with a start symbol. The last step is the replacement of B in bcbBa by bc. let the ﬁnal string be γ (again. j where i ≤ j. and in general a substring is represented by a pair of positions i. the new constituent that we get is (186) bcbbca . b. In parsing it is very popular to assign numbers to the positions between (!) the letters. the string aa has three occurrences in aacaaa: ε.18. we have the following ﬁrst level constituents: ε. So. ε . bcb. (185) bcbbca We go one step back and ﬁnd that.

z then either y < z. there is a x such that for all y x: y < x) and furthermore. Let C = γ1 . We say that C is dominated by D. an element of N). By comparison. In the last step we get that the entire string is of category A under this derivation. (189) abccddx Now. BBbca. . only the constituent structure is really important. Now. the derivation (182) imposes a diﬀerent constituent structure on the string.18. It is a pair T. in symbols C D. bcbbca The diﬀerence between (181) and (188) are regarded inessential. what the deﬁnition amounts to is that if one underlines the substring of C and the substring of D then the latter line includes everything that the former underlines. The relation < is deﬁned by . BA. Then C D. let the tree be deﬁned as follows. the following is a derivation that gives the same constituents as (181). it has a root (that is. but not both.) Visually. dx and a. bBbca. as they are diﬀerent and (1) and (2) are satisﬁed. Recall the deﬁnition of a tree. let C = abc. if x < y. let us return to the constituent structure. Context Free Grammars 71 which is of category B. while diﬀerent derivations that yield the same structure will give the same meaning. x . We call the members of Γ constituents and f (C) the category of C. The constituent structure is displayed by means of trees. C ∈ Γ. η2 be occurrences of substrings. For example. We say that x dominates y if x > y. The structure that we get is as follows. The set of nodes is the set Γ. and that it immediately dominates y if it dominates y but there is no z such that x dominates z and z dominates y. BBBa. however. each step in a derivation adds a constituent to the analysis of the ﬁnal string. (It may happen that η1 = γ1 or that η2 = γ2 . Basically. BBBA. y = z or y > z. (188) A. For example. Thus. γ2 and D = η1 . because it may give rise to diﬀerent meanings. There is a string α and a set Γ of occurrences of substrings of α together with a map f that assigns to each member of Γ a nonterminal (that is. if x < y and y < z then also x < z). if C D and (1) η1 is a preﬁx of γ1 and (2) η2 is a suﬃx of γ2. BBA. The diﬀerent is that it is not (186) that is a constituent but rather (187) bcbbca It is however not true that the constituents identify the derivation uniquely. < where < is transitive (that is. Visually this is what we get.

we add all rules . but still the remaining problem may be very complex.19. Consider by way of example the grammar (191) A→A|a It can be shown that if the grammar has no unary rules and nor rules of the form X → ε then a given string x has an exponential number of derivations. and a string x. We write C D and say that C precedes D if γ1 σ is a preﬁx of η1 (the preﬁx need not be proper). in general a given string can have any number of derivations. even inﬁnitely many. say Z → UXVX. we eliminate the ﬁrst rule (X → ε). This vastly reduces the problem. The ordering is here implicitly represented in the string. cddx and D = abcc. We shall show that it is possible to eliminate these rules (this reduction is not semantically innocent!). Still. For if the string is not in the language it is unnecessary to look for derivations. Let us see how. If one underlines C and D this deﬁnition amounts to saying that the line of C ends before the line of D starts. Let C = γ1 . η2 be an occurrence of τ. the problem of recognition is generally not of interest. as the derivations give information about the meaning associated with an expression. and then the parsing task. γ2 be an occurrences of σ and D = η1 . we ask the following questions: • (Recognition:) Is x ∈ L(G)? • (Parsing:) What derivation(s) does x have? Obviously. sometimes it is useful to ﬁrst solve the recognition task. furthermore. (190) abccddx Here C = a. Parsing and Recognition 72 The trees used in linguistic analysis are often ordered. Now. Given a rule X → ε and a rule that contains X on the right. 19 Parsing and Recognition Given a grammar G. x . The parsing problem for context free languages is actually not the one we are interested in: what we really want is only to know which constituent structures are associated with a given string.

then |η| = |γ| + p(ρ). So. In all other cases. If p(ρ) = −1 then α = ε. and this again takes only exponential time. we add the rules Z → UVX. Let ρ be a rule having Y on the left. If x is found once. (Since other rules may have X on the left. uyv = x . if ρ is productive. Thus. If ρ = X → α is a rule. Clearly. it still is the case that we can represent them is a very concise way. The idea to the algorithm is surprisingly simple. so that the task of enumerating the derivations is exponential. Then we add the rule X → UVX. Here is now a very simple minded strategy to ﬁnd out whether a string is in the language of the grammar (and to ﬁnd a derivation if it is): let x be given. p(ρ) > 0 and we call ρ productive. each step in a derivation increases the length of the string. We do this for all rules of the grammar. However. it is in the language. For example. the rule X → Y. We can express this formally as follows. excluding occurrences of the empty string. It follows that a string of length n has derivations of length 2n−1 at most. it is not advisable to replace all occurrences of X uniformly. Enumerate all derivations of length < 2n and look at the last member of the derivation. which are polynomial of order 3. if the grammar only contains productive rules. These two steps remove the rules that do not expand the length of a string. Hence |η| > |γ| if p(ρ) > 0. of length n. unless it replaces a nonterminal by a terminal. and denote it by p(ρ). Then write down all occurrences C = u. Before we do so. We add the rule obtained by replacing Y on the left by X. let us note. otherwise not.) We do this for all such rules. The resulting grammar generates the same set of strings. We shall see later that there are far better algorithms. and if p(ρ) = 0 then we have a rule of the form X → Y. if η is obtained in one step from γ by use of ρ. There is an actual derivation that deﬁnes this constituent structure: (192) uXv. Parsing and Recognition 73 obtained by replacing any number of occurrences of X on the right by the empty string.19. It is not hard to see that this algorithm is exponential. with the same set of constituents. Scan the string for a substring y which occurs to the right of a rule ρ = X → y. Z → UXV and Z → UV. Now we are still left with unary rules. that is. p(ρ) ≥ −1. that there are strings which have exponentially many diﬀerent constituents. v (which we now represent by pairs of positions — see above) of y and declare them constituents of category X. Then we remove X → Y. we call |α| − 1 the productivity of ρ. however. Now. Start with the string x. let Y → UVX be a rule. for example.

Now assume that we have ﬁlled m(i. They simply remain undeﬁned or empty. All we want to know is what substrings qualify as constituents of some sort. and work instead with the strings obtained by undoing the last step. j] has given the grammar G. For every rule X → α check to see whether the string between the positions i and i + k has a decomposition as given by α. j] is a string of category X if there are numbers k. When we have ﬁlled the relevant entries of the matrix. [i + k + m. each of which . length n. j) as follows. where i is the ﬁrst position and i+δ the last. 1) all symbols X such that X → x is a rule of the grammar. In actual practice. There are O(n2 ) many steps. this takes O(nν−1 ) steps. given k and n. For example.19. such that k + 2 + n + p = j. starting with j = 1. Parsing and Recognition 74 We scan the string for each rule of the grammar. j) where i + j > n.) We deﬁne a matrix M of dimension (n + 1) × n. n) contains X and whether m(i + k + 2 + n + p) contains V. To see why. k) for all k < j. In the above case we analyze uXv in the same way as we did with x. whether m(i + k + 2. Constituents are represented by pairs [i. Now we can discard the original string. This involves choosing three numbers. p such that [i.) The matrix is ﬁlled inductively. (Hence 0 < δ ≤ n. and x is the string between i and i + 1. n). Thus. (Do you see why?) The algorithm just given is already polynomial. ﬁnally k + m + n + p = k. Depending on the number ν of nonterminals to the right. If it contains a S ∈ Σ the string is in the language. and. In doing so we have all possible constituents for derivations of length 1. The entry m(i. m. n and p. n. n] is of category X and [i + k + m + n. j) is the set of categories that the constituent [i. [i + k. δ]. Now. k] is of category A. There are O(k2 ) ways to choose these numbers. p] is of category V. k) contains A. whichever is more appropriate. We put into m(i. m] = bu (so m = 2). in total we have O(nν+1 ) many steps to compute. we look up the entry m(0. The entire string is in the language if it qualiﬁes as a constituent of category S for some S ∈ Σ. p is ﬁxed since p = j − k − 2 − n. there is a faster way of doing this. The latter entries have been computed. notice that in each step we need to cut up a string into a given number of parts. (It is clear that we do not need to ﬁll the entries m(i. and checking whether the entry m(i. Now we ﬁll m(i. assume that we have a rule of the form (193) X → AbuXV Then x = [i. so this is just a matter of looking them up. This can be done by cutting the string into parts and checking whether they have been assigned appropriate categories. The procedure of establishing the categories is as follows. Let x be given. k.

a rule of the form X → Y1 Y2 . . Parsing and Recognition 75 consumes time proportional to the size of the grammar. n − i). Each step requires linear time. i) and m(i. Once that is done. surprisingly. It contains a S ∈ Σ. and if S → AB is a rule. given this we can recognize the language in O(n3 ) time! Now. respectively. Instead. then begin the derivation as follows: (196) S . Z1 → Y2 Z2 . given this algorithm. . The fact that a sequence of two constituents has been identiﬁed is knowledge that can save time later on when we have to ﬁnd such a sequence. . In the case of the rule above we have added a new nonterminal Y and added the rules (194). (194) X → AbuY. . Replace the rule X → AbuXV be the following set of rules. this can always be achieved. since we only have to choose a point to split up the constituent. . In general. The derivation is linear in the length of the string. We remark without proof that the new grammar generates the same language. parsing is easy. However. In fact. it is easy to return to the original grammar: just remove all constituents of category Y. can we actually ﬁnd a derivation or a constituent structure for the string in question? This can be done: start with m(0. AB Now underline one of A or B and continue with them in the same way. Here. But in the original grammar there is no way of remembering that we did have it. . Y → XV. the recognition consumes most of the work. It is perhaps worth examining why adding the constituents saves time in parsing. This is a downward search in the matrix.19. Now choose i such that 0 < i < n and look up the entries m(0. Zn−2 → Yn−1 Yn So. . Y is assumed not to occur in the set of nonterminals. Yn is replaced by (195) X → Y1 Z1 . the new grammar provides us with a constituent to handle the sequence. If they contain A and B. Pick one of them. Hence the overall time is quadratic! Thus. Notice that the grammar transformation adds new constituents. The reason is simply that the task of identifying constituents occurs over and over again. Here is how. n). The best we can hope for is to have ν = 2 in which case this algorithm needs time O(n3 ).

There is an easy way to convert a grammar into standard form. with X. Yi ∈ N. (197) < term >→ < number >| (<term>+<term>) | (<term>*<term>) Here. Next replace each terminal a that cooccurs with a nonterminal on the right hand side of a rule by Ya . often the conversion to standard form can be avoided however. It is mainly interesting for theoretic purposes. The procedure of elimination will yield the following grammar. Typical examples of syncategorematic occurrences of letters are brackets that are inserted in the formation of a term. Instead the following grammar will just as well. the above grammar distinguishes Y+ from Y∗ . since letters that are introduced together with nonterminals do not form a constituent of their own in the old grammar. For example.19. Parsing and Recognition 76 Deﬁnition 24 A CFG is in standard form if all the rules have the form X → Y1 Y2 · · · Yn . Consider the following expansion of the grammar (174). Now. the grammar is said to be in Chomsky Normal Form. operator symbols as well as brackets are added syncategorematically. or X → x. (198) < term >→ < number >| Y( < term > Y+ < term > Y) | Y( < term > Y+ < term > Y) Y( →( Y) →) Y+ →+ Y∗ →* However. If in addition n = 2 for all rules of the ﬁrst form. Such letter occurrences are called syncategorematic. The new grammar generates more constituents. (199) < term >→ < number >| Y( < term > Yo < term > Y) Y( →( Y) →) Yo →+ | * . Just introduce a new nonterminal Ya for each letter a ∈ A together with the rules Ya → a. it may happen that a grammar uses more nonterminals than necessary. but this is not necessary.

we can choose any of them and expand them ﬁrst using a rule. if in a given string we have several occurrences of nonterminals. There is a way to convert a given grammar into invertible form. Another source of time eﬃciency is the number of rules that match a given right hand side. it generates the same constituent structures. . Deﬁnition 25 A CFG is invertible if for any pair of rules X → α and Y → α we have X = Y. What needs to be shown is that it generates the same language (in fact. though with diﬀerent labels for the constituents). the powerset of the set of nonterminals of the original grammar. or always the rightmost nonterminal. 20 Greibach Normal Form We have spoken earlier about diﬀerent derivations deﬁning the same constituent structure. This grammar is clearly invertible: for any given sequence T 0 T 1 · · · T n−1 of nonterminals the left hand side S is uniquely deﬁned. . The rules are (200) S → T 0 T 1 . Greibach Normal Form 77 The reduction of the number of nonterminals has the same signiﬁcance as in the case of ﬁnite state automata: it speeds up recognition. we can agree to always expand the leftmost nonterminal. . and this can be signiﬁcant because not only the number of states is reduced but also the number of rules. Basically.20. we need to add the left hand symbol for each of them. . Yn−1 ∈ R. . For example. for example by always choosing a particular derivation. The set of nonterminals is ℘(N). This is because the application of two rules that target the same string but diﬀerent nonterminals commute: (201) ···X···Y ··· ···α···Y ··· ···X···γ··· ···α···γ··· This can be exploited in many ways. If there are several rules. T n−1 where S is the set of all X such that there are Yi ∈ T i (i < n) such that X → Y0 Y1 .

Now call π a priorization for G = S. if B and C are isomorphic local trees (that is. We shall denote the fact that there is a leftmost derivation of α from X by X G α. We write x y if the constituent of x is expanded before the constituent of y is. as we shall see shortly). which is deﬁned as follows. N. We write X G α if X G α for the linearization deﬁned by π. One can characterize exactly what it takes for such an order to be a linearization. After it has been replaced. depth–ﬁrst and breadth–ﬁrst. The ﬁrst is a particular case of the so–called depth–ﬁrst search and the linearization shall be called leftmost linearization. First. Now a minute’s reﬂection reveals that every linearization of the local subtrees of a tree induces a linearization of the entire tree but not conversely (there are orderings which do not proceed in this way. That is to say. that is. Greibach Normal Form 78 Recall that a derivation deﬁnes a set of constituent occurrences. Second if x > y then also x y. For each n. can be made very clear with tree domains. Proposition 26 Let π be a priorization. x y iﬀ x > y or x y. it is gone and can no longer ﬁgure in a derivation. Linearizations are closely connected with search strategies in a tree. Let S n be the set of all nodes x with d(x) = n. Notice that each occurrence of a nonterminal is replaced by some right hand side of a rule during a derivation that leads to a terminal string. Given a tree. if they correspond to the same rule ρ) then orders the leaves B linearly in the same way as orders the leaves of C (modulo the unique (!) isomorphism). it is linear. It is as follows. (Recall that x y iﬀ x precedes y. R if π deﬁnes a linearization on the local tree Hρ . Let this π ordering be . We can generalize the situation as follows. . A diﬀerent strategy is the breadth–ﬁrst search. which in turn constitute the nodes of the tree. we only need to order the daughters of the root node. Since the root is always the ﬁrst element in a linearization. X G α denotes the fact that there is a derivation of α from X determined by . Let be a linear ordering uniformly deﬁned on the leaves of local subtrees.20. and (b) if d(x) < d(y) then x ∆ y. The breadth–ﬁrst search is a linearization ∆. We shall present examples. Trees are always considered ordered.) For every tree there is exactly one leftmost linearization. A. S n shall be ordered linearly by . In the case of the leftmost linearization the ordering is the one given by . It follows that the root is the ﬁrst node in the linearization. a linearization is an ordering of the nodes which results from a valid derivation in the following way. for every ρ ∈ R. Then X π G x iﬀ X G x. the leaves. (a) If d(x) = d(y) then x ∆ y iﬀ x y. This search goes through the tree in increasing depth. The diﬀerence between these search strategies.

00. 1. 00. . the breadth–ﬁrst search in the numerical order. Let the following tree domain be given. the depth–ﬁrst linearization begins as follows. 0000. 0. 11. 0. however. 1. 10.20. . So. 2. The depth–ﬁrst search traverses the tree domain in the lexicographical order. . (205) ε. 1. The breadth–ﬁrst linearization goes like this. (204) ε. 3. We deﬁne x > y if x is a proper preﬁx of y and x y iﬀ x = uiv and y = u jw for some sequences u. 0. 2. So we never reach 1. 10. but it is more complicated than the given ones. 0. 000. is (203) ε. 11. . 20 The breadth–ﬁrst linearization. Greibach Normal Form 79 Deﬁnition 27 A tree domain is a set T of strings of natural numbers such that (i) if x if a preﬁx of y ∈ T then also x ∈ T . 20 Notice that with these linearizations the tree domain ω∗ cannot be enumerated. . . ε d d d d 0 1 2 ¢f ¢f ¢ ¢ f f 00 10 11 20 The depth–ﬁrst linearization is (202) ε. v. we never reach 00. Namely. so we do have a linearization. w and numbers i < j. 2. (b) if x j ∈ T and i < j then also xi ∈ T . On the other hand. . 00. ω∗ is countable.

In this case it also follows that LB (G s ) = LB (G). Call the resulting grammar G s . G is in Greibach (Normal) Form if every rule is of the form S → ε or of the form X → x Y. A. Deﬁnition 31 Let G = S. The grammar generates no terminal strings. C and D are completable. Since S. This deﬁnes G = S. or L(G) ∅ and every nonterminal is both reachable and completable. and S. is not completable. Let N be the set of symbols which are both reachable and completable. Therefore. R . R s such that N s ⊆ N. A. R be a CFG. A. it can be shown that every derivation in G is a derivation in G s and conversely. N s . B and C are reachable. It is clear that G α iﬀ G s α. A. Proposition 29 Let G and H be slender. Deﬁnition 28 A CFG is called slender if either L(G) = ∅ and G has no nonterminals except for the start symbol and no rules. N . in which case every element is both reachable and completable. X is called completable if there is an x such that X ⇒∗ x. Greibach Normal Form 80 The ﬁrst reduction of grammars we look at is the elimination of superﬂuous symbols and rules. the start symbol. which has the same set of derivations as G. this process must be repeated until G = G. N. If S N then L(G) = ∅. It may be that throwing away rules may make some nonterminals unreachable or uncompletable. N. A. Let G = S. In this case we put N := {S} and R := ∅. let R be the restriction of R to the symbols from A ∪ N . Call X ∈ N reachable if G α X β for some α and β. Additionally. Proposition 30 For every CFG G there is an eﬀectively constructible slender CFG G s = S. R (206) S B D → → → AB AB Ay A A C → → → CB x y In the given grammar A. no symbol is both reachable and completable. . R be a CFG. Otherwise. Then G = H iﬀ der(G) = der(H).20. Two slender grammars have identical sets of derivations iﬀ their rule sets are identical.

all non left recursive X– productions. It is also not hard to see that this property characterizes the Greibach form uniquely. Now let G1 := S. Then L(G1 ) = L(G). j < n. be all left recursive X–productions as well as X → β j . i < m. Theorem 33 (Greibach) For every CFG one can eﬀectively construct a grammar Gg in Greibach Normal Form with L(Gg ) = L(G). R−ρ . The reader may convince himself that the tree for G−ρ can be obtained in a very simple way from trees for G simply by removing all nodes x which dominate a local tree corresponding to the rule ρ. deﬁne R−ρ as follows. This is so if there are rules of the form X → Y (in particular rules like X → X). Lemma 34 Let G = S. j < n. that is to say. N. Here we assume that there are no rules of the form X → X. . which are isomorphic to Hρ . Replace ρ by all rules of the form S → β where β derives from α by applying a rule. where Z N ∪ A and R1 consists of all Y–productions from R with Y X as well as the productions (207) X → βj X → βj Z j < n. Such a production is called left recursive if it has the form X → X β. A. For every factorization α = α1 Y α2 of α and every rule Y → β add the rule X → α1 β α2 to R and ﬁnally remove the rule ρ. A. If X G α then α has a leftmost derivation from X in G iﬀ α = y Y for some y ∈ A∗ and Y ∈ N ∗ and y = ε only if Y = X. The proof is not hard. but not in the desired form. For if there is a rule of the form X → Y γ then there is a leftmost derivation of Y γ from X. This technique works only if ρ is not an S–production. Greibach Normal Form 81 Proposition 32 Let G be in Greibach Normal Form.20. Then L(G−ρ ) = L(G). Z → αi Z → αi Z i < m. R1 . Let ρ = X → α be a rule. We call this construction as skipping the rule ρ. Before we start with the actual proof we shall prove some auxiliary statements. N. Now let G−ρ := S. Skipping a rule does not necessarily yield a new grammar. R be a CFG and let X → X αi . N ∪ {Z}. i < m. We call ρ an X–production if ρ = X → α for some α. In this case we proceed as follows. A.

This deﬁnes the linearization .) This derivation is cut into segments Σi . respectively. Now. We claim ﬁrst that by taking in new nonterminals we can see to it that we get a grammar G1 such that L(G1 ) = L(G) in which the Xi –productions have the form Xi → x Y or Xi → X j Y with j > i. of length ki . It can be seen that (208) M(X) = j<n βj · ( i<m αi )∗ = P(X) From this the desired conclusion follows thus. This we prove by induction on i. let M(X) be the set of all γ such that there is a leftmost derivation from X in G in such a way that γ is the ﬁrst element not of the form X δ. such that (209) Σi = A j : p<i kp ≤ j < 1 + p<i+1 ki This partitioning is done in such a way that each Σi is a maximal portion of Γ of X– productions or a maximal portion of Y–productions with Y X. Then there exists a leftmost derivation Γ = Ai : i < n + 1 of x. Let x ∈ L(G). we deﬁne P(X) to be the set of all γ which can be derived from X priorized by in G1 such that γ is the ﬁrst element which does not contain Z. Now to the proof of Theorem 33. We choose an enumeration of N as N = {Xi : i < p}. We may assume at the outset that G is in Chomsky Normal Form. the leftmost ordering) and in all rules X → β j Z as well as Z → αi Z we also put the left to right ordering except that Z precedes all elements from α j and βi . Now let Γ be result of stringing together the Σi . Hence x ∈ L(G1 ). by the previous considerations. This is well–deﬁned. We consider the following priorization on G1 . (Recall that the Ai are instances of rules. Greibach Normal Form 82 Proof. since the ﬁrst string of Σi equals the ﬁrst string of Σi . by beginning with a derivation priorized by . Let j0 be the largest j such that Xi0 → X j Y is a rule. We distinguish two cases. as the last string of Σi equals the last string of Σi .20. priorized by . Γ is a G1 –derivation. By the previous lemma we can eliminate the production by introducing some new . The ﬁrst is j0 = i0 . The X–segments can be replaced by a –derivation Σi in G1 . Let i0 be the smallest i such that there is a rule Xi → X j Y with j ≤ i. In all rules of the form X → β j and Z → αi we take the natural ordering (that is. The converse is analogously proved. We shall prove this lemma rather extensively since the method is relatively tricky. We claim that P(X) = M(X). Likewise. For them we put Σi := Σi . The segments which do not contain X–productions are already G1 –derivations. i < σ.

For example. Z → DA. Also these rules can be skipped. The second case is j0 < i0 . We can skip the rule Xi0 → X j0 Y and introduce rules of the form (a) Xi0 → Xk Y with k > j0 . let the following grammar be given. Inductively we see that all rules of the form can be eliminated in favour of rules of the form (210b). but this does not disturb the proof.20. Now ﬁnally the rules of type (210c). as desired. In this way the second case is either eliminated or reduced to the ﬁrst. If some X p−2 –production has the form (210a) then we can skip this rule and get rules of the form X p−2 → xY . Here we apply the induction hypothesis on j0 . Z → DAZ Likewise we replace the production D → DC by (213) D → ABY. (211) S D → → SDA | CC DC | AB A B C → → → a b c The production S → SDA is left recursive. Greibach Normal Form 83 nonterminal symbol Zi0 . It may happen that for some j Z j does not occur in the grammar. Y → CY With this we get the grammar S Z D Y → → → → CC | CCZ DA | DAZ AB | ABY C | CY A B C → → → a b c (214) . Y → C. and then we get rules of the form Z → x Y for some x ∈ A. Let ﬁnally Pi := {Z j : j < i}. We replace it according to the above lemma by (212) S → CCZ. Now let P := {Zi : i < p} be the set of newly introduced nonterminals. At the end of this reduction we have rules of the form (210a) (210b) (210c) Xi → X j Y Xi → x Y Zi → W + ( j > i) (x ∈ A) (W ∈ (N ∪ Pi ) (ε ∪ Zi )) It is clear that every X p−1 –production already has the form X p−1 → x Y.

or you can remove that element and then the element that was added before it becomes visible. The devices that can do this are called pushdown automata. (216) S Z Y → → → cC | cCZ aBA | aBYA | aBAZ | aBYZ c | cY Now the grammar is in Greibach Normal Form. A pushdown memory is potentially inﬁnite. This behaviour is also called LIFO (last– in–ﬁrst–out) storage. They are ﬁnite state machines which have a memory in form of a pushdown. However. You can store in it as much as you like. more on that later. for example {an bn : n ∈ N}. Pushdown Automata 84 Next we skip the D–productions. Thus devices that can check membership in L(G) for a CFG must therefore be more powerful. the ﬁnite state automaton. so that the bottommost plates are only used when there is a lot of people. You can see only the last item you added to it. Namely.21. As long as you get a. and remove from the top. It is realized for example when storing plates. you scan the string from left to right. Recognition here means that the machine scans the string left–to–right and when the string ends (the machine is told when the string ends) then it says ‘yes’ if the string is in the language and ‘no’ otherwise. 21 Pushdown Automata Regular languages can be recognized by a special machine. You always add plates on top. the operations that you can perform on a pushdown storage are limited. (This picture is only accurate for deterministic machines. S Z D Y → → → → CC | CCZ ABA | ABYA | ABAZ | ABYAZ AB | ABY C | CY A B C → → → a b c (215) Next D can be eliminated (since it is not reachable) and we can replace on the right hand side of the productions the ﬁrst nonterminals by terminals. It is easy to see that you can decide whether a given string has the form an bn given an additional pushdown storage. you put it on the . and you can either put something on top of that element.) There are context free languages which are not regular.

Initially. If the pushdown is not emptied but the string is complete. and σ a set of instructions. Now. we also need a predicate ‘empty’. the storage alphabet. If p = pop. which contains a special symbol. This is done as follows. PDAs can be nondeterministic. then you have more bs than a. It schedules the actions using the stack as a storage. it is deﬁned to be a quintuple A. In fact. which is true of a stack x iﬀ x = ε. you can tell whether you have a string of the required form if you can tell whether you have an empty storage. we call . where s and s are states. We have two alphabets. Each instruction is an option for the machine to proceed. typically what one does is to ﬁll the storage before we start with a special symbol #. It is initialized with the stack containing # and the initial state i0 . I the stack alphabet. We assume that this is the case. For a given situation we may have several options. If given the current stack. the stack contains one symbol. We have an operation pop : x d → x. A. the current state and the next character there is at most one operation that can be chosen. A PDA contains a set of instructions. and I. The next state is s and the stack is determined from p. where A is the input alphabet. then the topmost symbol is popped. then a is pushed onto stack.21. the end-of-storage marker. c is a character or empty (the character read from the string). Once you hit on a b you start popping the as from the pushdown. The storage is represented as a string over an alphabet D that contains #. Q. p . Q the set of states. the next character must match c (and is then consumed). if it is pusha . t. i0 . then you have more as than bs. (If we do not have an end of stack marker. I. it can use that option only if it is in state s. and t is a character (read from the top of the stack) and ﬁnally p an instruction to either pop from the stack or push a character (diﬀerent from #) onto it. c. #. Pushdown Automata 85 pushdown. F. one for each b that you ﬁnd. σ . the alphabet of letters read from the tape. A transition instruction is a a quintuple s. if the topmost stack symbol is t and if c ε. s . So. However. If A is reading a string. notice that the control structure is a ﬁnite state automaton. F ⊆ Q the set of accepting states. #. If the pushdown is emptied before the string is complete. Formally. the stack alphabet. then it does the following. For each d ∈ D we have a predicate topd which if true of x iﬀ x = y d. i0 ∈ Q the start state. Then we are only in need of the following operations and predicates: For each d ∈ D we have an operation pushd : x → x d.

To continue the above example. Here is the dilemma. abccban is a palindrome. We add here that theoretically the operation that reads the top of the stack removes the topmost symbol. (The alternatives are simply diﬀerent actions and diﬀerent subsequent states. If you are scanning a word like this. Alternatively. although for a given machine the two languages may be diﬀerent. given the input. Since the stack is popped in reverse order. It can be shown that the class of languages accepted by PDAs by state is the same as the class of languages accepted by PDAs by stack. (We shall not go into the details here: but it is possible to peek into any number of topmost symbols. There are languages which are CF but cannot be recognized by a deterministic PDA. The trouble is that there is no way for the machine to know when to shift gear: it cannot tell when it is half through the string. we say that the PDA accepts x by stack if the stack is empty after x has been scanned. we put the automaton in an accepting state if after popping as the topmost symbol is #. An example is the language of palindromes: {xxT : x ∈ A∗ }. We shall establish that the class of languages accepted by PDAs by stack are exactly the CFGs. Basically.21.) We say that the PDA accepts x by state if it is in a ﬁnal state at the end of x. The obvious mechanism is this: scan the input and start pushing the input onto the stack until you are half through the string. While in the case of ﬁnite state automata there was a way to turn the machine into an equivalent deterministic machine. The price one pays is an exponential blowup of the number of states.) A . In order to look at the topmost symbol we actally need to pop it oﬀ the stack. and then start comparing the stack with the string you have left. it needs to put one a less than needed on the stack and then cancel # on the last move. if we put it back. First. A slight modiﬁcation of the machine results in a machine that accepts the language by stack. In general. Alternatively. There is a slight problem in that the PDAs might actually be nondeterministic. because the string might be longer than you have thought. It is for this reason that we need to review our notion of acceptance. For example. you recognize exactly the palindromes. this is not possible here. abddcT = cddba. we say that a run of the machine is a series of actions that it takes. there is no way of knowing when you should turn and pop symbols. Then abccba is a palindrome. the run speciﬁes what the machine chooses each time it faces a choice. but so is abccbaabccba and abccbaabccbaabccba. However. You accept the string if at the end the stack is #. then this as if we had just ‘peeked’ into the top of the stack. Pushdown Automata 86 the PDA deterministic. Let x = abc. The stack really is just a memory device. where xT is the reversal of x.

they can only execute one thing at a time. We say that the machine accepts the string if there is an accepting run on that input. Whenever the scanned symbol is a. Then we translate this rule into the following actions. The PDA has the same problem: it chooses a particular run but has no knowledge of what the outcome would have been had it chosen a diﬀerent run. But we ignore that ﬁnesse here. They also are deterministic. Take the grammar 0 1 2 3 4 5 6 7 8 9 10 S S Z Z Z Z Y Y A B C → → → → → → → → → → → cC cCZ aBA aBYA aBAZ aBYZ c cY a b c (217) We take the string ccabccaba. since each step allows to put only one symbol onto the stack.) Thus the last symbol put on the stack is Y0 .21. It can be shown that the machine ends with an empty stack on input x iﬀ there is a leftmost derivation of x that corresponds to the run. Computers are not parallel devices. Pushdown Automata 87 machine is deterministic iﬀ for every input there is only one run. Let us see an example. Notice that the deﬁnition of a run is delicate. which is then visible. the machine needs to go through several steps. (To schedule this correctly. we keep a record of the run and backtrack. Here is a leftmost derivation (to the left we show . Thus. Alternatively. We deﬁne the following machine. Its stack alphabet is N. Now back to the recognition problem. The Greibach Form always puts a terminal symbol in front of a series of nonterminals. to check whether a run exists on a input we need to emulate the machine. Now let X → aY0 Y1 · · · Yn−1 be a rule. on the stack. the beginning of stack is the start symbol of G. and enumerate all possible runs and check the outcome. and whenever the topmost symbol of the stack is X then pop that symbol from the stack and put Yn−1 then Yn−2 then Yn−3 etc. To show the theorem we use the Greibach Normal Form. Observe the following.

22. Shift–Reduce–Parsing

88

the string, and to the right the last rule that we applied): S cCZ ccZ ccaBYZ ccabYZ ccabcYZ ccabccZ ccabccaBA ccabccabA ccabccaba − 1 10 5 9 7 8 2 9 8

(218)

The PDA is parsing the string as follows. (We show the successful run.) The stack is initialized to S (Line 1). It reads c and deletes S, but puts ﬁrst Z and then C on it. The stack now is ZC (leftmost symbol is at the bottom!). We are in Line 2. Now it reads c and deletes C from stack (Line 3). Then it reads a and pops Z, but pushes ﬁrst Z, then Y and then B (Line 4). Then it pops B on reading b, pushing nothing on top (Line 5). It is clear that the strings to the left represent the following: the terminal string is that part of the input string that has been read, and the nonterminal string is the stack (in reverse order). As said, this is just one of the possible runs. There is an unsuccessful run which starts as follows. The stack is S. The machine reads c and decides to go with Rule 0: so it pops C but pushes only C. Then it reads c, and is forced to go with Rule 10, popping oﬀ C, but pushing nothing onto it. Now the stack is empty, and no further operation can take place. That run fails. Thus, we have shown that context free languages can be recognized by PDAs by empty stack. The converse is a little trickier, we will not spell it out here.

22

Shift–Reduce–Parsing

We have seen above that by changing the grammar to Greibach Normal Form we can easily implement a PDA that recognizes the strings using a pushdown of the nonterminals. It is not necessary to switch to Greibach Normal Form, though. We can translate directly the grammar into a PDA. Alos, grammar and automaton are not uniquely linked to each other. Given a particular grammar, we can deﬁne quite

22. Shift–Reduce–Parsing

89

diﬀerent automata that recognize the languages based on that grammar. The PDA implements what is often called a parsing strategy. The parsing strategy makes use of the rules of the grammar, but depending on the grammar in quite diﬀerent ways. One very popular method of parsing is the following. We scan the string, putting the symbols one by one on the stack. Every time we hit a right hand side of a rule we undo the rule. This is to do the following: suppose the top of the stack contains the right hand side of a rule (in reverse order). Then we pop that sequence and push the left–hand side of the rule on the stack. So, if the rule is A → a and we get a, we ﬁrst push it onto the stack. The top of the stack now matches the right hand side, we pop it again, and then push A. Suppose the top of the stack contains the sequence BA and there is a rule S → AB, then we pop twice, removing that part and push S. To do this, we need to be able to remember the top of the stack. If a rule has two symbols to its right, then the top two symbols need to be remembered. We have seen earlier that this is possible, if the machine is allowed to do some empty transitions in between. Again, notice that the strategy is nondeterministic in general, because several options can be pursued at each step. (a) Suppose the top part of the stack is BA, and we have a rule S → AB. Then either we push the next symbol onto the stack, or we use the rule that we have. (b) Suppose the top part of the stack is BA and we have the rules S → AB and C → A. Then we may either use the ﬁrst or the second rule. The strongest variant is to always reduce when the right hand side of a rule is on top of the stack. Despite not being always successful, this strategy is actually useful in a number of cases. The condition under which it works can be spelled out as follows. Say that a CFG is transparent if for every string α and nonterminal X ⇒∗ α, if there is an occurrence of a substring γ, say α = κ1 γκ2 , and if there is a rule Y → γ, then κ1 , κ2 is a constituent occurrence of category Y in α. This means that up to inessential reorderings there is just one parse for any given string if there is any. An example of a transparent language is arithmetical terms where no brackets are omitted. Polish Notation and Reverse Polish Notation also belong to this category. A broader class of languages where this strategory is successful is the class of NTS–languages. A grammar has the NTS–property if whenever there is a string S ⇒∗ κ1 γκ2 and if Y → γ then S ⇒∗ κ1 Yκ2 . (Notice that this does not mean that the constituent is a constituent under the same parse; it says that one can ﬁnd a parse that ends in Y being expanded in this way, but it might just be a diﬀerent parse.) Here is how the concepts diﬀer. There is a grammar for numbers that runs as follows. (219) < digit > → 0 | 1 | · · · | 9

22. Shift–Reduce–Parsing < number > → < digit >|< number >< number >

90

This grammar has the NTS–property. Any digit is of category < digit >, and a number can be broken down into two numbers at any point. The standard deﬁnition is as follows. (220) < digit > → 0 | 1 | · · · | 9 < number > → < digit >|< digit >< number >

This grammar is not an NTS–grammar, because in the expression (221) < digt >< digit >< digit >

the occurrence < digit >, ε of the string < digit >< digit > cannot be a constituent occurrence under any parse. Finally, the language of strings has no transparent grammar! This is because the string 732 possesses occurrences of the strings 73 and 32. These occurrences must be occurrences under one and the same parse if the grammar is transparent. But they overlap. Contradiction. Now, the strategy above is deterministic. It is easy to see that there is a nondeterministic algorithm implementing this idea that actually deﬁnes a machine parsing exactly the CFLs. It is a machine where you only have a stack, and symbols scanned are moved onto the stack. The top k symbols are visible (with PDAs we had k = 1). If the last m symbols, m ≤ k, are the right hand side of a rule, you are entitled to replace it by the corresponding left hand side. (This style of parsing is called shift–reduce parsing. One can show that PDAs can emulate a shift–reduce parser.) The stack gets cleared by m − 1 symbols, so you might end up seeing more of the stack after this move. Then you may either decide that once again there is a right hand side of a rule which you want to replace by the left hand side (reduce), or you want to see one more symbol of the string (shift). In general, like with PDAs, the nondeterminism is unavoidable. There is an important class of languages, the so–called LR(k)–grammars. These are languages where the question whether to shift or to reduce can be based on a lookahead of k symbols. That is to say, the parser might not know directly if it needs to shift or to reduce, but it can know for sure if it sees the next k symbols (or whatever is left of the string). One can show that for a language which there is an LR(k)–grammar with k > 0 there also is a LR(1)–grammar. So, a lookahead of just one symbol is enough to make an informed guess (at least with respect to some grammar, which however need not generate the constituents we are interested in).

If x ∈ L then there is an accepting run of A on x: (223) q0 → q1 → q2 → q3 · · · qn−1 → qn x0 x1 x2 xn−1 This run visits n + 1 states. The language B = {an bn : n ∈ N} of balanced strings is in LR(0). Since L is regular. in which case we get uw ∈ L. Let x be a string of length at least k. Unlike nondeterministic PDAs which may have alternative paths they can follow. we may choose n = 0. Theorem 35 Suppose that L is a regular language. For example: how do we decide whether a language is regular? The following is a very useful citerion. But A has at most n states. Some Metatheorems 91 If the lookahead is 0 and we are interested in acceptance by stack. while for example B+ is not (if contains the string abab. Let k be the number of states of A. We claim that there . There are i and j such that i < j and qi = q j . so there is a state. First. Then put u := x0 x1 · · · xi−1 .23. Then there exists a number k such that every x ∈ L of length at least k has a decomposition x = uvw with nonempty v such that for all n ∈ N: (222) uvn w ∈ L Before we enter the proof. v := xi xi+1 · · · x j−1 and w := x j x j+1 · · · xn−1 . The proof is as follows. there is a fsa A such that L = L(A). let us see some consequences of this theorem.) 23 Some Metatheorems It is often useful to know whether a given language can be generated by geammars of a given type. a deterministic PDA with out lookahead must make a ﬁrm guess: if the last letter is there it must know that this is so. which has been visited twice. which has a proper preﬁx ab. then the PDA also tells us when the string should be ﬁnished. Language of this form have the following property: no proper preﬁx of a string of the language is a string of the language. This is an important class. Because in consuming the last letter it should know that this is the last letter because it will erase the symbol # at the last moment.

one needed anti anti anti missile missile missile missiles.23. )k−2 missile L )k−1 missile . In the days of the arms race one produced not only missiles but also anti missile missile. Then there is a k satisfying the conditions above. w = (missile Then uw is of the required form. And to attack those. (224) (225) (226) (227) q0 → qi = q j → qn q0 → qi → q j = qi → xn q0 → qi → q j = qi → q j → xn q0 → qi → q j = qi → q j = qi → q j = qi → xn u v v v w u v v w u v w u w 92 There are examples of this kind in natural languages. No such word exists: let us break the original string into a preﬁx (anti )k of length 5k and a suﬃx (missile )k missile of length 8(k+1)−1. Now take the word (anti )k (missile )k missiile. And so on. For example. and then anti anti missile missile missiles to attach anti missile missiles. to be used against missiles. Suppose it is regular. (230) uv2 w = (anti )k−n anti missile anti missile (missile Similarly for any other attempt to divide the string. It is easy to see that it must contain the letters of anti missile. The string we take out must therefore have length 13n. Unfortunately. we decompose the original string as follows: (229) u = (anti )k−1 . There must be a subword of nonempty length that can be omitted or repeated without punishment. Suppose. v = anti missile . The entire string has length 13k + 7. An amusing example is from Steven Pinker. Some Metatheorems exists an accepting run for every string of the form uvq w. The general recipe for these expressions is as follows: (228) (anti )n (missile )n missile We shall show that this is not a regular language. We assume for simplicity that n = 1. the general argument is similar.

which is not in M. →ε So. We take the string ck dk+1 . to disprove that L is regular it is enough to show that M is not regular.23. We extend µ to strings as follows. and µ(c) := anti (232) (233) (234) µ(d) = missile µ(cdd) = anti missile missile µ(ccddd) = anti anti missile missile missile So if M = {ck dk+1 : k ∈ N} then the language above is the µ–image of M (modulo the blank at the end). let B = {c. (a) v = c p for some p (which must be > 0): (237) cc · · · c • c · · · c • c · · · cdd · · · d Then uw = ck−p dk+1 . Then For example. Theorem 36 Let µ : A → B∗ and L ⊆ A∗ be a regular language. (238) cc · · · c • c · · · cd · · · d • dd · · · d . Choose a number k. M is also the image of L under ν. Similarly. which assigns to each letter a ∈ A a string (possibly empty) of letters from B. And since it is arbitrary. However. Three cases are to be considered. Some Metatheorems 93 This proof can be simpliﬁed using another result. ν( ) = ε a → c. e → ε. s → ε. We show that the conditions are not met for this k. The proof is similar. with p. i → ε. Now. t → ε. n → ε. ν(missile) = d. Consider a map µ : A → B∗ . l → ε. d}. (c) v = c p dq . We try to decompose it into uvw such that uv j w ∈ M for any j. q > 0. the condition is not met for any number. (231) µ(x0 x1 x2 · · · xn−1 ) = µ(x0 )µ(x1 )µ(x2 ) · · · µ(xn−1 ) and µ(d) := missile . (b) v = d p for some p > 0. we can also do the following: let ν be the map (235) Then (236) ν(anti) = c. Then the set {µ(v) : v ∈ L} ⊆ B∗ also is regular. m → d.

there is a similar property of context free languages. again by Theorem 13. known as the Pumping Lemma or Ogden’s Lemma. dass Hans Peter Peter Peter die Kinder spielen lassen lassen lassen ließ. dass Hans ( ( lassen)k ließ. k does not satisfy the conditions. For if L and M are regular. dass Hans Peter Peter die Kinder spielen lassen lassen ließ. Although it is from a certain point on hard to follow what the sentences mean. We shall this language is not regular. Maria sagte. Theorem 37 (Pumping Lemma) Suppose that L is a context free language. we claim that H is the intersection of German with the following language (244) Maria sagte. Then so is (A∗ − L) ∪ (A∗ − M) and. so is H. From this we shall deduce that German as whole is not regular.23. so is A∗ − L. (239) A∗ − ((A∗ − L) ∪ (A∗ − M)) = L ∩ M This can be used in the following example. Then there is a number k ≥ 0 such that every string x ∈ L of length at least k possesses . namely. The argument is similar to the ones we have given above. Maria sagte. This follows from Theorem 13. Maria sagte. Inﬁnitives of German are stacked inside each other as follows: (240) (241) (242) (243) Maria sagte. Call the language of these strings H. so it is not in M. I shall give only the weakest form of it. and A∗ − M. so is L ∩ M. Some Metatheorems 94 Then uv2 w contains a substring dc. A third theorem is this: if L and M are regular. Hence. But H is not regular. they are grammatically impeccable. Peter)k die Kinder spielen Hence if German is regular. dass Hans Peter die Kinder spielen lassen ließ. dass Hans die Kinder spielen ließ. Now. Contradiction in all cases.

v = w = y = z = ε. w := x j · · · x −1 . must occur twice. y := x · · · xm−1 . A popular example is {an bn cn : n ∈ N}. Suppose that x is very large (for example: let π the maximum length of the right hand side of a production and ν = |N| the number of nonterminals. . otherwise the claim is trivial: every language has the property for k = 0. w and z may be empty. v := xi · · · x j−1 . I omit the proofs. Then the image of L under µ is context free as well. Theorem 38 Let L be a context free language over A and M a regular language. So. Along this branch you will ﬁnd that some nonterminal label. Suppose it is context free. Then any tree for x contains a branch leading from the top to the bottom and has length > ν + 1. I shall brieﬂy indicate why this theorem holds. Also. (247) xi xi+1 · · · xm−1 x j · · · x −1 have that same nonterminal label. Some Metatheorems a decomposition x = uvwyz with v can be zero): (245) uv j wy j z ∈ L ε or y 95 ε such that for every j ∈ N (which The condition v ε or y ε must be put in. Simply put u = x. Instead we shall see how the theorems can be used to show that certain languages are not context free. Then L ∩ M is context free. Notice however that u.23. This means that the string is cut up in this way: (246) x0 x1 · · · xi−1 ◦ xi xi+1 · · · x j−1 • x j · · · x −1 • x · · · xm−1 ◦ xm · · · xk−1 where the pair of • encloses a substring of label A and the ◦ encloses another one. say A. then put k > πν+1 ). Then we should be able to name a k ∈ N such that for all strings x of length at least k a decomposition of the kind above exists. We mention also the following theorems. Now we deﬁne the following substrings: (248) u := x0 x1 · · · xi−1 . let µ : A → B∗ be a function. z := xm · · · xk−1 Then one can easily show that uv j wy j z ∈ L for every j. Now.

We concentrate on showing that it is not context free. Likewise. the Pumping Lemma is not an exact characterization. . Contradiction. b and c must occur. We show that uv2 wy2 z L. Some Metatheorems 96 we take x = ak bb ck . Then v2 contains a b before an a. We shall leave it as an assignment to show that C fulﬁlls the conclusion of the pumping lemma. Contradiction. it is easy to see that vy must contain the same number of a. The product is nonempty. if it contains a constituent A: (250) y0 g1 · · · y p−1 • y p · · · yq−1 • y j · · · yq−1 then we can insert the pair v. Suppose. so at least one a. Hence.23. we take a string as in the pumping lemma and ﬁx a decomposition so that uv j wy j z ∈ C for every j. v contains a and b. y like this: (251) y0 y1 · · · y p−1 • x p · · · x j−1 ◦ yi · · · y j−1 • x · · · xm−1 y p · · · yq−1 Now one can show that since there are a limited number of nonterminals we are bound to ﬁnd a string which contains such a constituent where inserting the strings is inappropriate. The idea is now that no matter what string y0 y1 · · · y p−1 we are given. For example. v contains only a. The argument is a bit involved. Malay (or Indonesian) forms the plural of a noun by reduplicating it. But then y2 contains a c before a b. b and c (can you see why?). Natural languages do contain a certain amount of copying. We shall have to ﬁnd appropriate v and y. uv2 wy2 z L. So. Basically. Now y contains b and c. Now. First. v cannot contain a and c. Chinese has been argued to employ copying in yes-no questions. Here is a language that satisﬁes the test but is not context free: (249) C := {xx : x ∈ A∗ } This is known as the copy language.

- Program on Windows Using FREE ToolsUploaded byKenny Allen Christopher Abrahams
- Visual Basic Tutorial Chapter 1Uploaded byshoebie
- Ch 7 Data HandlingUploaded bySanskruti Namjoshi
- 05-FunctionsInC++Uploaded byPranab Phagetra
- C Programming NotesUploaded byAnurag Goel
- Lexicon Reflex MIDI Implementation GuideUploaded byStanley
- Grapics NotesUploaded byYogesh Katre
- datatyp.docxUploaded byragini0376
- 31008084_K01_000_00Uploaded byarfeem
- TUTORIAL se PFDUploaded byraptornt
- C Language.docUploaded byrayden22
- Xii Mat April-mayUploaded byVikas Saxena
- lesson 1Uploaded byTorres Jerome
- Photoshop CS3 VBScript Ref.pdfUploaded byNavjeet Singh Saini
- object oriented programming with java.pdfUploaded byanejaajay1461
- Monitor Pro Tuto2Uploaded byJose Luna
- STACKnotes.docUploaded byJatin Chaudhari
- CS F213 RL 6.5Uploaded bySAURABH MITTAL
- .Net FrameworkUploaded bysaditya76
- JavaHTP7e_15Uploaded byJael
- NCT90T Operator ProgrammerUploaded byDado Maja Arsenin
- Course File Programming and Problem Solving in cUploaded byenggeng7
- item analysis paper final 1Uploaded byapi-302131686
- 17 Using Code Blocks.docxUploaded byRoshani Shah
- OOP LABUploaded bySarzaminKhan
- CADKEY Advanced Design Language (CADL) GuideUploaded byrkdas_sust5091
- 9300_Servo_EP Commission.pdfUploaded byGuillermo Osvaldo Rivera Mellado
- CSC138 Sept 2011Uploaded byMuhammad Afiq
- IDENTIFIERSUploaded byTARIQHASSAN
- First Monthly Test 2018Uploaded byJosette Mae Atanacio