You are on page 1of 125

Regular languages

Definition: Let Σ be an alphabet. The class


R of regular languages over Σ is defined as
follows:

1. ∅ is an element of R.

2. {λ} is an element of R.

3. For each a ∈ Σ, {a} is an element of R.

4. If L1 and L2 are any elements of R, then


S
(a) L1 L2 is an element of R.

(b) L1L2 is an element of R.

(c) L1∗ is an element of R.

Only languages that can be obtained by using


statements 1–4 are regular languages.
1
Example:

Let Σ = {0, 1}. Some regular languages over


{0, 1} are:

1. ∅

2. {λ}

3. {0}, {1}

4.(a) {0} {1} = {0, 1}


S

(b) {0} {1} = {01}, {1} {0} = {10}

(c) {0}∗, {1}∗

2
Further examples:

1. {0} {10} = {0, 10}


S

2. {0} {01} = {001}

3. ({1} {λ}) {001}


S

4. {110}∗ {0, 1}

5. {10, 111, 11010}∗

∗ ∗S 
6. {0, 10} {11} {001, λ}

3
A regular language can therefore be described
by an explicit formula. By convention, we may
simplify the formula slightly: We

• leave out {} or replace with (), and

• replace
S
with +,

to obtain a regular expression.

Take care not to confuse the two notations—


either write {0, 1}∗, or (0 + 1)∗, but not
{0 + 1}∗.

4
Example:

Some regular expressions are:

1. ∅

2. λ

3. 0, 1

4.(a) 0 + 1

(b) 01, 10

(c) 0∗, 1∗

5
Further examples:

1. 0 + 10

2. 001

3. (1 + λ) 001

4. 110∗ (0 + 1)

5. (10 + 111 + 11010)∗

6. (0 + 10)∗ (11∗ + (001 + λ))

6
Regular expressions

Definition: Let Σ be an alphabet. The class


RE of regular expressions over Σ is defined as
follows:

1. ∅ is an element of RE.

2. λ is an element of RE.

3. For each a ∈ Σ, a is an element of RE.

4. If r1 and r2 are any elements of RE, then

(a) r1 + r2 is an element of RE.

(b) r1r2 is an element of RE.

(c) r1∗ is an element of RE.

Only expressions that can be obtained by using


statements 1–4 are regular expressions.
7
Simplifications

 
1. Exponential notation: write (rr) as r2

 
2. ‘Plus’ notation: write ((r∗) r) as r+

If we form regular expressions strictly accord-


ing to the definition, they are fully parenthe-
sized. In order to simplify such expressions, we
establish an order of precedence:

1. Highest: ∗

2. Then: concatenation

3. Lowest: +

8
Convention

if r1 and r2 are regular expressions, then r1 =


r2 means that they correspond to the same
languages.

Thus
b c = a + b∗ c
∗  
a+

However,
(a + b)∗ 6= a + b∗

9
Some examples of simplification:

1∗ (1 + λ) = 1∗
1∗1∗ = 1∗
0∗ + 1∗ = 1∗ + 0∗
∗ ∗ ∗
0 1 = (0 + 1 )∗
(0 + 1)∗01(0 + 1)∗ + 1∗0∗ = (0 + 1)∗

We won’t study the algebra of regular expres-


sions. Think of these statements as state-
ments about the languages that the regular
expressions represent.

10
Consider

(0 + 1)∗01(0 + 1)∗ + 1∗0∗ = (0 + 1)∗

∗

 [
(0 + 1) ≈ {0} {1}
represents the language of all strings of 0’s and
1’s.

∗ 01 0 + 1 ∗ ≈
(0 + 1 ) (  )
 [ ∗  [ ∗
{0} {1} {0} {1} {0} {1}
represents the language of all strings of 0’s and
1’s having the substring 01.

1∗0∗ ≈ {1}∗{0}∗
represents the language of all strings of 0’s and
1’s where the 1’s precede the 0’s, i.e., that do
not have the substring 01.
11
Example: The language L ⊆ {0, 1}∗ of all
strings of even length:

Any string in L can be divided into a number


of strings of even length.

Any concatenation of strings of even length


produces a string in L.

Therefore we can write


L = {00, 01, 10, 11}∗

which corresponds to the regular expression

(00 + 01 + 10 + 11)∗

We can also write


L = {{0, 1} {0, 1}}∗
which corresponds to

((0 + 1) (0 + 1))∗
12
Example: The language L ⊆ {0, 1}∗ containing
an odd number of 1’s:

Any string in L must contain at least one 1,


therefore we start with a string of the form

0i10j

There is an even number of further 1’s, each


followed by zero or more 0’s. This we can see
as a repetition of strings of the form

(10m10n)

Therefore we can represent L by


∗ ∗ ∗ ∗ ∗
0 10 10 10

13
If we decide to stop the initial substring imme-
diately after the 1, we can represent L as:
∗ ∗ ∗ ∗ ∗
0 1 0 10 1 0

We can also concentrate on the last 1 in the


string:
∗ ∗ ∗ ∗ ∗
0 10 1 0 10

or on a 1 in the middle, and make sure it has


an even number of 1’s on either side of it:
∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗
0 10 10 1 0 10 1 0

14
There may be many ways to describe a typical
element of a language, depending on which as-
pect we want to emphasize. There isn’t always
one that is the simplest or most natural.

The regular expression must be general enough


to describe every string in the language, but
not so general that it also describes strings
not in the language.

Note: In this course, don’t spend a lot of time


trying to simplify a regular expression if it isn’t
necessary for further work.

15
∗ ∗ ∗ ∗
10 10 10

is not general enough, since it doesn’t include


strings beginning with 0.

Adding 0∗ at the beginning to obtain

0∗ 10∗10∗ ∗10∗


rectifies that.

16
Let L ⊆ {0, 1}∗ be the set of all strings of
length 6 or less.

A regular expression for L is

λ + 0 + 1 + 00 + 01+
10 + 11 + 000 + . . . + 111110 + 111111
which isn’t very elegant.

Let’s first describe the set of strings of length


exactly 6:

(0 + 1 ) ( 0 + 1 ) ( 0 + 1 ) ( 0 + 1 ) ( 0 + 1 ) ( 0 + 1 )
or
(0 + 1 )6

To include strings of length less than 6, we


allow each of the factors to be λ, i.e.,

(0 + 1 + λ)6

17
Let

L=
x ∈ {0, 1}∗ |


x ends with 1 and


does not contain the substring 00}

We want a regular expression for L.

A string does not contain the substring 00


⇐⇒ no 0 can be followed by 0
⇐⇒ every 0 either comes at the end, or is
followed by 1.

Every string in L ends with 1


⇒ copies of the strings 01 and 1 account for
the entire string
⇒ every string in L corresponds to the regular
expression (1 + 01)∗.

18
(1 + 01)∗ allows λ, which doesn’t end with 1.

Therefore, (1 + 01)∗ is too general.

Possible fix - add 1 at the end - (1 + 01)∗1

Not correct, since we now exclude 01.

Add that possibility: (1 + 01)∗ (1 + 01)

Therefore, final answer is (1 + 01)+ .

19
Let l (for ‘letter’) denote

a + b + ... + z + A + B + ... + Z

and d (for ‘digit’) denote

0 + 1 + ... + 9 .

An identifier in C is any string of length 1 or


more that contains only letters, digits, ’s and
begins with a letter or .

Written as regular expression:

(l + )(l + d + )∗

20
Let

• a denote plus,

• m denote minus,

• s (for ‘sign’) denote λ + a + m, and

• p denote a point.

21
A constant in Pascal:

sd+(pd+ + pd+Esd+ + Esd+)

• First a sign,

• then one or more digits,

• then

– either a point and one or more digits,


which may or may not be followed by an
E, a sign and one or more digits,

– or just the E, the sign and one or more


digits.

If the constant is in exponential format, then


no decimal point is needed. If there is a deci-
mal point, there must be at least one digit on
either side.
22
Memory required to recognize a language

We want to discuss the problem of recognizing


the strings in a language. Let’s first establish
two conventions:

• We’ll have a single pass from left to right.

– this restriction is somewhat arbitrary—


two-way FA accept only regular lan-
guages (proof in Hopcroft and Ullman)

– but simplifies the discussion of how


much info must be remembered during
the process

– and thus allows us to classify languages


on the basis of how much we must re-
member at each step to recognize them

23
• We won’t wait until we reach the end of a
given string to decide if the string is in the
language or not. At each step we make a
tentative decision on the basis of the string
of input symbols we’ve seen so far.

When we reach the end, the final decision


is simply the most recent tentative deci-
sion, which was made for the substring that
is actually the entire string.

24
How much must we remember about a prefix
in order to make a decision about it?

Two extremes:

• Everything, i.e., the entire substring

• Nothing, e.g.,

– language is empty: we can ignore input


and answer ‘no’ at each step,

– language is Σ∗: we can ignore input and


answer ‘yes’ at each step.

In both cases the answer we return is al-


ways the same, thus we don’t need to re-
member anything about the substring, or
distinguish one substring from another.

25
Between these extremes, the answer is not al-
ways the same. There are at least two strings
x and y for which the answers are different.

Thus the info we remember when we’ve re-


ceived input x must be different from what we
remember when we’ve received input y.

Thus in at least one of these two situations we


must remember something.

26
Let

L = x ∈ {0, 1}∗ | x ends with 0




The decision for any x 6= λ depends only on its


last symbol.

Therefore we needn’t distinguish

• between one string ending with 0 and any


other ending with 0, or

• between one string ending with 1 and any


other ending with 1.

27
We haven’t considered λ yet.

Compare λ to any string ending in 1:

1. Neither is in L.

2. When you add one more symbol to each of


them, they both end in the same symbol.
The two resulting strings are either both in
or both not in L.

Therefore λ can be treated in exactly the same


way as any string ending in 1.

28
In summary, we have only two cases:

1. The input string ends with 0.

2. The input string doesn’t end with 0.

29
Let

L = x ∈ {0, 1}∗ | x ends with 10




From the definition it follows that we have to


distinguish between

• a string ending with 10, and

• a string not ending with 10.

Therefore we immediately have at least two


cases. Let’s call them 10 and N .

30
The decision for any x depends only on its last
two symbols.

Therefore we needn’t distinguish, say,

• between one string ending with 10 and any


other ending with 10, or

• between one string ending with 00 and any


other ending with 00.

Can we take this one step further, and say that


it’s sufficient to remember whether the current
prefix ends with 10 or not? In other words, are
cases 10 and N the only ones?

31
Consider the string 01011: if you remember
only that it doesn’t end with 10, you cannot
distinguish it from 100, now or later.

Suppose the next input is 0.

• Case 01011: current prefix is 010110,


which is in L.

• Case 100: current prefix is 1000, which is


not in L.

Therefore you have to remember enough now


to distinguish between 01011 and 100.

32
In how many cases do we have to split case N ?

Consider two strings x and y, where x = 0 and


y is any string ending in 00:

• Neither is in L. Starting from either one


we need at least one more symbol to have
a string in L.

• Input 0: Both x0 and y0 end in 00, and are


not in L.

Input 1: Both x1 and y1 end in 01, and are


not in L.

Thus x and y represent cases that don’t need


to be distinguished. We merge them into one
case, called A.

33
Now consider the three strings x, y and z,
where x = 1, y ends in 01, and z ends in 11:

• Not one is in L. Starting from any one, we


need at least one more symbol to have a
string in L.

• Input 0: Each of x0, y0, and z0 ends in 10,


and is in L.

Input 1: Each of x1, y1, and z1 ends in 11,


and is not in L.

Thus these three strings represent three cases


that need not be distinguished at all. We
merge them into one case, called B.

34
Now consider the strings x and y, where x = λ
and y is any string in case A:

• Neither is in L.

• Input 0: Both x0 and y0 is a string repre-


sented by case A.

Input 1: Both x1 and y1 is a string repre-


sented by case B.

Thus x and y represent two cases that need


not be distinguished. We can thus include λ in
case A.

The total number of states is now three, and


this cannot be reduced any further.

35
Let
 ∗
L = x ∈ {0, 1} | second-last symbol of x is 0

From the definition it follows that we have to


distinguish between

• a string having second-last symbol 0, and

• a string having second-last symbol 1.

Therefore we immediately have at least two


cases. Let’s call them 0a and 1a.

Are two cases sufficient?

36
Consider case 0a:

It includes both 01 and 00.

Suppose the next input is 0:

• 010 6∈ L, but

• 000 ∈ L

Thus, we need to split case 0a into two cases,


say with labels 01 and 00.

We could have chosen input 1 to distinguish


the two strings.

37
Consider case 1a:

It includes both 10 and 11.

Suppose the next input is 0:

• 100 ∈ L, but

• 110 6∈ L

Thus, we need to split case 1a into two cases,


say with labels 10 and 11.

We could have chosen input 1 to distinguish


the two strings.

38
Thus we have established that it is necessary to
remember both the last two symbols. There-
fore we already have four different cases.

What about the strings λ, 0, and 1? Does any


of them make a further case necessary?

Consider λ. Is it perhaps in case 00?

If we take input 1, then

• λ1 6∈ L, but

• 001 ∈ L

Therefore, it is necessary to distinguish be-


tween λ and 00.

Similarly, one could show that it is necessary


to distinguish λ from 01 and 10.
39
Finally, is λ in case 11?

For both, at least two more input symbols are


needed before the result can be in L.

By then, it will be irrelevant whether we started


off with λ or 11. Thus, λ is covered by case
11.

In a similar way one would show that 1 is cov-


ered by case 11.

40
Finally, consider 0: It must be distinguished
from

• 00, since 00 ∈ L, but 0 6∈ L.

• 01, since 01 ∈ L, but 0 6∈ L.

However, compare 0 and 10: Neither is in L,


but after one more input both resulting strings
will have the same last two symbols and then
it is irrelevant what came before.

41
Let

L = x ∈ {0, 1}∗ | x ends with 11




From the definition it follows that we have to


distinguish between

• a string ending with 11, and

• a string not ending with 11.

Therefore we immediately have at least two


cases. Let’s call them 11 and N .

42
The decision for any x depends only on its last
two symbols.

Therefore we needn’t distinguish, say,

• between one string ending with 11 and any


other ending with 11, or

• between one string ending with 00 and any


other ending with 00.

Can we take this one step further, and say that


it’s sufficient to remember whether the current
prefix ends with 11 or not? In other words, are
cases 11 and N the only ones?

43
Consider the strings x and y, where x ends in
01 and y in 00. Both are in case N .

On input 1,

• x1 ∈ L, but

• y1 6∈ L.

Therefore we have to distinguish between x


and y. Thus case N contains at least two
cases.

44
Do strings ending with 10 also form another
case?

Consider the strings x and y, where x ends in


00 and y in 10.

Regardless of the input symbol, the result will


end in the last two symbols. By then, it will
be irrelevant whether we started off with x or
y.

Thus we don’t have to distinguish between


strings ending in 00 or 10.

45
Finally, there are the strings λ, 0 and 1.

The strings λ and 0 don’t need to be distin-


guished from strings ending in 00, since we
need at least two more input symbols before
the result can be in L. By then, it will be
irrelevant with which we had started.

The string 1 doesn’t need to be distinguished


from strings ending in 01, since after one in-
put symbol the resulting strings will end in the
same two symbols.

46
L=
x ∈ {0, 1}∗ |


x contains an even number of 0’s and


an odd number of 1’s}

We don’t have to remember the entire string


we’ve seen so far.

Steps of abstraction:

1. Important is only the # of 0’s and the # of


1’s, not the way the symbols are arranged,
e.g., 011 and 101 are equivalent.

2. Crucial is whether the # of 0’s, respec-


tively the # of 1’s, is odd or even.

E.g., 011 and 0001111 are equivalent, but


011 and 001111 are not (add a 1 to see
why).

47
Therefore

1. we have four distinct cases, and

2. it is enough to remember which case we


are in currently.

48
L=
x ∈ {0, 1}∗ |


x ends in 1 and
does not contain the substring 00}

Let s denote the string we’ve seen so far.

Case N : s contains 00. s 6∈ L and not the


prefix of any string in L.

s doesn’t contain 00: Here there are three


possible cases.

49
Case 0: Last symbol of s is 0.

1. Next input is 0: go to Case N .

2. Next input is 1: go to Case 1.

Case 1: Last symbol of s is 1. (Note errors in


Edition 2 of textbook.)

1. Next input is 0: go to Case 0.

2. Next input is 1: go to Case 1.

Case λ: s = λ.

1. Differs from Case N : E.g., λ01 ∈ L.

2. Differs from Case 0: E.g., next input is


0 - λ0 leads to Case 0, but 00 leads to
Case N .

3. Differs from Case 1: λ 6∈ L, while Case 1


represents all strings which are in L.

50
Therefore

1. we have divided the strings in {0, 1}∗ into


four different types, and

2. in order to recognize strings in L, we need


only remember which of the four types we
have so far.

51
Consider

and

52
We can interpret the diagrams in two ways:

1. as a flowchart of the algorithm we follow


when processing strings, or

2. as specifying an abstract machine.

53
Interpretation as flowchart:

The circles correspond to the distinct cases the


algorithm is keeping track of, or the distinct
types of strings we have divided Σ∗ into.

The label used in each circle describes

1. in the first chart, the parities of the # of


0’s and the # of 1’s in the current string,
or

2. in the second chart, the various cases we


considered regarding the string s.

54
Note the following aspects of such a diagram:

• The short arrow that doesn’t originate at


one of the circles indicates the starting
point—this case includes λ.

• The double circles indicates the case in


which the current substring is in L.

• The arrows starting at each circle indicate,


for each possible input, which case results,
therefore how much info we need to re-
member.

55
Interpretation as abstract machine:

At any time the machine is in one of four pos-


sible states, labeled λ, 0, 1, and N .

Initially, when the machine is activated, it’s in


state λ.

It receives successive inputs of 0 or 1. As a


result of being in a certain state and getting
a certain input, it moves to the state specified
by the corresponding arrow.

Finally, certain states are accepting states.

A string of 0’s and 1’s is in L iff the machine is


in an accepting state as a result of processing
that string.

56
The term abstract machine means that it’s a
specification of the capabilities that the ma-
chine must have.

Important are:

• the set of states, and

• the function that specifies, for each (state,


input symbol)-pair, the state the machine
goes to next.

57
Crucial is that the set of states is finite: the
number of states puts an absolute limit on the
amount of info the machine needs to, or is able
to, remember.

Strings in the language can be arbitrarily long,


but remembering a fixed amount of info is suf-
ficient.

The machine has only one form of memory,


namely being able to distinguish between these
states.

Such a machine can recognize only simple lan-


guages.

58
Finite automata

Definition: A finite automata (short FA) is a


5-tuple (Q, Σ, q0, A, δ ), where

• Q is a finite set of states,

• Σ is a finite alphabet of input symbols,

• q0 ∈ Q is the initial state,

• A ⊆ Q is the set of accepting states,

• δ : Q × Σ → Q is the transition function.

For q ∈ Q, a ∈ Σ, δ (q, a) is the state to which


the FA moves if it is in state q and receives
input a.

Note: order in 5-tuple in Edition 2 wrong


59
Consider again
 ∗
L = x ∈ {0, 1} | x ends with 10

60
Until this FA gets at least two input symbols,
it remembers exactly what it has received.
Therefore there is a separate state for each
of λ, 0 and 1.

Once it has got two inputs, it remembers the


last two symbols it has seen. Therefore it cy-
cles between the states 00, 01, 11, and 10.

61
q δ (q, 0) δ (q, 1)
λ 0 1
0 00 01
1 10 11
00 00 01
01 10 11
10 00 01
11 10 11

62
We have already seen that we can reduce the
number of states.

Consider the three states 1, 01, and 11. Let


x, y and z be strings that take the FA to these
states.

• x, y and z don’t need to be distinguished


now, since no state is accepting.

• The rows of the states are exactly the


same. For all three, input 0 takes the FA to
state 10, and input 1 takes it to state 11.
Thus, after one input, we won’t be able to
distinguish between x, y and z anymore.

These three states do not represent three cases


that need to be distinguished. We merge them
into one state, called B.
63
Consider the three states 0, 00 and 10.

Their rows are identical.

However, 10 is an accepting state, while the


other two are not. Therefore we cannot simply
merge all three.

However, we can merge the two states 0 and


00. Call the resulting state A.

q δ (q, 0) δ (q, 1)
λ A B
A A B
B 10 B
10 A B

The total number of states is now four.


64
Now consider the states λ and A:

• Not one is an accepting state.

• Their rows are the same.

We can thus include λ in A.

q δ (q, 0) δ (q, 1)
A A B
B 10 B
10 A B

The total number of states is now three, and


this cannot be reduced any further.

65
Consider

We could say that

• state A represents ‘no progress toward 10’,


(note error in Edition 3) since the FA has
received either no input at all or an input
string ending with 0, but not ending with
10,

• state B represents ‘halfway there’, since


the FA has received an input string end-
ing with 1.

We’ll later have a systematic procedure for


minimizing the number of states in a given FA.
66
We have
δ :Q×Σ→Q ,
the state to which the FA goes, if it is in state
q and receives input a.

We want
δ ∗ : Q × Σ∗ → Q ,
the state in which the FA ends up, if it begins
in state q and receives the string x of input
symbols.

We’ll develop a recursive definition:

• define δ ∗ (q, λ),

• assume we know what δ ∗ (q, y ) is,

• for a ∈ Σ, define δ ∗ (q, ya)

67
• Basis: input string is λ.

The FA shouldn’t change state as result of


getting λ, thus ∀q ∈ Q, δ ∗ (q, λ) = q.

• Assume we know what δ ∗ (q, y ) is.

• Consider δ ∗ (q, ya): It’s the state that the


FA reaches when it begins in q and receives
first input string y, then symbol a.

The FA is in δ ∗ (q, y ) after reading y. By


definition, for every p ∈ Q, the state to
which the FA moves from p on input a is
δ (p, a).

Thus δ ∗ (q, ya) = δ (δ ∗ (q, y ) , a).

68
Definition

Let M = (Q, Σ, q0, A, δ ) be an FA. We define

δ ∗ : Q × Σ∗ → Q
as follows:

• ∀q ∈ Q, δ ∗ (q, λ) = q.

• ∀q ∈ Q, y ∈ Σ∗, and a ∈ Σ,

δ ∗ (q, ya) = δ δ ∗ (q, y ) , a




In other words, δ ∗ (q, x) processes the symbols


of x one at a time, using δ to move from one
state to the next.

69
Example

δ ∗ (q, abc) = δ δ ∗ (q, ab) , c




= δ δ δ ∗ (q, a) , b , c
 

∗  
= δ δ δ (q, λa) , b , c
∗   
= δ δ δ δ (q, λ) , a , b , c
= δ (δ (δ (q, a) , b) , c)
= δ (δ (q1, b) , c)
= δ (q2, c)
= q3

70
Note that as part of the calculation above, we
worked out

δ ∗ (q, a) = δ ∗ (q, λa)



= δ δ (q, λ) , a


= δ (q, a)

Therefore, for strings of length 1, δ and δ ∗ can


be used interchangeably.

Other property that δ ∗ satisfies:


∗ ∗ ∗ ∗
∀q ∈ Q, x, y ∈ Σ , δ (q, xy ) = δ δ (q, x) , y


71
Definitions

Let M = (Q, Σ, q0, A, δ ) be an FA. A string


x ∈ Σ∗ is accepted by M if

δ ∗ (q0, x) ∈ A .

If a string is not accepted, it is rejected by M .

The language accepted or recognized by M , is


the set

L (M ) = {x ∈ Σ∗ | x is accepted by M }

If L is any language over Σ, L is accepted, or


recognized, by M if and only if L = L (M ).

72
Note the if and only if in the last statement.

It is not sufficient to say that L is accepted by


M if every string in L is accepted by M .

That would mean that the FA that accepts Σ∗


would accept any conceivable language over Σ.

The power of any abstract machine, such as


an FA, does not lie in the number of strings
that it accepts, but in its ability to discrimi-
nate between strings, and accept some, while
rejecting others.

To accept a language L, an FA must accept


all the strings in L and reject all the strings in
Σ∗ \ L.
73
Theorem

A language L over the alphabet Σ is regular if


and only if there is an FA that accepts L.

In other words,

• if M is an FA, then there is a regular expres-


sion corresponding to the language L (M ),
and

• if r is a regular expression, there is an FA


that accepts the language corresponding
to r.

There exist algorithms for constructing the


regular expression of the first case, and the
FA of the second case. We won’t cover that
in the course.

In the following we’ll do two examples of each


direction.
74
Consider the following FA with initial state A:

We want the regular expression corresponding


to the language L accepted by the FA.

We notice that there are two accepting states,


namely A and B.

75
• State A:

– it’s also the initial state, therefore λ ∈ L.

– the only other strings x for which


δ ∗ (A, x) = A are concatenations of 00.

This corresponds to (00)∗.

• State B:

– The only way to reach B from the initial


state A without looping through A is
with 11.

– δ ∗ (B, x) = B iff x consists of copies


of 11.

This corresponds to 11(11)∗.

It is not possible to reach A or B in any other


way. Therefore the regular expression is
(00)∗ + (00)∗11(11)∗ = (00)∗(11)∗ .
76
We want the regular expression for the lan-
guage L accepted by the following FA M :

77
We notice the following:

• Regardless of which state we are in, the


input b takes us to state B. Thus any string
ending in b takes us to state B.

• From state B, the input aaa takes us to


state E, the accepting state.

Therefore, if a string ends in baaa, it is ac-


cepted.

The only way to reach E is from D with input


a; the only way to reach D is from C with input
a; the only way to reach C is from B with input
a. The only way to reach B is with input b;
this can happen in any state. Thus any string
accepted by M must end in baaa.

Therefore the regular expression is


(a + b)∗baaa .
78
Consider

(a + b)∗ (ab + bba) (a + b)∗


We want the FA that accepts the correspond-
ing language L, the set of all strings over {a, b}
that contain at least one of ab and bba.

We note

• ab and bba should be accepted,

• if a string x is accepted, then xy, y ∈ {a, b}∗


should be accepted.

79
80
At states p, r and s, we need transitions la-
beled a, a, and b, respectively. We may need
additional states.

• state p, input a: let’s think of p as the non-


accepting state we are in if the last input
was a. If we are in p and the next input
is a, no progress towards a result has been
made; thus δ (p, a) = p.

• state r, input a: let’s think of r as the non-


accepting state we are in if the last input
was b, and that b wasn’t preceded by a b.
If we now get an a, the last b becomes
worthless. By our informal definition of p,
δ (r, a) = p.

• state s, input b: let’s think of s as the non-


accepting state we are in if the string thus
far ended in bb. If we are in s and the next
input is b, no progress towards a result has
been made; thus δ (s, b) = s.

81
82
Consider
(11 + 110)∗0
We want the FA that accepts the correspond-
ing language L.

• λ 6∈ L, therefore q0 cannot be an accepting


state.

• 0 ∈ L, therefore 0 must take the FA to an


accepting state.

• 1 6∈ L

Moreover, we cannot lump 1 and λ into


one case, since there is at least one string
that discriminates between them:

1 110 6∈ L, but λ 110 ∈ L

83
L doesn’t contain

• strings beginning with 0, apart from 0 it-


self, or

• strings beginning with 10.

Therefore we introduce a state s that repre-


sents all strings that cannot be the prefix of
an element of L. s is a trap state.
84
Consider state r, input 1: The FA should not

• stay in r, since we need to differentiate be-


tween 1 and 11 -

1 10 ∈ L, but 11 10 6∈ L.

• return to q0, since we need to differentiate


between λ and 11 -

λ 00 6∈ L, but 11 00 ∈ L.

Therefore we need a new state, say t.


85
Consider state t, input 0:

110 ∈ L, therefore the FA must go to an ac-


cepting state.

Can it go to state p, or do we need another


accepting state?

Suppose it went to p. Then it wouldn’t be able


to distinguish anymore between strings starting
with 0 and those starting with 110. But

00 6∈ L while 1100 ∈ L
Therefore we need another accepting state,
say u, and δ (t, 0) = u.

86
Consider

• state u, input 0: The FA is in the same


situation as after reading 0 only - 1100 ∈ L
but not the prefix of any longer string in
L.

Therefore δ (u, 0) = p.

• state t, input 1: We can interpret t as in-


dicating the end of a copy of 11. If we get
another 1, we can see it as the start of the
next copy of 11.

Therefore δ (t, 1) = r.

• state u, input 1: We can interpret u as


indicating the end of a copy of 110. If we
get another 1, we can see it as the start of
the next copy of 110.

Therefore δ (u, 1) = r.

87
88
We continued to add states as long as it was
necessary. We stopped as soon as we were
able to define the required transitions in a way
that involved only the states we had already.

It is only due to our theorem (a language is


regular iff accepted by an FA) that we know
that the addition of states will stop.

If our language weren’t regular, we wouldn’t


be able to stop.

Perhaps the most difficult part of finding an FA


for a given language is determining whether or
not a new state is needed.

89
Distinguishing one string from another

It is possible to use a finite automata to rec-


ognize an infinite language,

• if we can group input strings in such a way


that strings in the same group do not need
to be distinguished from each other,

• which will make it unnecessary for the FA


to remember every input string.

The number of states in an FA

depends on

the number of distinct strings that must be


distinguished from each other.

90
Definitions

Let L be a language in Σ∗, and x ∈ Σ∗. Let

L/x = {z ∈ Σ∗ | xz ∈ L} .
Two strings x and y in Σ∗ are called distin-
guishable w.r.t. L if L/x 6= L/y.

Any string z ∈ Σ∗ that is in one of the two sets


but not the other, i.e., for which

xz ∈ L and yz 6∈ L, or vice versa


is said to distinguish x and y w.r.t. L.

If
L/x = L/y ,
x and y are indistinguishable w.r.t. L.

In other words, x and y are indistinguishable


w.r.t. L if for every z, either both xz and yz
are in L, or neither xz nor yz is in L.
91
In order to show that two strings x and y are
distinguishable, it is sufficient to find one z so
that either

xz ∈ L and yz 6∈ L
or
xz 6∈ L and yz ∈ L .

In order to show that two strings x and y are


indistinguishable, you must be able to show
that no such z exists. This is usually much
more difficult than the previous case.

92
Example

Let
 ∗
L = s ∈ {0, 1} | s ends with 10

• Let x = 01011 and y = 100.

Consider z = 0.

Then xz = 010110 ∈ L, but yz = 1000 6∈ L.


In other words, z ∈ L/x, but z 6∈ L/y. Thus
x and y are distinguishable.

• Let x = 0 and y = 100.

For any z that ends with 10, xz ∈ L and


yz ∈ L. For any z that doesn’t end in
10, xz 6∈ L and yz 6∈ L. In other words,
L/x = L = L/y. Thus x and y are indistin-
guishable w.r.t. L.

93
Another example:

Let
n o
L= 0k 1k | k ≥ 1 .

We can show that there is an infinite set of


strings in {0, 1}∗, any two of which are distin-
guishable with respect to L.

Let S = {0, 00, 000, . . .}.


Let x and y be two distinct elements of S.
Then x = 0i, i ≥ 1,
and y = 0j , j ≥ 1, j 6= i.

Let z = 1i.
Then xz = 0i1i ∈ L
and yz = 0j 1i 6∈ L.

94
Consider the language
 ∗
L = x ∈ {0, 1} | x ends with 01 .

The strings λ, 0, and 01 are pairwise distin-


guishable with respect to L.

• λ and 01:

λλ 6∈ L, but 01λ ∈ L

• 0 and 01:

0λ 6∈ L, but 01λ ∈ L

• λ and 0:

λ1 6∈ L, but 01 ∈ L

Therefore, λ, 0, and 01 are pairwise distin-


guishable.
95
We can put a lower bound on the memory re-
quirements of any FA that is capable of recog-
nizing a given language L.

Before we can prove the corresponding theo-


rem, we first need a lemma.

96
Lemma

Suppose that L ⊆ Σ∗ and M = (Q, Σ, q0, A, δ )


is an FA recognizing L. If x and y are strings
in Σ∗ that are distinguishable w.r.t. L, then
δ ∗ (q0, x) 6= δ ∗ (q0, y ).

Proof

x and y are distinguishable w.r.t. L.

Then there is a z such that z is in exactly one


of L/x and L/y.

Then exactly one of xz and yz is in L.

Then exactly one of δ ∗ (q0, xz ) and δ ∗ (q0, yz ) is


accepting.

Then
δ ∗ (q0, xz ) 6= δ ∗ (q0, yz )
97
However,
∗ ∗ ∗
δ (q0, xz ) = δ δ (q0, x) , z


∗ ∗ ∗
δ (q0, yz ) = δ δ (q0, y ) , z


Thus
δ ∗ (q0, x) 6= δ ∗ (q0, y )

98
Theorem

Suppose that L ⊆ Σ∗ and, for some positive


integer n, there are n strings in Σ∗, pairwise
distinguishable w.r.t. L. Then every FA recog-
nizing L must have at least n states.

Proof

Suppose x1, x2, . . . , xn are n strings, pairwise


distinguishable w.r.t. L.

Suppose M = (Q, Σ, q0, A, δ ) is an FA with


fewer than n states.

The states δ ∗ (q0, x1) , δ ∗ (q0, x2) , . . . , δ ∗ (q0, xn)


cannot all be distinct.  Therefore,
 for some
i 6= j, δ ∗ (q0, xi) = δ ∗ q0, xj .

Since xi and xj are distinguishable w.r.t. L, M


cannot recognize L.
99
Example

Suppose that n ≥ 1, and let


Ln =
x ∈ {0, 1}∗ |


|x| ≥ n and
the nth symbol from the right in x is 1}
Naive approach: we create a distinct state for
every possible substring of length n or less, i.e.,
for λ, 0, 1, 00, 01, 11, 10, 000, . . ..

Then the FA can remember the last n symbols


of the current input string.

The total number of states is:


Pn
i=0 number of strings of length i

Since the input alphabet has only two symbols,


the number of strings of any length i is 2i.

Then the number of states is 2n+1 − 1.


100
The (incomplete) FA for L3:

The FA has 23+1 −1 = 15 states. The number


of states can be reduced.
101
We’ll show that any FA recognizing Ln must
have at least 2n states.

Let x and y be two distinct strings of length n.

Then there must be at least one i, 1 ≤ i ≤ n,


such that x and y differ in position i. Mark that
position. To the right of the marker there are

n−i
symbols. Now choose any string z ∈ {0, 1}∗ of
length i − 1.

Then
|xz| = |yz| = n + i − 1 .
To the right of the marker there are now

(n − i) + (i − 1) = n − 1
symbols. Thus the marked position is the nth
from the right.
102
Since the strings xz and yz differ in this po-
sition, one of them must have a 1 in the nth
position from the right, and the other a 0.

Therefore one is in Ln, and the other not.


Thus x and y are distinguishable w.r.t. Ln.

There are 2n strings of length n. We have


shown that any two of them are distinguish-
able w.r.t. Ln. By our theorem every FA that
recognizes Ln has at least 2n states.

103
Recall the theorem: Suppose that L ⊆ Σ∗
and, for some positive integer n, there are n
strings in Σ∗, pairwise distinguishable w.r.t. L.
Then every FA that recognizes L has at least
n states.

Let L be any language: if for every positive in-


teger n, there are n strings which are pairwise
distinguishable w.r.t. L, then L cannot be rec-
ognized by an automaton with a finite number
of states.

104
Often we

• don’t show that, for every n ≥ 1, there are


n pairwise distinguishable strings,

• but instead that there is an infinite number


of pairwise distinguishable strings.

After all, if we have an infinite set S of pair-


wise distinguishable strings, then, for any n,
any subset of S of size n has n pairwise distin-
guishable elements.

It is sensible to choose a subset of S for which


it is easy to prove that all strings are pairwise
distinguishable.

105
Example:

Let L = ww | w ∈ {0, 1}∗ .




Let S = {0, 00, 000, . . .}.


Let x, y ∈ S, x 6= y.
Then x = 0i and y = 0j , j 6= i.

Let z = 10i1.
Then xz = 0i10i1 ∈ L, but yz = 0j 10i1 6∈ L.

Therefore L is not regular.

106
Moreover, sometimes we

• don’t show that there is an infinite number


of pairwise distinguishable strings,

• but that all strings over the alphabet are


pairwise distinguishable.

Example:

Let Lpal be the language of palindromes over


{0, 1}, i.e., the set of strings of 0’s and 1’s that
are identical to their reverse. E.g., 0, 0110,
010 are in Lpal, but 1101 is not.

Lpal is very simple to describe, but we can


show that we need to distinguish all strings
over {0, 1}. Therefore we need much more
memory to recognize it than that offered by
any FA.
107
For Lpal we need to distinguish all substrings:
after processing any string x, we must remem-
ber enough to distinguish x from every other
string, say y.

For every x and y, there is at least one possible


string z of subsequent inputs that would cause
us to make one decision for x and the opposite
for y.

In other words, for any two strings x and y,


there is a string z so that exactly one of xz
and yz is in Lpal.

108
Theorem:

Lpal cannot be accepted by any FA, and there-


fore is not regular.

Proof:

Let x, y ∈ {0, 1}∗, x 6= y. We’ll show that x and


y are distinguishable w.r.t. Lpal.

We consider two cases.

• |x| = |y|:

Let z = xr .

Then xz = xxr ∈ Lpal, but yz = yxr 6∈ Lpal.

109
• |x| 6= |y|:

Assume |x| < |y|, and write y = y1y2, with


|y1| = |x|.

Let w ∈ {0, 1}∗. For z = wwr xr , xz =


xwwr xr ∈ Lpal.

Now narrow down the choice of w: Let


|w| = |y2|, but w 6= y2.

Consider yz = y1y2z = y1y2wwr xr .

By choice, |y1| = |xr |, and |y2| = |w| = |wr |.

yz ∈ Lpal ⇒ wr = y2r ⇒ w = y2. Contradic-


tion.

We have therefore shown that Lpal is not


regular.

110
The concept of distinguishability gives us

• a method for showing that a given lan-


guage is not regular. There are other
methods, e.g., the so-called pumping
lemma for regular languages.

• a tool for finding the FA with the least


possible states that recognizes a given lan-
guage.

111
Unions, Intersections, Complements

Suppose both L1 and L2 are regular languages


over an alphabet Σ.

We know

• there are FAs M1 and M2 accepting L1 and


L2, respectively,

• L1 ∪ L2, L1L2, and L1∗ are regular and can


therefore be accepted by FAs.

Given M1 and M2, can we produce new FAs


that accept the three new languages?

At this point we’ll consider the union, from


which we can also derive methods for intersec-
tion and difference.
112
Consider union:

Suppose

M1 = (Q1, Σ, q1, A1, δ1)


and
M2 = (Q2, Σ, q2, A2, δ2) .

A machine M can decide whether a string

x ∈ L1 ∪ L2
if it has enough information at each step to
decide separately whether

x ∈ L1
and whether
x ∈ L2 .

113
It needs to

• keep track of M1 and M2 simultaneously,

• therefore remember at each step the cur-


rent state of both,

• therefore remember the ordered pair (p, q ),


where p ∈ Q1, q ∈ Q2.

114
Thus, the set of states is

Q1 × Q2 .
Then the initial state must be

(q1, q2) .

Accepting states? A string x should be ac-


cepted if it is in L1 or L2, thus a state (p, q )
should be accepting if

p ∈ A1 or q ∈ A2 .

Transition function? If the FA is in (p, q ) and


receives input a, it should move to

(δ1 (p, a) , δ2 (q, a)) ,


since
δ1 (p, a) and δ2 (q, a)
are the states to which M1 and M2 would
move.
115
Theorem

Suppose that

M1 = (Q1, Σ, q1, A1, δ1)


and
M2 = (Q2, Σ, q2, A2, δ2) .
accept languages L1 and L2, respectively.

Let M = (Q, Σ, q0, A, δ ), where

Q = Q1 × Q2 ,

q0 = (q1, q2) ,
for p ∈ Q1, q ∈ Q2, and a ∈ Σ,

δ ((p, q ) , a) = (δ1 (p, a) , δ2 (q, a)) ,

A = {(p, q ) | p ∈ A1 or q ∈ A2} .

Then M accepts the language L1 ∪ L2.


116
Example

Let L1 and L2 be languages over {0, 1}, where


L1 = {x | 00 is not a substring of x}

L2 = {x | x ends with 01} .

Corresponding FAs:

117
L1 ∪ L2 is recognized by
Mu = (Qu, Σ, qu, Au, δu), where

Qu = {A, B, C} × {P, Q, R}
= {(A, P ) , (A, Q) , (A, R)} ∪
{(B, P ) , (B, Q) , (B, R)} ∪
{(C, P ) , (C, Q) , (C, R)}


qu = (A, P )

Au = {(A, P ) , (A, Q) , (A, R)} ∪


{(B, P ) , (B, Q) , (B, R)} ∪
{(C, R)}

118

δu :

q δu (q, 0) δu (q, 1)
(A, P ) (B, Q) (A, P )
(A, Q) (B, Q) (A, R)
(A, R) (B, Q) (A, P )
(B, P ) (C, Q) (A, P )
(B, Q) (C, Q) (A, R)
(B, R) (C, Q) (A, P )
(C, P ) (C, Q) (C, P )
(C, Q) (C, Q) (C, R)
(C, R) (C, Q) (C, P )

States (A, Q), (B, P ) and (B, R) cannot be


reached, therefore we can leave them out.

119
120
Consider intersection:

A string x should be accepted if it is in L1 and


L2, thus a state (p, q ) should be accepting if

p ∈ A1 and q ∈ A2 .

Therefore, in our previous theorem, we only


need to define

A = {(p, q ) | p ∈ A1 and q ∈ A2}


to have M accepting the language L1 ∩ L2.

Example (cont.):

Ai = {(A, R) , (B, R)}

121
122
Consider the complement L1 \ L2:

A string x should be accepted if it is in L1,


but not in L2, thus a state (p, q ) should be
accepting if

p ∈ A1 and q 6∈ A2 .

Therefore, in our previous theorem, we only


need to define

A = {(p, q ) | p ∈ A1 and q 6∈ A2}


to have M accepting the language L1 \ L2.

Example (cont.):

Ac = {(A, P ) , (B, P ) , (A, Q) , (B, Q)}

123
124
Special case: L1 = Σ∗.

Two methods:

• Use

M1 = ({q1} , Σ, q1, {q1} , δ1)


where
∀a ∈ Σ, δ1 (q1, a) = q1
and proceed as above.

• Notice that a string x should be accepted


if it is not in L2. Therefore L02 will be
accepted by

M20 = (Q2, Σ, q2, Q2 \ A2, δ2)

125

You might also like