You are on page 1of 40

FORMAL MODELS OF COMPUTATION [CSC 427]

LECTURE MATERIAL

INTRODUCTION TO AUTOMATA THEORY


Automata theory is the study of abstract computing devices, or ‘machines’.
Before computers (1930), Allan Turing studied an abstract machine (Turing
machine) that had all the capabilities of today’s computers (concerning what
they could compute). His goal was to describe precisely the boundary between
what a computing machine could do and what it could not do. Simpler kinds of
machines (finite automata) were studied by a number of researchers and useful
for a variety of purposes. Theoretical developments bear directly on what
computer scientists do today

Why Study Automata Theory?


Finite automata are a useful model for many important kinds of software and
hardware:
1. Software for designing and checking the behaviour of digital circuits
2. The lexical analyser of a typical compiler, that is, the compiler component
that breaks the input text into logical units
3. Software for scanning large bodies of text, such as collections of Web pages,
to find occurrences of words, phrases or other patterns
4. Software for verifying systems of all types that have a finite number of
distinct states, such as communications protocols of protocols for secure
exchange information.

Central Concepts of Automata Theory


Formal Languages
A language is a collection of sentences of finite length all constructed from a
finite alphabet of symbols. It can also be seen as a system suitable for
1
expression of certain ideas, facts and concepts. From our natural (human)
language, one may broadly see that a language is a collection of sentences; a
sentence is a sequence of words; and a word is a combination of syllables. If
one considers a language that has a script, then it can be observed that a word is
a sequence of symbols of its underlying alphabet. To formerly learn a language,
you must follow these three steps:
1. Learning its alphabet - the symbols that are used in the language.
2. Its words - as various sequences of symbols of its alphabet.
3. Formation of sentences - sequence of various words that follow certain
rules of the language.
For example, the English sentence
"The English articles - a, an and the – are categorized into two types: indefinite
and definite." may be treated as a sequence of symbols from the Roman
alphabet along with enough punctuation marks such as comma, full-stop, colon
and further one more special symbol, namely blank-space which is used to
separate two words. Thus, abstractly, a sentence or a word may be interchange-
ably used for a sequence of symbols from an alphabet.

Alphabet
A finite, non-empty set of symbols. We normally use the symbols a, b, c, . . .
with or without subscripts or 0, 1, 2, . . ., etc. for the elements of an alphabet.
A set of alphabets is represented by the symbol
Examples:
The binary alphabet: = {0, 1}
The set of all lower-case letters: = {a, b, . . . , z}
Example: Σ = {a, b, c, d} is an alphabet set where ‘a’, ‘b’, ‘c’, and ‘d’ are
symbols.

2
Strings
A string is a finite sequence, possibly empty, of symbols drawn from some
alphabet . Given any alphabet , the shortest string that can be formed from 
is the empty string, which we will write as . The set of all possible strings
over an alphabet  is written as *. A string over an alphabet Σ is a finite
sequence of symbols of Σ. Although one writes a sequence as (a1, a2, . . . , an),
in the present context, we prefer to write it as a1a2 · · · an, i.e. by juxtaposing
the symbols in that order.

Thus, a string is also known as a word or a sentence. Normally, we use lower


case letters towards the end of English alphabet, namely z, y, x, w, etc., to
denote strings.
Example 1: ‘cabcad’ is a valid string on the alphabet set Σ = {a, b, c, d}

Example 2: Let Σ = {a, b} be an alphabet; then aa, ab, bba, baaba, . . .


are some examples of strings over Σ. Since the empty sequence is a finite
sequence, it is also a string. Which is ( ) in earlier notation; but presently
denoted as ε.
The set of all strings over an alphabet Σ is denoted by Σ∗. For example,
if Σ = {0, 1}, then
Σ∗ = {ε, 0, 1, 00, 01, 10, 11, 000, 001, . . .}.
Although the set Σ∗ is infinite, it is a countable set. In fact, Σ∗ is countably
infinite for any alphabet Σ.
Example: 01101 and 111 are strings from the binary alphabet = {0, 1}

Operations on Strings
Empty String: An empty string is a string with zero occurrences of symbols.
This string is denoted by ϵ and may be chosen from any alphabet whatsoever.

3
Length of a String: This is the number of positions for symbols in the string
Example: 01101 has length 5
Note that there are only two symbols (0 and 1) in the string 01101, but 5
positions for symbols. Length of string w is denoted by |w|.
Example: |011| = 3 and |ϵ| = 0

Concatenation
One of the most fundamental operations used for string manipulation is
concatenation. Let x = a1a2 · · · an and y = b1b2 · · · bm be two strings. The
concatenation of the pair x, y denoted by xy is the string a1a2 · · · anb1b2 · · ·
bm. Clearly, the binary operation concatenation on Σ∗ is associative, i.e., for all
x, y, z ∈ Σ∗, x(yz) = (xy)z. Thus, x(yz) may simply be written as xyz. Also,
since ε is the empty string, it satisfies the property εx = xε = x for any string x ∈
Σ∗. Hence, Σ∗ is a monoid with respect to concatenation.

The operation concatenation is not commutative on Σ∗.


For a string x and an integer n ≥ 0, we write xn+1 = xnx with the base condition
x0 = ε. That is, xn is obtained by concatenating n copies of x. Also, whenever
n = 0, the string x1 · · · xn represents the empty string ε. Let x be a string over an
alphabet Σ. For a ∈ Σ, the number of occurrences of a in x shall be denoted by
|x|a. The length of a string x denoted by |x| is defined as

|x| =
Essentially, the length of a string is obtained by counting the number of
symbols in the string. For example, |aab| = 3, |a| = 1. Note that |ε| = 0.
If we denote An to be the set of all strings of length n over Σ, then one
can easily ascertain that

4
And hence, being An a finite set, Σ∗ is a countably infinite set.
We say that x is a substring of y if x occurs in y, that is y = uxv for
some strings u and v. The substring x is said to be a prefix of y if u = ε.
Similarly, x is a suffix of y if v = ε.
Generalizing the notation used for number of occurrences of symbol a in a
string x, we adopt the notation |y|x as the number of occurrences of a string
x in y.

Powers of an Alphabet
If is an alphabet, we can express the set of all strings of a certain length from
that alphabet by using the exponential notation: the set of strings of length k,
k
each of whose is in
Examples:
0
: { ϵ }, regardless of what alphabet k
is. That is ϵ is the only string of
length 0
k
If = {0, 1}, then:
1
1. = {0, 1}
2
2. = {00, 01, 10, 11}
3
3. = {000, 001, 010, 011, 100, 101, 110, 111}
k 1
Note: confusion between and :
k
1. is an alphabet; its members 0 and 1 are symbols
1
2. is a set of strings; its members are strings (each one of length 1)

Kleen Star
The Kleene star, Σ*, is a unary operator on a set of symbols or strings, Σ, that
gives the infinite set of all possible strings of all possible lengths over Σ

5
including λ.
i.e. Σ* = Σ0 U Σ1 U Σ2 U……. where Σp is the set of all possible strings
of length p.
Example: If Σ = {a, b}, Σ*= {λ, a, b, aa, ab, ba, bb,………..}
*
: The set of all strings over an alphabet {0, 1}* = { ϵ, 0, 1, 00, 01, 10, 11,
000, . . .}
* 0 1 2
= ∪ ∪ ∪...
The symbol ∗ is called Kleene star and is named after the mathematician and
logician Stephen Cole Kleene.

Kleene Closure / Plus


The set Σ+ is the infinite set of all possible strings of all possible lengths
over Σ excluding λ.
i.e. Σ+ = Σ1 U Σ2 U Σ3 U…….
Σ+ = Σ* − { λ }
Example: If Σ = { a, b } , Σ+ ={ a, b, aa, ab, ba, bb,………..}
+ 1 2
= ∪ ∪...
Thus: *
= +
∪{ϵ}
Examples:
1. x = 01101 and y = 110
Then xy = 01101110 and yx = 11001101
2. For any string w, the equations ϵw = wϵ = w hold.
That is, ϵ is the identity for concatenation (when concatenated with any string it
yields the other string as a result).
*
If S and T are subsets of , then
S.T = {s.t | s ∈ S, t ∈ T}

6
Functions on Strings
The length of a string s, which we will write as |s|, is the number of symbols in
s. For example:
|| = 0
|10011001| = 8
For any symbol c and string s, we define the function #c(s) to be the number of
times that the symbol c occurs in s. E.g. #a(abbaaa) = 4.

The concatenation of two strings s and t, written s || t or simply st, is the string
formed by appending t to s. For example, if x = good and y = bye, then
xy = goodbye. So |xy| = |x| + |y|.
The empty string, , is the identity for concatenation of strings.
So x (x  =  x = x).
Concatenation, as a function defined on strings, is associative.
So s, t, w ((st)w = s(tw)).

String Replication
For each string w and each natural number i, the string wi is defined as:
w0 = 
wi+1 = wi w
For example:
a3 = aaa
(bye)2 = byebye
a0b3 = bbb

String Reversal
For each string w, the reverse of w, which we will write wR, is defined as:
if |w| = 0 then wR = w = 

7
if |w|  1 then a   (u  * (w = ua)). (i.e., the last character of w is a.)
Then wR = a u R.

Theorem Concatenation and Reverse of Strings


Theorem: If w and x are strings, then (w x)R = xR wR.
For example, (nametag)R = (tag)R (name)R = gateman.
Proof: The proof is by induction on |x|:
Base case: |x| = 0. Then x = , and (wx)R = (w )R = (w)R =  wR =  R wR = xR
wR.
Prove: n  0 (((|x| = n) → ((w x)R = xR wR)) → ((|x| = n + 1) → ((w x)R = xR
wR))).
Consider any string x, where |x| = n + 1. Then x = u a for some character a and
|u| = n. So:
(w x)R = (w (u a))R rewrite x as ua
= ((w u) a)R associativity of concatenation
= a (w u)R definition of reversal
= a (uR wR) induction hypothesis
= (a uR) wR associativity of concatenation
= (ua)R wR definition of reversal
= xR wR rewrite ua as x

Relations on Strings
A string s is a substring of a string t iff s occurs contiguously as part of t.
For example:
aaa is a substring of aaabbbaaa
aaaaaa is not a substring of aaabbbaaa

8
A string s is a proper substring of a string t iff s is a substring of t and s  t.
Every string is a substring (although not a proper substring) of itself. The empty
string, , is a substring of every string.

A string s is a prefix of t iff x  * (t = sx). A string s is a proper prefix of a


string t iff s is a prefix of t and s  t.
Every string is a prefix (although not a proper prefix) of itself. The empty string,
, is a prefix of every string.
For example, the prefixes of abba are: , a, ab, abb, abba.

A string s is a suffix of t iff x  * (t = xs). A string s is a proper suffix of a


string t iff s is a suffix of t and s  t. Every string is a suffix (although not a
proper suffix) of itself. The empty string, , is a suffix of every string. For
example, the suffixes of abba are: , a, ba, bba, abba.

Languages [Definitions]
(1) A language is a (finite or infinite) set of strings over a finite alphabet .
(2) A language is a subset of Σ* for some alphabet Σ. It can be finite or
infinite.
(3) In order to define the notion of a language in a broad spectrum, it is felt that
it can be any collection of strings over an alphabet. Thus we define a language
over an alphabet Σ as a subset of Σ∗.
*
(4) If is an alphabet, and L ⊆ , then L is a (formal) language over .

When we are talking about more than one language, we will use the notation L
to mean the alphabet from which the strings in the language L are formed.
“Given some string s and some language L, is s in L?”

9
Examples:
(1) Let  = {a, b}. * = {, a, b, aa, ab, ba, bb, aaa, aab, …}.
Some examples of languages over  are:
, {}, {a, b}, {, a, aa, aaa, aaaa, aaaaa}, {, a, aa, aaa, aaaa, aaaaa, …}
(2) If the language takes all possible strings of length 2 over Σ = {a, b},
then L = { ab, bb, ba, bb}
(3) Let L = {} = . L is the language that contains no strings.
(4) The Empty Language is Different From the Empty String. Let L = {}, the
language that contains a single string, . Note that L is different from .
(5) The empty set ∅ is a language over any alphabet. Similarly, {ε} is also
a language over any alphabet.
(6) The set of all strings over {0, 1} that start with 0.
(7) The set of all strings over {a, b, c} having ac as a substring. Note that
∅ 6= {ε}, because the language ∅ does not contain any string but {ε} contains a
string, namely ε. Also it is evident that |∅| = 0; whereas, |{ε}| = 1.
Since languages are sets, we can apply various well known set operations
NOTE: ∅ 6= { ϵ } since ∅ has no strings and { ϵ } has one

A language over need not include strings with all symbols of .


Thus, a language over is also a language over any alphabet that is a superset
of .
Other language examples
(8) The language of all strings consisting of n 0s followed by n 1s (n ≥ 0):
{ ϵ, 01, 0011, 000111, . . .}
(9) The set of strings of 0s and 1s with an equal number of each:
{ ϵ, 01, 10, 0011, 0101, 1001, . . .}
*
(10) is a language for any alphabet .

10
(11) {w | w consists of an equal number of 0 and 1}
(12) {0n1n | n ≥ 1}
(13) {0i1j | 0 ≤ i ≤ j}

Important operators on languages:


Union:- The union of two languages L and M, denoted L∪M, is the set of
strings that are in either L, or M, or both.
Example
If L = {001, 10, 111} and M = { ϵ, 001} then L∪M = { ϵ, 001, 10, 111}

Concatenation:- The concatenation of languages L and M, denoted L.M or


just LM , is the set of strings that can be formed by taking any string in L and
concatenating it with any string in M.
Example
If L = {001, 10, 111} and M = { ϵ, 001} then
L.M = {001, 10, 111, 001001, 10001, 111001}

Closure:- The closure of a language L is denoted L*and represents the set of


those strings that can be formed by taking any number of strings from L,
possibly with repetitions (i.e., the same string may be selected more than once)
and concatenating all of them.
Examples:
If L = {0, 1} then L* is all strings of 0 and 1
If L = {0, 11} then L* consists of strings of 0 and 1 such that the 1 come in
pairs, e.g., 011, 11110 and ϵ. But not 01011 or 101.
Formally, L* is the infinite union ∪i 0 Li where L0 = { ϵ }, L1 = L, and for i > 1
we have Li = LL…L (the concatenation of i copies of L).

11
Intersection: Suppose L1 and L2 are languages over some common alphabet,
the intersection L1 ∩ L2 of L1 and L2 consists of all strings which are
contained in both languages

The complement ¬L of a language: The complement ¬L of a language with


respect to a given alphabet consists of all strings over the alphabet that are not
in the language.
• The Kleene star: the language consisting of all words that are concatenations
of 0 or more words in the original language;

Such string operations are used to investigate closure properties of classes of


languages. A class of languages is closed under a particular operation when the
operation, applied to languages in the class, always produces a language in the
same class again. For instance, the context-free languages are known to be
closed under union, concatenation, and intersection with regular languages, but
not closed under intersection or complement.

The empty set Ø and the set { } are languages over every alphabet. Ø is a
language that contains no string. { } is a language that contains just the empty
string.
The union of two languages L1 and L2, denoted L1 L2, refers to the language
that consists of all the strings that are either in L1 or in L2, that is, to { x | x is in
L1 or x is in L2 }.

The intersection of L1 and L2, denoted L1 L2, refers to the language that
consists of all the strings that are both in L1 and L2, that is, to { x | x is in L1
and in L2 }.
The complementation of a language L over , or just the complementation of L
when is understood, denoted , refers to the language that consists of all the
12
strings over that are not in L, that is, to { x | x is in * but not in L }.

Example: Consider the languages L1 = { , 0, 1} and L2 = { , 01, 11}. The union


of these languages is L1 L2 = { , 0, 1, 01, 11}, their intersection is L1 L2 = { },
and the complementation of L1 is = {00, 01, 10, 11, 000, 001, . . . }.

The difference of L1 and L2, denoted L1 - L2, refers to the language that
consists of all the strings that are in L1 but not in L2, that is, to { x | x is in L1
but not in L2 }.

The cross product of L1 and L2, denoted L1 × L2, refers to the set of all the
pairs (x, y) of strings such that x is in L1 and y is in L2, that is, to the relation
{ (x, y) | x is in L1 and y is in L2 }.

The composition of L1 with L2, denoted L1L2, refers to the language { xy | x


is in L1 and y is in L2 }.

Example: If L1 = { , 1, 01, 11} and L2 = {1, 01, 101} then L1 - L2 = { , 11} and
L2 - L1 = {101}.
On the other hand, if L1 = { , 0, 1} and L2 = {01, 11}, then the cross product of
these languages is L1 × L2 = {( , 01), ( , 11), (0, 01), (0, 11), (1, 01), (1, 11)},
and their composition is L1L2 = {01, 11, 001, 011, 101, 111}.
L - Ø = L, Ø - L = Ø, ØL = Ø, and { }L = L for each language L.
Li will also be used to denote the composing of i copies of a language L, where
L0 is defined as {}. The set L0 L1 L2 L3 . . . , called the Kleene closure or just
the closure of L, will be denoted by L*. The set L1 L2 L3 , called the positive
closure of L, will be denoted by L+.
Li consists of those strings that can be obtained by concatenating i strings from
L. L* consists of those strings that can be obtained by concatenating an arbitrary
13
number of strings from L.
Example: Consider the pair of languages L1 = { , 0, 1} and L2 = {01, 11}. For
these languages L1 2 = { , 0, 1, 00, 01, 10, 11}, and L2 3 = {010101, 010111,
011101, 011111, 110101, 110111, 111101, 111111}. In addition, is in L1*, in
L1 +, and in L2* but not in L2 +.
The operations above apply in a similar way to relations in * × *, when and are
alphabets.

Specifically, the union of the relations R1 and R2, denoted R1 R2, is the
relation { (x, y) | (x, y) is in R1 or in R2 }. The intersection of R1 and R2,
denoted R1 R2, is the relation { (x, y) | (x, y) is in R1 and in R2 }. The
composition of R1 with R2, denoted R1R2, is the relation { (x1x2, y1y2) | (x1,
y1) is in R1 and (x2, y2) is in R2 }.

Prefix Relation
We define the following languages in terms of the prefix relation on strings:
L1 = {w  {a, b}*: no prefix of w contains b}
= {, a, aa, aaa, aaaa, aaaaa, aaaaaa, …}.
L2 = {w  {a, b}*: no prefix of w starts with b}
= {w  {a, b}*: the first character of w is a}  {}.
L3 = {w  {a, b}*: every prefix of w starts with b} = .

L3 is equal to  because  is a prefix of every string. Since  does not start with
b, no strings meet L3’s requirement.
Recall that we defined the replication operator on strings: For any string s and
integer n, sn = n copies of s concatenated together. For example,
(bye)2 = byebye. We can use replication as a way to define a language, rather
than a single string, if we allow n to be a variable, rather than a specific

14
constant.

Using Replication to Define a Language


Let L = {an : n  0}. L = {, a, aa, aaa, aaaa, aaaaa, …}.
Languages are sets. So, if we want to provide a computational definition of a
language, we could specify either:
• a language generator, which enumerates (lists) the elements of the language, or
• a language recognizer, which decides whether or not a candidate string is in
the language and returns True if it is and False if it isn't.

For example, the logical definition, L = {x : y  {a, b}* (x = ya)}, can be


turned into either a language generator (enumerator) or a language recognizer.
In some cases, when considering an enumerator for a language L, we may care
about the order in which the elements of L are generated. If there exists a total
order D of the elements of L (as there does, for example, on the letters of
the Roman alphabet or the symbols for the digits 0 – 9), then we can use D to
define on L a useful total order called lexicographic order (written <L):
• Shorter strings precede longer ones: x (y ((|x| < |y|) → (x <L y))), and
• Of strings that are the same length, sort them in dictionary order using D.

We will say that a program lexicographically enumerates the elements of L iff


it enumerates them in lexicographic order.

Lexicographic Enumeration
Let L = {x  {a,b}* : all a's precede all b's}. The lexicographic enumeration of
L is:
, a, b, aa, ab, bb, aaa, aab, abb, bbb, aaaa, aaab, aabb, abbb, bbbb, aaaaa, …

15
L* always contains an infinite number of strings as long as L is not equal to
either  or {} (i.e., as long as there is at least one nonempty string any number
of which can be concatenated together). If L = , then L* = {}, since there
are no strings that could be concatenated to  to make it longer. If L = {}, then
L* is also {}. It is sometimes useful to require that at least one element of L be
selected. So we define: L+ = L L*.
Another way to describe L+ is that it is the closure of L under concatenation.
Note that L+ = L* - {} iff   L

Concatenation and Reverse of Languages


Let L be a language defined over some alphabet . Then the revers e of L,
written LR is: LR = {w  * : w = xR for some x  L}.
In other words, LR is the set of strings that can be formed by taking some string
in L and reversing it. Since we have defined the reverse of a language in terms
of the definition of reverse applied to strings, we expect it to have analogous
properties.

Theorem: If L1 and L2 are languages, then (L1 L2)R = L2R L1R.


Proof: If x and y are strings, then x (y ((xy)R = yRxR))
(L1 L2)R = {(xy)R : x  L1 and y  L2} Definition of concatenation of
languages
= {yRxR : x  L1 and y  L2} Lines 1 and 2
= L2R L1R Definition of concatenation of languages

Examples
1. If L1 = {0, 1, 01} and L2 = {1, 00}, then
L1L2 = {01, 11, 011, 000, 100, 0100}.
2. For L1 = {b, ba, bab} and L2 = {ε, b, bb, abb}, we have

16
L1L2 = {b, ba, bb, bab, bbb, babb, baabb, babbb, bababb}.
Note:
1. Since concatenation of strings is associative, so is the concatenation of
languages. That is, for all languages L1, L2 and L3, (L1L2)L3 = L1(L2L3).
Hence, (L1L2)L3 may simply be written as L1L2L3.
2. The number of strings in L1L2 is always less than or equal to the
product of individual numbers, i.e. |L1L2| ≤ |L1||L2|.
3. L1 ⊆ L1L2 if and only if ε ∈ L2.

Proof. The “if part” is straightforward; for instance, if ε ∈ L2, then for any x ∈
L1, we have x = xε ∈ L1L2. On the other hand, suppose ε / ∈ L2. Now, note
that a string x ∈ L1 of shortest length in L1 cannot be in L1L2. This is because,
if x = yz for some y ∈ L1 and a nonempty string z ∈ L2, then |y| < |x|. A
contradiction to our assumption that x is of shortest length in L1.
Hence L1 6⊆ L1L2.
4. Similarly, ε ∈ L1 if and only if L2 ⊆ L1L2.
We write Ln to denote the language which is obtained by concatenating
n copies of L. More formally,
L0 = {ε} and
Ln = Ln−1L, for n ≥ 1.

In the context of formal languages, another important operation is Kleene


star. Kleene star or Kleene closure of a language L, denoted by L∗.
Example
1. Kleene star of the language {01} is
{ε, 01, 0101, 010101, . . .} = {(01)n | n ≥ 0}.
2. If L = {0, 10}, then L∗ = {ε, 0, 10, 00, 010, 100, 1010, 000, . . .}
Since an arbitrary string in Ln is of the form x1x2 · · · xn, for xi ∈ L and
one can easily observe that
17
L∗ = {x1x2 · · · xn | n ≥ 0 and xi ∈ L, for 1 ≤ i ≤ n}
Thus, a typical string in L∗ is a concatenation of finitely many strings of L.
Remark Note that, the Kleene star of the language L = {0, 1} over
the alphabet Σ = {0, 1} is
L∗ = L0 ∪ L ∪ L2 ∪ · · ·
= {ε} ∪ {0, 1} ∪ {00, 01, 10, 11} ∪ · · ·
= {ε, 0, 1, 00, 01, 10, 11, · · · }
= the set of all strings over Σ.
Thus, the earlier introduced notation Σ∗ is consistent with the notation of
Kleene star by considering Σ as a language over Σ.

Properties of Languages
The properties of languages with respect to the newly introduced operations:
concatenation, Kleene closure, and positive closure are as follows, L, L1, L2,
L3 and L4 are languages.
1. Recall that concatenation of languages is associative.
2. Since concatenation of strings is not commutative, we have L1L2 6= L2L1,
in general.
3. L{ε} = {ε}L = L.
4. L∅ = ∅L = ∅.

Proof. Let x ∈ L∅; then x = x1x2 for some x1 ∈ L and x2 ∈ ∅. But ∅ being
empty set cannot hold any element. Hence there cannot be any element x ∈ L∅
so that L∅ = ∅. Similarly, ∅L = ∅ as well.

5. Distributive Properties:
1. (L1 ∪ L2)L3 = L1L3 ∪ L2L3.
Proof. Suppose x ∈ (L1 ∪ L2)L3
⇒ x = x1x2, for some x1 ∈ L1 ∪ L2), and some x2 ∈ L3
18
⇒ x = x1x2, for some x1 ∈ L1 or x1 ∈ L2, and x2 ∈ L3
⇒ x = x1x2, for some x1 ∈ L1 and x2 ∈ L3,
or x1 ∈ L2 and x2 ∈ L3
⇒ x ∈ L1L3 or x ∈ L2L3
⇒ x ∈ L1L3 ∪ L2L3.

Conversely, suppose x ∈ L1L3 ∪ L2L3 x ∈ L1L3 or x ∈ L2L3.


Without loos of generality, assume x 6∈ L1L3. Then x ∈ L2L3.
⇒ x = x3x4, for some x3 ∈ L2 and x4 ∈ L3
⇒ x = x3x4, for some x3 ∈ L1 ∪ L2, and some x4 ∈ L3
⇒ x ∈ (L1 ∪ L2)L3.
Hence, (L1 ∪ L2)L3 = L1L3 ∪ L2L3.

2. L1(L2 ∪ L3) = L1L2 ∪ L1L3.


Proof. Similar to the above.

From these properties it is clear that the concatenation is distributive


over finite unions. Moreover, we can observe that concatenation is also
distributive over countably infinite unions. That is,

6. If L1 ⊆ L2 and L3 ⊆ L4, then L1L3 ⊆ L2L4.


7. ∅∗ = {ε}.
8. {ε}∗ = {ε}.
9. If ε ∈ L, then L∗ = L+.

19
10. L∗L = LL∗ = L+.

Proof. Suppose x ∈ L∗L. Then x = yz for some y ∈ L∗ and z ∈ L. But y ∈ L∗


implies y = y1 · · · yn with yi ∈ L for all i. Hence,
x = yz = (y1 · · · yn)z = y1(y2 · · · ynz) ∈ LL∗. Converse is similar. Hence,
L∗L = LL∗. Further, when x ∈ L∗L, as above, we have x = y1 · · · ynz is clearly
in L+. On the other hand, x ∈ L+ implies x = x1 · · · xm with m ≥ 1 and xi ∈ L
for all i. Now write xꞌ = x1 · · · xm−1 so that x = xꞌ xm.
Here, note that xꞌ ∈ L∗; particularly, when m = 1 then xꞌ = ε. Thus,
x ∈ L∗L. Hence, L+ = L∗L.

11. (L∗)∗ = L∗.


12. L∗L∗ = L∗.
13. (L1L2)∗L1 = L1(L2L1)∗.

Proof. Let x ∈ (L1L2)∗L1. Then x = yz, where z ∈ L1 and y = y1 · · · yn ∈


(L1L2)∗ with yi ∈ L1L2. Now each yi = uivi, for ui ∈ L1 and
vi ∈ L2. Note that viui+1 ∈ L2L1, for all i with 1 ≤ i ≤ n − 1. Hence,
x = yz = (y1 · · · yn)z = (u1v1 · · · unvn)z = u1(v1u2 · · · vn−1unvnz) ∈
L1(L2L1)∗. Converse is similar. Hence, (L1L2)∗L1 = L1(L2L1)∗.

14. (L1 ∪ L2)∗ = (L∗ 1L∗ 2)∗.


Proof. Observe that L1 ⊆ L∗ 1 and {ε} ⊆ L∗ 2. Hence, by properties 3
and 6, we have L1 = L1{ε} ⊆ L∗ 1L∗ 2. Similarly, L2 ⊆ L∗ 1L∗ 2. Hence,
L1 ∪ L

20
Relation between languages, grammars and automata

TURING MACHINES
A Turing machine (TM) is a device that manipulates symbols on a strip of tape
according to a table of rules. Despite its simplicity, a Turing machine can be
adapted to simulate the logic of any computer algorithm, and is particularly
useful in explaining the functions of a CPU inside a computer. It is a
mathematical model which consists of an infinite length tape divided into cells
on which input is given. The "Turing" machine was described in 1936 by Alan
Turing who called it an "a-machine"(automatic machine). The Turing machine
is not intended as practical computing technology, but rather as a hypothetical
device representing a computing machine. Turing machines help computer
scientists understand the limits of mechanical computation.

Constituents of a Turing Machine


A Turing machine consists of:
(i) A Read-Write Head:- This reads the input tape. After reading the symbol on
the tape and overwriting it with another symbol (which can be the same), the
head moves to the next character, either on the left or on the right.
(ii) A finite controller that specifies the behaviour of the machine (for each
21
state of the automaton and each symbol read from the tape, what symbol to
write on the tape and which direction to move next).
(iii) A halting state: In addition to moving left or right, the machine may also
halt. In this case, the TM is usually said to accept the input. (A TM is thus an
automaton with only one accepting state H). The initial position of the Turing
machine is usually explicitly stated (otherwise, the machine can for instance
start by moving first to the left-most symbol).
(iv) A state register that stores the state of the Turing machine. There is one
special start state with which the state register is initialized. After reading an
input symbol, it is replaced with another symbol, its internal state is changed,
and it moves from one cell to the right or left. If the TM reaches the final state,
the input string is accepted, otherwise rejected.

A TM can be formally described as a 7-tuple (Q, X, Σ, δ, q0, B, F) where:


Q is a finite set of states
X is the tape alphabet
Σ is the input alphabet
δ is a transition function; δ : Q × X → Q × X × {Left_shift, Right_shift}.
q0 is the initial state
B is the blank symbol
F is the set of final states

A TM accepts a language if it enters into a final state for any input string w. A
language is recursively enumerable (generated by Type-0 grammar) if it is
accepted by a Turing machine. A TM decides a language if it accepts it and
enters into a rejecting state for any input not in the language. A language is
recursive if it is decided by a Turing machine. There may be some cases where
a TM does not stop. Such TM accepts the language, but it does not decide it.

22
Time and Space Complexity of a Turing Machine
For a Turing machine, the time complexity refers to the measure of the number
of times the tape moves when the machine is initialized for some input symbols
and the space complexity is the number of cells of the tape written.
Time complexity all reasonable functions:
T(n) = O(n log n)
TM's space complexity:
S(n) = O(n)

Designing a Turing Machine


The basic guidelines of designing a Turing machine have been explained below
with the help of a couple of examples.

Example 1
Design a TM to recognize all strings consisting of an odd number of α’s.
Solution
The Turing machine M can be constructed by the following moves:
 Let q1 be the initial state.
 If M is in q1; on scanning α, it enters the state q2 and writes B (blank).
 If M is in q2; on scanning α, it enters the state q1 and writes B (blank).
From the above moves, we can see that M enters the state q1 if it scans an even
number of α’s, and it enters the state q2 if it scans an odd number of α’s. Hence
q2 is the only accepting state.
Hence, M = {{q1, q2}, {1}, {1, B}, δ, q1, B, {q2}},
where δ is given by:

23
Example 2
Design a Turing Machine that reads a string representing a binary number and
erases all leading 0’s in the string. However, if the string comprises of only 0’s,
it keeps one 0.
Solution
Let us assume that the input string is terminated by a blank symbol, B, at each
end of the string.
The Turing Machine, M, can be constructed by the following moves:
 q0 be the initial state.
 If M is in q0,on reading 0, it moves right, enters the state q1 and erases 0.
On reading 1, it enters the state q2 and moves right.
 If M is in q1, on reading 0, it moves right and erases 0, i.e., it replaces 0’s
by B’s. On reaching the leftmost 1, it enters q2 and moves right. If it
reaches B, i.e., the string comprises of only 0’s, it moves left and enters
the state q3.
 If M is in q2, on reading either 0 or 1, it moves right. On reaching B, it
moves left and enters the state q4. This validates that the string comprises
only of 0’s and 1’s.
 If M is in q3, it replaces B by 0, moves left and reaches the final state qf.
 If M is in q4, on reading either 0 or 1, it moves left. On reaching the
beginning of the string, i.e., when it reads B, it reaches the final state qf.
Hence, M = {{q0, q1, q2, q3, q4, qf}, {0,1, B}, {1, B}, δ, q0, B, {qf}}
where δ is given by:

24
Non-Deterministic Turing Machine
A non-deterministic Turing machine can be formally defined as a tuple (Q, X,
Σ, δ, q0, B, F) where:
Q is a finite set of states
X is the tape alphabet
Σ is the input alphabet
δ is a transition function; δ : Q × X → P(Q × X × {Left_shift, Right_shift}).
q0 is the initial state
B is the blank symbol
F is the set of final states

In a Non-Deterministic Turing Machine, for every state and symbol, there are a
group of actions the TM can have. So, here the transitions are not deterministic.
The computation of a non-deterministic Turing Machine is a tree of
configurations that can be reached from the start configuration. An input is
accepted if there is at least one node of the tree which is an accept
configuration, otherwise it is not accepted. If all branches of the computational
tree halt on all inputs, the non-deterministic Turing Machine is called a Decider
and if for some input, all branches are rejected, the input is also rejected.

A Turing Machine with a semi-infinite tape has a left end but no right end. The
left end is limited with an end marker. It is a two-track tape:
1. Upper track: It represents the cells to the right of the initial head position.
2. Lower track: It represents the cells to the left of the initial head position in
reverse order.
The infinite length input string is initially written on the tape in contiguous tape
cells. The machine starts from the initial state q0 and the head scans from the
left end marker ‘End’. In each step, it reads the symbol on the tape under its
25
head. It writes a new symbol on that tape cell and then it moves the head either
into left or right one tape cell. A transition function determines the actions to be
taken. It has two special states called accept state and reject state. If at any
point in time it enters into the accepted state, the input is accepted and if it
enters into the reject state, the input is rejected by the TM. In some cases, it
continues to run infinitely without being accepted or rejected for some certain
input symbols.
Note: Turing machines with semi-infinite tape are equivalent to standard Turing
machines.

A deterministic linear bounded automaton is always context-sensitive and the


linear bounded automaton with empty language is undecidable.
33. Linear Bounded Automata
Decidability
A language is called Decidable or Recursive if there is a Turing machine which
accepts and halts on every input string w. Every decidable language is Turing-
Acceptable. A decision problem P is decidable if the language L of all yes
instances to P is decidable. For a decidable language, for each input string, the
TM halts either at the accept or the reject state.

Example
Find out whether the following problem is decidable or not: Is a number ‘m’
prime?
Solution
Prime numbers = {2, 3, 5, 7, 11, 13, …………..}
34. Language Decidability
Divide the number ‘m’ by all the numbers between ‘2’ and ‘√m’ starting from
‘2’. If any of these numbers produce a remainder zero, then it goes to the

26
“Rejected state”, otherwise it goes to the “Accepted state”. So, here the answer
could be made by ‘Yes’ or ‘No’. Hence, it is a decidable problem.

For an undecidable language, there is no Turing Machine which accepts the


language and makes a decision for every input string w (TM can make decision
for some input string though). A decision problem P is called “undecidable” if
the language L of all yes instances to P is not decidable. Undecidable languages
are not recursive languages, but sometimes, they may be recursively enumerable
languages.
Examples:
 The halting problem of Turing machine
 The mortality problem
 The mortal matrix problem
 The Post correspondence problem, etc.

UNIVERSAL TURING MACHINE [UTM]


A Turing machine that is able to simulate any other Turing machine is called a
Universal Turing Machine (UTM, or simply a universal machine). A more
mathematically oriented definition with a similar "universal" nature was
introduced by Alonzo Church, whose work on lambda calculus intertwined with
Turing's in a formal theory of computation known as the Church-Turing
thesis. The thesis states that Turing machines indeed capture the informal
notion of effective method in logic and mathematics, and provide a precise
definition of an algorithm or mechanical procedure.

The existence of UTM is one of the most astonishing contributions of Alan


Turing. A UTM is a Turing machine which, when supplied with a tape on which
the code of some Turing machine M is written (together with some input string),
will produce the same output as the machine M. In other words, a UTM is a
27
machine which can be programmed to simulate any other possible Turing
machine. We now take this remarkable finding for granted. But at the time
(1936), it was so astonishing that it is considered by some to have been the
fundamental theoretical breakthrough that led to modern computers.

Church Turing Thesis


In computability theory, the Church–Turing thesis (also known as the Turing-
Church thesis, the Church-Turing conjecture, Church's thesis, Church's
conjecture, and Turing's thesis) is a combined hypothesis ("thesis") about the
nature of functions whose values are effectively calculable; or, in more modern
terms, functions whose values are algorithmically computable. In simple terms,
the Church-Turing thesis states that a function is algorithmically computable if
and only if it is computable by a Turing machine.

Several attempts were made in the first half of the 20th Century to formalize the
notion of computability:
(i) American mathematician Alonzo Church created a method for defining
functions called the λ-calculus.
(ii) British mathematician Alan Turing created a theoretical model for a
machine, now called a universal Turing machine, that could carry out
calculations from inputs.
(iii) Church, along with mathematician Stephen Kleene and logician J.B. Rosser
created a formal definition of a class of functions whose values could be
calculated by recursion.

All three computational processes (recursion, the λ-calculus, and the Turing
machine) were shown to be equivalent—all three approaches define the same
class of functions. This has led mathematicians and computer scientists to
believe that the concept of computability is accurately characterized by these
28
three equivalent processes. Informally the Church–Turing thesis states
that if some method (algorithm) exists to carry out a calculation, then the same
calculation can also be carried out by a Turing machine (as well as by a
recursively definable function, and by a λ-function).
The thesis can be stated as follows:
• Every effectively calculable function is a computable function.
Turing stated it this way:
• "It was stated ... that 'a function is effectively calculable if its values can be
found by some purely mechanical process.' We may take this literally,
understanding that by a purely mechanical process one which could be carried
out by a machine. The development ... leads to ... an identification of
computability with effective calculability.

FINITE STATE AUTOMATA


Problem 1
Construct a regular expression corresponding to the automata given below:

Solution
Here the initial state is q2 and the final state is q1.
The equations for the three states q1, q2, and q3 are as follows:
q1 = q1a + q3a + є (є move is because q1 is the initial state0)
q2 = q1b + q2b + q3b

29
q3 = q2a
Now, we will solve these three equations:
q2 = q1b + q2b + q3b
= q1b + q2b + (q2a)b (Substituting value of q3)
= q1b + q2(b + ab)
= q1b (b + ab)* (Applying Arden’s Theorem)
q1 = q1a + q3a + є
= q1a + q2aa + є (Substituting value of q3)
= q1a + q1b(b + ab*)aa + є (Substituting value of q2)
= q1(a + b(b + ab)*aa) + є
= є (a+ b(b + ab)*aa)*
= (a + b(b + ab)*aa)*
Hence, the regular expression is (a + b(b + ab)*aa)*.

Problem 2
Construct a regular expression corresponding to the automata given below:

Solution:
Here the initial state is q1 and the final state is q2
Now we write down the equations:
q1 = q10 + є

30
q2 = q11 + q20
q3 = q21 + q30 + q31
Now, we will solve these three equations:
q1 = є0* [As, εR = R]
So, q1 = 0*
q2 = 0*1 + q20
So, q2 = 0*1(0)* [By Arden’s theorem]
Hence, the regular expression is 0*10*.

Pushdown Automaton (PDA)


• In computer science, a pushdown automaton (PDA) is a type of automaton that
uses a stack for temporary data storage. The PDA is used in theories about what
can be computed by machines. The PDA is more capable than finite-state
machines but less capable than Turing machines. Because its input can be
described with a formal grammar, it can be used in parser design. Pushdown
automata differ from finite state machines in two ways:
 They can use the top of the stack to decide which transition to take.
 They can manipulate the stack as part of performing a transition.

Pushdown automata choose a transition by indexing a table by input signal,


current state, and the symbol at the top of the stack. This means that those three
parameters completely determine the transition path that is chosen. Finite state
machines just look at the input signal and the current state: they have no stack to
work with. Pushdown automata add the stack as a parameter for choice.

Pushdown automata can also manipulate the stack, as part of performing a


transition. Finite state machines choose a new state, the result of following the
transition. The manipulation can be to push a particular symbol to the top of the
stack, or to pop off the top of the stack. The automaton can alternatively ignore
31
the stack, and leave it as it is. The choice of manipulation (or no manipulation)
is determined by the transition table.
Given an input signal, current state, and stack symbol, the automaton can
follow a transition to another state, and optionally manipulate (push or pop) the
stack. In general, pushdown automata may have several computations on a
given input string, some of which may be halting in accepting configurations
while others are not.
There are two classes of PDAs:

Deterministic Pushdown Automaton (DPDA): The machine has only one


possible choice of action for all situations. Their application is limited to
deterministic context-free grammars.

Nondeterministic Pushdown Automaton (NDPDA or NPDA)


The automaton can have two or more possible choices of action for some or all
situations. The choices may or may not be mutually exclusive. When they are
not, the automaton will create branches, each following one of the correct
choices. If more than one of the branches created during the execution of the
automaton complete successfully, multiple outputs will be produced. This kind
of PDAs can handle all context-free grammars.

Non-determinism means that there may be more than just one transition
available to follow, given an input signal, state, and stack symbol. If in every
situation only one transition is available as continuation of the computation,
then the result is a deterministic pushdown automaton (DPDA), a strictly
weaker device. Unlike finite-state machines, there is no mechanical way to turn
a NDPDA into an equivalent DPDA.

32
If we allow a finite automaton access to two stacks instead of just one, we
obtain a more powerful device, equivalent in power to a Turing machine. A
linear bounded automaton is a device which is more powerful than a pushdown
automaton but less so than a Turing machine.

Nondeterministic pushdown automata are equivalent to context-free grammars:


for every context-free grammar, there exists a pushdown automaton such that
the language generated by the grammar is identical with the language generated
by the automaton, which is easy to prove. The reverse is true, though harder to
prove: for every pushdown automaton there exists a context-free grammar such
that the language generated by the automaton is identical with the language
generated by the grammar.

PDA Transitions:
Second:
δ(q, ε, z) = {(p1,γ1), (p2,γ2),…, (pm,γm)}
– Current state is q
– Current input symbol is not considered
– Symbol currently on top of the stack z
– Move to state pi from q
– Replace z with γi on the stack (leftmost symbol on top)
– No input symbol is read

Deterministic Finite Automata


A deterministic automaton (DA) is a tuple A = (Q, Σ, δ, q0, F), where
• Q is a nonempty set of states,
• Σ is an alphabet,
• δ: Q × Σ → Q is a transition function,
• q0 ∈ Q is the initial state, and
33
• F ⊆ Q is the set of final states.

From an operational point of view, a deterministic automaton can be seen as the


control unit of a machine that reads input from a tape divided into cells by
means of a reading head. Initially, the automaton is in the initial state, the tape
contains the word to be read, and the reading head is positioned on the first cell
of the tape. At each step, the machine reads the content of the cell occupied by
the reading head, updates the current state according to the transition function,
and advances the head one cell to the right. The machine accepts a word if the
state reached after reading it completely is final.

Figure 2: Tape with reading head.

A run is accepting if qn ∈ F. The automaton A accepts a word w ∈ Σ∗ if it has


an accepting run on input w. The language recognized by A is the set
L(A) = {w ∈ Σ∗ | w is accepted by A}.
A deterministic finite automaton (DFA) is a DA with a finite set of states.
Notice that a DA has exactly one run on a given word. Given a DA, we often
say “the word w leads from q0 to q”, meaning that the unique run of the DA on
the word w ends at the state q. Graphically, non-final states of a DFA are
represented by circles, and final states by double circles (see the example
below). The transition function is represented by labeled directed edges:
if δ(q, a) = q0 then we draw an edge from q to q0 labeled by a. We also draw an
edge into the initial state.

34
Example: The Figure below shows the graphical representation of the DFA
A = (Q, Σ, δ, q0, F), where
Q = {q0, q1, q2, q3}, Σ = {a, b}, F = {q0}, and δ is given by the following table
δ(q0, a) = q1 δ(q1, a) = q0 δ(q2, a) = q3 δ(q3, a) = q2
δ(q0, b) = q3 δ(q1, b) = q2 δ(q2, b) = q1 δ(q3, b) = q0

The runs of A on aabb and abbb are

The first one is accepting, but the second one is not. The DFA recognizes the
language of all words over the alphabet {a, b} that contain an even number of
a’s and an even number of b’s. The DFA is in the states on the left, respectively
on the right, if it has read an even, respectively an odd, number of a’s. Similarly,
it is in the states at the top, respectively at the bottom, if it has read an even,
respectively an odd, number of b’s.

Trap States
Consider the DFA with a trap state of the figure below over the alphabet {a, b,
c}.

35
The automaton recognizes the language {ab, ab}. The pink state on the left is
often called a trap state or a garbage collector: if a run reaches this state, it gets
trapped in it, and so the run cannot be accepting. DFAs often have a trap state
with many ingoing transitions, and this makes it difficult to find a nice graphical
representation. So when drawing DFAs we often omit the trap state. For
instance, we only draw the black part of the automaton in the Figure. Notice that
no information is lost: if a state q has no outgoing transition labeled by a, then
we know that δ(q, a) = qt, where qt is the trap state.

Non-Deterministic Finite Automata


A non-deterministic automaton (NA) is a tuple A = (Q, Σ, δ, Q0, F), where
Q, Σ, and F are as for DAs,
Q0 is a nonempty set of initial states and
δ: Q × Σ → P(Q) is a transition relation.
In a deterministic automaton the next state is completely determined by the
current state and the letter read by the head. In particular, this implies that the
automaton has exactly one run for each word. Nondeterministic automata have
the possibility to choose the state out of a set of candidates (which may also be
empty), and so they may have zero, one, or many runs on the same word. The
automaton is said to accept a word if at least one of these runs is accepting.

A run of A on input a0a1 . . . an is a sequence p0 a0→ pn, such that pi ∈ Q for


0 ≤ i ≤ n, p0 ∈ Q0, and pi+1 ∈ δ(pi, ai) for 0 ≤ i < n − 1. That is, a run starts at
some initial state. A run is accepting if pn ∈ F. A accepts a word w ∈ Σ∗, if

36
there is an accepting run on input w. The language recognized by A is the set
L(A) = {w ∈ Σ∗ | w is accepted by A}. The runs of NAs are defined as for DAs,
but substituting p0 ∈ Q0 for pi+1 ∈ δ(pi, ai) for δ(pi, ai) = pi+1. Acceptance
and the language recognized by a NA are defined as for DAs. A
nondeterministic finite automaton (NFA) is a NA with a finite set of states.

We often identify the transition function δ of a DA with the set of triples (q, a,
q0) such that q0 = δ(q, a), and the transition relation δ of a NFA with the set of
triples (q, a, q0) such that q0 ∈ δ(q, a); so we often write (q, a, q0) ∈ δ, meaning
q0 = δ(q, a) for a DA, or q0 ∈ δ(q, a) for a NA. If a NA has several initial states,
then its language is the union of the sets of words accepted by runs starting at
each initial state.

Example: The figure below shows a NFA A = (Q, Σ, δ, Q0, F) where Q = {q0,
q1, q2, q3}, Σ = {a, b}, Q0 = {q0}, F = {q3}, and the transition relation δ is
given by the following table
δ(q0, a) = {q1} δ(q1, a) = {q1} δ(q2, a) = ∅ δ(q3, a) = {q3}
δ(q0, b) = ∅ δ(q1, b) = {q1, q2} δ(q2, b) = {q3} δ(q3, b) = {q3}

A has no run for any word starting with a b. It has exactly one run for abb, and
four runs for abbb, namely

37
Two of these runs are accepting, the other two are not. L(A) is the set of words
that start with a and contain two consecutive bs.

After a DA reads a word, we know if it belongs to the language or not. This is


no longer the case for NAs: if the run on the word is not accepting, we do not
know anything; there might be a different run leading to a final state. So NAs
are not very useful as language acceptors. However, they are very important.
From the operational point of view, it is often easier to find a NFA for a given
language than to find a DFA, and also, NFAs can be automatically transformed
into DFAs. From a data structure point of view, there are two further reasons to
study NAs. First, many sets can be represented far more compactly as NFAs
than as DFAs. So using NFAs may save memory. Second, and more
importantly, any of the operations on sets and relations in can take as input a
DFA and returns a NFA. Therefore, NFAs are not only convenient, but also
necessary to obtain a data structure implementing all operations.

PARSING
In computer science and linguistics, parsing, or, more formally, syntactic
analysis, is the process of analysing a text, made of a sequence of tokens (for
example, words), to determine its grammatical structure with respect to a given
(more or less) formal grammar. Parsing is also an earlier term for the
diagramming of sentences of natural languages, and is still used for the
diagramming of inflected languages, such as the Romance languages or Latin.
Parsing is a common term used in psycholinguistics when describing language
comprehension.

In this context, parsing refers to the way that human beings, rather than
computers, analyze a sentence or phrase (in spoken language or text) in terms of
grammatical constituents, identifying the parts of speech, syntactic relations,
38
etc. This term is especially common when discussing what linguistic cues help
speakers to parse garden path sentences.

Types of Parsing
(1) Top Down Parsing
Top-down parsing is a parsing strategy where one first looks at the highest level
of the parse tree and works down the parse tree by using the rewriting rules of a
formal grammar. LL parsers are a type of parser that uses a top-down parsing
strategy. Top-down parsing is a strategy of analysing unknown data
relationships by hypothesizing general parse tree structures and then
considering whether the known fundamental structures are compatible with the
hypothesis. It occurs in the analysis of both natural languages and computer
languages.

Top-down parsing can be viewed as an attempt to find left-most derivations of


an input stream by searching for parse-trees using a top-down expansion of the
given formal grammar rules. Tokens are consumed from left to right. Inclusive
choice is used to accommodate ambiguity by expanding all alternative right-
hand-sides of grammar rules. In top-down parsing, you start with the start
symbol and apply the productions until you arrive at the desired string.
Example, let’s trace through the two approaches on this simple grammar that
recognizes strings consisting of any number of a’s followed by at least one (and
possibly more) b’s:
S –> AB
A –>aA | ε
B –> b | bB
Here is a top-down parse of aaab. We begin with the start symbol and at each
step, expand one of the remaining non-terminals by replacing it with the right
side of one of its productions. We repeat until only terminals remain. The top-
39
down parse produces a leftmost derivation of the sentence.
S
AB S –> AB
aAB A –>aA
aaAB A –>aA
aaaAB A –>aA
aaaεB A –>ε
aaab B –> b

(2) Bottom Up Parsing


A bottom-up parse works in reverse. We begin with the sentence of terminals
and each step applies a production in reverse, replacing a substring that matches
the right side with the nonterminal on the left. We continue until we have
substituted our way back to the start symbol. If you read from the bottom to top,
the bottom-up parse prints out a rightmost derivation of the sentence.

aaab
aaaεb (insert ε)
aaaAb A –>ε
aaAb A –>aA
aAb A –>aA
Ab A –>aA
AB B –> b
S S –> AB

40

You might also like