You are on page 1of 88

.6cm.

Introduction to Formal Languages,


Automata and Computability
K. Krithivasan and R. Rama

Introduction to Formal Languages, Automata and Computability – p.1/74


Language
Strings are defined over an alphabet which is finite.
Alphabet may vary depending upon the application.
Elements of an alphabet are called symbols. Usually
we denote the basic alphabet set either as Σ or T . For
example, the following are a few examples of an
alphabet set.
Σ1 = {a, b}
Σ2 = {0, 1, 2}
Σ3 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.

Introduction to Formal Languages, Automata and Computability – p.2/74


contd.
A string or word is a finite sequence of symbols from
the alphabet, usually written as concatenated symbols
and not separated by gaps or commas. For example if
Σ = {a, b}, a string abbab is a string or word over Σ.
If w is a string over an alphabet Σ, then the length of
w written as len(w) or |w| is the number of symbols it
contains. If |w| = 0, then w is called as empty string
denoted either as λ or .

Introduction to Formal Languages, Automata and Computability – p.3/74


contd.
For any word w, w = w = w. For any string
w = a1 . . . an of length n, the reverse of w is writ-
ten as w R which is the string an an−1 . . . a1 , where each
symbol ai belongs to the basic alphabet Σ. A string z
that is appearing consecutively within another string w
is called a substring or subword of w. For example aab
is a substring of baabb.

Introduction to Formal Languages, Automata and Computability – p.4/74


contd.
The set of all strings over an alphabet Σ is denoted by
Σ∗ which includes the empty string . For example for
Σ = {0, 1}, Σ∗ = {, 0, 1, 00, 01, 10, 11, . . . }. Note
that Σ∗ is a countably infinite set. Also Σn denotes the
set of all strings over Σ whose length is n. Hence
Σ∗ = Σ0 ∪ Σ1 ∪ Σ2 ∪ Σ3 . . . and
Σ+ = Σ1 ∪ Σ2 ∪ Σ3 . . . . Subsets of Σ∗ are called
languages. For example if Σ = {a, b}
L1 = {, a, b}
L2 = {ab, aabb, aaabbb, . . . }
L3 = {w ∈ Σ∗ /|w|a = |w|b }

In the above example, L1 is finite, L2 and L3 are


infinite languages. φ denotes an empty language.
Introduction to Formal Languages, Automata and Computability – p.5/74
contd.
DefinitionA set is an unordered collection of objects.

Introduction to Formal Languages, Automata and Computability – p.6/74


contd.
DefinitionA set is an unordered collection of objects.
Example Let W denote the set of well formed
parentheses. It can be defined inductively as follows:
Basis clause : [ ] ∈ W
Inductive clause : if x, y ∈ W , xy ∈ W and [x] ∈ W
Extremal clause : No object is a member of W unless
its being so follows from a finite number of applica-
tions of the basis and the inductive clauses.

Introduction to Formal Languages, Automata and Computability – p.6/74


Language
Definition Let Σ be any alphabet set. Σ+ is a set of nonempty
strings over Σ defined as follows:
1. Basis : If a ∈ Σ, then a ∈ Σ+ .
2. Induction : If α ∈ Σ+ and a ∈ Σ, αa, aα are in Σ+ .
3. No other element belong to Σ+ .
Clearly the set Σ+ contains all strings of length n, n ≥ 1.

Introduction to Formal Languages, Automata and Computability – p.7/74


Language
Definition Let Σ be any alphabet set. Σ+ is a set of nonempty
strings over Σ defined as follows:
1. Basis : If a ∈ Σ, then a ∈ Σ+ .
2. Induction : If α ∈ Σ+ and a ∈ Σ, αa, aα are in Σ+ .
3. No other element belong to Σ+ .
Clearly the set Σ+ contains all strings of length n, n ≥ 1.
Definition Let Σ be any alphabet set. Σ∗ is defined as follows:
1. Basis :  ∈ Σ∗ .
2. Induction : If α ∈ Σ∗ , a ∈ Σ, then aα, αa ∈ Σ∗ .
3. No other element is in Σ∗ .

Introduction to Formal Languages, Automata and Computability – p.7/74


contd.
Since languages are sets, one can define the set-
theoretic operations of union, intersection, difference,
complement in the usual fashion. The following oper-
ations are also defined for languages. If x = a1 . . . an ,
y = b1 . . . bm , the concatenation of x and y is defined
as xy = a1 . . . an b1 . . . bm . The catenation (or con-
catenation) of two languages L1 and L2 is defined by,
L1 L2 = {w1 w2 /w1 ∈ L1 and w2 ∈ L2 }. Note that
concatenation of languages is associative because con-
catenation of strings is associative. Also L0 = {} and
Lφ = φL = φ, L = L = L. Introduction to Formal Languages, Automata and Computability – p.8/74
contd.
The concatenation closure (Kleene closure) of a
language L, in symbols L∗ is defined to be the union
of all powers of L:
[

L∗ = Li
i=0

[

Also L+ = Li .
i=1

Introduction to Formal Languages, Automata and Computability – p.9/74


contd.
The right quotient and right derivative are the
following sets respectively.
L1 \L2 = {y|yz ∈ L1 for some z ∈ L2 }

∂zr L = L/{z} = {y/yz ∈ L}


Similarly left quotient of a language L1 by a language
L2 is defined by
L2 /L1 = {z|yz ∈ L1 for some y ∈ L2 }.

The left derivative of a language L with respect to a


word y is denoted as ∂y L which is equal to {z|yz ∈ L}.
Introduction to Formal Languages, Automata and Computability – p.10/74
contd.
The mirror image (or reversal) of a language is the
collection of the mirror images of its words and
mir(L) = {mir(w)/w ∈ L} or LR = {wR /w ∈ L}.
The operations substitution and homomorphism are
defined as follows.
For each symbol a of an alphabet Σ, let σ(a) be a
language over Σa . Also σ() = , σ(αβ) = σ(α).σ(β)
for α, β ∈ Σ . σ is a mapping from Σ to 2 where
+ ∗ V∗

V is the union of the alphabets Σa , is called a


substitution. For a language L over Σ, we define
σ(L) = {α/α ∈ σ(β) for some β ∈ L}.

Introduction to Formal Languages, Automata and Computability – p.11/74


contd.
A substitution is -free if and only if none of the
language σ(a) contains . A family of languages is
closed under substitution if and only if whenever L is
in the family and σ is a substitution such that σ(a) is
in the family, then σ(L) is also in the family.
A substitution σ such that σ(a) consists of a single
word wa is called a homomorphism. It is called -free
homomorphism if none of σ(a) is .
Algebraically, one can see that Σ∗ is a free semigroup
with  as its identity.

Introduction to Formal Languages, Automata and Computability – p.12/74


contd.
The homomorphism which is defined above agrees
with the customary definition of homomorphism of
one semigroup into another.
Inverse homomorphism can be defined as follows:

h−1 (w) = {x|h(x) = w}

h−1 (L) = {x|h(x) is in L}.


It should be noted that h(h−1 (L)) need not be equal to
L. Generally h(h−1 (L)) ⊆ L and h−1 (h(L)) ⊇ L.

Introduction to Formal Languages, Automata and Computability – p.13/74


Grammar
Definition A phrase-structure grammar or a type 0
grammar is a 4-tuple G = (N, T, P, S), where N is a
finite set of nonterminal symbols called the
nonterminal alphabet, T is a finite set of terminal
symbols called the terminal alphabet, S ∈ N is the
start symbol and P is a set of productions (also called
production rules or simply rules) of the form u → v,
where u ∈ (N ∪ T )∗ N (N ∪ T )∗ and v ∈ (N ∪ T )∗ .
Derivations are defined as follows:
If αuβ is a string in (N ∪ T )∗ and u → v is a rule in
P , from αuβ we get αvβ by replacing u by v. This
is denoted as αuβ ⇒ αvβ. ⇒ is read as ‘directly
derives’. Introduction to Formal Languages, Automata and Computability – p.14/74
contd.
If α1 ⇒ α2 , α2 ⇒ α3 , . . . , αn−1 ⇒ αn , the derivation
is denoted as α1 ⇒ α2 ⇒ · · · ⇒ αn or α1 ⇒ αn . ⇒ is
∗ ∗

the reflexive, transitive closure of ⇒.


Definition The language generated by a grammar
G = (N, T, P, S) is the set of terminal strings
derivable in the grammar from the start symbol.
∗ ∗
L(G) = {w/w ∈ T , S ⇒ w}

Introduction to Formal Languages, Automata and Computability – p.15/74


Example
Consider the grammar G = (N, T, P, S) where
N = {S, A}, T = {a, b, c}, production rules in P are
S → aSc, S → aAc, A → b
A typical derivation in the grammar is
S ⇒ aSc
⇒ aaScc
⇒ aaaAccc
⇒ aaabccc

The language generated is L(G) = {an bcn /n ≥ 1}.

Introduction to Formal Languages, Automata and Computability – p.16/74


Lengths
DefinitionIf the rules are of the form αAβ → αγβ,
α, β ∈ (N ∪ T )∗ , A ∈ N , γ ∈ (N ∪ T )+ , the
grammar is called context-sensitive grammar.

Introduction to Formal Languages, Automata and Computability – p.17/74


Lengths
DefinitionIf the rules are of the form αAβ → αγβ,
α, β ∈ (N ∪ T )∗ , A ∈ N , γ ∈ (N ∪ T )+ , the
grammar is called context-sensitive grammar.
DefinitionIf in the rule u → v, |u| 6 |v|, the grammar
is called length increasing grammar.

Introduction to Formal Languages, Automata and Computability – p.17/74


Lengths
DefinitionIf the rules are of the form αAβ → αγβ,
α, β ∈ (N ∪ T )∗ , A ∈ N , γ ∈ (N ∪ T )+ , the
grammar is called context-sensitive grammar.
DefinitionIf in the rule u → v, |u| 6 |v|, the grammar
is called length increasing grammar.
Example Let G = (N, T, P, S) where N = {S, B},
T = {a, b, c},
P has the following rules:
1. S → aSBc
2. S → abc
3. cB → Bc
4. bB → bb
Introduction to Formal Languages, Automata and Computability – p.17/74
contd..
Let us consider the language generated. The number
appearing above ⇒ denotes the rule being used.

S ⇒ abc; here abc ∈ L(G)


2

1
S ⇒ aSBc
2
⇒ aabcBc
3
⇒ aabBcc
4
⇒ aabbcc, a2 b2 c2 ∈ L(G)

Introduction to Formal Languages, Automata and Computability – p.18/74


contd.
Similarly
1
S ⇒ aSBc
1
⇒ aaSBcBc
2
⇒ aaabcBcBc
3
⇒ aaabBccBc
3
⇒ aaabBcBcc
3
⇒ aaabBBccc
4
⇒ aaabbBccc
4
⇒ aaabbbccc, a3 b3 c3 ∈ L(G)
Introduction to Formal Languages, Automata and Computability – p.19/74
contd.
In general any string of the form an bn cn will be
generated.

S ⇒ an−1 S(Bc)n−1 (by applying rule 1 (n-1) times)


⇒ an bc(Bc)n−1 (rule 2 once)


n(n − 1)
⇒ a bB c (by applying rule 3 times)
∗ n n−1 n
2
⇒ an bn cn (by applying rule 4,(n − 1) times)

Hence L(G) = {an bn cn /n ≥ 1}. This is a type 1


language.
Introduction to Formal Languages, Automata and Computability – p.20/74
Type2 language
Definition If in a grammar, the production rules are of the form, A → α,
where A ∈ N and α ∈ (N ∪ T )∗ , the grammar is called a type 2 grammar or
context-free grammar. The language generated is called a type 2 language or
context-free language.

Introduction to Formal Languages, Automata and Computability – p.21/74


Type2 language
Definition If in a grammar, the production rules are of the form, A → α,
where A ∈ N and α ∈ (N ∪ T )∗ , the grammar is called a type 2 grammar or
context-free grammar. The language generated is called a type 2 language or
context-free language.

Definition If the rules are of the form A → αB, A → β, A, B ∈ N, α, β ∈

T ∗ , the grammar is called a right linear grammar or type 3 grammar and the

language generated is called a type 3 language or regular set. We can even

put the restriction that the rules can be of the form A → aB, A → b, where

A, B ∈ N, a ∈ T, b ∈ T ∪ . This is possible because a rule A → a1 . . . ak B

can be split into A → a1 B1 , B1 → a2 B2 , . . . , Bk−1 → ak B by introducing

new nonterminals B1 , . . . , Bk .

Introduction to Formal Languages, Automata and Computability – p.21/74


Example
Let G = (N, T, P, S) where N = {S}, T = {a, b}
P consists of the following rules.
1. S → aS
2. S → bS
3. S → 
This grammar generates all strings in T ∗ . For example, the string abbaab is generated as follows:

S ⇒ aS (rule 1)
⇒ abS (rule 2)
⇒ abbS (rule 2)
⇒ abbaS (rule 1)
⇒ abbaaS (rule 1)
⇒ abbaabS (rule 2)
⇒ abbaab (rule 3)

Introduction to Formal Languages, Automata and Computability – p.22/74


Derivation tree
We have considered the definition of a grammar and
derivation. Each derivation can be represented by a
tree called a derivation tree (sometimes called parse
tree). A derivation tree for the derivation considered
in previous example with grammar
S → aSc, S → aAc, A → b is
S

a S c

a S c

a A c

Introduction to Formal Languages, Automata and Computability – p.23/74


b
Example
Consider the following CFG, G = (N, T, P, S),
N = {S, A, B}, T = {a, b}. P consists of the
following productions
1. S → aB
2. B → b
3. B → bS
4. B → aBB
5. S → bA
6. A → a
7. A → aS
8. A → bAA
Introduction to Formal Languages, Automata and Computability – p.24/74
contd..
The derivation tree for aaabbb is as follows,
S

a B

a B B

a B B b

b b

Introduction to Formal Languages, Automata and Computability – p.25/74


Example
Consider the grammar
G = ({S}, {a, b}, {S → SaSbS, S → SbSaS,
S → }, S). The language generated by this grammar
is the same as the language generated by the grammar
G = (N, T, P, S), N = {S, A, B}, T = {a, b}, P
consists of the following productions
S → aB, B → b, B → bS, B → aBB, S → bA, A →
a, A → aS, A → bAA, except that , the empty string
is also generated here.

Introduction to Formal Languages, Automata and Computability – p.26/74


Example
Consider the grammar
G = ({S}, {a, b}, {S → SaSbS, S → SbSaS,
S → }, S). The language generated by this grammar
is the same as the language generated by the grammar
G = (N, T, P, S), N = {S, A, B}, T = {a, b}, P
consists of the following productions
S → aB, B → b, B → bS, B → aBB, S → bA, A →
a, A → aS, A → bAA, except that , the empty string
is also generated here.

0
X

−1

−2

Introduction to Formal Languages, Automata and Computability – p.26/74


contd..
Consider a string w having equal number of a’s and
b’s. We use induction.
Basis
|w| = 0, S ⇒ 
|w| = 2, it is either ab or ba, then

S ⇒ SaSbS ⇒ ab

S ⇒ SbSaS ⇒ ba.

Introduction to Formal Languages, Automata and Computability – p.27/74


contd..
Induction Assume that the result holds up to strings
of length k − 1. Prove that the result holds for strings
of length k. Draw a graph where the x axis represents
the length of the prefixes of the given string. y axis
represents the number of a’s - number of b’s. For the
string aabbabba, the graph will look as given in the
previous figure. For a given string w with equal
number of a’s and b’s there are 3 possibilities.
1. The string begins with ‘a’ and ends with ‘b’.
2. The string begins with ‘b’ and ends with ‘a’.
3. Other two cases (begins with a and ends with a,
begins with b and ends with b)
Introduction to Formal Languages, Automata and Computability – p.28/74
contd..
In the first case w = aw1 b and w1 has equal numbers
of a’s and b’s. So we have
S ⇒ SaSbS ⇒ aSbS ⇒ aSb ⇒ aw1 b as S ⇒ w1 by
∗ ∗

inductive hypotheses. A similar argument holds for


case 2. In case 3, the graph mentioned above will
cross the x axis.
Consider w = w1 w2 where w1 , w2 have equal number
of a’s and b’s. Let us say w1 begins with a, and w1
corresponds to the portion where the graph touches
the x axis for the first time. In the above example
w1 = aabb and w2 = abba.

Introduction to Formal Languages, Automata and Computability – p.29/74


contd..
In this case we can have a derivation as follows:
1
S ⇒ SaSbS
3
⇒ aSbS

⇒ aw10 bS(w1 = aw10 b)

⇒ aw10 bw2

⇒ w1 w2 .

S ⇒ w follows from inductive hypothesis.


Introduction to Formal Languages, Automata and Computability – p.30/74


CS
Theorem Every context-sensitive language is length increasing and
conversely.
That every context-sensitive language is length increasing can be seen
from definitions.
Every length increasing language is context-sensitive can be seen from
the following construction.
Let L be a length increasing language generated by G = (N, T, P, S).
Without loss of generality, one can assume that the productions in P
are of the form X → a, X → X1 . . . Xm , X1 . . . Xm → Y1 . . . Yn ,
2 ≤ m ≤ n, X, X1 , . . . , Xm , Y1 , . . . , Yn ∈ N , a ∈ T . Productions in
P which are already context-sensitive productions are not modified.
Hence consider, a production of the form

X1 . . . X m → Y 1 . . . Y n , 2 ≤ m ≤ n
Introduction to Formal Languages, Automata and Computability – p.31/74
contd..
It is replaced by the following set of context-sensitive
productions:
X1 . . . X m → Z1 X2 . . . X m
Z1 X 2 . . . X m → Z 1 Z2 X 3 . . . X m
..
.
Z1 Z2 . . . Zm−1 Xm → Z1 Z2 . . . Zm Ym+1 . . . Yn
Z1 Z2 . . . Zm Ym+1 . . . Yn → Y1 Z2 . . . Zm Ym+1 . . . Yn
..
.
Y1 Y2 . . . Ym−1 Zm Ym+1 . . . Yn → Y1 Y2 . . . Ym Ym+1 . . . Yn
where Zk , 1 ≤ k ≤ m are new nonterminals.
Introduction to Formal Languages, Automata and Computability – p.32/74
contd..
Each production that is not context-sensitive is to be
replaced by a set of context-sensitive productions as
mentioned above. Application of this set of rules has
the same effect as applying X1 . . . Xm → Y1 . . . Yn .
Hence a new grammar G0 thus obtained is
context-sensitive that is equivalent to G.
Example Let L = {an bm cn dm /n, m ≥ 1}.
The type 1 grammar generating this CSL is given by
G = (N, T, P, S) with N = {S, A, B, X, Y },
T = {a, b, c, d}

Introduction to Formal Languages, Automata and Computability – p.33/74


contd..
and P =
S → aAB|aB
A → aAX|aX
B → bBd|bY d
Xb → bX
XY → Yc
Y → c.

Introduction to Formal Languages, Automata and Computability – p.34/74


contd..
Sample Derivations
S ⇒ aB ⇒ abY d ⇒ abcd.
S ⇒ aB ⇒ abBd ⇒ abbY dd ⇒ ab2 cd2 .
S ⇒ aAB ⇒ aaXB ⇒ aaXbY d
⇒ aabXY d
⇒ aabY cd
⇒ a2 bc2 d.

Introduction to Formal Languages, Automata and Computability – p.35/74


contd..

S ⇒ aAB ⇒ aaAXB
⇒ aaaXXB
⇒ aaaXXbY d
⇒ aaaXbXY d
⇒ aaabXXY d
⇒ aaabXY cd
⇒ aaabY ccd
⇒ aaabcccd.

Introduction to Formal Languages, Automata and Computability – p.36/74


Exercise
Consider the length increasing grammar with
productions P =
S → aSBc, S → abc, cB → Bc, bB → bb. All rules
except the rule cB → Bc are context-sensitive. The
following grammar is a context-sensitive grammar
equivalent to the above grammar.
S → aSBC
S → abc
C → c
CB → DB
DB → DC
DC → BC.
Introduction to Formal Languages, Automata and Computability – p.37/74
Ambiguity
Definition Let G = (N, T, P, S) be a CFG. A word w
in L(G) is said to be ambiguously derivable in G, if it
has two or more different derivation trees in G.
Since the correspondence between derivation trees
and leftmost derivations is a bijection, an equivalent
definition in terms of leftmost derivations can be
given.

Introduction to Formal Languages, Automata and Computability – p.38/74


Ambiguity
Definition Let G = (N, T, P, S) be a CFG. A word w
in L(G) is said to be ambiguously derivable in G, if it
has two or more different derivation trees in G.
Since the correspondence between derivation trees
and leftmost derivations is a bijection, an equivalent
definition in terms of leftmost derivations can be
given.
DefinitionLet G = (N, T, P, S) be a CFG. A word w
in L(G) is said to be ambiguously derivable in G, if it
has two or more different leftmost derivations in G.

Introduction to Formal Languages, Automata and Computability – p.38/74


Ambiguity
Definition Let G = (N, T, P, S) be a CFG. A word w
in L(G) is said to be ambiguously derivable in G, if it
has two or more different derivation trees in G.
Since the correspondence between derivation trees
and leftmost derivations is a bijection, an equivalent
definition in terms of leftmost derivations can be
given.
DefinitionLet G = (N, T, P, S) be a CFG. A word w
in L(G) is said to be ambiguously derivable in G, if it
has two or more different leftmost derivations in G.
DefinitionA CFG is said to be ambiguous if there is a
word w in L(G) which is ambiguously derivable. Oth-
erwise it is unambiguous. Introduction to Formal Languages, Automata and Computability – p.38/74
Example
Consider the grammar G with rules S → SS, S → a
where S is the nonterminal and a is the terminal
symbol. L(G) = {an /n ≥ 1}.This grammar is
ambiguous as a3 has two different derivation trees as
follows

Introduction to Formal Languages, Automata and Computability – p.39/74


Example
Consider the grammar G with rules S → SS, S → a
where S is the nonterminal and a is the terminal
symbol. L(G) = {an /n ≥ 1}.This grammar is
ambiguous as a3 has two different derivation trees as
follows
S S

S S S S

a S S S S a

a a a a

Introduction to Formal Languages, Automata and Computability – p.39/74


Ambiguity contd..
Definition A CFL L is said to be inherently ambiguous if all the grammars
generating it are ambiguous or in other words, there is no unambiguous
grammar generating it.
Example L = {an bn cp /n, m, p ≥ 1, n = m or m = p}.
This can be looked at as L = L1 ∪ L2

L1 = {an bn cp /n, p ≥ 1}
L2 = {an bm cm /n, m ≥ 1}.

L1 and L2 can be generated individually by unambiguous grammars, but any

grammar generating L1 ∪ L2 will be ambiguous. Since strings of the form

an bn cn will have two different derivation trees, one corresponding to L1 and

another corresponding to L2 . Hence L is inherently ambiguous.


Introduction to Formal Languages, Automata and Computability – p.40/74
Ambiguity contd..
Definition A CFL L is said to be inherently ambiguous if all the grammars
generating it are ambiguous or in other words, there is no unambiguous
grammar generating it.
Example L = {an bn cp /n, m, p ≥ 1, n = m or m = p}.
This can be looked at as L = L1 ∪ L2

L1 = {an bn cp /n, p ≥ 1}
L2 = {an bm cm /n, m ≥ 1}.

L1 and L2 can be generated individually by unambiguous grammars, but any

grammar generating L1 ∪ L2 will be ambiguous. Since strings of the form

an bn cn will have two different derivation trees, one corresponding to L1 and

another corresponding to L2 . Hence L is inherently ambiguous.


Introduction to Formal Languages, Automata and Computability – p.40/74
contd..
Theorem It is undecidable to determine whether a
given CFG G is ambiguous or not.

Introduction to Formal Languages, Automata and Computability – p.41/74


contd..
Theorem It is undecidable to determine whether a
given CFG G is ambiguous or not.
Theorem It is undecidable to determine whether a
given CFL L is ambiguous or not.

Introduction to Formal Languages, Automata and Computability – p.41/74


contd..
Theorem It is undecidable to determine whether a
given CFG G is ambiguous or not.
Theorem It is undecidable to determine whether a
given CFL L is ambiguous or not.
DefinitionA CFL L is bounded if there exists strings
w1 , . . . , wk such that L ⊆ w1∗ w2∗ . . . wk∗ .

Introduction to Formal Languages, Automata and Computability – p.41/74


contd..
Theorem It is undecidable to determine whether a
given CFG G is ambiguous or not.
Theorem It is undecidable to determine whether a
given CFL L is ambiguous or not.
DefinitionA CFL L is bounded if there exists strings
w1 , . . . , wk such that L ⊆ w1∗ w2∗ . . . wk∗ .
TheoremThere exists an algorithm to find out whether
a given bounded CFL is inherently ambiguous or not.

Introduction to Formal Languages, Automata and Computability – p.41/74


contd..
Theorem It is undecidable to determine whether a
given CFG G is ambiguous or not.
Theorem It is undecidable to determine whether a
given CFL L is ambiguous or not.
DefinitionA CFL L is bounded if there exists strings
w1 , . . . , wk such that L ⊆ w1∗ w2∗ . . . wk∗ .
TheoremThere exists an algorithm to find out whether
a given bounded CFL is inherently ambiguous or not.
DefinitionLet G = (N, T, P, S) be a CFG then the de-
gree of ambiguity of G is the maximum number of
derivation trees a string w ∈ L(G) can have in G.

Introduction to Formal Languages, Automata and Computability – p.41/74


contd.
We can also use the idea of power series and find out
the number of different derivation trees a string can
have. Consider the grammar with rules
S → SS, S → a write an equation S = SS + a
Initial solution is S = a, S1 = a
Use this in the equation for S on the right-hand side
S2 = S1 S1 + a
= aa + a.
S3 = S2 S2 + a
= (aa + a)(aa + a) + a
= a4 + 2a3 + a2 + a.
Introduction to Formal Languages, Automata and Computability – p.42/74
contd.

S4 = S3 S3 + a
= (a4 + 2a3 + a2 + a)2 + a
= a8 + 4a7 + 6a6 + 6a5 + 5a4 + 2a3 + a2 + a.
We can proceed like this using Si = Si−1 Si−1 + a
In Si , upto strings of length i, the coefficient of the
string will give the number of different derivation trees
it can have in G. For example, in S4 coefficient of a4 is
5 and a4 has 5 different derivation trees in G. The co-
efficient of a3 is 2 - the number of different derivation
trees for a3 is 2 in G. Introduction to Formal Languages, Automata and Computability – p.43/74
Simplification
Definition Let G = (N, T, P, S) be a CFG. A variable X in N is said
to be useful if and only if there is at least a string α ∈ L(G) such that
∗ ∗
S ⇒ α1 Xα2 ⇒ α,

where α1 , α2 ∈ (N ∪ T )∗ i.e., X is useful because it appears in at least


one derivation from S to a word α in L(G). Otherwise X is useless.
Consequently the production involving X is useful.
One can understand the ‘useful symbol’ concept in two steps.
Step 1 For a symbol X ∈ N to be useful it should occur in some

derivation starting from S, i.e., S ⇒ α1 Xα2 .

Introduction to Formal Languages, Automata and Computability – p.44/74


contd.
Step 2 Also X has to derive a string α ∈ T ∗ i.e.,

X ⇒α
These two conditions are necessary. But they are not
sufficient. These two conditions may be satisfied still
α1 or α2 may contain a nonterminal from which a
terminal string cannot be derived. So the usefulness of
a symbol has to be tested in two steps as above.
Lemma Let G = (N, T, P, S) be a CFG such that
L(G) 6= φ. Then there exists an equivalent context-
free grammar G0 = (N 0 , T 0 , P 0 , S) that does not con-
tain any useless symbol or productions.
Introduction to Formal Languages, Automata and Computability – p.45/74
contd.
The context-free grammar G0 is obtained by the
following elimination procedures
I. First eliminate all those symbols X such that X
does not derive any string over T ∗ . Let
G2 = (N2 , T, P2 , S) be the grammar thus modified.
As L(G) 6= φ, S will not be eliminated. The following
algorithm identifies symbols not to be eliminated.
Algorithm GENERATING
Step 1 Let GEN = T ;
Step 2 If A → α and every symbol of α belongs to
GEN , then add A to GEN .
Remove from N all those symbols that are not in the
set GEN and all the rules using them.
Introduction to Formal Languages, Automata and Computability – p.46/74
contd.
Let the resultant grammar be G2 = (N2 , T, P2 , S).
II. Now eliminate all symbols in the grammar G2 that are not occurring

in any derivation from ‘S’. i.e., S 6⇒ α1 Xα2 .
Algorithm REACHABLE
Let REACH = {S}.
If A ∈ REACH and A → α ∈ P then every symbol of α is in
REACH.
The above algorithm terminates as any grammar has only finite set of
rules. It collects all those symbols which are reachable from S through
derivations from S.
Now in G2 remove all those symbols that are not in REACH and
also productions involving them. Hence one gets the modified gram-
mar G0 = (N 0 , T 0 , P 0 , S), a new CF G. Introduction to Formal Languages, Automata and Computability – p.47/74
contd.
without useless symbols and productions using them.
The equivalence of G with G0 can be easily seen since
only symbols and productions leading to derivation of
terminal strings from S are present in G0 . Hence
L(G) = L(G0 ).
Theorem Given a CFG G = (N, T, P, S). Procedure I
of the previous lemma is executed to get
G2 = (N2 , T, P2 , S) and procedure II of previous
lemma is executed to get G0 = (N 0 , T 0 , P 0 , S). Then
G0 contains no useless symbols.
Suppose G0 contains a symbol X (say) which is use-
less. It is easily seen that N 0 ⊆ N2 , T 0 ⊆ T , P 0 ⊆ P2 .
Introduction to Formal Languages, Automata and Computability – p.48/74
contd.
Since X is obtained after execution of II,
S ⇒ α1 Xα2 , α1 , α2 ∈ (N 0 ∪ T 0 )∗ . Every symbol of

N 0 is also in N2 . Since G2 is obtained by execution of


I, it is possible to get a terminal string from every
symbol of N2 and hence from N : α1 ⇒ w1 and
0 ∗

α2 ⇒ w2 , w1 , w2 ∈ T ∗ . Thus

S ⇒ α1 Xα2 ⇒ w1 Xw2 ⇒ w1 ww2.


∗ ∗ ∗

Clearly S is not useless as supposed. Hence G0


contains only useful symbols.

Introduction to Formal Languages, Automata and Computability – p.49/74


Example
Let G = (N, T, P, S) be a CFG with
N = {S, A, B, C}
T = {a, b} and
P = {S → Sa|A|C, A → a, B → bb, C → aC|B}
First ‘GEN ’ set will be {S, A, B, a, b}. Then
G2 = (N2 , T, P2 , S) where N2 = {S, A, B}
P = {S → Sa|A, A → a, B → bb}
In G2 , REACH set will be {S, A, a}. Hence
G0 = (N 0 , T 0 , P 0 , S) where
N 0 = {S, A}
T 0 = {a}
P 0 = {S → Sa|A, A → a}
L(G) = L(G0 ) = {an |n ≥ 1}.Introduction to Formal Languages, Automata and Computability – p.50/74
-rule Elimination
Definition Any production of the form A →  is called
a -rule. If A ⇒ , then we call A a nullable symbol.

Theorem Let G = (N, T, P, S) be a CFG such that


∈/ L(G). Then there exists a CFG without -rules
generating L(G).
Before modifying the grammar one has to identify the
set of null-able symbols of G. This is done by the
following procedure.
Algorithm NULL
1. Let N U LL := φ,
2. If A →  ∈ P , then A ∈ N U LL,
3. If A → B1 . . . Bt ∈ P , and each Bi is in N U LL,
then A ∈ N U LL Introduction to Formal Languages, Automata and Computability – p.51/74
contd.
Run algorithm N U LL for G and get the N U LL set.
The modification of G to get G0 = (N, T, P 0 , S) with
respect to N U LL is given below.
If A → A1 . . . At ∈ P , t ≥ 1 and if n (n < t) of these
Ai s are in N U LL. Then P will contain 2n rules where
the variables in N U LL are either present or absent in
all possible combinations. If n = t then remove
A →  from P . The grammar G0 = (N, T, P 0 , S) thus
obtained is -free. To prove that a word w ∈ L(G) if
and only if w ∈ L(G0 ). As G and G0 do not differ in
N and T , one can equivalently show that,

A ⇒G w
Introduction to Formal Languages, Automata and Computability – p.52/74
contd.
if and only if

A ⇒ G0 w
for A ∈ N . Clearly w 6= . Let A ⇒G w. n =
n

1, A ⇒G w then A → w is in P and hence in P 0 .


Then A ⇒G0 w. Assume the result to be true for all


derivations from A of length n − 1. Let A ⇒G w


n

i.e., A ⇒ w. Let w = w1 . . . wk
n−1
Y1 , Y2 , . . . , Yk =⇒G
G
and let X1 , X2 , . . . , Xm be those Yj ’s in order such that
Yj ⇒ wj , wj 6= . Clearly k ≥ 1 as w 6= . Hence

A → X1 . . . Xm is a rule in G0 .
Introduction to Formal Languages, Automata and Computability – p.53/74
contd.
We can see that X1 . . . Xm ⇒G w as some Yj derive

only . Since each Yj ⇒ wj , wj takes fewer than n


steps, by induction, Yj ⇒ wj , for wj 6= . Hence



A ⇒G X1 . . . Xm ⇒ w.
Conversely if A ⇒G0 w. Then to show that A ⇒G w
∗ ∗

also. Again the proof is by induction on the number of


derivation steps.
Basis If A ⇒0 w, then A → w ∈ G0 . By the
G
construction of G0 , one can see that there exists a rule
A → α in G such that α and w differ only in nullable
symbols. Hence A ⇒ α ⇒G w where for α ⇒ w only
∗ ∗
G
-rules are used. Introduction to Formal Languages, Automata and Computability – p.54/74
contd.
Induction
Assume A ⇒G0 w n > 1. Then let
n

A ⇒0 Y1 . . . Yk ⇒G w. For A ⇒0 Y1 . . . Yk the

G G
corresponding equivalent derivation by G will be
A ⇒0 X1 . . . Xm ⇒G Y1 . . . Yk as some of the Xi0 are

G
nullable. Hence A ⇒G Y1 . . . Yk ⇒G w1 . . . wk = w
∗ ∗

by induction hypothesis. Hence A ⇒G w. Hence


L(G) = L(G0 ).

Introduction to Formal Languages, Automata and Computability – p.55/74


Example
Consider G = (N, T, P, S) where
N = {S, A, B, C, D}, T = {0, 1} and
P = {S → AB0C, A → BC, B → 1|, C → D|,
D → }
The set N U LL = {A, B, C, D}.
Then G0 will be with
P 0 = {S → AB0C|AB0|A0C|B0C|A0|B0|0C|0,
A → BC|B|C, B → 1, C → D}.

Introduction to Formal Languages, Automata and Computability – p.56/74


Procedure to Eliminate Unit Rules
Definition Any rule of the form X → Y , X, Y ∈ N is
called a unit rule. Note that A → a with a ∈ T is not
a unit rule.
In simplification of context-free grammars another
important step is to make the given context-free
grammar unit rule free i.e., to eliminate unit rules.
This is essential for the application in compilers. As
the compiler spends time on each rule used in parsing
by generating semantic routines, having unnecessary
unit rules will increase compiler time.
Lemma Let G be a CFG, then there exists a CFG G0
without unit rules such that L(G) = L(G0 ).

Introduction to Formal Languages, Automata and Computability – p.57/74


contd.
Let G = (N, T, P, S) be a CFG. First find the sets of pairs of

nonterminals (A, B) in G such that A ⇒ B by unit rules. Such a pair
is called a unit-pair.
Algorithm UNIT-PAIR

1. (A, A) ∈ U N IT − P AIR, for every variable A ∈ N .


2. If(A, B) ∈ U N IT − P AIR and B → C ∈ P then
(A, C) ∈ U N IT − P AIR

Now construct G0 = (N, T, P 0 , S) as follows. Remove all unit produc-


tions. For every unit-pair (X, Y ), if Y → α ∈ P is a nonunit rule, add
to P 0 , X → α. Thus G0 has no unit rules.

Introduction to Formal Languages, Automata and Computability – p.58/74


contd.
Let us consider leftmost derivations in G and G0 . If w ∈ L(G0 ), then

S ⇒G0 w. Clearly there exists a derivation of w from S by G where

zero or more applications of unit rules are used. Hence S ⇒G w whose

length may be different from S ⇒G0 w.

If w ∈ L(G), then one can consider S ⇒G0 w by G. A sequence of
unit-productions applied in this derivation is to be written by a nonunit
production. Since this is a leftmost derivation, for such sequence there
exists one rule in G0 doing the same job. Hence one can see the simula-
tion of a derivation of G with G0 .

Introduction to Formal Languages, Automata and Computability – p.59/74


contd.
Hence L(G) = L(G0 ).
Example Let G = (N, T, P, S) be a CFG where
N = {X, Y }, T = {a, b} and
P = {X → aX|Y |b, Y → bK|K|b, K → a}
U N IT − P AIR =
{(X, X), (Y, Y ), (K, K), (X, Y ), (Y, K), (X, K)}
Then G0 = (N, T, P 0 , S) where
P 0 = {X → aX|bK|b|a, Y → bK|b|a, K → a}.
Remark Since removal of -rule can introduce unit
productions, to get a simplified CFG to generate
L(G) − {}, the following steps have to be used in
the order given. Introduction to Formal Languages, Automata and Computability – p.60/74
contd.
1. Remove -rules
2. Remove unit-rules
3. Remove useless symbols.
(i) Remove symbols not deriving terminal
strings.
(ii) Remove symbols not reachable from S.
Example It is essential that steps 3(i) and 3(ii) have to
be executed in that order. If step 3(ii) is executed first
and then step 3(i), we may not get the required
reduced grammar. Consider the CFG
G = (N, T, P, S) where N = {S, A, B, C},
T = {a, b} and P = {S → ABC|a, A → a, C → b}
Introduction to Formal Languages, Automata and Computability – p.61/74
L(G) = {a}
contd.
Applying step 3(ii) first removes nothing. Then apply
step 3(i) which removes B and S → ABC leaving
S → a, A → a, C → b. Though A and C do not
contribute to the set L(G), they are not removed.
On the other hand applying step 3(i) first, removes B,
S → ABC. Afterwards apply step 3(ii), removes
A, C, A → a, C → b. Hence S → a is the only rule
left which is the required result.

Introduction to Formal Languages, Automata and Computability – p.62/74


Normal form
The most popular normal forms are Weak Chomsky
Normal Form, Chomsky Normal Form, Strong
Chomsky Normal Form, Greibach Normal Form.
Definition Let G = (N, T, P, S) be a CFG. If each
rule in P is of the form A → ∆, A → a or A → ,
where A ∈ N , ∆ ∈ N + , a ∈ T , then G is said to be in
Weak Chomsky Normal Form (WCNF).
Example Let G = (N, T, P, S) be a CFG where
N = {S, A, B}, T = {a, b} and
P = {S → ASB|AB, A → a, B → b}. G is in
WCNF.

Introduction to Formal Languages, Automata and Computability – p.63/74


contd.
Theorem For any CFG G = (N, T, P, S) there exists a
CFG G0 in WCNF such that L(G) = L(G0 ).
Let G = (N, T, P, S) be a CFG. One can construct an
equivalent CFG in WCNF as below. Let
G0 = (N 0 , T 0 , P 0 , S 0 ) be an equivalent CFG where
N 0 = N ∪ {Aa /a ∈ T }, none of Aa ’s belong to N .
P 0 = {A → ∆|A → α ∈ P and every occurrence of a
symbol from T present in α is replaced by Aa , giving
∆} ∪ {Aa → a|a ∈ T }. Clearly ∆ ∈ N 0+ and P 0 gets
the required form. G0 is in WCNF. That G and G0
equivalent can be seen easily.

Introduction to Formal Languages, Automata and Computability – p.64/74


Chomsky Normal Form
Definition Let  ∈/ L(G) and G = (N, T, P, S) be a CFG. G is said to
be in Chomsky Normal Form (CNF) if all its productions are of the
form A → BC or A → a, A, B, C ∈ N, a ∈ T .
Example The following CFGs are in CNF.
1. G1 = (N, T, P, S) where N = {S, A, B, C}, T = {0, 1} and
P = {S → AB|AC|SS, C → SA, A → 0, B → 1}.
2. G2 = (N, T, P, S) where N = {S, A, B, C}, T = {a, b} and
P = {S → AS|SB, A → AB|a, B → b}.
No CFG in CNF can generate . If  is to be added to L(G), then a new
start symbol S 0 is to be taken and S 0 →  should be added. For every
rule S → α, S 0 → α should be added which make sure that the new
start symbol does not appear on the right-hand side of any production.

Introduction to Formal Languages, Automata and Computability – p.65/74


contd.
Theorem Given CFG G, there exists an equivalent CFG G00 in CNF.
Let G = (N, T, P, S) be a CFG without  rules, unit-rules and useless
symbols and also  ∈ / L(G). Modify G to G0 such that
G0 = (N 0 , T, P 0 , S) is in WCNF. Let A → ∆ ∈ P . If |∆| = 2, such
rules need not be modified. If |∆| ≥ 3, the modification is as below:
If A → ∆ = A1 A2 A3 , the new set of equivalent rules will be:

A → A 1 B1

B1 → A 2 A 3 .

Similarly if A → A1 A2 . . . An ∈ P , it is replaced by

Introduction to Formal Languages, Automata and Computability – p.66/74


contd.
A → A 1 B1
B1 → A 2 B2
..
.
Bn−2 → An−1 An .
Let P 00 be the collection of modified rules and
G00 = (N 00 , T, P 00 , S) be the modified grammar which
is clearly in CNF. Also L(G) = L(G00 ).

Introduction to Formal Languages, Automata and Computability – p.67/74


contd.
Example Let G = (N, T, P, S) be a CFG where
N = {S, A, B}, T = {a, b}
P = {S → SAB|AB|SBC, A → AB|a,
B → BAB|b, C → b}.
Clearly G is not in CNF but in WCNF. Hence the
modification of rules in P are as below:
For S → SAB, the equivalent rules are
S → SB1 , B1 → AB.
For S → SBC, the equivalent rules are
S → SB2 , B2 → BC.
For B → BAB, the equivalent rules are
B → BB3 , B3 → AB.

Introduction to Formal Languages, Automata and Computability – p.68/74


contd.
Hence G00 = (N 00 , T, P 00 , S) will be with
N 00 = {S, A, B, C, B1 , B2 , B3 }, T = {a, b}

P 00 ={S → SB1 |SB2 |SA, B1 → AB, B2 → BC, B3 → AB,


A → AB|a, B → BB3 |b, C → b}.

Clearly G00 is in CNF.

Introduction to Formal Languages, Automata and Computability – p.69/74


Strong Chomsky Normal Form
Definition A CFG G = (N, T, P, S) is said to be in
Strong Chomsky Normal Form (SCNF) when rules in
P are only of the forms A → a, A → BC where
A, B, C ∈ N , a ∈ T subject to the following
conditions:
(i) if A → BC ∈ P , then B 6= C.
(ii) if A → BC ∈ P , then for each rule
X → DE ∈ P , we have E 6= B and D 6= C.

Introduction to Formal Languages, Automata and Computability – p.70/74


contd.
Theorem For every CFG G = (N, T, P, S) there
exists an equivalent CFG in SCNF.
Let G = (N, T, P, S) be a CFG in CNF. One can
construct an equivalent CFG.
G0 = (N 0 , T, P 0 , S 0 ) in SCNF as below.
N 0 = {S 0 } ∪ {AL , AR |A ∈ N }
T =T

Introduction to Formal Languages, Automata and Computability – p.71/74


contd.

P 0 ={AL → BL CR , AR → BL CR |A → BC ∈ P }
∪ {S 0 → XL YR |S → XY ∈ P }
∪ {S 0 → a|S → a ∈ P }
∪ {AL → a, AR → a|A → a ∈ P, a ∈ T }.

Clearly L(G) = L(G0 ) and G0 is in SCNF.

Introduction to Formal Languages, Automata and Computability – p.72/74


contd.
Example Let G = (N, T, P, S) be a CFG where
N = {S, A, B}, T = {0, 1} and
P = {S → AB|0, B → BA|1, A → AB|0}.
Then G0 = (N 0 , T, P 0 , S 0 ) in SCNF will be with

N 0 = {S 0 , SL , SR , AL , AR , BL , BR }
T = {0, 1}
P 0 = {S 0 → AL BR |0, SL → AL BR |0,
SR → AL BR |0, AL → AL BR |0,
AR → AL BR |0, BL → BL AR |1,
BR → BL AR |1}.
Introduction to Formal Languages, Automata and Computability – p.73/74
Greibach Normal Form
Definition Let  ∈/ L(G) and G = (N, T, P, S) be a
CFG. G is said to be in Greibach Normal Form
(GNF), if each rule in P rewrites a variable into a
word in T N ∗ i.e., each rule will be of the form
A → aα, a ∈ T , α ∈ N ∗ .

Introduction to Formal Languages, Automata and Computability – p.74/74

You might also like