Chapter_2 (2-2)
Regular Expressions
Regular expressions: Method to describe
regular languages in formal language theory.
Lexical Analysis (scanner)
Source
Program Tokens
Lexical
(Character analyzer
Stream)
▪ Lexical Analysis is
also known as
lexical scanner.
average = (sum/count)
◼ The lexical analyzer needs to scan and identify
only a finite set of valid string/token/lexeme
that belong to the language in hand.
average = (sum/count)
average identifier
= Assignment operator
( open parenthesis
sum identifier
/ Division operator
count Identifier
) Close parenthesis
Valid Token/string
▪ There are some predefined rules for every lexeme to
be identified as a valid token.
▪ These rules are defined by grammar rules, by
means of a pattern.
o A pattern explains what can be a token, and these
patterns are defined by means of regular expressions.
History
Stephen Cole Kleene
Regular expressions originated in 1951, when
mathematician Stephen Cole Kleene described regular
languages using his mathematical notation called regular
sets.
These arose in theoretical computer science, in the
subfields of automata theory (models of computation)
and the description and classification of formal
languages.
Formal Language and Natural Language
• Natural Language: is one which is normally spoken by
people. (Arabic, English, …)
• Formal Language: It is one that can be specified
precisely and is amenable for use with computers.
• A (formal) language is a set of strings from a given
alphabet.
• The syntax of Java is an example of a formal language.
Regular Language
a Regular Language (also called a rational language) is a formal
language that can be defined by a regular expression
A simple example of a language that is not regular is the set of
strings { anbn | n ≥ 0 }.
Intuitively, it cannot be recognized with a finite automaton, since a
finite automaton has finite memory and it cannot remember the exact
number of a's.
Equivalent formalisms
A regular language satisfies the following equivalent properties:
✓ it is the language of a regular expression (by the above definition)
✓ it is the language accepted by a nondeterministic finite automaton (NFA)
✓ it is the language accepted by a deterministic finite automaton (DFA)
✓ it can be generated by a regular grammar
✓ it is the language accepted by an alternating finite automaton
✓ it is the language accepted by a two-way finite automaton
✓ it can be generated by a prefix grammar
✓ it can be accepted by a read-only Turing machine
✓ it can be defined in monadic second-order logic (Büchi–Elgot–
Trakhtenbrot theorem)
What is a Regular Expression?
Regular Expressions are used to represent regular
languages. If a language can’t be represented by the
regular expression, then it means that language is not
regular.
What is a Regular Expression?
A regular expression, regex or regexp (sometimes called
a rational expression) is a sequence of characters that
define a search pattern.
Usually, this pattern is used by string searching algorithms
for "find" or "find and replace" operations on strings, or for
input validation. It is a technique that developed in
theoretical computer science and formal language theory.
"find" or "find and replace" operations in word file
REGULAR EXPRESSIONS
• Regular Expressions is the metalanguage used to
define the token types of a programming language
• Regular Expressions consist of constants, which
denote sets of strings, and operator symbols, which
denote operations over these sets.
Example of Regular Expressions :
( a ( b + c )* )* d
12
Given a finite alphabet Σ, the following constants are
defined as Regular Expressions:
•(empty set) ∅ denoting the set ∅.
•(empty string) ε denoting the set containing only the "empty"
string, which has no characters at all.
•(literal character) a in Σ denoting the set containing only the
character a.
Language Elements:
An alphabet is a finite set of symbols (characters)
A string s is a finite sequence of symbols from
• s denotes the length of string s
• denotes the empty string, thus = 0
A language is a specific set of strings over some fixed
alphabet
Language Elements:
Example (Assume Language alphabet is (0,1)):
Then ===> = (0,1)
L1 is {0,10,1011}
L2 is {ε,0,00,000,0000,00000,. . . . }
A language is a set of strings
A string is a finite sequence of symbols taken
from a finite alphabet
• The C language is the (infinite) set of all strings that
constitute legal C programs
• The language of C reserved words is the (finite) set of all
alphabetic strings that cannot be used as identifiers in
the C programs
16
REGULAR EXPRESSIONS
There are a number of algebraic laws that are obeyed by
regular expressions, which can be used to manipulate
regular expressions into equivalent forms.
• These are formulas or expressions consisting of
three possible operations on languages :
1. Union.
2. Concatenation.
3. Kleene Star.
REGULAR EXPRESSIONS
Language Operators
(1) Union of two languages:
• L U M = all strings that are either in L or M
• Note: A union of two languages produces a third
language
REGULAR EXPRESSIONS
Language Operators
(1) Union of two languages:
The union of two sets is that set which contains all the
elements in each of the two sets and nothing else.
The union operation on languages is designated with a ‘+’.
For example,
1. {abc, ab, ba} + {ba, bb} = {abc, ab, ba, bb}
2. L + {} =L
REGULAR EXPRESSIONS
Language Operators
(2) Concatenation of two languages:
• L . M = all strings that are of the form xy
s.t., x L and y M
• The dot operator is usually omitted
• i.e., LM is same as L.M
REGULAR EXPRESSIONS
Language Operators
(2) Concatenation of two languages:
In order to define concatenation of languages, we must
first define concatenation of strings.
This is simply the two strings forming a new string.
For example,
abc . ba = abcba
REGULAR EXPRESSIONS
Language Operators
(2) Concatenation of two languages:
Note that any string concatenated with the null string is
that string itself:
s . ε = s.
The concatenation of two languages is that language
formed by concatenating each string in one language with
each string in the other language.
REGULAR EXPRESSIONS
Language Operators
(2) Concatenation of two languages:
For example
{ab, a, c} . {b, ε} = {ab.b, ab. ε, a.b, a. ε, c.b, c. ε}
= {abb, ab, a, cb, c}
In this example, the string ab need not be listed twice.
Note that if L1 and L2 are two languages, then L1 . L2 is not
necessarily equal to L2 . L1.
Also, L . {ε} = L, but L . φ = φ.
REGULAR EXPRESSIONS
Language Operators
(2) Concatenation of two languages:
➢ If L1 And L2 Are Two Languages, Then L1 . L2 Is Not
Necessarily Equal To L2 . L1.
REGULAR EXPRESSIONS
(3) Kleene Closure (the * operator)
This operation is a unary operation (designated by
a postfix asterisk) and is often called closure.
If L is a language, we define:
L0 = {ε}
L1 = L
L2 = L . L
L3 = L . L.L
Ln = L . Ln-1
L* = L0 + L1 + L2 + L3 + L4 + L5 + ...
“i” here refers to how many strings to concatenate from the parent
language L to produce strings in the language Li
(3) Kleene Closure (the * operator)
Kleene Closure of a given language L:
• L0= {}
• L1= {w | for some w L}
• L2= { w1w2 | w1 L, w2 L (duplicates allowed)}
• Li= { w1w2…wi | all w’s chosen are L (duplicates allowed)}
• (Note: the choice of each wi is independent)
• L* = Ui≥0 Li (arbitrary number of concatenations)
special notes
L* is an infinite set iff |L|≥1 and L≠{}
If L={}, then L* = {}
If L = Φ, then L* = {}
Σ* denotes the set of all words over an
alphabet Σ
• Therefore, an abbreviated way of saying there
is an arbitrary language L over an alphabet Σ
is:
• L Σ*
Precedence of Operators
Highest to lowest:
(*) operator (star) has the highest precedence
(.) (concatenation) has the second highest
precedence
(+) operator has the lowest precedence of all
Example:
01* + 1 = ( 0 . ((1)*) ) + 1
REGULAR EXPRESSIONS
(3) Kleene Closure (the * operator)
Example:
Let L = {1, 00}
L0 = {}
L1 = {1,00}
L2 = {11,100,001,0000}
L3 = {111,1100,1001,10000,000000,00001,00100,0011}
…….
L* = L0 U L1 U L2 U …
REGULAR EXPRESSIONS
(3) Kleene Closure (the * operator)
Example:
If M is a language:
, we define:
Examples
Regular Language
a|b L= {a, b}
(a | b)(a | b) L= {aa, ab, ba, bb}
a* L= {, a, aa, aaa, ...}
(a | b)* The set of all strings of a’s and b’s.
L= {a, b, ……?……}
a | a *b The set containing the string a and
all strings consisting of zero or
31
more a’s followed by b.
L= {a, b, ab, aab, …, a…ab}
Examples
a|b* denotes {ε, "a", "b", "bb", "bbb", …}
(a|b)* denotes the set of all strings with no symbols other than "a"
and "b", including the empty string: {ε, "a", "b", "aa", "ab", "ba", "bb",
"aaa", …}
ab*(c|ε) denotes the set of strings starting with "a", then zero or
more "b"s and finally optionally a "c": {"a", "ac", "ab", "abc", "abb",
"abbc", …}
(0|(1(01*0)*1))* denotes the set of binary numbers that are
multiples of 3: { ε, "0", "00", "11", "000", "011", "110", "0000", "0011",
"0110", "1001", "1100", "1111", "00000", … }
Algebraic Laws of Regular Expressions
Commutative:
• E+F = F+E
Associative:
• (E+F)+G = E+(F+G)
• (EF)G = E(FG)
Identity:
• E+Φ = E
•E=E=E
Annihilator:
• ΦE = EΦ = Φ
Algebraic Laws of Regular Expressions
Distributive:
• E(F+G) = EF + EG
• (F+G)E = FE+GE
Idempotent: E + E = E
Involving Kleene closures:
• (E*)* = E*
• Φ* =
• * =
• E+ =EE*
• E? = +E
Summary
• These are formulas or expressions consisting of three possible
operations on languages – (union, concatenation, and Kleene
star)
• Union –The union of two sets is that set which contains all the
elements in each of the two sets and nothing else. And it is
designated with a ‘+’.
• For example: {abc, ab, ba} + {ba, bb} = {abc, ab, ba, bb}
• Concatenation –concatenating each string in one set with each
string in the other set. And it is designated with a ‘.’
• For example, {ab, a, c} . {b} = {ab.b, a.b, c.b} = {abb, ab, cb}
• Kleene * -generates zero or more concatenations of strings
from the language to which it is applied. And it is designated
with a ‘*’.
• For example, a* = {, a, aa, aaa, aaaa, aaaaa, aaaaaaaaaaaaa}
◼ Optional characters ? ,* and +
➢ ? (0 or 1)
◼ /colou?r/ ➔ color or colour
➢ * (0 or more)
◼ /oo*h!/ ➔ oh! or Ooh! or Ooooh!
➢ + (1 or more)
◼ /o+h!/ ➔ oh! or Ooh! or Ooooh!
Examples
• For each of the following regular expressions,
list six strings which are in its language.
1. (a(b+c)*)*d
2. (a+b)*(c+d)
3. (a*b*)*
Regular Expiration For Language
Example:
String of a’s and b’s that start and end with a.
a (a | b)* a
Example:
all strings of lowercase letters in which the letters
are in ascending lexicographic order.
a* b* c* …..z*
Exercises
• Suppose L1 represents the set of all strings
from the alphabet 0,1 which contain an even
number of ones (even parity). Which of the
following strings belong to L1?
(a) 0101
(b) 110211
(c) 000
(d) 010011
(e)
41
Exercises
• Suppose L2 represents the set of all strings
from the alphabet a,b,c which contain an
equal number of a’s, b’s, and c’s. Which of
the following strings belong to L2?
(a) bca
(b) accbab
(c)
(d) aaa
(e) aabbcc
42
Exercises
• Which of the following strings belong to the
language specified by this regular expression:
(a+bb)*a
(a) ε
(b) aaa
(c) ba
(d) bba
(e) abba
43
TRUE OR FALSE?
Let R and S be two regular expressions.
Then:
1. ((R*)*)* = R* ?
2. (R+S)* = R* + S* ?
3. (RS + R)* RS = (RR*S)* ?
More Examples of Regular Expression
1.Regular Expression for no 0 or many triples of 0’s and many 1 in the strings.
2.RegExp for strings of one or many 11 or no 11.
3.A regular expression for ending with abb
4.A regular expression for all strings having 010 or 101.
5.Regular expression for Even Length Strings defined over {a,b}
6.Regular Expression for strings having at least one double 0 or double 1.
7.Regular Expression of starting with 0 and having multiple even 1’s or no 1.
8.Regular Expression for an odd number of 0’s or an odd number of 1’s in the strings.
9.Regular Expression for having strings of multiple double 1’s or null.
10.Regular Expression (RE) for starting with 0 and ending with 1.