You are on page 1of 25

Regular Expressions

Reading: Chapter 3

1
Regular Expressions vs. Finite
Automata
◼ Offers a declarative way to express the pattern of any
string we want to accept
◼ E.g., 01*+ 10*

◼ Automata => more machine-like


< input: string , output: [accept/reject] >
◼ Regular expressions => more program syntax-like
◼ Unix environments heavily use regular expressions
◼ E.g., bash shell, grep, vi & other editors, sed
◼ Perl scripting – good for string processing
◼ Lexical analyzers such as Lex or Flex(used to generate lexical analyzers
(scanners or lexers) for programming languages)

2
Regular Expressions
Regular = Finite Automata
expressions (DFA, NFA, -NFA)
Syntactical
expressions Automata/machines

Regular
Languages

Formal language
classes

3
regular expressions and finite automata are two different ways of
describing the same thing. They are both used to describe regular
languages, which are a subset of all possible languages.
Regular languages are important in computer science because they are
used to define the syntax of programming languages, regular
expressions, and other formal languages.
A formal language is a set of strings of symbols that may be constrained
by rules that are specific to it. It is used to describe the syntax of
programming languages, regular expressions, and other formal
languages.
A regular language is a formal language that can be recognized by a
finite automaton. It is a subset of all possible languages.
4
Language Operators
◼ Union of two languages:
◼ L U M = all strings that are either in L or M
◼ Note: A union of two languages produces a third
language

◼ Concatenation of two languages:


◼ L . M = all strings that are of the form xy
s.t., x  L and y  M
◼ The dot operator is usually omitted
◼ i.e., LM is same as L.M
5
Union of two languages (L U M):
◼ Given two languages L and M, the union of L and M denoted as L U M is
defined as the set of all strings that are either in L or in M. In other words, L U
M is the language that contains all the strings that belong to L or M or both.
◼ Example:
Let L be the language of all strings consisting of only the letter 'a': L = {a, aa,
aaa, ...}
Let M be the language of all strings consisting of only the letter 'b': M = {b, bb,
bbb, ...}
The union of L and M, denoted as L U M, would be the language that contains
all strings consisting of only 'a's, all strings consisting of only 'b's, and any
combination thereof: L U M = {a, aa, aaa, ..., b, bb, bbb, ...}

6
Concatenation of two languages
(L . M):
◼ Given two languages L and M, the concatenation of L and M is denoted as L . M,
and it represents the set of all strings that can be formed by concatenating a
string from L with a string from M.
◼ Example:
Let L be the language of all strings consisting of the letter 'a' repeated one or
more times: L = {a, aa, aaa, ...}
Let M be the language of all strings consisting of the letter 'b' repeated one or
more times: M = {b, bb, bbb, ...}
The concatenation of L and M, denoted as L . M or simply LM, would be the
language that contains all possible combinations of strings from L followed by
strings from M: LM = {ab, aab, aaab, ..., abb, aabb, aaabb, ...}

7
Regular expression patterns
The pattern can include literal characters, metacharacters, and operators.
Metacharacters:
• . (dot): Matches any single character except a newline.
• \ (backslash): Escapes a special character, allowing it to be used as a literal
character.
• [] (square brackets): Matches any single character within the brackets.
• [^] (caret within square brackets): Matches any single character not within the
brackets.
• \d, \w, \s: Shorthand character classes representing digits, word characters, and
whitespace characters, respectively.
Operators:
• (asterisk): Matches zero or more occurrences of the preceding element.
• (plus): Matches one or more occurrences of the preceding element.
• ? (question mark): Matches zero or one occurrence of the preceding element.
• | (pipe): Acts as an OR operator, allowing alternative matches.
• () (parentheses): Groups expressions and establishes operator precedence 8
“i” here refers to how many strings to concatenate from the parent
language L to produce strings in the language Li

Kleene Closure (the * operator)


◼ Kleene Closure of a given language L:
◼ L0= {}
◼ L1= {w | for some w  L}
◼ L2= { w1w2 | w1  L, w2  L (duplicates allowed)}
◼ Li= { w1w2…wi | all w’s chosen are  L (duplicates allowed)}
◼ (Note: the choice of each wi is independent)
◼ L* = Ui≥0 Li (arbitrary number of concatenations)
Example:
◼ Let L = { 1, 00}
◼ L0= {}
◼ L1= {1,00}
◼ L2= {11,100,001,0000}
◼ L3= {111,1100,1001,10000,000000,00001,00100,0011}
◼ L* = L0 U L1 U L2 U …

9
Kleene Closure (special notes)

◼ L* is an infinite set iff |L|≥1 and L≠{} Why?

◼ If L={}, then L* = {} Why?


◼ If L = Φ, then L* = {} Why?

Σ* denotes the set of all words over an


alphabet Σ
◼ Therefore, an abbreviated way of saying
there is an arbitrary language L over an
alphabet Σ is:
◼ L  Σ*
10
Building Regular Expressions
◼ Let E be a regular expression and the
language represented by E is L(E)
◼ Then:
◼ (E) = E
◼ L(E + F) = L(E) U L(F)
◼ L(E F) = L(E) L(F)
◼ L(E*) = (L(E))*

11
Example: how to use these regular expression
properties and language operators?
◼ L = { w | w is a binary string which does not contain two consecutive 0s
or two consecutive 1s anywhere)
◼ E.g., w = 01010101 is in L, while w = 10010 is not in L
◼ Goal: Build a regular expression for L
◼ Four cases for w:
◼ Case A: w starts with 0 and |w| is even
◼ Case B: w starts with 1 and |w| is even
◼ Case C: w starts with 0 and |w| is odd
◼ Case D: w starts with 1 and |w| is odd
◼ Regular expression for the four cases:
◼ Case A: (01)*
◼ Case B: (10)*
◼ Case C: 0(10)*
◼ Case D: 1(01)*
◼ Since L is the union of all 4 cases:
◼ Reg Exp for L = (01)* + (10)* + 0(10)* + 1(01)*
◼ If we introduce  then the regular expression can be simplified to:
◼ Reg Exp for L = ( +1)(01)*( +0) 12
• (ε + 1): This part represents either the empty string (ε) or the digit "1".
It matches either an empty input or the single character "1".

• (01)*: This part represents zero or more occurrences of the string "01".
It matches strings that consist of any number of repetitions of "01".

• (ε + 0): This part represents either the empty string (ε) or the digit "0".
It matches either an empty input or the single character "0".

13
Precedence of Operators
◼ Highest to lowest
◼ * operator (star)
◼ . (concatenation)
◼ + operator

◼ Example:
◼ 01* + 1 = ( 0 . ((1)*) ) + 1

14
◼ (Star) Operator:
The * operator, also known as the Kleene star operator, is
used to denote zero or more occurrences of the preceding
element or group in a regular expression.
◼ Example: ab*c
This regular expression matches strings that start with an
'a', followed by zero or more 'b's, and ends with a 'c'. It
would match strings like "ac", "abc", "abbc", "abbbc", and
so on.

15
◼ (Dot) Operator (Concatenation):
The . operator, also known as the dot operator or
concatenation operator, is used to concatenate or join
two regular expressions or elements together.
◼ Example: ab.c
This regular expression matches strings that start with an
'a', followed by a 'b', then any character (except newline),
and ends with a 'c'. It would match strings like "abc",
"ab1c", "abXc", and so on.

16
◼ (Plus) Operator:
The + operator is used to denote one or more occurrences
of the preceding element or group in a regular expression.
◼ Example: a+b
This regular expression matches strings that start with one
or more 'a's, followed by a 'b'. It would match strings like
"ab", "aab", "aaab", and so on.

17
Finite Automata (FA) & Regular
Expressions (Reg Ex)
◼ To show that they are interchangeable,
consider the following theorems:
◼ Theorem 1: For every DFA A there exists a regular
Proofs expression R such that L(R)=L(A)
in the book ◼ Theorem 2: For every regular expression R there
exists an  -NFA E such that L(E)=L(R)

 -NFA NFA
Theorem 2 Kleene Theorem

Reg Ex DFA
Theorem 1
18
Kleene Theorem
◼ The theorem states that for any language that can be
described by a regular expression, there exists a
corresponding finite automaton that recognizes the
same language. Likewise, for any language recognized
by a finite automaton, there exists a regular expression
that describes the same language.
◼ In other words, Kleene's theorem provides a way to
convert between regular expressions and finite
automata, demonstrating their equivalence in terms of
expressive power
19
DFA Reg Ex
Theorem 1

DFA to RE construction
Informally, trace all distinct paths (traversing cycles only once)
from the start state to each of the final states
and enumerate all the expressions along the way

Example: 1 0 0,1

q0 0 q1 1 q2

(1*) 0 (0*) 1 (0 + 1)*

1* 00* 1 (0+1)*

Q) What is the language?

1*00*1(0+1)*
20
Reg Ex  -NFA
Theorem 2

RE to -NFA construction
Example: (0+1)*01(0+1)*

(0+1)* 01 (0+1)*

 

0
 
0

 0 1

 1
  1

 

21
Algebraic Laws of Regular
Expressions
◼ Commutative:
◼ E+F = F+E
◼ Associative:
◼ (E+F)+G = E+(F+G)
◼ (EF)G = E(FG)
Φ represents the empty set or the empty
◼ Identity: language, while  represents the empty
◼ E+Φ = E string or the absence of any characters.
◼ E=E=E Φ denotes the absence of elements or
strings, while  denotes the absence of
◼ Annihilator: characters or the possibility of an empty
◼ ΦE = EΦ = Φ string being present.
22
Algebraic Laws…
◼ Distributive:
◼ E(F+G) = EF + EG
◼ (F+G)E = FE+GE
◼ Idempotent: E + E = E
◼ Involving Kleene closures:
◼ (E*)* = E*
◼ Φ* =
◼ * =
◼ E+ =EE*
◼ E? =  +E
◼ ("ε + E" matches either the empty string or the symbol or expression "E")
23
True or False?
Let R and S be two regular expressions. Then:

1. ((R*)*)* = R* ?

2. (R+S)* = R* + S* ?

3. (RS + R)* RS = (RR*S)* ?

24
Summary
◼ Regular expressions
◼ Equivalence to finite automata
◼ DFA to regular expression conversion
◼ Regular expression to -NFA
conversion
◼ Algebraic laws of regular expressions
◼ Unix regular expressions and Lexical
Analyzer
25

You might also like