You are on page 1of 15

Chapter: 3 Regular Expression

Topic 1: Regular Expressions – REs and Python:

Many a times, we are needed to extract required information from given data. For example,
we want to know the number of people who contacted us in the last month through Gmail or
we want to know the phone numbers of employees in a company whose names start with 'A'
or we want to retrieve the date of births of the patients in a hospital who joined for treatment
for hypertension, etc.

To get such information, we have to conduct the searching operation on the data. Once we get
required information, we have to extract that data for further use. Regular expressions are
useful to perform such operations on data.

 Regular Expressions

 A regular expression is a string that contains special symbols and characters to find and
extract the information needed by us from the given data.
 A regular expression helps us to search information, match, find and split information as
per our requirements.
 A regular expression is also called simply regex Regular expressions are available not
only in Python but also in many languages like Java, Perl, AWK, etc.
 Python provides re module that stands for regular expressions.
 This module contains methods like compile(), search0, match0, findall(), split(), etc,
which are used in finding the information in the available data.
 So, when we write a regular expression, we should import re module as:
import re

 Regular expressions are nothing but strings containing characters and special symbols.
Simple regular expression may look like this:
reg = r'm\w\w'

Meaning: any word starting with letter m and having length 3 letter.

 Sequence Character in RE

 Some of the special sequences beginning with '\' represent predefined sets of characters
that are often useful, such as the set of digits, the set of letters, or the set of anything that
isn’t whitespace.
 The following predefined special sequences are a subset of those available.
\d
Matches any decimal digit; this is equivalent to the class [0-9].
\D
Matches any non-digit character; this is equivalent to the class [^0-9].
\s
Matches any whitespace character; this is equivalent to the class [\t\n\r\f\v].
\S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
\w
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
\W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
\A
Matches only at the start of the string.
\b
Matches the empty string, but only at the beginning or end of a word.
\B
Matches the empty string, but only when it is not at the beginning or end of a word. This
means that r'py\B' matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'. \B is just the
opposite of \b,
\Z
Matches only at the end of the string.
 Special Characters and Pattern Matching

Python allows you to specify a pattern to match when searching for a substring contained in a
string. This pattern is known as a regular expression. For example, if you want to find a
North American-style phone number in a string, you can define a pattern consisting of three
digits, followed by a dash (-), and followed by four more digits.

In a regular expression, certain characters have special meanings. The table below lists these
special characters and their meanings.

Character Meaning Example


* Zero or more ab*c matches ac, abc, abbc, and so on
occurrences of a
character
+ One or more ab+c matches abc, abbc, and so on
occurrences of a
character
? Zero or one occurrences ab?c matches ac and abc
of a character
. Any character a.*c matches any substring starting with a and
ending with c
[chars] Any character inside the a[bB]c matches abc and aBc
brackets
[char1-char2] A range of characters a[a-z]c matches a, followed by any non-
capitalized letter, followed by c
[^chars] Any character not inside a[^bB]c matches a, followed by anything
the brackets but b or B, followed by c
[^char1- Any character not in a a[^a-z]c matches a, followed by anything that is
char2] range of characters not a lower-case letter, followed by c
{num} An exact number of ab{3}c matches abbbc
occurrences of a
character
{num1,num2} A number of ab{1,3}c matches abc, abbc and abbbc
occurrences of a
character in a specified
range
| Matches either of two abc|aBc matches abc or aBc
alternatives
^ Matches the start of the ^abc matches abc in abcd, but does not
string only match abc in dabc
$ Matches the end of the abc$ matches abc in dabc, but does not
string only match abc in abcd
^(?!str) Matches anything other ^(?!abc) matches anything other than abc
than the specified string
^(?!str1|str2) Matches anything other ^(?!abc|def) matches anything other
than the two specified than abc and def
strings

If you use multiple special characters in a pattern, you can use parentheses () to specify the
order in which the characters are to be interpreted.

How to write RE Program:


 First step is define the RE
i.e reg= r’m\w\w’
It will return 3 letter word starting with letter m
 Next step is to compile this RE using compile() method of re module.
i.e prog = re.compile(r’m\w\w’)
Now prog represents an object that contain RE.
 Next step is to run this expression on string ‘str’ using search() method or match()
method.
str ='cat mat man mom'
#This is the string on which RE will act.
res = prog.search(str) # searching for RE in str
 Result is stored in res object and we can display it by calling the group() method on
object
print(res.group())

Following methods belong to the ‘re' module that are used in the regular expressions:

The match() method searches in the beginning of the string and if the matching string
is found, it returns an object that contains the resultant string, otherwise it returns
None. We can access the string from the returned object using group() method.

The search() method searches the string from beginning till the end and returns the
first occurrence of the matching string, otherwise it returns None. We can use group0
method to retrieve the string from the object returned by this method.

The findall() method searches the string from beginning till the end and returns all
occurrences of the matching string in the form of a list object. If the matching strings
are not found, then it returns an empty list. We can retrieve the resultant strings
from the list using a for loop.

The split0 method splits the string according to the regular expression and the
resultant pieces are returned as a list. If there are no string pieces, then it returns an
empty list. We can retrieve the resultant string pieces from the list using a for loop.

The sub() method-substitutes (or replaces) new strings in the place of existing strings.
After substitution, the main string is returned by this method

Program 1: A python program to create Regular Expression to search for string


starting with m and having total 3 character using search() method.

reg = r'm\w\w’
str ='cat mat man mom'
prog = re.compile(r'\m\w\w')
res = prog.search(str)
print res.group()

# it is not required to compile it again


str ="frormet"
res=prog.search (str)
print(res.group())
#instead of compiling and running re again we can write single statement
import re
str ='name nest abc nmc'
result=re.search(r'n\w\w',str)
print result.group()

note: here statement


result=re.search(r'n\w\w',str)
is equivalent to
prog = re.compile(r'n\w\w')
result = prog.search(str)
so in place of these two statement we can use single statement

Program 2: A python program to create Regular Expression to search for string


starting with m and having total 3 character using findall() method.

# Search method will return only first search so in place of search use findall() method
import re
str ='cat mat man mom'
prog = re.compile(r'\m\w\w')
res = prog.findall(str)
print "Findall Method return list"
print res #it will print as list

#we can use for loop also


print "Using for Loop"
for r in res:
print r
Program 3: A python program to create Regular Expression to search for string
starting with m and having total 3 character using match() method.

# match method will return matched string if it is in starting otherwise it returns None
import re
str = 'mat man mom'
res = re.match(r'\m\w\w',str)
res1 = re.match(r'\n\w\w',str)
print res.group()
print res1

Program 4: A python program to create Regular Expression to split a string into pieces
where one or more non alpha numeric characters are found.

#split method
import re
str = "this: is; python\'s book"
res = re.split(r'\W+',str)
print "Split with any charecter which is not alpha numeric"
print res

str = "this*is *python*s book"


res = re.split(r'\*',str)
print "split with *"
print res
Program 5: A python program to create Regular Expression to replace a string with
new string.

#RE also used to find and replace word. For this sub() method will be used
# Syntax : sub(regular expression, new string, string)
import re
str ="Today is sunday"
res = re.sub(r'sunday','Monday',str)
print res

 Repetition
Things get more interesting when you use + and * to specify repetition in the pattern
 + -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
 * -- 0 or more occurrences of the pattern to its left
 ? -- match 0 or 1 occurrences of the pattern to its left

Program 6: A python program to create Regular Expression to retrieve all the word
starting with given string.

#program to retrieve all word starting with a


import re
str = 'an appple a day keeps the doctor away'
res= re.findall (r'a[\w]*',str)

for r in res:
print r
# But there is problem in output. The ay is not word but still we get as output. So to solve this
#use \b.
#using \b
res= re.findall (r'\ba[\w]*\b',str)
print "\nUsing \\b"
for r in res:
print r

Program 7: A python program to create Regular Expression to retrieve all the word
starting with given string.

import re
Str1 = '1 2 3 one 1st two three four five six seven 8eight 9nine 10ten as on in'
print "String is : " + Str1

print "\nWords with starting letter Digit"


res = re.findall(r'\d[\w]*',Str1)
for r in res:
print r

print "\nAll the words having three characters"


res = re.findall (r'\b\w{3}\b',Str1)
print res

print "\nword have length atleast 3 letter"


res = re.findall (r'\b\w{3,}\b',Str1)
print res

print "\nWords having length 2,3 or 4 letter"


res = re.findall (r'\b\w{2,4}\b',Str1)
print res

print "Retrive only digits"


res = re.findall (r'\b\d\b',Str1)
print res

print "\nLast word of string starting with i"


res = re.findall (r'i[\w]*\Z',Str1)
print res

print "\nFirst word of string starting with 1"


res = re.findall (r'\A1[\w]*',Str1)
print res

Output:

Program 7: Program using special characters

import re
str1 ="hello world"
print str1
print "\nSearch in starting r'^He'"
res = re.search(r'^He',str1,re.IGNORECASE)
if res:
print ("String start with 'He'")
else:
print "string does not start with 'He'"

print "\nSearch at end r'World$'"

res = re.search(r'World$',str1)
if res:
print ("String end with 'world'")
else:
print "string does not end with 'world'"

#for world output will be else part because we don’t have used ignore case
Output:

 Using RE on Files

 We can use RE not only on string, but also on file where huge data is available.
 Following program illustrate how can we apply RE on file

Program 8: Python Program to create RE that reads email-ids from a text file.

Text file: mail.txt

Code to retrieve email ids:


import re
f=open('E:\Varsha\LJIET\Python\Notes\mails.txt','r')
lst=[]
for line in f:
res = re.findall(r'\S+@\S+',line)
lst.append(res)
for i in lst:
print i
f.close()

Output:

Program 9: Python Program to retrieve data from file using RE and then write data in
another file.

Text File : Salary

Program:
import re
f1=open('E:\Varsha\LJIET\Python\Notes\salary.txt','r')
f2=open('newfile.txt','w')
print "ID\tSalary"
for line in f1:
res1=re.search(r'\d{4}',line)
res2=re.search(r'\d{4,}.\d{2}',line)
print res1.group() +"\t"+res2.group()
f2.write(res1.group()+"\t"+res2.group()+"\n")
f1.close()
f2.close()

Output:

Text file : newfile

 Explain the parameters of Match function of RE.

The match() method searches in the beginning of the string and if the matching string
is found, it returns an object that contains the resultant string, otherwise it returns
None. We can access the string from the returned object using group() method.

This function attempts to match RE pattern to string with optional flags.

Here is the syntax for this function −

re.match(pattern, string, flags=0)


Here is the description of the parameters −
Parameter & Description
1 pattern
This is the regular expression to be matched.
2 string
This is the string, which would be searched to match the pattern at the beginning of string.
3 flags
You can specify different flags using bitwise OR (|). These are modifiers, which are listed
in the table below.
 re Flags: Many Python Regex Methods and Regex functions(search, match) take an
optional argument called Flags. This flags can modify the meaning of the given Regex
pattern. Various flags used in Python includes

Syntax for Regex Flags What does this flag do


[re.M] Make begin/end consider each line
[re.I] It ignores case
[re.S] Make [ . ]
[re.U] Make { \w,\W,\b,\B} follows Unicode rules
[re.L] Make {\w,\W,\b,\B} follow locale
[re.X] Allow comment in Regex

The re.match function returns a match object on success, None on failure. We use
group(num) or groups() function of match object to get matched expression.
Match Object Method & Description
1 group(num=0) This method returns entire match (or specific subgroup num)
2 groups() This method returns all matching subgroups in a tuple (empty if there weren't
any)
Program: Demonstrate use of group and groups function using re.match():
import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
print "matchObj.group() : ", matchObj.group()
print "matchObj.group(1) : ", matchObj.group(1)
print "matchObj.group(2) : ", matchObj.group(2)
else:
print "No match!!"

Output:

 Explain the parameters of search function of RE.


This function searches for first occurrence of RE pattern within string with optional flags.

Here is the syntax for this function −

re.search(pattern, string, flags=0)


Here is the description of the parameters −
Parameter & Description
1 pattern This is the regular expression to be matched.
2 string This is the string, which would be searched to match the pattern anywhere in
the string.
3 flags You can specify different flags using bitwise OR (|). These are modifiers,
which are listed in the table below.

The re.search function returns a match object on success, none on failure. We


use group(num) or groups() function of match object to get matched expression.
Match Object Methods & Description
1 group(num=0) This method returns entire match (or specific subgroup num)
2 groups() This method returns all matching subgroups in a tuple (empty if
there weren't any)
Program: Demonstrate use of group and groups function using re.search()and use of
flag re.I:
import re

line = "Cats are smarter than dogs";

searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)

if searchObj:
print "searchObj.group() : ", searchObj.group()
print "searchObj.group(1) : ", searchObj.group(1)
print "searchObj.group(2) : ", searchObj.group(2)
else:
print "Nothing found!!"

Output:

Program – Example that shows the use of re module

File: ReAllFunctionDemo.py

# Example of w+ and ^ Expression


import re

data = "ljiet32,education is fun"


r1 = re.findall(r"^\w+",data)
print r1

# Example of \s expression in re.split function

data = "ljiet32,education is fun"


r1 = re.findall(r"^\w+", data)
print (re.split(r'\s','we are splitting the words'))
print (re.split(r's','split the words'))

# Using re.findall for text

list = ["ljiet32 get", "ljiet32 give", "ljiet Selenium"]


for element in list:
z = re.match("(g\w+)\W(g\w+)", element)
if z:
print(z.groups())

patterns = ['software testing', 'ljiet32']


text = 'software testing is fun?'
for pattern in patterns:
print 'Looking for "%s" in "%s" ->' % (pattern, text),
if re.search(pattern, text):
print 'found a match!'
else:
print 'no match'
abc = 'ljiet32@google.com, careerljiet32@hotmail.com,
users@yahoomail.com'
emails = re.findall(r'[\w\.-]+@[\w\.-]+', abc)
for email in emails:
print email

# Example of re.M or Multiline Flags

data = """ljiet32
careerljiet32
selenium"""
k1 = re.findall(r"^\w", data)
k2 = re.findall(r"^\w", data, re.MULTILINE)
print k1
print k2
Output:

You might also like