Professional Documents
Culture Documents
REGULAR EXPRESSIONS
Regular Expressions is a special sequence of characters that helps to match , find ,replace and extract string
substrings or sets of strings, in a given file or string using a specialized syntax held in a pattern.
• It is extremely useful for extracting information from text such as code, files, log, spreadsheets or even
documents.
– Parsing a strings
The regular expressions are themselves little programs to search and parse strings. To use them in
our program, the library/module re must be imported.
In [2]:
import re
a) Meta characters: As the name suggests, these characters have a special meaning, similar to * in wild card.
These characters don’t match themselves. Instead, they signal that characters should be matched.
Example : . ^ $ * + ? { } [ ] \ | ( )
b) Literals (like a,b,1,2…) : Most letters and characters will simply match themselves.
For example, the regular expression test will match the string test exactly.
The ‘re’ package provides multiple methods to perform queries on an input string. Here are the most
commonly used methods, I will discuss:
1. re.match()
2. re.search()
3. re.findall()
4. re.split()
5. re.sub()
6. re.compile()
In [4]:
help(re.match)
Example : Call match on string () : Corona virus pandemic in 2020 , it will return all the string that matches
corona but not virus
In [61]:
import re
result = re.match('Corona ', 'Corona virus pandemic in 2020')
print(result.group())
Corona
In [ ]:
In [56]:
import re
result = re.match('virus', 'Corona virus pandemic in 2020')
print(result.group())
--------------------------------------------------------------------------
-
AttributeError Traceback (most recent call las
t)
<ipython-input-56-ef82acfa5718> in <module>
1 import re
2 result = re.match('virus', 'Corona virus pandemic in 2020')
----> 3 print(result.group())
re.search()
It is similar to match() but it doesn’t restrict us to find matches at the beginning of the string only.
In [63]:
import re
result = re.search('virus', 'Corona virus pandemic in 2020')
print(result.group())
virus
In [ ]:
In [69]:
import re
fhand = open('myfile.txt')
for line in fhand:
line = line.rstrip()
if re.search('how', line):
print(line)
In [75]:
import re
hand = open('myfile.txt')
for line in hand:
line = line.rstrip()
if re.search('you\?$', line):
print(line)
Eg 1 : Consider the following example, where the regular expression is for searching lines which
starts with I and has any two characters (any character represented by two dots) and then has a
character m.
In [18]:
import re
fhand = open('myfile.txt')
for line in fhand:
line = line.rstrip()
if re.search('^I..m', line):
print(line)
I am doing fine
In the previous program, we knew that there are exactly two characters between I and m. Hence, we could
able to give two dots. But, when we don’t know the exact number of characters between two characters (or
strings), we can make use of dot and + symbols together
In [19]:
import re
hand = open('myfile.txt')
for line in hand:
line = line.rstrip()
if re.search('^h.+u', line):
print(line)
Observe the regular expression ^h.+u here. It indicates that, the string should be starting with h and ending
with u and there may by any number of (dot and +) characters inbetween.
In [28]:
import re
fhand = open('mbox-short.txt')
for line in fhand:
line = line.rstrip()
if re.search('^From: ', line):
print(line)
From: stephen.marquard@uct.ac.za
In [29]:
From: stephen.marquard@uct.ac.za
In [30]:
# Search for lines that start with From and have an at sign
import re
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
if re.search('^From:.+@', line):
print(line)
From: stephen.marquard@uct.ac.za
• If we want to extract data from a string in Python we can use the findall() method to extract all of the
substrings which match a regular expression.
In [31]:
import re
s = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
lst = re.findall('\S+@\S+', s)
print(lst)
['csev@umich.edu', 'cwen@iupui.edu']
In [32]:
['stephen.marquard@uct.ac.za']
['<postmaster@collab.sakaiproject.org>']
['<200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['louis@media.berkeley.edu']
['<postmaster@collab.sakaiproject.org>']
['<200801042308.m04N8v6O008125@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
In [33]:
['stephen.marquard@uct.ac.za']
['postmaster@collab.sakaiproject.org']
['200801051412.m05ECIaH010327@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['louis@media.berkeley.edu']
['postmaster@collab.sakaiproject.org']
['200801042308.m04N8v6O008125@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
Assume that we need to extract the data in a particular syntax. For example, we need to extract the lines
containing following format –
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
The line should start with X-, followed by 0 or more characters. Then, we need a colon and white-space.
They are written as it is. Then there must be a number containing one or more digits with or without a
decimal point. Note that, we want dot as a part of our pattern string, but not as meta character here. The
pattern for regular expression would be
In [34]:
# Search for lines that start with 'X' followed by any non
# whitespace characters and ':'
# followed by a space and any number.
# The number can include a decimal.
import re
hand = open('mbox-short.txt')
for line in hand:
line = line.rstrip()
if re.search('^X\S*: [0-9.]+', line):
print(line)
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
In [ ]:
In [35]:
['0.8475']
['0.0000']
In [36]:
['39772']
In [37]:
['18']
Escape character
In [39]:
import re
x = 'we just received $10.00 for cookies.'
y = re.findall('\$[0-9.]+',x)
y
Out[39]:
['$10.00']
In [ ]: