Professional Documents
Culture Documents
Python regular expressions tutorial shows how to use regular expressions in Python. For regular
expressions in Python we use the re module.
Regular expressions are used for text searching and more advanced text manipulation. Regular
expressions are built-in tools like grep, sed, text editors like vi, emacs, programming languages like
Tcl, Perl, and Python.
Python re module
In Python, the re module provides regular expression matching operations.
A pattern is a regular expression that defines the text we are searching for or manipulating. It
consists of text literals and metacharacters. The pattern is compiled with the compile function.
Because regular expressions often include special characters, it is recommended to use raw strings.
(Raw strings are preceded with r character.) This way the characters are not interpreded before
they are compiled to a pattern.
After we have compiled a pattern, we can use one of the functions to apply the pattern on a text
string. The funcions include match, search, find, and finditer.
Regular expressions
The following table shows some basic regular expressions:
Regex Meaning
| Alternation operator.
[abc] Matches a or b, or c.
Function Description
search Scans through a string, looking for any location where this RE matches.
findall Finds all substrings where the RE matches, and returns them as a list.
finditer Finds all substrings where the RE matches, and returns them as an iterator.
The match, fullmatch, and search functions return a match object if they are successful.
Otherwise, they return None.
match_fun.py
#!/usr/bin/env python
import re
pattern = re.compile(r'book')
In the example, we have a tuple of words. The compiled pattern will look for a 'book' string in each
of the words.
pattern = re.compile(r'book')
With the compile function, we create a pattern. The regular expression is a raw string and consists
of four normal characters.
if re.match(pattern, word):
print(f'The {word} matches')
We go through the tuple and call the match function. It applies the pattern on the word. The match
function returns a match object if there is a match at the beginning of a string. It returns None if
there is no match.
$ ./match_fun.py
The book matches
The bookworm matches
The bookish matches
The bookstore matches
Four of the words in the tuple match the pattern. Note that the words that do not start with the
'book' term do not match. To include also these words, we use the search function.
fullmatch_fun.py
#!/usr/bin/env python
import re
pattern = re.compile(r'book')
if re.fullmatch(pattern, word):
print(f'The {word} matches')
In the example, we use the fullmatch function to look for the exact 'book' term.
$ ./fullmatch_fun.py
The book matches
search_fun.py
#!/usr/bin/env python
import re
pattern = re.compile(r'book')
if re.search(pattern, word):
print(f'The {word} matches')
In the example, we use the search function to look for the 'book' term.
Google Ads -
Sitio O cial
Con Google Ads, no hay
contratos ni mínimo de
Google Ads inversión.
$ ./search_fun.py
The book matches
The bookworm matches
The bookish matches
The cookbook matches
The bookstore matches
The pocketbook matches
This time the cookbook and pocketbook words are included as well.
Dot metacharacter
The dot (.) metacharacter stands for any single character in the text.
dot_meta.py
#!/usr/bin/env python
import re
pattern = re.compile(r'.even')
In the example, we have a tuple with eight words. We apply a pattern containing the dot
metacharacter on each of the words.
pattern = re.compile(r'.even')
The dot stands for any single character in the text. The character must be present.
$ ./dot_meta.py
The seven matches
The revenge matches
Two words match the pattern: seven and revenge.
question_mark_meta.py
#!/usr/bin/env python
import re
pattern = re.compile(r'.?even')
if re.match(pattern, word):
print(f'The {word} matches')
In the example, we add a question mark after the dot character. This means that in the pattern we
can have one arbitrary character or we can have no character there.
$ ./question_mark_meta.py
The seven matches
The even matches
The revenge matches
The event matches
This time, in addition to seven and revenge, the even and event words match as well.
Anchors
Anchors match positions of characters inside a given text. When using the ^ anchor the match
must occur at the beginning of the string and when using the $ anchor the match must occur at the
end of the string.
anchors.py
#!/usr/bin/env python
import re
sentences = ('I am looking for Jane.',
'Jane was walking along the river.',
'Kate and Jane are close friends.')
pattern = re.compile(r'^Jane')
if re.search(pattern, sentence):
print(sentence)
In the example, we have three sentences. The search pattern is ^Jane. The pattern checks if the
"Jane" string is located at the beginning of the text. The Jane\. would look for "Jane" at the end of
the sentence.
Exact match
An exact match can be performed with the fullmatch function or by placing the term between the
anchors: ^ and $.
exact_match.py
#!/usr/bin/env python
import re
pattern = re.compile(r'^book$')
if re.search(pattern, word):
print(f'The {word} matches')
In the example, we look for an exact match for the 'book' term.
$ ./exact_match.py
The book matches
Character classes
A character class defines a set of characters, any one of which can occur in an input string for a
match to succeed.
character_class.py
#!/usr/bin/env python
import re
pattern = re.compile(r'gr[ea]y')
if re.search(pattern, word):
print(f'{word} matches')
In the example, we use a character class to include both gray and grey words.
pattern = re.compile(r'gr[ea]y')
The [ea] class allows to use either 'e' or 'a' charcter in the pattern.
named_character_class.py
#!/usr/bin/env python
import re
text = 'We met in 2013. She must be now about 27 years old.'
pattern = re.compile(r'\d+')
if found:
print(f'There are {len(found)} numbers')
pattern = re.compile(r'\d+')
The \d+ pattern looks for any number of digit sets in the text.
$ ./named_character_classes.py
There are 2 numbers
case_insensitive.py
#!/usr/bin/env python
import re
$ ./case_insensitive.py
dog matches
Dog matches
DOG matches
Doggy matches
All four words match the pattern.
Alternations
The alternation operator | creates a regular expression with several choices.
alternations.py
#!/usr/bin/env python
import re
pattern = re.compile(r'Jane|Beky|Robert')
if re.match(pattern, word):
print(word)
pattern = re.compile(r'Jane|Beky|Robert')
finditer_fun.py
#!/usr/bin/env python
import re
text = 'I saw a fox in the wood. The fox had red fur.'
pattern = re.compile(r'fox')
s = item.start()
e = item.end()
print(f'Found {text[s:e]} at {s}:{e}')
In the example, we search for the 'fox' term in the text. We go over the iterator of the found
matches and print them with their indexes.
s = item.start()
e = item.end()
The start and end functions return the starting and ending index, respectively.
$ ./finditer_fun.py
Found fox at 8:11
Found fox at 29:32
Capturing groups
Capturing groups is a way to treat multiple characters as a single unit. They are created by placing
characters inside a set of round brackets. For instance, (book) is a single group containing 'b', 'o',
'o', 'k', characters.
The capturing groups technique allows us to find out those parts of a string that match the regular
pattern.
capturing_groups.py
#!/usr/bin/python3
import re
pattern = re.compile(r'(</?[a-z]*>)')
The code example prints all HTML tags from the supplied string by capturing a group of
characters.
$ ./capturing_groups.py
<p>
<code>
</code>
</p>
import re
pattern = re.compile(r'^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,18}$')
if re.match(pattern, email):
print(f'{email} matches')
else:
print(f'{email} does not match')
pattern = re.compile(r'^[a-zA-Z0-9._-]+@[a-zA-Z0-9-]+\.[a-zA-Z.]{2,18}$')
The first ^ and the last $ characters provide an exact pattern match. No characters before and after
the pattern are allowed. The email is divided into five parts. The first part is the local part. This is
usually a name of a company, individual, or a nickname. The [a-zA-Z0-9._-]+ lists all possible
characters, we can use in the local part. They can be used one or more times.
The second part consists of the literal @ character. The third part is the domain part. It is usually
the domain name of the email provider such as Yahoo, or Gmail. The [a-zA-Z0-9-]+ is a
character class providing all characters that can be used in the domain name. The + quantifier
allows to use of one or more of these characters.
The fourth part is the dot character. It is preceded by the escape character (\) to get a literal dot.
The final part is the top level domain: [a-zA-Z.]{2,18}. Top level domains can have from 2 to 18
characters, such as sk, net, info, travel, cleaning, travelinsurance. The maximum length can be 63
characters, but most domain are shorter than 18 characters today. There is also a dot character.
This is because some top level domains have two parts; for instance co.uk.
$ ./emails.py
luke@gmail.com matches
andy@yahoocom does not match
34234sdfa#2345 does not match
f344@gmail.com matches