REGULAR EXPRESSIONS AND OOP CONCEPTS
UNIT – 4 | I SEM | MCA |2024-26 BATCH | RIT
REGEX IN PYTHON
What is Regular expression?
Applications of regular expressions
How to create Regular expressions in Python?
REGULAR EXPRESSION
A Regular Expression (regex) is a sequence of characters that defines a search
pattern.
It is commonly used for string matching, searching, and replacing text.
It is a code or way of describing what kind of text is being looked for in a bigger
chunk of text.
Python provides the re module to work with regular expressions.
APPLICATIONS OF REGULAR EXPRESSIONS
Data validations
Ex: mobile number validation, email validation, etc
Data extraction
Specific info from data can be extracted
Data cleaning, web scrapping
Functionalities of
ctrl+f and replace, grep commands (UNIX), LIKE operator in SQL
To create translators – compilers, interpreters, assemblers
For syntax analysis and lexical analysis
Password Policies
Used in NLP to identify specific patterns in data.
BASIC SEARCH FUNCTIONS
search()
match()
finditer()
findall()
re.match()
Purpose: search for a pattern at the beginning of a string.
Syntax: re.match (pattern, string, flags = 0)
pattern: The regular expression pattern you want to search for
string: input string in which you want to search for pattern
Returns: if a match is found at the beginning of the string, it returns a
match object; otherwise it returns None.
Using the re Module in Python
Python’s re module provides powerful tools for regex operations.
re.match() – Matches the Beginning of a String. It only checks the start of the string.
import re .group() → Returns the actual match.
.span() → Returns the start and end positions
pattern = r"Hello" of the match.
text = "Hello, world!"
match = re.match(pattern, text) if match:
if match: print("Matched text:", match.group())
print("Match found!") # Returns matched text ("Hello")
else: print("Start and End positions:",
print("No match") match.span())
# Output: Match found! # Returns (0, 5)
What is a Raw String (r"")?
• In Python, the r before a string (like r"^\d$") makes it a raw string literal.
• In a normal string, backslashes (\) are treated as escape characters
(e.g., "\n" for a newline, "\t" for a tab).
• A raw string (r"") tells Python not to interpret backslashes as escape
sequences.
• In regex, we often use \d, \s, \b, etc., where \ has a special meaning.
Using r"" prevents Python from treating \ as an escape character.
• Always use r"" for regex patterns to avoid unexpected errors.
re.search()
Purpose: The search() function in the re module scans a string for the
first occurrence of a pattern.
Syntax: re.search (pattern, data)
pattern: The regular expression pattern you want to search for
data: input string in which you want to search for pattern
Returns: match object if match is found or None if no match found
re.search() – Finds the First Match Anywhere
Unlike match(), search() checks the entire string.
import re
pattern = r"world"
text = "Hello, world!"
match = re.search(pattern, text)
if match:
print("Match found!")
# Output: Match found!
In Python, you can use regular expressions in two ways:
1. Directly as a string pattern
You pass a raw string directly to functions like re.search(), re.match(), etc.
2. Using Regular Expression Objects
You first compile the pattern using re.compile(), creating a reusable regex
object. This is useful for repeated searches.
Using Regular Expression Objects
import re
pattern = re.compile(r"World") # Compile the regex pattern
text = "Hello, World!"
match = pattern.search(text) # Using the compiled object
if match:
print("Found:", match.group())
re.finditer()
Purpose: re.finditer() returns an iterator yielding match objects for all
non-overlapping occurrences of a pattern in a string.
Syntax: re.finditer (pattern, data, flags = 0)
pattern: The regular expression pattern you want to search for
data: input string in which you want to search for pattern
Returns: iterator object containing match info.
re.finditer() – Returns Matches as an Iterator
import re import re
pattern = r“Hello" pattern = re.compile('ab', re.IGNORECASE)
text = "Hello, world!" data = 'abaababa'
match_iter = re.finditer(pattern, data)
matches = re.finditer(pattern, text) count = 0
for match in matches: for match in match_iter:
print(match.group()) count += 1
print(f"start:{match.start()},
# Output: Hello, world end:{match.end()}, element:{match.group()}")
print("total:", count)
Useful when handling large data, as it yields results lazily.
re.findall()
Purpose: re.findall() returns a list of all non-overlapping matches of a
pattern in a string.
Syntax: re.findall (pattern, data, flags = 0)
pattern: The regular expression pattern you want to search for
data: input string in which you want to search for pattern
Returns: A list containing all matching substrings
re.findall() – Returns All Matches in a List
import re
pattern = r“[0-9]” # Find all numbers
text = "My number is 123 and my friend's is 456"
matches = re.findall(pattern, text)
print(matches) # Output: ['1', '2', '3', '4', '5', '6']
import re
pattern = re.compile('ab', re.IGNORECASE)
data = ‘abaababa’
match_list = re.findall(pattern, data)
print(match_list) # Output: ['ab', 'ab', 'ab']
DIFFERENCE BETWEEN findall() AND finditer()
Both re.findall() and re.finditer() are used to search for all occurrences of a
pattern in a string, but they differ in how they return results.
Feature re.findall() re.finditer()
Return Type Returns a list of matching Returns an iterator yielding
substrings. match objects.
Memory Usage Stores all matches in a list Uses an iterator (more memory-
(higher memory usage for efficient).
large data).
Accessing Match Info Only returns matched Provides full match details
substrings, no details like (start, end, groups).
position.
Use Case When only matched strings When additional match details
are needed. (index, groups) are needed.
Understanding Non-Overlapping Matches in re.findall() and finditer()
In re.findall() and finditer(), matches are non-overlapping, meaning
once a match is found, the search continues after the match, rather
than inside it.
import re
data = "ababab"
matches = re.findall(r"aba", data)
print(matches)
#Output: ['aba']
CHARACTER CLASS IN PYTHON REGEX
A character class typically refers to a set of characters that you can
define using regular expressions
Character classes are used to specify range or group of characters you
want to search in data
These classes help in defining flexible patterns for text searching and
validation.
Character Classes in Python Regex
Square Brackets [ ]
•Used to define a set of characters.
•Example: [abc] matches 'a', 'b', or 'c'.
Range of Characters
•[a-z] → Matches any lowercase letter (a to z).
•[A-Z] → Matches any uppercase letter (A to Z).
•[0-9] → Matches any digit (0 to 9).
Negation [^ ] (Caret Inside Brackets)
•Matches anything except the characters inside the brackets.
•Example: [^0-9] matches anything except digits.
Predefined Character Classes
•\d → Matches any digit (equivalent to [0-9]).
•\D → Matches any non-digit character (equivalent to [^0-9]).
•\w → Matches any word character (letters, digits, underscore) [a-zA-Z0-9_].
•\W → Matches any non-word character (opposite of \w).
•\s → Matches any whitespace character (space, tab, newline).
•\S → Matches any non-whitespace character.
Special Character Classes
•[aeiou] → Matches any vowel.
•[13579] → Matches any odd digit.
•[02468] → Matches any even digit.
Regex Meaning Example Matches Does Not Match
Pattern
\b Word boundary (start or \bcat\b "The cat is here" → "caterpillar",
end of a word) "cat" "wildcat" →
\A Matches only at the start of \AHello "Hello world" → "world Hello" →
a string
\Z Matches only at the end of tutorial\Z "This is a tutorial" → "tutorial on regex"
a string →
. Matches every character
Find digits in given data
import re import re
pattern = r'[0-9]' pattern = r'[0-9]'
data = "The price is $." data = "The price is $100."
match_list = re.findall(pattern, data) match_iter = re.finditer(pattern, data)
if match_list: for match in match_iter:
print("digits present") print(match)
else:
print("not present")
Table 1: Basic Regex Metacharacters
Symbol Description
. Matches any character except a newline
^ Matches the start of a string
$ Matches the end of a string
Matches 0 or more occurrences of the preceding
*
character
Matches 1 or more occurrences of the preceding
+
character
? Matches 0 or 1 occurrence of the preceding character
{n} Matches exactly n occurrences
{n,} Matches n or more occurrences
{n,m} Matches between n and m occurrences
\ Escape character (e.g., \. matches a literal dot .)
Table 2: Character Classes and Groups
Pattern Description
\d Matches any digit (0-9)
\D Matches any non-digit character
\w Matches any word character (a-z, A-Z, 0-9, _)
\W Matches any non-word character
\s Matches any whitespace (space, tab, newline)
\S Matches any non-whitespace character
[abc] Matches any one of a, b, or c
[^abc] Matches anything except a, b, or c
Matches word boundaries (e.g., \bword\b
\b
matches the word "word" exactly)
re.sub() – Replaces Text in a String
import re
text = "Python is fun!"
new_text = re.sub(r"Python", "Java", text)
print(new_text)
# Output: Java is fun!
https://regexr.com/
https://www.kaggle.com/code/albeffe/regex-exercises-
solutions/notebook
Character classes
. any character except newline
\w\d\s word, digit, whitespace
\W\D\S not word, digit, whitespace
[abc] any of a, b, or c
[^abc] not a, b, or c
[a-g] character between a & g
Anchors
^abc$ start / end of the string
\b word boundary
Escaped characters
\. \* \\ escaped special characters
\t \n \r tab, linefeed, carriage return
Groups
(abc) capture group
Quantifiers & Alternation
a* a+ a? 0 or more, 1 or more, 0 or 1
a{5} a{2,} exactly five, two or more
a{1,3} between one & three
a+? a{2,}? match as few as possible
ab|cd match ab or cd