Regular Expressions in Python With Code Examples
Regular expressions, also known as regex, are a powerful tool for searching and manipulating text in programming languages. Python provides robust support for regular expressions through the ‘re’ module. In this article, we will explore regular expressions in python and learn how to use them.
What are Regular Expressions?
A regular expression is a pattern that describes a set of strings. It consists of a combination of literal characters and special characters, known as metacharacters, that have special meanings. Regular expressions can be used to search for specific patterns of text, extract data from data, or replace text with other text.
Regex Functions
Python's ‘re’ module provides several functions that are commonly used for regular expressions. Some of the most commonly used functions are:
- re.search(pattern, string, flags=0): Searches for the first occurrence of the pattern in the string and returns a match object if found.
- re.findall(pattern, string, flags=0): Searches for all occurrences of the pattern in the string and returns a list of all matches found.
- re.sub(pattern, repl, string, count=0, flags=0): Replaces all occurrences of the pattern in the string with the specified replacement string (‘repl’).
- re.compile(pattern, flags=0): Compiles the regex pattern into a regex object, which can be used for efficient matching operations.
Metacharacters
Metacharacters are special characters that have special meanings in regular expressions. Some of the mostly used metacharacters are:
- . - Matches any character except newline
- ^ - Matches the beginning of the line or string
- $ - Matches the end of the line or string
- * - Matches zero or more occurrences of the preceding character
- *? - Matches zero or more occurrences of the preceding character (non-greedy)
- + - Matches one or more occurrences of the preceding character
- +? - Matches one or more occurrences of the preceding character (non-greedy)
- ? - Matches zero or one occurrence of the preceding character
- {m} - Matches exactly m occurrences of the preceding character
- {m,n} - Matches between m and n occurrences of the preceding character
- [] - Matches any character inside the brackets
- [^] - Matches any character not inside the brackets
- () - Creates a capturing group for extracting a substring
- | - Matches either the expression before or after the pipe
- \ - Escapes metacharacters, allowing them to be treated as literals
Special Sequences
Special Sequences are metacharacters that have predefined meaning in regular expressions. Some of the most commonly used special sequences are:
- \d - Matches any digit character (equivalent to [0-9])
- \D - Matches any non-digit character (equivalent to [^0-9])
- \s - Matches any whitespace character (space, tab, newline, etc.)
- \S - Matches any non-whitespace character
- \w - Matches any alphanumeric character (equivalent to [a-zA-Z0-9_])
- \W - Matches any non-alphanumeric character
- \b - Matches a word boundary (the point between a word character and a non-word character)
- \B - Matches any position that is not a word boundary
- (?i) - Turns on case-insensitive matching for the remainder of the expression
- (?x) - Allows whitespace and comments in the regular expression
- (?P<name>) - Creates a named capture group
- (?P=name) - Matches the same text as the named capture group specified by name
- *? - Matches zero or more occurrences of the preceding character (non-greedy)
- +? - Matches one or more occurrences of the preceding character (non-greedy)
- [^] - Matches any character not inside the brackets
Code Examples Of Metacharacters And Special Sequences:
import re
# Define a sample string to search
txt = "Albert Camus was the recipient of the 1957 Nobel Prize in Literature"
# use ' [] ' to find all lower case characters alphabetically between "a" and "m"
pattern1 = re.findall("[a-g]", txt)
# use ' \d ' to find all the digit characters
pattern2 = re.findall("\d", txt)
# use ' . ' to search for a sequence that starts with "Ca" followed by any two
# characters and ends with "s"
pattern3 = re.findall("Ca..s", txt)
# use ' ^ ' to check if the string starts with "Albert"
pattern4 = re.findall("^Albert", txt)
# use ' $ ' to check if the string ends with "Literature"
pattern5 = re.findall("Literature$", txt)
# now print the results
print(pattern1)
print(pattern2)
print(pattern3)
if pattern4:
print("Yes, the string starts with 'Albert'")
if pattern5:
print("Yes, the string ends with 'Literature'")
Output
['b', 'e', 'a', 'a', 'e', 'e', 'c', 'e', 'f', 'e', 'b', 'e', 'e', 'e', 'a', 'e']
['1', '9', '5', '7']
['Camus']
Yes, the string starts with 'Albert'
Yes, the string ends with 'Literature'
Some Useful Functions Of the “re” Module
import re
# Search for a pattern in a string
string = "The quick brown fox jumps over the lazy dog"
pattern = "brown"
match = re.search(pattern, string)
print(match.group())
# Find all occurrences of a pattern in a string
string = "The quick brown fox jumps over the lazy dog"
pattern = "the"
matches = re.findall(pattern, string, flags=re.IGNORECASE)
print(matches)
# Replace all occurrences of a pattern in a string
string = "The quick brown fox jumps over the lazy dog"
pattern = "brown"
replacement = "red"
new_string = re.sub(pattern, replacement, string)
print(new_string)
# Compile a regex pattern and use it for matching
pattern = re.compile(r"\d+")
string = "The quick brown fox jumps over the 1234 lazy dogs"
matches = pattern.findall(string)
print(matches)
Output
brown
['The', 'the']
The quick red fox jumps over the lazy dog
['1234']
Code Explanation
In the first example, we search for the pattern “brown” in the string “The quick brown fox jumps over the lazy dog”. The “re.search()” function returns a match object, which we can use to extract the matched text using the “group()” method.
In the second example, we use the “re.findall()” function to find all occurrences of the pattern “the” in our string. We use “flags” parameter the specify the “re.IGNORECASE” flag, which makes the search case-insensitive.
In the third example, we use “re.sub()” funcition to replace all the occurrences of the pattern “brown with the string “red” in our string. The function returns a new string with the replacements made.
In the fourth example, we compile a regex pattern using “re.compile()” function. The pattern ‘ \d+ ’ matches one or more decimal digits. We then use the compiled pattern to find all the matches of the pattern in our string using the “findall()” method. The function returns a list of all matches found.