To use regular expressions we have to import a module called re in Python. Let’s start with a simple example which searches the pattern “word” followed by three letters –
import re
str = 'batman starts with the word:bat!!'
match = re.search(r'word:\w\w\w', str)
# If-statement after search() tests if it succeeded
if match:
print 'found', match.group() ## 'found word:bat'
else:
print 'did not find'
Why the prefix r?
I was wondering why do we have a prefix r in there? Google’s Python course said: The ‘r’ at the start of the pattern string designates a python “raw” string which passes through backslashes without change which is very handy for regular expressions.
I didn’t quite get that, so I searched and found this StackOverflow post. It becomes clear with the following example –
>>> '\n'
'\n'
>>> r'\n'
'\\n'
>>> print '\n'
>>> print r'\n'
\n
Search Examples
Here are some more re.search examples which can be used in the second block of code in this post –
re.search(r'iii', 'niiice') # found iii
re.search(r'igs', 'niiice') # did not find
## . = any char but \n
re.search(r'..e', 'niiice') # found ice
## \d = digit char, \w = word char
re.search(r'\d\d\d', 'n123ce') # found 123
re.search(r'\w\w\w', '$$batman&&') # found bat
Repetition
Here’s what I learned about finding repeated pattern in a given string:
- Plus sign (
+): 1 or more occurrences of the pattern to its left, e.g. ‘i+’ = one or more i’s - Star sign (
*): 0 or more occurrences of the pattern to its left - Question mark (
?): match 0 or 1 occurrences of the pattern to its left
re.search(r'\d\s*\d\s*\d', 'xx1 2 3xx') #found 1 2 3
re.search(r'\d\s*\d\s*\d', 'xx12 3xx') => # found 12 3
re.search(r'\d\s*\d\s*\d', 'xx123xx') => #found 123
re.search(r'^b\w+', 'foobatman') # did not find
re.search(r'b\w+', 'foobatman') # found batman
Finding An Email using Regular Expression
import re
str = 'contact superman at supes@earth.com'
#search 1 or more words followed by @ followed by 1 or more words
match = re.search(r'\w+@\w+', str)
if match:
print match.group() ## 'supes@earth'
But it only returns the email address partially. We need to adjust the code in a way that will allow it to print the .com part as well.
The following code accommodates dots and dashes:
import re
str = 'contact superman at supes@g-mail.com'
#both sets can contain a word, a dash or a dot
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
print match.group() ## 'supes@g-mail.com'
Now that we have a way to find the email address, can we extract the username from it? Yes, we can! This can be done using group extraction in python. Just add parenthesis around the username and host as follows:
import re
str = 'contact superman at supes@g-mail.com'
#both sets can contain a word, a dash or a dot
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
print match.group() ## 'supes@g-mail.com'
print match.group(1) # supes
print match.group(2) #g-mail.com
FindAll()
There’s something called re.findall() will find all matches of a given pattern in the string as opposed to re.search() which only finds the first match of the given pattern.
import re
str = 'contact superman at supes@g-mail.com and batman at batsy@justice.com'
#both sets can contain a word, a dash or a dot
matches = re.findall(r'([\w.-]+)@([\w.-]+)', str)
for match in matches:
print match
#print match[0] # prints supes, batsy
#print match[1] # prints g-mail.com, justice.com
##prints
#('supes', 'g-mail.com')
#('batsy', 'justice.com')
Bonus
Before ending this post, I want to add a point about greedy vs non-greedy aspect about regular expressions which I learned from Google’s python course.
Let’s say we want to match html tags in the following string:
<b>boldman</b> and <i>italicman</i>
It’s common to come up with a solution like <.*> – which will match for any string starting and ending with < and >. However, that matches the whole string instead of individual tags as follows:
import re
str = '<b>boldman</b> and <i>italicman</i>'
#both sets can contain a word, a dash or a dot
match = re.findall(r'<.*>', str)
if match:
print match
#result
#['<b>boldman</b> and <i>italicman</i>']
It can be fixed by adding ? in the solution as follows: <.*?>
import re
str = '<b>boldman</b> and <i>italicman</i>'
#both sets can contain a word, a dash or a dot
match = re.findall(r'<.*?>', str)
if match:
print match
#result
#['<b>', '</b>', '<i>', '</i>']
That’s all in this post. Thanks for reading 🙂