from lec_utils import *
Lecture 9¶
Regular Expressions¶
EECS 398: Practical Data Science, Winter 2025¶
practicaldsc.org • github.com/practicaldsc/wn25 • 📣 See latest announcements here on Ed
Agenda 📆¶
Today's lecture will mostly be about regular expressions. Good resources:
- regex101.com, a helpful site to have open while writing regular expressions.
- Python
re
library documentation and how-to.
The "how-to" is great, read it! - regex "cheat sheet".
- These are all on the resources tab of the course website as well.
Motivation¶
Sending emails ✉️¶
- Suppose you run a club and have a list of members' names and emails, like so:
Sarah Mitchell (lew.bras2@gmail.com) David Chen (chend5@umich.edu) Julia Patel (sung.pat4@icloud.com) Michael Torres (torrmik1@umich.edu) Rebecca Nash (nash.reb3@hotmail.com) Thomas Wright (wright.t5@icloud.com) Amira Hassan (hassa.mra@umich.edu) Kevin Zhang (zhang.k9@hotmail.com) Lauren Cooper (coop.l14@icloud.com) Daniel Park (parkde12@umich.edu) Maria Rodriguez (rod.mar18@hotmail.com) Andrew Lee (lee.and7@icloud.com) Sophia Kim (spk1999@umich.edu) Brandon Wu (wu.bran22@hotmail.com) Rachel Thompson (thom.r11@icloud.com)
- How do you extract just their emails?
data:image/s3,"s3://crabby-images/e17f2/e17f26a590a8bf4ec03d68e3982338a2bec38755" alt="No description has been provided for this image"
re.findall(r'\(([\w.]+@[\w.]+)\)', s)
do?Basic regular expressions¶
Regular expressions¶
- A regular expression, or regex for short, is a sequence of characters used to match patterns in strings.
- For example,
\(\d{3}\) \d{3}-\d{4}
describes a pattern that matches US phone numbers of the form'(XXX) XXX-XXXX'
.
- Think of regex as a "mini-language".
Formally, they are a grammar for describing a language.
- Pros ✅: They are very powerful and are widely used – virtually every programming language has a module for working with them.
- Cons ❌: They can be hard to read and have many different "dialects."
Writing regular expressions¶
- You will ultimately write most of your regular expressions in Python, using the
re
module. We will see how to do so shortly.
- However, a useful tool for designing regular expressions is regex101.com.
- We will use it heavily during lecture; you should have it open as we work through examples. If you're trying to revisit this lecture in the future, you'll likely want to watch the recording; just looking at the notebook won't give you enough context.
Literals¶
A literal is a character that has no special meaning.
Letters, numbers, and some symbols are all literals.
Some symbols, like
.
,*
,(
, and)
, are special characters.*Example*: The regex
hey
matches the string'hey'
. The regexhe.
also matches the string'hey'
.
Regex building blocks 🧱¶
The four main building blocks for all regexes are shown below.
table source, inspiration.
operation | order of op. | example | matches ✅ | does not match ❌ |
---|---|---|---|---|
concatenation | 3 | AABAAB |
'AABAAB' |
every other string |
or | 4 | AA|BAAB |
'AA' , 'BAAB' |
every other string |
closure (zero or more) |
2 | AB*A |
'AA' , 'ABBBBBBA' |
'AB' , 'ABABA' |
parentheses | 1 | A(A|B)AAB (AB)*A |
'AAAAB' , 'ABAAB' 'A' , 'ABABABABA' |
every other string'AA' , 'ABBA' |
Note that |
, (
, )
, and *
are special characters, not literals. They manipulate the characters around them.
*Example (or, parentheses)*:
- What does
EECS 280|398
match? - What does
EECS (280|398)
match?
*Example (closure, parentheses)*:
- What does
eecs*
match? - What does
(eecs)*
match?
Activity
Write a regular expression that matches 'billy'
, 'billlly'
, 'billlllly'
, etc.
- First, think about how to match strings with any even number of
'l'
s, including zero'l'
s (i.e.'biy'
). - Then, think about how to match only strings with a positive even number of
'l'
s.
✅ Click here to see the answer after you've tried it yourself at regex101.com.
bi(ll)*y
will match any even number of 'l'
s, including 0.
To match only a positive even number of 'l'
s, we'd need to first "fix into place" two 'l'
s, and then follow that up with zero or more pairs of 'l'
s. This specifies the regular expression bill(ll)*y
.
Activity
Write a regular expression that matches 'billy'
, 'billlly'
, 'biggy'
, 'biggggy'
, etc.
Specifically, it should match any string with a positive even number of 'l'
s in the middle, or a positive even number of 'g'
s in the middle.
✅ Click here to see the answer after you've tried it yourself at regex101.com.
Possible answers: bi(ll(ll)*|gg(gg)*)y
or bill(ll)*y|bigg(gg)*y
.
Note, bill(ll)*|gg(gg)*y
is not a valid answer! This is because "concatenation" comes before "or" in the order of operations. This regular expression would match strings that match bill(ll)*
, like 'billll'
, OR strings that match gg(gg)*y
, like 'ggy'
.
Intermediate regex¶
More regex syntax¶
operation | example | matches ✅ | does not match ❌ |
---|---|---|---|
wildcard | .U.U.U. |
'CUMULUS' 'JUGULUM' |
'SUCCUBUS' 'TUMULTUOUS' |
character class | [A-Za-z][a-z]* |
'word' 'Capitalized' |
'camelCase' '4illegal' |
at least one | bi(ll)+y |
'billy' 'billlllly' |
'biy' 'bily' |
between $i$ and $j$ occurrences | m[aeiou]{1,2}m |
'mem' 'maam' 'miem' |
'mm' 'mooom' 'meme' |
.
, [
, ]
, +
, {
, and }
are also special characters, in addition to |
, (
, )
, and *
.
*Example (character classes, at least one): [A-E]+
is just shortform for `(A|B|C|D|E)(A|B|C|D|E)`.
*Example (wildcard)*:
- What does
.
match? - What does
he.
match? - What does
...
match?
*Example (at least one, closure)*:
- What does
123+
match? - What does
123*
match?
*Example (number of occurrences)*: What does tri{3, 5}
match? Does it match 'triiiii'
?
*Example (character classes, number of occurrences)*:
What does [1-6a-f]{3}-[7-9E-S]{2}
match?
Activity
Write a regular expression that matches any lowercase string has a repeated vowel, such as 'noon'
, 'peel'
, 'festoon'
, or 'zeebraa'
.
✅ Click here to see the answer after you've tried it yourself at regex101.com.
One answer: [a-z]*(aa|ee|ii|oo|uu)[a-z]*
This regular expression matches strings of lowercase characters that have 'aa'
, 'ee'
, 'ii'
, 'oo'
, or 'uu'
in them anywhere. [a-z]*
means "zero or more of any lowercase characters"; essentially we are saying it doesn't matter what letters come before or after the double vowels, as long as the double vowels exist somewhere.
Activity
Write a regular expression that matches any string that contains both a lowercase letter and a number, in any order. Examples include 'billy398'
, '398!!billy'
, and 'bil3ly98'
.
✅ Click here to see the answer after you've tried it yourself at regex101.com.
One answer: (.*[a-z].*[0-9].*)|(.*[0-9].*[a-z].*)
We can break the above regex into two parts – everything before the |
, and everything after the |
.
The first part, .*[a-z].*[0-9].*
, matches strings in which there is at least one lowercase character and at least one digit, with the lowercase character coming first.
The second part, .*[0-9].*[a-z].*
, matches strings in which there is at least one lowercase character and at least one digit, with the digit coming first.
Note, the .*
between the digit and letter classes is needed in the event the string has non-digit and non-letter characters.
This is the kind of task that would be easier to accomplish with regular Python string methods.
Even more regex syntax¶
operation | example | matches ✅ | does not match ❌ |
---|---|---|---|
escape character | umich\.edu |
'umich.edu' |
'umich!edu' |
beginning of line | ^ark |
'ark two' 'ark o ark' |
'dark' |
end of line | ark$ |
'dark' 'ark o ark' |
'ark two' |
zero or one | cat? |
'ca' 'cat' |
'cart' (matches 'ca' only) |
built-in character classes* | \w+ \d+ |
'billy' '231231' |
'this person' '858 people' |
character class negation | [^a-z]+ |
'WOLVERINE551' '1721$$' |
'porch' 'billy.edu' |
**Note*: in Python's implementation of regex,
\d
refers to digits.\w
refers to alphanumeric characters ([A-Z][a-z][0-9]_
). Whenever we say "alphanumeric" in an assignment, we're referring to\w
!\s
refers to whitespace.\b
is a word boundary.
*Example (escaping)*:
- What does
he.
match? - What does
he\.
match? - What does
(734)
match? - What does
\(734\)
match?
*Example (anchors)*:
- What does
734-764
match? - What does
^734-764
match? - What does
734-764$
match?
*Example (built-in character classes)*:
- What does
\d{3} \d{3}-\d{4}
match? - What does
\bcat\b
match? Does it find a match in'my cat is hungry'
? What about'concatenate'
,'kitty cat'
, or'in-the-cat-hat'
?
Remember, in Python's implementation of regex,
\d
refers to digits.\w
refers to alphanumeric characters ([A-Z][a-z][0-9]_
). Whenever we say "alphanumeric" in an assignment, we're referring to\w
!\s
refers to whitespace.\b
is a word boundary.
Activity
Write a regular expression that matches any string that:
- is between 5 and 10 characters long, and
- is made up of only vowels (either uppercase or lowercase, including
'Y'
and'y'
), periods, and spaces.
Examples include 'yoo.ee.IOU'
and 'AI.I oey'
.
✅ Click here to see the answer after you've tried it yourself at regex101.com.
One answer: ^[aeiouyAEIOUY. ]{5,10}$
Key idea: Within a character class (i.e. [...]
), special characters do not generally need to be escaped.
Regex in Python¶
re
in Python¶
- The
re
module is built into Python. It allows us to use regular expressions to find, extract, and replace strings.
import re
re.findall
takes in a stringregex
and a stringtext
and returns a list of all matches ofregex
intext
. You'll use this most often.
re.findall('AB*A',
'here is a string for you: ABBBA. here is another: ABBBBBBBA')
['ABBBA', 'ABBBBBBBA']
re.sub
takes in a stringregex
, a stringrepl
, and a stringtext
, and replaces all matches ofregex
intext
withrepl
.
re.sub('AB*A',
'billy',
'here is a string for you: ABBBA. here is another: ABBBBBBBA')
'here is a string for you: billy. here is another: billy'
Raw strings¶
When using regular expressions in Python, it's a good idea to use raw strings, denoted by an r
before the quotes, e.g. r'exp'
.
re.findall('\bcat\b', 'my cat is hungry')
[]
re.findall(r'\bcat\b', 'my cat is hungry')
['cat']
# Huh?
print('\bcat\b')
cat
Capturing and non-capturing groups¶
- Surround a regex with
(
and)
to define a capture group within a pattern. Capture groups are useful for extracting relevant parts of a string.
re.findall(r'\w+@(\w+)\.edu',
'my old email was billy@notumich.edu, my new email is notbilly@umich.edu')
['notumich', 'umich']
- Notice what happens if we remove the
(
and)
!
re.findall(r'\w+@\w+\.edu',
'my old email was billy@notumich.edu, my new email is notbilly@umich.edu')
['billy@notumich.edu', 'notbilly@umich.edu']
- Earlier, we also saw that parentheses can be used to group parts of a regex together. When using
re.findall
, all groups are treated as capturing groups.
# A regex that matches strings with two of the same vowel followed by 3 digits.
# We only want to capture the digits, but...
re.findall(r'(aa|ee|ii|oo|uu)(\d{3})', 'eeoo124')
[('oo', '124')]
- To specify that we don't want to capture a particular group, use
?:
inside the parentheses at the start.?:
specifies a non-capturing group.
re.findall(r'(?:aa|ee|ii|oo|uu)(\d{3})', 'eeoo124')
['124']
Example: Extracting hashtags¶
- The dataset
'data/ira.csv'
contains tweets tagged by Twitter as likely being posted by the Internet Research Agency, the tweet factory facing allegations for attempting to influence US political elections.
For more context, read this Wikipedia article.
tweets = pd.read_csv('data/ira.csv', names=['id', 'user', 'time', 'text'])
tweets.head()
id | user | time | text | |
---|---|---|---|---|
0 | 3906258 | ea85ac8be1e8ab479064ca4c0fe3ac6587f76b1ef97452... | 2016-11-16 09:04 | The Best Exercise To Lose Belly Fat In 2 weeks... |
1 | 1051443 | 8e58ab0f46d273103d9e71aa92cdaffb6e330ec7d15ae5... | 2016-12-24 04:31 | RT @Philanthropy: Dozens of ‘hate groups’ have... |
2 | 2823399 | Room Of Rumor | 2016-08-18 20:26 | Artificial intelligence can find, map poverty,... |
3 | 272878 | San Francisco Daily | 2016-03-18 19:28 | Uber balks at rules proposed by world’s busies... |
4 | 7697802 | 41bb9ae5991f53996752a0ab8dd36b543821abca8d5aed... | 2016-07-30 15:44 | RT @dirtroaddiva1: #IHatePokemonGoBecause he ... |
tweets.shape
(90000, 4)
- Question: What are the most common hashtags among all 9000 tweets?
A hashtag is any alphanumeric string beginning with'#'
, e.g.'#GoBlue'
.
Extracting hashtags¶
- Most Series
.str
operations support regular expressions.
We can usere.findall
to find all of the hashtags in a particular string.
example_tweet = tweets['text'].iloc[0]
example_tweet
'The Best Exercise To Lose Belly Fat In 2 weeks https://t.co/oHFToG7rh6 #Exercise #LoseBellyFat #CatTV #TeenWolf… https://t.co/b4pr9gEx38'
re.findall(r'#(\w+)', example_tweet)
['Exercise', 'LoseBellyFat', 'CatTV', 'TeenWolf']
re.findall(r'#(\w+)', 'hey there, no hashtags here')
[]
- We can use the Series
str.findall
method, with the regular expression above, to extract hashtags out of each tweet intweets['text']
.
tags = tweets['text'].str.findall(r'#(\w+)')
tags.head()
0 [Exercise, LoseBellyFat, CatTV, TeenWolf] 1 [] 2 [tech] 3 [news] 4 [IHatePokemonGoBecause, PokesAreJokes] Name: text, dtype: object
- We can use the
explode
method on the above Series to separate each list into individual elements.
(
tags
.explode()
.value_counts()
.head(15)
.sort_values()
.plot(kind='barh', title='Most Common Hashtags in IRA Tweets')
)
Followup questions¶
- Which accounts were tagged most often?
For example, in the tweet'I love being a @UMich student'
, user'UMich'
is tagged.
- Which accounts tweeted most often?
- Which websites were linked most often?
- Why were these hashtags used by these accounts?
Again, read the linked Wikipedia article, and do a bit of your own research! These tweets aren't by a random sample of Twitter users.
- Web servers typically record every request made of them in the "logs".
s = '''132.249.20.188 - - [01/Oct/2024:2:36:15 -0400] "GET /my/home/ HTTP/1.1" 200 2585'''
- Let's use our new regex syntax (including capturing groups) to extract the day, month, year, and time from the log string
s
.
exp = '\[(.+)\/(.+)\/(.+):(.+):(.+):(.+) .+\]'
re.findall(exp, s)
[('01', 'Oct', '2024', '2', '36', '15')]
- While above regex works, it is not very specific. It works on incorrectly formatted log strings.
other_s = '[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'
re.findall(exp, other_s)
[('adr', 'jduy', 'wffsdffs', 'r4s4', '4wsgdfd', 'asdf')]
- Be as specific in your pattern matching as possible – you don't want to match and extract strings that don't fit the pattern you care about.
.*
matches every possible string, but we don't use it very often.
- A better date extraction regex:
\[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]
\d{2}
matches any 2-digit number.[A-Z]{1}
matches any single occurrence of any uppercase letter.[a-z]{2}
matches any 2 consecutive occurrences of lowercase letters.- Remember, special characters (
[
,]
,/
) need to be escaped with\
.
s
'132.249.20.188 - - [01/Oct/2024:2:36:15 -0400] "GET /my/home/ HTTP/1.1" 200 2585'
new_exp = '\[(\d{2})\/([A-Z]{1}[a-z]{2})\/(\d{4}):(\d{2}):(\d{2}):(\d{2}) -\d{4}\]'
re.findall(new_exp, s)
[]
- A benefit of
new_exp
overexp
is that it doesn't capture anything when the string doesn't follow the format we specified.
other_s
'[adr/jduy/wffsdffs:r4s4:4wsgdfd:asdf 7]'
re.findall(new_exp, other_s)
[]