from lec_utils import *
Agenda¶
- Recap: Handling missing values.
- Introduction to HTTP.
- The structure of HTML.
- Parsing HTML.
- Example: Scraping quotes.
- Example: Scraping the Happening @ Michigan page.
Recap: Handling missing values¶
Summary of imputation techniques¶
- Consider whether values are missing intentionally, or whether there's a default replacement.
- Listwise deletion.
Drop, or ignore, missing values.
- (Conditional) mean imputation.
Fill in missing values with the mean of observed values. If there's a reason to believe the missingness depends on another categorical column, fill in missing values with the observed mean separately for each category.
- (Conditional) Probabilistic imputation.
Fill in missing values with a random sample of observed values. If there's a reason to believe the missingness depends on another categorical column, fill in missing values with a random sample drawn separately for each category.
- Regression imputation.
Predict missing values using other features.
Activity
Work on UCSD DSC 80 Spring 2022 Final Exam, Problem 5 (Parts 1-4).
Missingness mechanisms¶
There are three key missingness mechanisms, which describe how data in a column can be missing.
- Missing completely at random (MCAR): Data are MCAR if the chance that a value is missing is completely independent of other columns and the actual missing value.
Example: Suppose that after the Midterm Exam, I randomly choose 5 scores to delete on Gradescope, meaning that 5 students have missing grades. MCAR is ideal, but rare!
- Missing at random (MAR): Data are MAR if the chance that a value is missing depends on other columns.
Example: Suppose that after the Midterm Exam, I randomly choose 5 scores to delete on Gradescope among sophomore students. Now, scores are missing at random dependent on class standing.
- Not missing at random (NMAR): Data are NMAR if the chance that a value is missing depends on the actual missing value itself.
Example: Suppose that after the Midterm Exam, I randomly delete 5 of the 10 lowest scores on Gradescope. Now, scores are not missing at random, since the chance a value is missing depends on how large it is.
- Statistical imputation packages usually assume data are MAR.
MCAR is usually unrealistic to assume. If data are NMAR, you can't impute missing values, since the other features in your data can't explain the missingness.
How do we know if data are MCAR?¶
- It seems that if our data are MCAR, there is no risk to dropping missing values.
In the MCAR setting, just imagine we're being given a large, random sample of the true dataset.
- If the data are not MCAR, though, then dropping the missing values will introduce bias.
For instance, suppose we asked people "How much do you give to charity?" People who give little are less likely to respond, so the average response is biased high.
- There is no perfect procedure for determining if our data are MCAR, MAR, or NMAR; we mostly have to use our understanding of how the data is generated.
- But, we can try to determine whether $Y_\text{missing}$ is similar to $Y$, using the information we do have in other columns.
We did this earlier, when looking at the proportion of missing'child'
heights for each'gender'
.
Question 🤔 (Answer at practicaldsc.org/q)
Remember that you can always ask questions anonymously at the link above!
What lingering questions do we have about how to handle missing values?
Introduction to HTTP¶
Data sources¶
- Often, the data you need doesn't exist in "clean"
.csv
files.
- Solution: Collect your own data from the internet!
For most questions you can think of, the answer exists somewhere on the internet. If not, you can run our own survey – also on the internet!
Manual copy-pasting¶
- If data is already nicely formatted in a table online, sometimes we can easily copy it and paste it into a
.csv
or.tsv
file..tsv
stands for "tab-separated values", just like.csv
stands for "comma-separated values."
- For example, open the 2024 Michigan Football schedule here and click "Text Only".
This is what you should see.
- Copy the text in the table at the bottom and save it in a file named
2024-schedule.tsv
in yourdata
folder.
You may need to do some minor reformatting in the.tsv
file before this works.
As a challenge, see if you can find a way to do this entirely within your Terminal, i.e. without opening a text editor!
schedule = pd.read_csv('data/2024-schedule.tsv', sep='\t')
schedule.head()
Date | Time | At | Opponent | Location | Tournament | Result | |
---|---|---|---|---|---|---|---|
0 | Aug 31 (Sat) | 7:30 PM | Home | Fresno State | Ann Arbor, Mich. | NaN | W 30-10 |
1 | Sep 7 (Sat) | Noon | Home | #3 Texas | Ann Arbor, Mich. | NaN | L 12-31 |
2 | Sep 14 (Sat) | Noon | Home | Arkansas State | Ann Arbor, Mich. | NaN | W 28-18 |
3 | Sep 21 (Sat) | 3:30 PM | Home | #11 USC | Ann Arbor, Mich. | NaN | W 27-24 |
4 | Sep 28 (Sat) | Noon | Home | Minnesota | Ann Arbor, Mich. | NaN | - |
- For Wikipedia specifically, you can use Wikitable2CSV, which converts Wikipedia tables to
.csv
files for you.
Programatically accessing data¶
- We won't always be able to copy-paste tables from online, and even when we can, it's not easily reproducible.
What if mgoblue.com didn't have a "Text Only" option? Or what if the schedule changes – how can I prevent myself from having to copy-and-paste again?
- There are two ways to programmatically access data on the internet: either by scraping, or through an API.
By programmatically, we mean by writing code.
- We will discuss the differences between both approaches, but for now, the important part is that they both use HTTP.
- Note: We'll cover the essentials of how the internet works for the purposes of accessing data, but for more details, take EECS 485.
Goal¶
- Our goal in today's lecture will be to programatically create a DataFrame containing all of the events at the Happening @ Michigan page, taken from events.umich.edu.
- Somehow, we'll start with the link to a webpage that looks like:
- And end with a DataFrame like this one, all just by writing code:
title | time | location | |
---|---|---|---|
0 | International Students Career Series: Coffee Chat with the University Career Center | Sep 23, 2024 9:00am | University Career Center, 3200 Student Activities Building, Program Room (3003), 515 E Jefferson St, Ann Arbor, MI, United States |
1 | Alaska Teachers & Personnel Informational Meeting | Sep 23, 2024 10:00am | |
2 | Michigan in Washington Fall 2024 Application Deadline | Sep 23, 2024 10:00am | |
3 | EEB Prelim Seminar Series - Evolution of “Collecting” Behavior in Deep Sea Carrier Snails | Sep 23, 2024 10:30am | Biological Sciences Building |
4 | Huron Affinity Group Overview (iMatter Teams) | Sep 23, 2024 11:00am |
The request-response model¶
- HTTP stands for Hypertext Transfer Protocol.
It was developed in 1989 by Tim Berners-Lee (and friends). The "S" in HTTPS stands for "secure".
- HTTP follows the request-response model.
- A request is made by the client.
- A response is returned by the server.
- Example: YouTube search 🎥.
- Consider the following URL: https://www.youtube.com/results?search_query=iphone+16+pro+unboxing.
- Your web browser, a client, makes an HTTP request with a search query.
- The server, YouTube, is a computer that is sitting somewhere else.
- The server returns a response that contains the search results.
- Note: ?search_query=iphone+16+pro+unboxing is called a "query string."
Consequences of the request-response model¶
- When a request is sent to view content on a webpage, the server must:
- process your request (i.e. prepare data for the response).
- send content back to the client in its response.
- Remember, servers are computers. Someone has to pay to keep these computers running.
Every time you access a website, someone has to pay.
- If you make too many requests, the server may block your IP address, or you may even take down the website!
A journalist scraped and accidentally took down the Cook County Inmate Locater, and as a result, inmate's families weren't able to contact them while the site was down.
HTTP request methods¶
- The request methods you will use most often are
GET
andPOST
; see Mozilla's web docs for a detailed list of request methods.
GET
is used to request data from a specified resource.
Almost all of the requests we'll make in this class areGET
requests.
To load websites, your web browser uses a lot ofGET
requests!
POST
is used to send data to the server.
For example, uploading a photo to Instagram or entering credit card information on Amazon.
- You can make requests directly in your Terminal using the
curl
method, which you'll learn more about in EECS 485. Here, we'll make requests using therequests
Python module!
There are other packages that work similarly (e.g.urllib
), butrequests
is arguably the easiest to use.
import requests
Example: GET
requests via requests
¶
- For instance, let's access the source code of the events page, https://events.umich.edu.
res = requests.get('https://events.umich.edu')
res
is now aResponse
object.
res
<Response [200]>
- The
text
attribute ofres
is a string that containing the entire response.
type(res.text)
str
len(res.text)
219874
print(res.text[:2000])
<!DOCTYPE html> <html lang="en"> <!-- ::: ::: ::: ::: :::::::::: ::: ::: :::::::::: :::: ::: ::::::::::: :::::::: :+: :+: :+:+: :+:+: :+: :+: :+: :+: :+:+: :+: :+: :+: :+: +:+ +:+ +:+ +:+:+ +:+ +:+ +:+ +:+ +:+ :+:+:+ +:+ +:+ +:+ +#+ +:+ +#+ +:+ +#+ +#++:++# +#+ +:+ +#++:++# +#+ +:+ +#+ +#+ +#++:++#++ +#+ +#+ +#+ +#+ +#+ +#+ +#+ +#+ +#+ +#+#+# +#+ +#+ #+# #+# #+# #+# #+# #+#+#+# #+# #+# #+#+# #+# #+# #+# ######## ### ### ########## ### ########## ### #### ### ######## Version: 6.0 - Mustard's Retreat --> <head> <meta name="viewport" content="initial-scale=1, maximum-scale=1"> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> <title>Happening at the University of Michigan | Happening @ Michigan</title> <link rel="icon" sizes="16x16" href="/favicon.ico" /> <meta property="og:title" content="Happening @ Michigan" /> <meta property="og:image" content="default-image.png" /> <link rel="stylesheet" href="/css/jquery-ui-custom.css" /> <link rel="stylesheet" href="/bundles/umevents/css/jquery-ui.structure.min.css" /> <link rel="stylesheet" href="/bundles/umevents/css/jquery-ui.theme.min.css" /> <link rel="stylesheet" href="/css/main.css" /> <script type="text/javascript" src="/js-dist/jquery.min.js"></script> <script type="text/javascript" src="/js-dist/jquery-ui.min.js"></script> <script type="text/javascript" src="/js/modals.js"></script> <script type="text/javascript" src="/js/infoPoint.js"></script> <script type="text/javascript" src="/js/jquery.unveil.js"></script> <script type="text/javascript" src="/js/jquery.windowaction.js"></script> <script type="text/javas
- The response is a string containing HTML, the markup language used to format information on the internet. The events data we're looking for is in
res.text
somewhere, but we have to search for it and extract it.
- The following call to
requests.post
makes a post request to https://httpbin.org/post, with a'name'
parameter of'Go Blue'
.
post_res = requests.post('https://httpbin.org/post',
data={'name': 'Go Blue'})
post_res
<Response [200]>
post_res.text
'{\n "args": {}, \n "data": "", \n "files": {}, \n "form": {\n "name": "Go Blue"\n }, \n "headers": {\n "Accept": "*/*", \n "Accept-Encoding": "gzip, deflate, br", \n "Content-Length": "12", \n "Content-Type": "application/x-www-form-urlencoded", \n "Host": "httpbin.org", \n "User-Agent": "python-requests/2.32.3", \n "X-Amzn-Trace-Id": "Root=1-66f2db61-2d8211151c1be3905e7cf53c"\n }, \n "json": null, \n "origin": "35.3.45.217", \n "url": "https://httpbin.org/post"\n}\n'
- Now, the response is a string describing a JSON object. We'll learn how to work with these later in the lecture and into Lecture 10, but for now, note that we can use the
.json()
method to convert it to a Python dictionary.
post_res.json()
{'args': {}, 'data': '', 'files': {}, 'form': {'name': 'Go Blue'}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate, br', 'Content-Length': '12', 'Content-Type': 'application/x-www-form-urlencoded', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.32.3', 'X-Amzn-Trace-Id': 'Root=1-66f2db61-2d8211151c1be3905e7cf53c'}, 'json': None, 'origin': '35.3.45.217', 'url': 'https://httpbin.org/post'}
- What happens when we try and make a
POST
request somewhere where we're unable to?
yt_res = requests.post('https://youtube.com',
data={'name': 'Go Blue'})
yt_res
<Response [400]>
# This takes the text of yt_res and renders it as an HTML document within our notebook!
HTML(yt_res.text)
HTTP status codes¶
- When we request data from a website, the server includes an HTTP status code in the response.
- The most common status code is
200
, which means there were no issues.
- Other times, you will see a different status code, describing some sort of event or error.
- Common examples:
403
: forbidden,404
: page not found,500
: internal server error. - The first digit of a status describes its general "category."
- Common examples:
- For example, the CSE faculty page doesn't let us scrape it.
Nothing is stopping us from opening Chrome, clicking "View Page Source", and manually downloading the HTML, though!
res = requests.get('https://cse.engin.umich.edu/people/faculty/')
res.status_code
403
- As an aside, you can render HTML directly in a notebook using the
IPython.display.HTML
function.
from IPython.display import HTML
HTML(res.text)
403 Forbidden
- Sometimes, websites either don't want you to scrape, or prohibit you from scraping.
It's best practice to check the website'srobots.txt
file, where they specify who is and isn't allowed to scrape.
As we saw on the previous slide, the CSE website blocks us from scraping it, as we got a 403: Forbidden status code.
- Some unsuccessful requests can be re-tried, depending on the issue.
A good first step is to wait a little, then try again.
- A common issue is that you're making too many requests to a particular server at a time. If this is the case, you are being rate-limited; one solution is to increase the time between each request.
You can even do this programatically, say, usingtime.sleep
.
- See LDS 14 for more examples.
The structure of HTML¶
Scraping vs. APIs¶
- We mentioned earlier that there are two ways to programmatically access data on the internet: either by scraping, or through an API.
- Scraping is the act of emulating a web browser to access its HTML source code.When scraping, you get back data as HTML and have to parse that HTML to extract the information you want. Parse means to "extract meaning from a sequence of symbols".
✅ Pros | ❌ Cons |
---|---|
If the website exists, you can usually scrape it. This is what Google does! |
Scraping and parsing code gets messy, since HTML documents contain lots of content unrelated to the information you're trying to find (advertisements, formatting). When the website's structure changes, your code will need to, too. The site owner may not want you to scrape it! |
- An application programming interface, or API, is a service that makes data directly available to the user in a convenient fashion. Usually, APIs give us code back as JSON objects.
APIs are made by organizations that host data. For example, X (formally known as Twitter) has an API, as does OpenAI, the creators of ChatGPT.
✅ Pros | ❌ Cons |
---|---|
If an API exists, the data are usually clean, up-to-date, and ready to use. The presence of an API signals that the data provider is okay with you using their data. The data provider can plan and regulate data usage. Sometimes, you may need to create an API "key", which is like an account for using the API. APIs can often give you access to data that isn't publicly available. |
APIs don't always exist for the data you want! |
- We're learning how to scrape now, and will see how to use APIs after.
What is HTML?¶
- HTML (Hypertext Markup Language) is the basic building block of the internet.
- It is a markup language, not a programming language.
Markup languages specify what something should look like, while programming languages specify what something should calculate or do.
- Specifically, it defines the content and layout of a webpage, and as such, it is what you get back when you scrape a webpage.
- We're only going to learn enough HTML to help us scrape information.
See this tutorial for more details on HTML.
An example webpage¶
- For instance, here's the source code of a very basic webpage.
!cat data/lec09_ex1.html
<html> <head> <title>Page title</title> </head> <body> <h1>This is a heading</h1> <p>This is a paragraph.</p> <p>This is <b>another</b> paragraph.</p> </body> </html>
- Here's what that webpage actually looks like:
HTML('data/lec09_ex1.html')
This is a heading
This is a paragraph.
This is another paragraph.
The anatomy of HTML documents¶
- HTML document: The totality of markup that makes up a webpage.
- Document Object Model (DOM): The internal representation of an HTML document as a hierarchical tree structure.
- HTML element: An object in the DOM, such as a paragraph, header, or title.
- HTML tags: Markers that denote the start and end of an element, such as
<p>
and</p>
.
- Attributes: Some tags have attributes, which further specify how to display information.
<p style="color: red">Look at my red text!</p>
Example: Pages and trees¶
- For the following webpage:
- The DOM tree would look like:
- Why do we care? Extracting information out of an HTML document will involve traversing this tree.
- See the attached reference slides for examples of common tags.
- Often, the information we're looking for is nestled in one of these tags:
Element | Description |
---|---|
<html> |
the document |
<head> |
the header |
<body> |
the body |
<div> |
a logical division of the document |
<span> |
an inline logical division |
<p> |
a paragraph |
<a> |
an anchor (hyperlink) |
<h1>, <h2>, ... |
header(s) |
<img> |
an image |
- There are many, many more. See this article for examples.
- Tags can have attributes, which further specify how to display information on a webpage.
- For instance,
<img>
tags havesrc
andalt
attributes, among others:
<img src="cool-visualization.png" alt="My box plot that I'm super proud of." width=500>
- Hyperlinks have
href
attributes:
Click <a href="https://study.practicaldsc.org">this link</a> to access past exams.
- The
<div>
tag is one of the more common tags. It defines a "section" of an HTML document, and is often used as a container for other HTML elements.
Think of<div>
s like cells in Jupyter Notebooks.
<div class="background">
<h3>This is a heading</h3>
<p>This is a paragraph.</p>
</div>
- Often, the information we're looking for is stored in an attribute!
You can imagine a situation where we want to get the URL behind a button, for example.
Question 🤔 (Answer at practicaldsc.org/q)
Remember that you can always ask questions anonymously at the link above!
What lingering questions do we have about the Document Object Model and the structure of HTML?
Parsing HTML¶
Beautiful Soup 🍜¶
- Beautiful Soup 4 is a Python HTML parser.
Remember, parse means to "extract meaning from a sequence of symbols".
- Warning: Beautiful Soup 4 and Beautiful Soup 3 work differently, so make sure you are using and looking at documentation for Beautiful Soup 4.
Rest assured, thepds
conda environment already has Beautiful Soup 4 installed.
Example HTML document¶
- To start, we'll work with the source code for an HTML page with the DOM tree shown below:
- The string
html_string
contains an HTML "document".
html_string = '''
<html>
<body>
<div id="content">
<h1>Heading here</h1>
<p>My First paragraph</p>
<p>My <em>second</em> paragraph</p>
<hr>
</div>
<div id="nav">
<ul>
<li>item 1</li>
<li>item 2</li>
<li>item 3</li>
</ul>
</div>
</body>
</html>
'''.strip()
HTML(html_string)
Heading here
My First paragraph
My second paragraph
Instantiating BeautifulSoup
objects¶
bs4
'sBeautifulSoup
function takes in a string or file-like object representing HTML and returns a parsed document.
# We also could have used:
# import bs4
# But, then we'd need to use bs4.BeautifulSoup every time.
from bs4 import BeautifulSoup
BeautifulSoup?
- Normally, we pass the result of a
GET
request toBeautifulSoup
, but here we will pass our hand-craftedhtml_string
.
soup = BeautifulSoup(html_string)
soup
<html><head></head><body> <div id="content"> <h1>Heading here</h1> <p>My First paragraph</p> <p>My <em>second</em> paragraph</p> <hr/> </div> <div id="nav"> <ul> <li>item 1</li> <li>item 2</li> <li>item 3</li> </ul> </div> </body></html>
type(soup)
bs4.BeautifulSoup
BeautifulSoup
objects have several useful attributes, e.g.text
:
print(soup.text)
Heading here My First paragraph My second paragraph item 1 item 2 item 3
Finding elements in a BeautifulSoup object¶
- The two main methods you will use to extract information from a BeautifulSoup object are
find
andfind_all
.
soup.find(tag)
finds the first instance of a tag (the first one on the page, i.e. the first one that DFS sees), and returns just that tag.
It has several optional arguments: look at the documentation!
soup.find_all(tag)
will find all instances of a tag, and returns a list of tags.
- Remember:
find
finds tags!
Using find
¶
- Let's try and extract the first
<div>
subtree.
soup.find('div')
<div id="content"> <h1>Heading here</h1> <p>My First paragraph</p> <p>My <em>second</em> paragraph</p> <hr/> </div>
- Let's try and find the
<div>
element that has anid
attribute equal to'nav'
.
soup.find('div', attrs={'id': 'nav'})
<div id="nav"> <ul> <li>item 1</li> <li>item 2</li> <li>item 3</li> </ul> </div>
find
will return the first occurrence of a tag, regardless of its depth in the tree.
# The ul child is not at the top of the tree, but we can still find it.
soup.find('ul')
<ul> <li>item 1</li> <li>item 2</li> <li>item 3</li> </ul>
Using find_all
¶
find_all
returns a list of all matching tags.
soup
<html><head></head><body> <div id="content"> <h1>Heading here</h1> <p>My First paragraph</p> <p>My <em>second</em> paragraph</p> <hr/> </div> <div id="nav"> <ul> <li>item 1</li> <li>item 2</li> <li>item 3</li> </ul> </div> </body></html>
soup.find_all('div')
[<div id="content"> <h1>Heading here</h1> <p>My First paragraph</p> <p>My <em>second</em> paragraph</p> <hr/> </div>, <div id="nav"> <ul> <li>item 1</li> <li>item 2</li> <li>item 3</li> </ul> </div>]
soup.find_all('li')
[<li>item 1</li>, <li>item 2</li>, <li>item 3</li>]
- We often use the
find_all
method in conjunction with afor
-loop or list comprehension, to perform some operation on every matching tag.
[x.text for x in soup.find_all('li')]
['item 1', 'item 2', 'item 3']
Node attributes¶
- The
text
attribute of a tag element gets the text between the opening and closing tags.
soup.find('p')
<p>My First paragraph</p>
soup.find('p').text
'My First paragraph'
- The
attrs
attribute of a tag element lists all of its attributes.
soup.find('div')
<div id="content"> <h1>Heading here</h1> <p>My First paragraph</p> <p>My <em>second</em> paragraph</p> <hr/> </div>
soup.find('div').text
'\n Heading here\n My First paragraph\n My second paragraph\n \n '
soup.find('div').attrs
{'id': 'content'}
- The
get
method of a tag element gets the value of an attribute.find
andget
are easy to get confused, but you'll use them both a lot.
soup.find('div').get('id')
'content'
- The
get
method must be called directly on the node that contains the attribute you're looking for.
soup
<html><head></head><body> <div id="content"> <h1>Heading here</h1> <p>My First paragraph</p> <p>My <em>second</em> paragraph</p> <hr/> </div> <div id="nav"> <ul> <li>item 1</li> <li>item 2</li> <li>item 3</li> </ul> </div> </body></html>
# While there are multiple 'id' attributes, none of them are in the <html> tag at the top.
soup.get('id')
soup.find('div').get('id')
'content'
Activity
Consider the following HTML document, which represents a webpage containing the top few songs with the most streams on Spotify today in Canada.
<head>
<title>3*Canada-2022-06-04</title>
</head>
<body>
<h1>Spotify Top 3 - Canada</h1>
<table>
<tr class='heading'>
<th>Rank</th>
<th>Artist(s)</th>
<th>Song</th>
</tr>
<tr class=1>
<td>1</td>
<td>Harry Styles</td>
<td>As It Was</td>
</tr>
<tr class=2>
<td>2</td>
<td>Jack Harlow</td>
<td>First Class</td>
</tr>
<tr class=3>
<td>3</td>
<td>Kendrick Lamar</td>
<td>N95</td>
</tr>
</table>
</body>
- Part 1: How many leaf nodes are there in the DOM tree of the previous document — that is, how many nodes have no children?
- Part 2: What does the following line of code evaluate to?
len(soup.find_all("td"))
- Part 3: What does the following line of code evaluate to?
soup.find("tr").get("class")
Example: Scraping quotes¶
Example: Scraping quotes¶
- Navigate to quotes.toscrape.com.
- Goal: Extract quotes, and relevant metadata, into a DataFrame.
- Specifically, let's try to make a DataFrame that looks like the one below:
quote | author | author_url | tags | |
---|---|---|---|---|
0 | “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.” | Albert Einstein | https://quotes.toscrape.com/author/Albert-Einstein | change,deep-thoughts,thinking,world |
1 | “It is our choices, Harry, that show what we truly are, far more than our abilities.” | J.K. Rowling | https://quotes.toscrape.com/author/J-K-Rowling | abilities,choices |
2 | “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.” | Albert Einstein | https://quotes.toscrape.com/author/Albert-Einstein | inspirational,life,live,miracle,miracles |
Organizing our work¶
- Eventually, we will implement a single function,
make_quote_df
, which takes in an integern
and returns a DataFrame with the quotes on the firstn
pages of quotes.toscrape.com.
- Along the way, we'll implement several helper functions, with the goal of separating our logic: each function should either request information, OR parse, but not both!
- This makes it easier to debug and catch errors.
- It also avoids unnecessary requests.
Downloading a single page¶
- First, let's figure out how to download a single page from quotes.toscrape.com.
- The URLs seem to be formatted a very particular way:
https://quotes.toscrape.com/page/2
def download_page(i):
url = f'https://quotes.toscrape.com/page/{i}'
res = requests.get(url)
return BeautifulSoup(res.text)
- Let's test our function on a single page, like Page 2.
There's nothing special about Page 2; we chose it arbitrarily.
soup = download_page(2)
soup
<!DOCTYPE html> <html lang="en"><head> <meta charset="utf-8"/> <title>Quotes to Scrape</title> <link href="/static/bootstrap.min.css" rel="stylesheet"/> <link href="/static/main.css" rel="stylesheet"/> </head> <body> <div class="container"> <div class="row header-box"> <div class="col-md-8"> <h1> <a href="/" style="text-decoration: none">Quotes to Scrape</a> </h1> </div> <div class="col-md-4"> <p> <a href="/login">Login</a> </p> </div> </div> <div class="row"> <div class="col-md-8"> <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”</span> <span>by <small class="author" itemprop="author">Marilyn Monroe</small> <a href="/author/Marilyn-Monroe">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="friends,heartbreak,inspirational,life,love,sisters" itemprop="keywords"/> <a class="tag" href="/tag/friends/page/1/">friends</a> <a class="tag" href="/tag/heartbreak/page/1/">heartbreak</a> <a class="tag" href="/tag/inspirational/page/1/">inspirational</a> <a class="tag" href="/tag/life/page/1/">life</a> <a class="tag" href="/tag/love/page/1/">love</a> <a class="tag" href="/tag/sisters/page/1/">sisters</a> </div> </div> <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”</span> <span>by <small class="author" itemprop="author">J.K. Rowling</small> <a href="/author/J-K-Rowling">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="courage,friends" itemprop="keywords"/> <a class="tag" href="/tag/courage/page/1/">courage</a> <a class="tag" href="/tag/friends/page/1/">friends</a> </div> </div> <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“If you can't explain it to a six year old, you don't understand it yourself.”</span> <span>by <small class="author" itemprop="author">Albert Einstein</small> <a href="/author/Albert-Einstein">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="simplicity,understand" itemprop="keywords"/> <a class="tag" href="/tag/simplicity/page/1/">simplicity</a> <a class="tag" href="/tag/understand/page/1/">understand</a> </div> </div> <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”</span> <span>by <small class="author" itemprop="author">Bob Marley</small> <a href="/author/Bob-Marley">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="love" itemprop="keywords"/> <a class="tag" href="/tag/love/page/1/">love</a> </div> </div> <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”</span> <span>by <small class="author" itemprop="author">Dr. Seuss</small> <a href="/author/Dr-Seuss">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="fantasy" itemprop="keywords"/> <a class="tag" href="/tag/fantasy/page/1/">fantasy</a> </div> </div> <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”</span> <span>by <small class="author" itemprop="author">Douglas Adams</small> <a href="/author/Douglas-Adams">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="life,navigation" itemprop="keywords"/> <a class="tag" href="/tag/life/page/1/">life</a> <a class="tag" href="/tag/navigation/page/1/">navigation</a> </div> </div> <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”</span> <span>by <small class="author" itemprop="author">Elie Wiesel</small> <a href="/author/Elie-Wiesel">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="activism,apathy,hate,indifference,inspirational,love,opposite,philosophy" itemprop="keywords"/> <a class="tag" href="/tag/activism/page/1/">activism</a> <a class="tag" href="/tag/apathy/page/1/">apathy</a> <a class="tag" href="/tag/hate/page/1/">hate</a> <a class="tag" href="/tag/indifference/page/1/">indifference</a> <a class="tag" href="/tag/inspirational/page/1/">inspirational</a> <a class="tag" href="/tag/love/page/1/">love</a> <a class="tag" href="/tag/opposite/page/1/">opposite</a> <a class="tag" href="/tag/philosophy/page/1/">philosophy</a> </div> </div> <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”</span> <span>by <small class="author" itemprop="author">Friedrich Nietzsche</small> <a href="/author/Friedrich-Nietzsche">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="friendship,lack-of-friendship,lack-of-love,love,marriage,unhappy-marriage" itemprop="keywords"/> <a class="tag" href="/tag/friendship/page/1/">friendship</a> <a class="tag" href="/tag/lack-of-friendship/page/1/">lack-of-friendship</a> <a class="tag" href="/tag/lack-of-love/page/1/">lack-of-love</a> <a class="tag" href="/tag/love/page/1/">love</a> <a class="tag" href="/tag/marriage/page/1/">marriage</a> <a class="tag" href="/tag/unhappy-marriage/page/1/">unhappy-marriage</a> </div> </div> <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“Good friends, good books, and a sleepy conscience: this is the ideal life.”</span> <span>by <small class="author" itemprop="author">Mark Twain</small> <a href="/author/Mark-Twain">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="books,contentment,friends,friendship,life" itemprop="keywords"/> <a class="tag" href="/tag/books/page/1/">books</a> <a class="tag" href="/tag/contentment/page/1/">contentment</a> <a class="tag" href="/tag/friends/page/1/">friends</a> <a class="tag" href="/tag/friendship/page/1/">friendship</a> <a class="tag" href="/tag/life/page/1/">life</a> </div> </div> <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“Life is what happens to us while we are making other plans.”</span> <span>by <small class="author" itemprop="author">Allen Saunders</small> <a href="/author/Allen-Saunders">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="fate,life,misattributed-john-lennon,planning,plans" itemprop="keywords"/> <a class="tag" href="/tag/fate/page/1/">fate</a> <a class="tag" href="/tag/life/page/1/">life</a> <a class="tag" href="/tag/misattributed-john-lennon/page/1/">misattributed-john-lennon</a> <a class="tag" href="/tag/planning/page/1/">planning</a> <a class="tag" href="/tag/plans/page/1/">plans</a> </div> </div> <nav> <ul class="pager"> <li class="previous"> <a href="/page/1/"><span aria-hidden="true">←</span> Previous</a> </li> <li class="next"> <a href="/page/3/">Next <span aria-hidden="true">→</span></a> </li> </ul> </nav> </div> <div class="col-md-4 tags-box"> <h2>Top Ten tags</h2> <span class="tag-item"> <a class="tag" href="/tag/love/" style="font-size: 28px">love</a> </span> <span class="tag-item"> <a class="tag" href="/tag/inspirational/" style="font-size: 26px">inspirational</a> </span> <span class="tag-item"> <a class="tag" href="/tag/life/" style="font-size: 26px">life</a> </span> <span class="tag-item"> <a class="tag" href="/tag/humor/" style="font-size: 24px">humor</a> </span> <span class="tag-item"> <a class="tag" href="/tag/books/" style="font-size: 22px">books</a> </span> <span class="tag-item"> <a class="tag" href="/tag/reading/" style="font-size: 14px">reading</a> </span> <span class="tag-item"> <a class="tag" href="/tag/friendship/" style="font-size: 10px">friendship</a> </span> <span class="tag-item"> <a class="tag" href="/tag/friends/" style="font-size: 8px">friends</a> </span> <span class="tag-item"> <a class="tag" href="/tag/truth/" style="font-size: 8px">truth</a> </span> <span class="tag-item"> <a class="tag" href="/tag/simile/" style="font-size: 6px">simile</a> </span> </div> </div> </div> <footer class="footer"> <div class="container"> <p class="text-muted"> Quotes by: <a href="https://www.goodreads.com/quotes">GoodReads.com</a> </p> <p class="copyright"> Made with <span class="zyte">❤</span> by <a class="zyte" href="https://www.zyte.com">Zyte</a> </p> </div> </footer> </body></html>
- Now that this works, later on, we can call
download_page(1)
,download_page(2)
,download_page(3)
, ...,download_page(n)
.
Parsing a single page¶
- Now, let's try and extract the relevant information out of the
soup
object for Page 2.
- Open quotes.toscrape.com/page/2 in Chrome, right click the page, and click "Inspect"!
This will help us find where each quote is located in the HTML.
divs = soup.find_all('div', class_='quote')
# The above is a shortcut for the following, just for when the attribute key is class:
# divs = soup.find_all('div', attrs={'class': 'quote'})
divs[0]
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork"> <span class="text" itemprop="text">“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”</span> <span>by <small class="author" itemprop="author">Marilyn Monroe</small> <a href="/author/Marilyn-Monroe">(about)</a> </span> <div class="tags"> Tags: <meta class="keywords" content="friends,heartbreak,inspirational,life,love,sisters" itemprop="keywords"/> <a class="tag" href="/tag/friends/page/1/">friends</a> <a class="tag" href="/tag/heartbreak/page/1/">heartbreak</a> <a class="tag" href="/tag/inspirational/page/1/">inspirational</a> <a class="tag" href="/tag/life/page/1/">life</a> <a class="tag" href="/tag/love/page/1/">love</a> <a class="tag" href="/tag/sisters/page/1/">sisters</a> </div> </div>
- From this
<div>
, we can extract the quote, author name, author's URL, and tags.
Strategy: Figure out how to process one<div>
, then put that logic in a function to use on other<div>
s.
# The quote.
divs[0].find('span', class_='text').text
"“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”"
# The author.
divs[0].find('small', class_='author').text
'Marilyn Monroe'
# The URL for the author.
divs[0].find('a').get('href')
'/author/Marilyn-Monroe'
# The quote's tags.
divs[0].find('meta', class_='keywords').get('content')
'friends,heartbreak,inspirational,life,love,sisters'
Parsing a single quote, and then a single page¶
- Let's implement a function that takes in a
<div>
corresponding to a single quote and returns a dictionary containing the quote's information.
Why use a dictionary? Passingpd.DataFrame
a list of dictionaries is an easy way to create a DataFrame.
def process_quote(div):
quote = div.find('span', class_='text').text
author = div.find('small', class_='author').text
author_url = 'https://quotes.toscrape.com' + div.find('a').get('href')
tags = div.find('meta', class_='keywords').get('content')
return {'quote': quote, 'author': author, 'author_url': author_url, 'tags': tags}
# Make sure everything here looks correct based on what's on the webpage!
process_quote(divs[4])
{'quote': '“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”', 'author': 'Dr. Seuss', 'author_url': 'https://quotes.toscrape.com/author/Dr-Seuss', 'tags': 'fantasy'}
- Now, we can implement a function that takes in a list of
<div>
s, callsprocess_quote
on each<div>
in the list, and returns a DataFrame.
def process_page(divs):
return pd.DataFrame([process_quote(div) for div in divs])
process_page(divs)
quote | author | author_url | tags | |
---|---|---|---|---|
0 | “This life is what you make it. No matter what... | Marilyn Monroe | https://quotes.toscrape.com/author/Marilyn-Monroe | friends,heartbreak,inspirational,life,love,sis... |
1 | “It takes a great deal of bravery to stand up ... | J.K. Rowling | https://quotes.toscrape.com/author/J-K-Rowling | courage,friends |
2 | “If you can't explain it to a six year old, yo... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | simplicity,understand |
... | ... | ... | ... | ... |
7 | “It is not a lack of love, but a lack of frien... | Friedrich Nietzsche | https://quotes.toscrape.com/author/Friedrich-N... | friendship,lack-of-friendship,lack-of-love,lov... |
8 | “Good friends, good books, and a sleepy consci... | Mark Twain | https://quotes.toscrape.com/author/Mark-Twain | books,contentment,friends,friendship,life |
9 | “Life is what happens to us while we are makin... | Allen Saunders | https://quotes.toscrape.com/author/Allen-Saunders | fate,life,misattributed-john-lennon,planning,p... |
10 rows × 4 columns
Putting it all together¶
- Now, we can implement
make_quote_df
.
def make_quote_df(n):
'''Returns a DataFrame containing the quotes on the first n pages of https://quotes.toscrape.com/.''' # This is called a docstring!
dfs = []
for i in range(1, n+1):
# Download page n and create a BeautifulSoup object.
soup = download_page(i)
# Create DataFrame using the information in that page.
divs = soup.find_all('div', class_='quote')
df = process_page(divs)
# Append DataFrame to dfs.
dfs.append(df)
# Stitch all DataFrames together.
return pd.concat(dfs).reset_index(drop=True)
quotes = make_quote_df(3)
quotes.head()
quote | author | author_url | tags | |
---|---|---|---|---|
0 | “The world as we have created it is a process ... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | change,deep-thoughts,thinking,world |
1 | “It is our choices, Harry, that show what we t... | J.K. Rowling | https://quotes.toscrape.com/author/J-K-Rowling | abilities,choices |
2 | “There are only two ways to live your life. On... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | inspirational,life,live,miracle,miracles |
3 | “The person, be it gentleman or lady, who has ... | Jane Austen | https://quotes.toscrape.com/author/Jane-Austen | aliteracy,books,classic,humor |
4 | “Imperfection is beauty, madness is genius and... | Marilyn Monroe | https://quotes.toscrape.com/author/Marilyn-Monroe | be-yourself,inspirational |
- Now,
quotes
is s DataFrame, like any other!
quotes[quotes['author'] == 'Albert Einstein']
quote | author | author_url | tags | |
---|---|---|---|---|
0 | “The world as we have created it is a process ... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | change,deep-thoughts,thinking,world |
2 | “There are only two ways to live your life. On... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | inspirational,life,live,miracle,miracles |
5 | “Try not to become a man of success. Rather be... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | adulthood,success,value |
12 | “If you can't explain it to a six year old, yo... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | simplicity,understand |
26 | “If you want your children to be intelligent, ... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | children,fairy-tales |
28 | “Logic will get you from A to Z; imagination w... | Albert Einstein | https://quotes.toscrape.com/author/Albert-Eins... | imagination |
- Implement
download_page(i)
, which downloads a single page (pagei
) and returns aBeautifulSoup
object of the response.
- Implement
process_quote(div)
, which takes in a<div>
tree corresponding to a single quote and returns a dictionary containing all of the relevant information for that quote.
- Implement
process_page(divs)
, which takes in a list of<div>
trees corresponding to a single page and returns a DataFrame containing all of the relevant information for all quotes on that page.
- Implement
make_quote_df(n)
.
Example: Scraping the Happening @ Michigan page¶
Example: Scraping the Happening @ Michigan page¶
- As we stated earlier, our ultimate goal in today's lecture is to create a DataFrame with the information about each event at events.umich.edu.
res = requests.get('https://events.umich.edu')
res
<Response [200]>
soup = BeautifulSoup(res.text)
- Let's start by opening the page in Chrome, right clicking on the page, and clicking "Inspect".
As we can see, the HTML is much more complicated this time – this is usually the case for real websites!
Identifying <div>
s¶
- It's not easy identifying which
<div>
s we want. The Inspect tool makes this easier, but it's good to verify thatfind_all
is finding the right number of elements.
divs = soup.find_all(class_='col-xs-12')
len(divs)
89
- Again, let's deal with one
<div>
at a time. First, we should extract the title of the event.
divs[0]
<div class="col-xs-12 col-sm-4 col-md-4 col-lg-2 flex no-pad"> <div class="event-listing-grid event-single"> <time class="time-banner" datetime="2024-09-24 8:30"><i class="fa fa-clock-o"></i> Sep 24, 2024 8:30am</time> <div class="list-image"> <a href="/event/124208"> <img alt="livestream" class="icon" height="128" src="/images/umicons_livestream.svg" width="128"/> <h5> Livestream / Virtual </h5> </a> </div> <div class="event-info"> <div class="event-title"><h3> <a href="/event/124208" title="Rush Orthopedics Live Surgery Q&A with Dr. Verma"> Rush Orthopedics Live Surgery Q&A with Dr.... </a></h3> </div> <ul class="event-details"> <li class="item"> <a href="/list?filter=locations:1" title="Virtual"><i class="fa fa-location-arrow fa-fw"></i><span> Virtual</span></a> </li> <li class="item"><a href="/group/3815" title="LSA Opportunity Hub"><i class="fa fa-group fa-fw"></i><span> LSA Opportunity Hub </span></a></li> <li class="item"><a href="/group/4442" title="LSA Transfer Student Center"><i class="fa fa-group fa-fw"></i><span> LSA Transfer Student Center </span></a></li> <li class="item"><a href="/list?filter=alltypes:24"><i class="fa fa-list fa-fw"></i><span> Livestream / Virtual </span></a></li> <li class="item"><a href="https://lsa-umich.12twenty.com/events/30006101217151"> <i class="fa fa-link fa-fw"></i> <span>RSVP Here</span> </a></li> </ul> <!-- <p> Get views from the operating room through a live-streamed surgery with Dr. Nikhil Verma, a surgeon who specializes in the treatment of the shoulder,... ( 2024-09-24 8:30am ) </p> --> </div> </div> </div>
divs[0].find('div', class_='event-title').find('a').get('title')
'Rush Orthopedics Live Surgery Q&A with Dr. Verma'
- The time and location, too.
divs[0].find('time').get('datetime')
'2024-09-24 8:30'
divs[0].find('ul').find('a').get('title')
'Virtual'
Parsing a single event, and then every event¶
- As before, we'll implement a function that takes in a BeautifulSoup object corresponding to a single
<div>
and returns a dictionary with the relevant information about that event.
def process_event(div):
title = div.find('div', class_='event-title').find('a').get('title')
location = div.find('ul').find('a').get('title')
time = pd.to_datetime(div.find('time').get('datetime')) # Good idea!
return {'title': title, 'time': time, 'location': location}
process_event(divs[12])
{'title': 'CommuniTea', 'time': Timestamp('2024-09-24 12:00:00'), 'location': 'Trotter Multicultural Center-Sankofa Lounge'}
- Now, we can call it on every
<div>
indivs
.
Remember, we already randivs = soup.find_all(class_='col-xs-12')
.
row_list = []
for div in divs:
try:
row_list.append(process_event(div))
except Exception as e:
print(e)
'NoneType' object has no attribute 'find'
events = pd.DataFrame(row_list)
events.head()
title | time | location | |
---|---|---|---|
0 | Rush Orthopedics Live Surgery Q&A with Dr. Verma | 2024-09-24 08:30:00 | Virtual |
1 | 2024 Investment Banking Coffee Chats at Univer... | 2024-09-24 09:00:00 | |
2 | 2024 Morgan Stanley Global Capital Markets Cof... | 2024-09-24 09:00:00 | |
3 | 2024 Morgan Stanley Institutional Equity Coffe... | 2024-09-24 09:00:00 | |
4 | Framing & Facilitating High Stakes Discussions... | 2024-09-24 10:00:00 | Virtual |
- Now,
events
is a DataFrame, like any other!
# Which events are in-person today?
events[~events['location'].isin(['Virtual', ''])]
title | time | location | |
---|---|---|---|
6 | Materials Science and Engineering Career Fair | 2024-09-24 10:00:00 | Pierpont Commons |
8 | Macro Seminar: Tuesday, September 24 | 2024-09-24 11:30:00 | Lorch Hall |
9 | 2024 Fall Job & Internship Fair: In Person | 2024-09-24 12:00:00 | 530 South State Street, Ann Arbor, Michigan 48... |
... | ... | ... | ... |
85 | OrgLead 24-25 | 2024-09-24 19:30:00 | Michigan Union - Pendleton (2nd Floor) |
86 | Mosher-Jordan (2024-2025) (Housing) | 2024-09-24 20:00:00 | Cesar Chavez Lounge |
87 | Symphony Band | 2024-09-24 20:00:00 | Hill Auditorium |
51 rows × 3 columns
Web data in practice¶
- The spread of true and false news online by Vosoughi et al. compared how true and false news spreads via X (Twitter):
There is worldwide concern over false news and the possibility that it can influence political, economic, and social well-being. To understand how false news spreads, Vosoughi et al. used a data set of rumor cascades on Twitter from 2006 to 2017. About 126,000 rumors were spread by ∼3 million people. False news reached more people than the truth; the top 1% of false news cascades diffused to between 1000 and 100,000 people, whereas the truth rarely diffused to more than 1000 people. Falsehood also diffused faster than the truth. The degree of novelty and the emotional reactions of recipients may be responsible for the differences observed.
- To conduct this study, the authors used the X API for accessing tweets and web-scraped fact-checking websites to verify whether news was false or not.
Summary, next time¶
- Beautiful Soup is an HTML parser that allows us to (somewhat) easily extract information from HTML documents.
soup.find
andsoup.find_all
are the methods you will use most often.
- When writing scraping code:
- Use "inspect element" to identify the names of tags and attributes that are relevant to the information you want to extract.
- Separate your logic for making requests and for parsing.
- Next time: Accessing data (and other services) through APIs.