requests
This is the language that web page content is written in. Some basic facts about HTML:
<tagname>Contents</tagname>
h1
: <h1>This is the largest heading</h1>
href
attribute of the a
tag:<a>
stands for "anchor" and really means "link"<a href="https://example.com">Click here to go to example dot com</a>
becomes Click here to go to example dot com.<!-- like this -->
Instead of reading the following list, let's look at the source for a couple of webpages and see what we find:
Ways to view HTML:
Some common HTML elements to know about:
Note: If Colab sees HTML amongst your Markdown, it will render it like HTML - that's why I'm able to show you both the code and how it renders in the examples below.
html
tagbody
tagh1
...h6
are headingsp
is for paragraphParagraph 1
Paragraph 2
div
is a general-purpose (and by default invisible) container for blocks of page contentspan
is a general-purpose container for snippets of text content<table>
<tr> <!-- begin header (first) row -->
<th>Heading 1</ht> <!-- column 1 heading -->
<th>Heading 2</ht> <!-- column 2 heading -->
</tr>
<tr> <!-- begin second row -->
<td>Row 1, Column 1</td>
<td>Row 1, Column 2</td>
</tr>
<tr> <!-- begin second row -->
<td>Row 2, Column 1</td>
<td>Row 2, Column 2</td>
</tr>
</table>
renders to:
Heading 1 | Heading 2 |
---|---|
Row 1, Column 1 | Row 1, Column 2 |
Row 2, Column 1 | Row 2, Column 2 |
Packages you'll need to pip install
for this all to work (and for Lab 5):
requests
beautifulsoup4
Game plan:
requests
to get the HTML code for a webpage given its URLbeautifulsoup4
to parse the resulting HTML and extract the data we want from it.import requests
import bs4 # pip install beautifulsoup4 if needed (colab has it installed by default)
url = "https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_23w/"
response = requests.get(url)
print(response.text)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
soup
Things to demo:
soup.a
); extract text (.text
) and attributes ['href']
class_
kwarg)attrs=attr_dict
find_all
soup.find('a')['href']
'#course-overview'
soup.find("tr", class_="header")
<tr class="header"> <th>Date</th> <th>Topics</th> <th>Assignments</th> <th>References</th> </tr>
soup.find("h2", attrs={"id": "course-policies"})
<h2 id="course-policies">Course Policies</h2>
soup.find_all("h2")
[<h2 id="course-overview">Course Overview</h2>, <h2 id="assessment">Assessment</h2>, <h2 id="resources-for-getting-help-and-support">Resources for Getting Help and Support</h2>, <h2 id="logistics">Logistics</h2>, <h2 id="schedule">Schedule</h2>, <h2 id="course-policies">Course Policies</h2>]
first_list = soup.find('ul')
[el.text for el in first_list.find_all('li')]
['Course\nOverview', 'Assessment', 'Resources for Getting\nHelp and Support', 'Logistics', 'Schedule', 'Course\nPolicies']
What does it cost you to scrape a website?
What does it cost the person/company/entity of the website being scraped?
Should anyone be able to scrape anything?
Timely news article: Meta Was Scraping Sites for Years While Fighting the Practice; archived link
Things to keep in mind:
Scraping public data from websites is generally OK (but I'm not a lawyer and this is not legal advice).
Most websites will have Terms of Service or Terms of Use. Violating these may be illegal (but I am not a lawyer and this is not legal advice).
robots.txt
which specifies how and what non-human users may access. Example: https://www.wwu.edu/robots.txtIf the service provides downloadable datasets or an API, use these instead of scraping.
Don't redistribute data without permission.
Rate limit your scraping requests - at least 1 per second (for a typical webpage) is usually reasonable, though 5-10 seconds is better.
Always save results instead of re-requesting.
url = "https://cs.wwu.edu/faculty"
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')