L14 - HTML and Web Scraping; Scraping Ethics¶

Announcements:¶

  • Data Ethics 2 out today, due next Wednesday.
  • Lab 4: how's it going?
In [ ]:

Quiz 4 FMQs:¶

  • $z$-scores and quartiles
  • The goal of tokenization and lemmatization

Goals:¶

  • Know the basics of how to read HTML
  • Know how to get the HTML of a webpage in Python using requests
  • Know how to get information out of raw HTML by parsing it using the Beautiful Soup library.
  • Know how to follow proper web scraping etiquette.

HTML - HyperText Markup Language¶

This is the language that web page content is written in. Some basic facts about HTML:

  • It's not really a "programming language" per se - more like Markdown than like Python.
  • It's basically a way to assign structure to the webpage content.
  • Basic units of HTML:
    • tags: <tagname>Contents</tagname>
      • Example: h1: <h1>This is the largest heading</h1>
    • attributes, such as href attribute of the a tag:
      • <a> stands for "anchor" and really means "link"
      • <a href="https://example.com">Click here to go to example dot com</a> becomes Click here to go to example dot com.
    • Comments are written <!-- like this -->
  • HTML is (delightfully) "boring":
    • HTML does not specify the appearance, formatting, or even layout of the page elements. This is done using a different language called Cascading Style Sheets (CSS).
    • Much of the fancy dynamic and interactive page content you encounter on real websites is implemented using JavaScript (a "real" programming language).

Basic Elements¶

Instead of reading the following list, let's look at the source for a couple of webpages and see what we find:

  • https://fw.cs.wwu.edu/~wehrwes/
  • https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_23w/

Ways to view HTML:

  • Right click > View Page Source
  • Open developer tools (varies by browser)

Some common HTML elements to know about:

Note: If Colab sees HTML amongst your Markdown, it will render it like HTML - that's why I'm able to show you both the code and how it renders in the examples below.

  • The whole page is enclosed in an html tag
  • The body of the page is enclosed in a body tag
  • h1...h6 are headings
    • Heading 1

    • Heading 2

    • ...
    • Heading 6
  • p is for paragraph
    • Paragraph 1

      Paragraph 2

  • div is a general-purpose (and by default invisible) container for blocks of page content
    • This stuff lives in a div.
  • span is a general-purpose container for snippets of text content
    • This stuff lives in a span, but this stuff does not.
  • Tables allow you to lay out information in tabular format.
    <table>
          <tr> <!-- begin header (first) row -->
            <th>Heading 1</ht> <!-- column 1 heading -->
            <th>Heading 2</ht> <!-- column 2 heading -->
          </tr>
          <tr> <!-- begin second row -->
            <td>Row 1, Column 1</td>
            <td>Row 1, Column 2</td>
          </tr>
            <tr> <!-- begin second row -->
            <td>Row 2, Column 1</td>
            <td>Row 2, Column 2</td>
          </tr>
    </table>
    

renders to:

Heading 1 Heading 2
Row 1, Column 1 Row 1, Column 2
Row 2, Column 1 Row 2, Column 2

Web Scraping¶

So you want some data, but you can only find it buried in some webpage.¶

Packages you'll need to pip install for this all to work (and for Lab 5):

  • requests
  • beautifulsoup4

Game plan:

  • Use requests to get the HTML code for a webpage given its URL
  • Use beautifulsoup4 to parse the resulting HTML and extract the data we want from it.
In [1]:
import requests
import bs4 # pip install beautifulsoup4 if needed (colab has it installed by default)
In [ ]:
url = "https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_23w/"
response = requests.get(url)
print(response.text)
In [ ]:
soup = bs4.BeautifulSoup(response.text, 'html.parser')
soup

Things to demo:

  • first instance of tag via attribute access (soup.a); extract text (.text) and attributes ['href']
  • find first instance of tag with class (class_ kwarg)
  • find with general attributes using attrs=attr_dict
  • find_all
  • Returned elements are BeautifulSoups as well: can search within results
In [12]:
soup.find('a')['href']
Out[12]:
'#course-overview'
In [15]:
soup.find("tr", class_="header")
Out[15]:
<tr class="header">
<th>Date</th>
<th>Topics</th>
<th>Assignments</th>
<th>References</th>
</tr>
In [18]:
soup.find("h2", attrs={"id": "course-policies"})
Out[18]:
<h2 id="course-policies">Course Policies</h2>
In [20]:
soup.find_all("h2")
Out[20]:
[<h2 id="course-overview">Course Overview</h2>,
 <h2 id="assessment">Assessment</h2>,
 <h2 id="resources-for-getting-help-and-support">Resources for Getting
 Help and Support</h2>,
 <h2 id="logistics">Logistics</h2>,
 <h2 id="schedule">Schedule</h2>,
 <h2 id="course-policies">Course Policies</h2>]
In [24]:
first_list = soup.find('ul')
In [27]:
[el.text for el in first_list.find_all('li')]
Out[27]:
['Course\nOverview',
 'Assessment',
 'Resources for Getting\nHelp and Support',
 'Logistics',
 'Schedule',
 'Course\nPolicies']

Scraping Ethics¶

  • What does it cost you to scrape a website?

  • What does it cost the person/company/entity of the website being scraped?

  • Should anyone be able to scrape anything?

  • Timely news article: Meta Was Scraping Sites for Years While Fighting the Practice; archived link

Scraping Etiquette¶

Things to keep in mind:

  • Scraping public data from websites is generally OK (but I'm not a lawyer and this is not legal advice).

  • Most websites will have Terms of Service or Terms of Use. Violating these may be illegal (but I am not a lawyer and this is not legal advice).

    • Most sites will also have a robots.txt which specifies how and what non-human users may access. Example: https://www.wwu.edu/robots.txt
  • If the service provides downloadable datasets or an API, use these instead of scraping.

  • Don't redistribute data without permission.

  • Rate limit your scraping requests - at least 1 per second (for a typical webpage) is usually reasonable, though 5-10 seconds is better.

    • If you don't rate limit, you are indistinguishable from a denial-of-service attack.
    • If pages are large or involve database queries on the backend, it may be polite to wait longer between queries.
  • Always save results instead of re-requesting.

    • Save early before too much analysis is done in case you want to change your analysis.

Okay, let's do something with this!¶

Live coding problem: collect the Names and Office Numbers of all WWU CS faculty from the department directory.

In [ ]:
url = "https://cs.wwu.edu/faculty"
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')