Announcements:¶
- Feel free to code along with today's lecture - and there are a couple exercises interspersed!
Goals:¶
- Know the basics of how to read HTML
- Understand the basic purpose and structure of XML
- Know how to get the HTML of a webpage in Python using
requests - Know how to get information out of raw HTML by parsing it using the Beautiful Soup library.
HTML - HyperText Markup Language¶
HTML is one pervasive example of structured data.
This is the language that web page content is written in. Some basic facts about HTML:
- It's not really a "programming language" per se - more like Markdown than like Python.
- It's basically a way to assign structure to the webpage content.
- Basic units of HTML:
- tags:
<tagname>Contents</tagname>- Example:
h1:<h1>This is the largest heading</h1>
- Example:
- attributes, such as
hrefattribute of theatag:<a>stands for "anchor" and really means "link"<a href="https://example.com">Click here to go to example dot com</a>becomes Click here to go to example dot com.
- Comments are written
<!-- like this -->
- tags:
- HTML is (delightfully) "boring":
- HTML does not specify the appearance, formatting, or even layout of the page elements. This is done using a different language called Cascading Style Sheets (CSS).
- Much of the fancy dynamic and interactive page content you encounter on real websites is implemented using JavaScript (a "real" programming language).
Basic Elements¶
Instead of reading the following list, let's look at the source for a couple of webpages and see what we find:
Ways to view HTML:
- Right click > View Page Source
- Open developer tools (varies by browser)
Some common HTML elements to know about:
Note: If Jupyter sees HTML amongst your Markdown, it will render it like HTML - that's why I'm able to show you both the code and how it renders in the examples below.
The whole page is enclosed in an
htmltagThe body of the page is enclosed in a
bodytagh1...h6are headingsHeading 1
Heading 2
- ...
Heading 6
pis for paragraphParagraph 1
Paragraph 2
divis a general-purpose (and by default invisible) container for blocks of page content- This stuff lives in a div.
spanis a general-purpose container for snippets of text content- This stuff lives in a span, but this stuff does not.
Tables allow you to lay out information in tabular format.
<table> <tr> <!-- begin header (first) row --> <th>Heading 1</ht> <!-- column 1 heading --> <th>Heading 2</ht> <!-- column 2 heading --> </tr> <tr> <!-- begin second row --> <td>Row 1, Column 1</td> <td>Row 1, Column 2</td> </tr> <tr> <!-- begin second row --> <td>Row 2, Column 1</td> <td>Row 2, Column 2</td> </tr> </table>
renders to:
Heading 1 Heading 2 Row 1, Column 1 Row 1, Column 2 Row 2, Column 1 Row 2, Column 2
XML¶
Example XML document:
<?xml version="1.0" encoding="UTF-8"?>
<library>
<book id="b001" status="available">
<title>The Great Gatsby</title>
<author>
<firstName>F. Scott</firstName>
<lastName>Fitzgerald</lastName>
</author>
<publicationYear>1925</publicationYear>
<isbn>978-0-7432-7356-5</isbn>
<genres>
<genre>Fiction</genre>
<genre>Classic</genre>
</genres>
</book>
<book id="b002" status="checked-out">
<title>1984</title>
<author>
<firstName>George</firstName>
<lastName>Orwell</lastName>
</author>
<publicationYear>1949</publicationYear>
<isbn>978-0-452-28423-4</isbn>
<genres>
<genre>Dystopian</genre>
<genre>Science Fiction</genre>
</genres>
</book>
</library>
- XML is HTML's more general (and not particularly well-liked) cousin.
- HTML-like, but not document-specific
- Meant to represent whatever structured data you like
- Doesn't even have the "default" presentation that HTML has via browsers
- You can define a "schema" that narrows down what a "valid" XML document looks like for your particular use case.
- For example, below: maybe you insist a
<library>can contain only a sequence of<book>s - Documents can then be automatically error checked against the schema, then parsed much more easily
- For example, below: maybe you insist a
One major (and largely successful) competitor to XML is JSON - we'll see this next time.
Demo¶
import requests
import bs4
url = "https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_25f/"
response = requests.get(url)
print(response.text[:500])
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
<meta charset="utf-8" />
<meta name="generator" content="pandoc" />
<meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
<meta name="author" content="Scott Wehrwein" />
<title>DATA 311 - Fundamentals of Data Science</title>
<style>
code{white-space: pre-wrap;}
span.smallcaps{font-variant: small-caps;}
div.columns{display: flex; gap: min(4vw, 1.5em)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
To find the first instance of the tag, you can access the tag name as a property of the soup object:
# get the first link in the document (<a> tag)
soup.a
<a href="#course-overview" id="toc-course-overview">Course Overview</a>
# get the first h1 tag
soup.h1
<h1 class="title">DATA 311 - Fundamentals of Data Science</h1>
To get the text inside a tag, use the tag's text property:
# get the text inside from the h1 element we found above
soup.h1.text
'DATA 311 - Fundamentals of Data Science'
Attributes of tags can be accessed using dictionary-like indexing:
# get the href attribute of the a element we found above
soup.a["href"]
'#course-overview'
We can also use the find method to search for the first instance of a tag:
# use find to get the first a, equivalent to soup.a
soup.find("a")
<a href="#course-overview" id="toc-course-overview">Course Overview</a>
The objects returned by find and friends are also Soup objects, meaning we can call methods on them too:
# find the first table (which is the Schedule table)
schedule = soup.find("table")
# search the table for the first row with class = "odd"
schedule.find("tr", class_="odd")
<tr class="odd"> <td>09/24 (0)</td> <td>Introduction and overview<br/>What is data science? What is data? <br/><a href="lectures/L00/L00_slides.pdf">slides</a><br/><a href="lectures/L00/L00.html">typed notes</a><br/><a href="lectures/L00/W00.html">worksheet</a><br/><a href="lectures/L00/L00.pdf">whiteboard</a></td> <td>Start of Quarter Survey (Canvas)</td> <td>1.1, 1.3</td> </tr>
Exercise 0: Find the text of the first link that's inside an unordered list (<ul>) element.
soup.ul.a.text
'Course\nOverview'
It's often useful to search for a tag with a given id attribute:
# find the h2 with id="course-policies" with the `id` attribute to the find method
soup.find("h2", id="course-policies")
<h2 id="course-policies">Course Policies</h2>
You can also search arbitrary attributes (or combinations thereof) with the attrs kwarg:
# use find with a dict of attributes passed to the attrs kwarg
# we can even do this without specifying the type of tag!
soup.find(attrs={"href": "#course-policies"})
<a href="#course-policies" id="toc-course-policies">Course Policies</a>
Sometimes you want more than one tag, or the first one fitting a description doesn't narrow it down enough. Suppose we want all the rows of the Schedule table - we can get a list with find_all:
# find all tr elements of the first table element in the document
len(soup.find_all("tr"))
30
Let's eliminate the header row by getting only rows with class even or odd:
len(soup.find_all("tr", attrs={"class": ["odd", "even"]}))
29
Exercise 1: Get a list containing the text of all the navigation links on the course webpage. Hint: the navigation buttons are all <a> elements that live inside a <nav> element.
[tag.text for tag in soup.nav.find_all("a")]
['Course\nOverview', 'Assessment', 'Resources', 'Logistics', 'Schedule', 'Course\nPolicies']
Exercise 2: collect the Names and Office Numbers of all WWU CS faculty from the department directory.
url = "https://cs.wwu.edu/faculty"
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')