# L17 - HTML and Web Scraping

#### Announcements:
* Lab 4 Part 2 - outlier detection: hint?
> My favorite way to find outliers is to make histograms. I'd like to histogram all the images, but they're not single numbers, so I can't. One approach is to work around this quite directly: find a way to convert each image into a single number (a statistic, you could say), then plot a histogram of all the single numbers. You may want to try out a few of these - if you find weirdness in a histogram, you're probably onto something.

#### Goals:
* Know the basics of how to read HTML
* Know how to get the HTML of a webpage in Python using `requests`
* Know how to get information out of raw HTML by parsing it using `beautifulsoup4`.

## HTML - HyperText Markup Language

This is the language that web page content is written in. Some basic facts about HTML:
* It's not really a "programming language" per se - more like Markdown than like Python.
* It's basically a way to assign structure to the webpage content.
* Basic units of HTML:
    * **tags**:  `<tagname>Contents</tagname>`
        * Example: `<h1>Biggest Possible Heading</h1>`
    * **attributes**, such as `href` attribute of the `a` tag:
        * `<a>` stands for "anchor" and really means "link"
        * `<a href="https://example.com">Click here to go to example dot com</a>` becomes [Click here to go to example dot com](https://example.com).
    * Comments are written `<!-- like this -->`
* Many (most?) tags can have other tags **nested** inside them.
* HTML is (delightfully) "boring":    
    * HTML does *not* specify the appearance, formatting, or even layout of the page elements. This is done using a different language called Cascading Style Sheets (CSS).
    * Much of the fancy dynamic and interactive page content you encounter on real websites is implemented using JavaScript (an actual programming language).

#### Basic Elements

Instead of reading the following list, let's look at the source for a couple of webpages and see what we find:
* https://fw.cs.wwu.edu/~wehrwes/
* https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_21f/

Some common HTML elements to know about:

*Note*: If Jupyter sees HTML amongst your Markdown, it will render it like HTML - that's why I'm able to show you both the code and how it renders in the examples below.
* The whole page is enclosed in an `html` tag
* The body of the page is enclosed in a `body` tag
* `h1`...`h6` are headings
    * <h1>Heading 1</h1>
    * <h2>Heading 2</h2>
    * ...
    * <h6>Heading 6</h6>
* `p` is for paragraph
    * <p>Paragraph 1</p><p>Paragraph 2</p>
* `div` is a general-purpose (and by default invisible) container for blocks of page content
    * <div>This stuff lives in a div.</div>
* `span` is a general-purpose container for snippets of text content
    * <span>This stuff lives in a span</span>, but this stuff does not.
* Tables allow you to lay out information in tabular format.
    ```html
    <table>
          <tr> <!-- begin header (first) row -->
            <th>Heading 1</ht> <!-- column 1 heading -->
            <th>Heading 2</ht> <!-- column 2 heading -->
          </tr>
          <tr> <!-- begin second row -->
            <td>Row 1, Column 1</td>
            <td>Row 1, Column 2</td>
          </tr>
            <tr> <!-- begin second row -->
            <td>Row 2, Column 1</td>
            <td>Row 2, Column 2</td>
          </tr>
    </table>
    ```
renders to:
<table>
          <tr> <!-- begin header (first) row -->
            <th>Heading 1</ht> <!-- column 1 heading -->
            <th>Heading 2</ht> <!-- column 2 heading -->
          </tr>
          <tr> <!-- begin second row -->
            <td>Row 1, Column 1</td>
            <td>Row 1, Column 2</td>
          </tr>
            <tr> <!-- begin second row -->
            <td>Row 2, Column 1</td>
            <td>Row 2, Column 2</td>
          </tr>
    </table>

## Web Scraping
#### So you want some data, but you can only find it buried in some webpage.

Packages you'll need to `pip install` for this all to work (and for Lab 5):
* `requests`
* `beautifulsoup4`

Game plan:
* Use `requests` to get the HTML code for a webpage given its URL
* Use `beautifulsoup4` to parse the resulting HTML and extract the data we want from it.



In [2]:
import requests
import bs4

In [5]:
url = "https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_21f/"
response = requests.get(url)
print(response.text)

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
<head>
  <meta charset="utf-8" />
  <meta name="generator" content="pandoc" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" />
  <meta name="author" content="Scott Wehrwein" />
  <title>DATA 311 - Fundamentals of Data Science</title>
  <style>
    code{white-space: pre-wrap;}
    span.smallcaps{font-variant: small-caps;}
    span.underline{text-decoration: underline;}
    div.column{display: inline-block; vertical-align: top; width: 50%;}
    div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;}
    ul.task-list{list-style: none;}
    .display.math{display: block; text-align: center; margin: 0.5rem auto;}
  </style>
  <link rel="stylesheet" href="md.css" />
  <!--[if lt IE 9]>
    <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script>
  <![endif]-->
</head>
<body>
<header id="title-block-header">
<h1 class="tit

In [9]:
soup = bs4.BeautifulSoup(response.text, 'html.parser')


Things to demo:
* first instance of tag via attribute access (`soup.a`); extract text (`.text`) and attributes `['href']`
* find first instance of tag with class (`class_` kwarg)
* find with general attribute
* `find_all`
* Returned elements are BeautifulSoups as well: can search within results

In [11]:
soup.a

<a href="#course-overview">Course Overview</a>

In [12]:
soup.a.text

'Course Overview'

In [15]:
soup.a['href']

'#course-overview'

In [16]:
soup.find('a', attrs={'href': "#assessment"})

<a href="#assessment">Assessment</a>

In [18]:
soup.find(class_="author")

<p class="author">Scott Wehrwein</p>

In [25]:
soup.find_all('p', class_="author")

[<p class="author">Scott Wehrwein</p>]

In [29]:
items = soup.find('ul').find_all('li')

In [30]:
[it.text for it in items]

['Course Overview',
 'Assessment',
 'Resources for Getting Help and Support',
 'Logistics',
 'Schedule',
 'Course Policies']

In [31]:
str(items[0])

'<li><a href="#course-overview">Course Overview</a></li>'

#### Okay, let's do something with this!

Live coding problem: collect the **Names** and **Office Numbers** of all WWU CS faculty from the department [directory](https://cs.wwu.edu/faculty).

In [32]:
url = "https://cs.wwu.edu/faculty"
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')

In [42]:
cards = soup.find_all('div', class_="card")

wes = cards[5]
name = wes.h3.a.text
office = wes.find_all('div')[4].text
office

'CF 479'

In [45]:
import pandas as pd
import numpy as np
data = {
    "Name": [],
    "Office": []
}
for card in cards:
    data["Name"].append(card.h3.a.text)
    divs = card.find_all('div')
    appended = False
    for d in divs:
        if d.text[:2] == "CF":
            appended = True
            data["Office"].append(d.text)
    if not appended:
        data["Office"].append(np.nan)
df = pd.DataFrame(data)
df

Unnamed: 0,Name,Office
0,"Shameem Ahmed, PhD",CF 491
1,"Selina Akter, PhD",CF 409
2,Justice Banson,CF 413
3,"Aran Clauson, PhD",CF 411
4,"Kameron Decker Harris, PhD",CF 461
5,"Wesley Deneke, PhD",CF 479
6,"Abdul Derwish, MS",CF 409
7,"Marie Deschene, PhD",CF 413
8,"Yasmine Elglaly, PhD",CF 465
9,"Perry Fizzano, PhD",CF 469
