{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1ad710c2",
   "metadata": {
    "id": "1ad710c2"
   },
   "source": [
    "# L11 - Data Collection and Structured Data 1\n",
    "## HTML, XML, and Web Scraping"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d52ae3b",
   "metadata": {
    "id": "1d52ae3b"
   },
   "source": [
    "#### Announcements:\n",
    "\n",
    "* Ethics 2 out\n",
    "  * Read an article and answer some questions by class Monday 5/4\n",
    "  * In-class data analysis activity, done in pairs; submit by 10pm Monday 5/11\n",
    "* (draft) Project writeup is posted. Start thinking about topics to propose! You're encouraged to run ideas by me and/or Nick before writing up your proposal.\n",
    "* Feel free to code along with today's lecture - and there are a couple exercises interspersed!"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "EFgHfIcpLvQZ",
   "metadata": {
    "id": "EFgHfIcpLvQZ"
   },
   "source": [
    "#### Goals:\n",
    "* Know the basics of how to read HTML\n",
    "* Understand the basic purpose and structure of XML\n",
    "* Know how to get the HTML of a webpage in Python using `requests`\n",
    "* Know how to get information out of raw HTML by parsing it using the Beautiful Soup library."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "646f31cc",
   "metadata": {
    "id": "646f31cc"
   },
   "source": [
    "## HTML - HyperText Markup Language\n",
    "\n",
    "HTML is one pervasive example of **structured data**.\n",
    "\n",
    "This is the language that web page content is written in. Some basic facts about HTML:\n",
    "* It's not really a \"programming language\" per se - more like Markdown than like Python.\n",
    "* It's basically a way to assign structure to the webpage content.\n",
    "* Basic units of HTML:\n",
    "    * **tags**:  `<tagname>Contents</tagname>`\n",
    "        * Example: `h1`: `<h1>This is the largest heading</h1>`\n",
    "    * **attributes**, such as `href` attribute of the `a` tag:\n",
    "        * `<a>` stands for \"anchor\" and really means \"link\"\n",
    "        * `<a href=\"https://example.com\">Click here to go to example dot com</a>` becomes [Click here to go to example dot com](https://example.com).\n",
    "    * Comments are written `<!-- like this -->` <!-- hush -->\n",
    "* HTML is (delightfully) \"boring\":    \n",
    "    * HTML does *not* specify the appearance, formatting, or even layout of the page elements. This is done using a different language called Cascading Style Sheets (CSS).\n",
    "    * Much of the fancy dynamic and interactive page content you encounter on real websites is implemented using JavaScript (a \"real\" programming language)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10ddbd95",
   "metadata": {
    "id": "10ddbd95"
   },
   "source": [
    "#### Basic Elements\n",
    "\n",
    "Instead of reading the following list, let's look at the source for a couple of webpages and see what we find:\n",
    "* https://fw.cs.wwu.edu/~wehrwes/\n",
    "* https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_26s/\n",
    "\n",
    "Ways to view HTML:\n",
    "* Right click > View Page Source\n",
    "* Open developer tools (varies by browser)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f6018ecc",
   "metadata": {
    "id": "f6018ecc"
   },
   "source": [
    "Some common HTML elements to know about:\n",
    "\n",
    "*Note*: If Jupyter sees HTML amongst your Markdown, it will render it like HTML - that's why I'm able to show you both the code and how it renders in the examples below.\n",
    "* The whole page is enclosed in an `html` tag\n",
    "* The body of the page is enclosed in a `body` tag\n",
    "* `h1`...`h6` are headings\n",
    "    * <h1>Heading 1</h1>\n",
    "    * <h2>Heading 2</h2>\n",
    "    * ...\n",
    "    * <h6>Heading 6</h6>\n",
    "* `p` is for paragraph\n",
    "    * <p>Paragraph 1</p><p>Paragraph 2</p>\n",
    "* `div` is a general-purpose (and by default invisible) container for blocks of page content\n",
    "    * <div>This stuff lives in a div.</div>\n",
    "* `span` is a general-purpose container for snippets of text content\n",
    "    * <span>This stuff lives in a span</span>, but this stuff does not.\n",
    "* Tables allow you to lay out information in tabular format.\n",
    "    ```html\n",
    "    <table>\n",
    "          <tr> <!-- begin header (first) row -->\n",
    "            <th>Heading 1</ht> <!-- column 1 heading -->\n",
    "            <th>Heading 2</ht> <!-- column 2 heading -->\n",
    "          </tr>\n",
    "          <tr> <!-- begin second row -->\n",
    "            <td>Row 1, Column 1</td>\n",
    "            <td>Row 1, Column 2</td>\n",
    "          </tr>\n",
    "            <tr> <!-- begin second row -->\n",
    "            <td>Row 2, Column 1</td>\n",
    "            <td>Row 2, Column 2</td>\n",
    "          </tr>\n",
    "    </table>\n",
    "    ```\n",
    "renders to:\n",
    "<table>\n",
    "          <tr> <!-- begin header (first) row -->\n",
    "            <th>Heading 1</ht> <!-- column 1 heading -->\n",
    "            <th>Heading 2</ht> <!-- column 2 heading -->\n",
    "          </tr>\n",
    "          <tr> <!-- begin second row -->\n",
    "            <td>Row 1, Column 1</td>\n",
    "            <td>Row 1, Column 2</td>\n",
    "          </tr>\n",
    "            <tr> <!-- begin second row -->\n",
    "            <td>Row 2, Column 1</td>\n",
    "            <td>Row 2, Column 2</td>\n",
    "          </tr>\n",
    "    </table>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "52547e5c-bdbd-4b69-83c4-4d966171efc4",
   "metadata": {},
   "source": [
    "## XML\n",
    "\n",
    "Example XML document:\n",
    "```xml\n",
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n",
    "<library>\n",
    "    <book id=\"b001\" status=\"available\">\n",
    "        <title>The Great Gatsby</title>\n",
    "        <author>\n",
    "            <firstName>F. Scott</firstName>\n",
    "            <lastName>Fitzgerald</lastName>\n",
    "        </author>\n",
    "        <publicationYear>1925</publicationYear>\n",
    "        <isbn>978-0-7432-7356-5</isbn>\n",
    "        <genres>\n",
    "            <genre>Fiction</genre>\n",
    "            <genre>Classic</genre>\n",
    "        </genres>\n",
    "    </book>\n",
    "    <book id=\"b002\" status=\"checked-out\">\n",
    "        <title>1984</title>\n",
    "        <author>\n",
    "            <firstName>George</firstName>\n",
    "            <lastName>Orwell</lastName>\n",
    "        </author>\n",
    "        <publicationYear>1949</publicationYear>\n",
    "        <isbn>978-0-452-28423-4</isbn>\n",
    "        <genres>\n",
    "            <genre>Dystopian</genre>\n",
    "            <genre>Science Fiction</genre>\n",
    "        </genres>\n",
    "    </book>\n",
    "</library>\n",
    "```\n",
    "\n",
    "* XML is HTML's more general (and not particularly well-liked) cousin.\n",
    "* HTML-like, but not document-specific\n",
    "* Meant to represent whatever structured data you like\n",
    "* Doesn't even have the \"default\" presentation that HTML has via browsers\n",
    "* You can define a \"schema\" that narrows down what a \"valid\" XML document looks like for your particular use case.\n",
    "  * For example, below: maybe you insist a `<library>` can contain only a sequence of `<book>`s\n",
    "  * Documents can then be automatically error checked against the schema, then parsed much more easily\n",
    "\n",
    "One major (and largely successful) competitor to XML is JSON - we'll see this next time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48b3be23-8490-4f95-8233-e323cf5997ab",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "62236313",
   "metadata": {
    "id": "62236313"
   },
   "source": [
    "## Web Scraping\n",
    "#### So you want some data, but you can only find it buried in some webpage.\n",
    "\n",
    "Game plan:\n",
    "* Use `requests` to get the HTML code for a webpage given its URL\n",
    "* Use `beautifulsoup4` to parse the resulting HTML and extract the data we want from it.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8888e109-2a6a-4155-8f13-ab0d560c7db1",
   "metadata": {},
   "source": [
    "### Demo"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "febba039",
   "metadata": {
    "executionInfo": {
     "elapsed": 164,
     "status": "ok",
     "timestamp": 1675881161960,
     "user": {
      "displayName": "Scott Wehrwein",
      "userId": "11327482518794216604"
     },
     "user_tz": 480
    },
    "id": "febba039"
   },
   "outputs": [],
   "source": [
    "import requests\n",
    "import bs4\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "00c342bf",
   "metadata": {
    "id": "00c342bf"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<!DOCTYPE html>\n",
      "<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"en-US\" xml:lang=\"en-US\">\n",
      "<head>\n",
      "  <meta charset=\"utf-8\" />\n",
      "  <meta name=\"generator\" content=\"pandoc\" />\n",
      "  <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0, user-scalable=yes\" />\n",
      "  <meta name=\"author\" content=\"Scott Wehrwein\" />\n",
      "  <title>DATA 311 - Fundamentals of Data Science</title>\n",
      "  <style>\n",
      "    /* Default styles provided by pandoc.\n",
      "    ** See https://pandoc.org/MANUAL.html#variables-for-html for config info.\n",
      "\n"
     ]
    }
   ],
   "source": [
    "url = \"https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_26s/\"\n",
    "response = requests.get(url)\n",
    "print(response.text[:500])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "bcc3c418",
   "metadata": {
    "id": "bcc3c418"
   },
   "outputs": [],
   "source": [
    "soup = bs4.BeautifulSoup(response.text, 'html.parser')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0025b13-706b-4c40-9791-a32efe28afdc",
   "metadata": {
    "id": "0b973f90"
   },
   "source": [
    "To find the first instance of the tag, you can access the tag name as a property of the soup object:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "dcb6dd43-ff7c-4846-bac3-5c9a2a5c5984",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<a href=\"#course-overview\" id=\"toc-course-overview\">Course\n",
       "Overview</a>"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get the first link in the document (<a> tag)\n",
    "soup.a"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "b577f085-9e31-4f65-96cc-7fb2ec848d1a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<h1 class=\"title\">DATA 311 - Fundamentals of Data Science</h1>"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get the first h1 tag\n",
    "soup.h1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "914e98f7-4749-4e5f-8402-72f1195663ae",
   "metadata": {},
   "source": [
    "To get the text inside a tag, use the tag's `text` property:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "0f0391a7-322d-4862-89a0-f81ac05c3975",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'DATA 311 - Fundamentals of Data Science'"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get the text inside from the h1 element we found above\n",
    "soup.h1.text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2243e7e2-2d1c-4d39-ae14-d0c46904a116",
   "metadata": {},
   "source": [
    "Attributes of tags can be accessed using dictionary-like indexing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "24d60068-d8ce-4899-b27f-e18cdecaa122",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'#course-overview'"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# get the href attribute of the a element we found above\n",
    "soup.a[\"href\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f35b20a3-1978-4692-a991-8c4cf20f986b",
   "metadata": {},
   "source": [
    "We can also use the `find` method to search for the first instance of a tag:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b70e07a1-8d9f-47d3-8010-12a678a50ffd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<a href=\"#course-overview\" id=\"toc-course-overview\">Course\n",
       "Overview</a>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# use find to get the first a, equivalent to soup.a\n",
    "soup.find('a')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f25a68ef-6b60-49e7-bc96-a7e2180f7299",
   "metadata": {},
   "source": [
    "The objects returned by `find` and friends are also Soup objects, meaning we can call methods on them too:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "8f72df37-c8e6-4af0-8ed4-48c39e8046e4",
   "metadata": {},
   "outputs": [],
   "source": [
    "# find the first table (which is the Schedule table)\n",
    "schedule = soup.find(\"table\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "141af557-5743-4051-950a-78c3446d9341",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tr>\n",
       "<th>Date</th>\n",
       "<th>Topics</th>\n",
       "<th>Assignments</th>\n",
       "<th>References</th>\n",
       "</tr>"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# search the table for the first row with class = \"odd\"\n",
    "schedule.find(\"tr\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ade63747-a307-4430-8f88-e87a1425368d",
   "metadata": {},
   "source": [
    "**Exercise 0**: Find the text of the first link that's inside the first unordered list (`<ul>`) element."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "52f83fa5-2877-4b95-b29c-ffa6e26fe831",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Course\\nOverview'"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "soup.ul.a.text"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "911ffe72-4481-47f9-be4d-8eda3df96b9a",
   "metadata": {
    "id": "0b973f90"
   },
   "source": [
    "It's often useful to search for a tag with a given `id` attribute:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "adbdb612-5f44-4700-b7c8-470361595b20",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Course Policies'"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# find the h2 with id=\"course-policies\" with the `id` attribute to the find method\n",
    "soup.find(\"h2\", attrs={\"id\": \"course-policies\"})"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f149ae3-4f33-481b-babe-7c67badfd9f5",
   "metadata": {},
   "source": [
    "You can also search arbitrary attributes (or combinations thereof) with the `attrs` kwarg:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f927c6cc-1514-4502-90fd-538f9b3741eb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# use find with a dict of attributes passed to the attrs kwarg\n",
    "# we can even do this without specifying the type of tag!\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dbdb15d8-b741-441e-b539-473d83edf3cd",
   "metadata": {},
   "source": [
    "Sometimes you want more than one tag, or the first one fitting a description doesn't narrow it down enough. Suppose we want all the rows of the Schedule table - we can get a list with `find_all`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "id": "5ec7c85d-cd32-4098-883f-c885227748e7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<tr>\n",
       " <th>Date</th>\n",
       " <th>Topics</th>\n",
       " <th>Assignments</th>\n",
       " <th>References</th>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>4/1 (0)</td>\n",
       " <td>Introduction and overview<br/>What is data science? What is\n",
       " data?<br/><a href=\"lectures/L00/L00.pdf\">wb</a>, <a href=\"lectures/L00/L00.html\">typed</a>, <a href=\"lectures/L00/W00.html\">ws</a>, <a href=\"lectures/L00/L00_slides.pdf\">slides</a></td>\n",
       " <td>Start of Quarter Survey (Canvas)</td>\n",
       " <td>1.1, 1.3</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>4/3</td>\n",
       " <td>Data types<br/>numerical data<br/>Jupyter<br/>Data science tools\n",
       " overview<br/><a href=\"lectures/L01/L01.html\">notes</a>, <a href=\"lectures/L01/W01.html\">ws</a>, <a href=\"lectures/L01/L01.pdf\">wb</a></td>\n",
       " <td></td>\n",
       " <td></td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>4/6 (1)</td>\n",
       " <td>Multidimensional arrays; numpy<br/><a href=\"lectures/L02/L02.ipynb\">ipynb</a>, <a href=\"lectures/L02/L02.html\">html</a>, <a href=\"lectures/L02/L02.html\">notes</a></td>\n",
       " <td><a href=\"lab1\">Lab 1: numpy</a></td>\n",
       " <td><a href=\"https://medium.com/better-programming/numpy-illustrated-the-visual-guide-to-numpy-3b1d4976de1d?source=friends_link&amp;sk=57b908a77aa44075a49293fa1631dd9b\">Numpy\n",
       " Illustrated</a><br/>McKinney 4<br/><a href=\"https://numpy.org/doc/stable/user/quickstart.html\">Numpy\n",
       " Quickstart</a></td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>4/8</td>\n",
       " <td>Tabular data and Dataframes<br/>pandas basics<br/>Notebook: <a href=\"lectures/L03/L03.ipynb\">ipynb</a>, <a href=\"lectures/L03/L03.html\">html</a></td>\n",
       " <td><a href=\"ethics1\">Data Ethics 1</a> out</td>\n",
       " <td>McKinney 5<br/><a href=\"https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html\">10\n",
       " mins to Pandas</a><br/><a href=\"https://pandas.pydata.org/docs/user_guide/dsintro.html#dsintro\">Intro\n",
       " to Pandas Data structures</a></td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>4/10</td>\n",
       " <td>Minimum viable prob/stat<br/>pandas - basic stats and\n",
       " histograms<br/><a href=\"lectures/L04/L04.ipynb\">ipynb</a>, <a href=\"lectures/L04/L04.html\">html</a>, <a href=\"lectures/L04/L04.pdf\">wb</a>, <a href=\"lectures/L04/W04.html\">ws</a></td>\n",
       " <td>Quiz 1</td>\n",
       " <td>Skiena 2.1-2.2<br/>McKinney 5.3</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>4/13 (2)</td>\n",
       " <td>Formulating data questions<br/>Conditional Probability and\n",
       " Independence<br/><a href=\"lectures/L05/L05.ipynb\">ipynb</a>, <a href=\"lectures/L05/L05.html\">html</a>, <a href=\"lectures/L05/W05.html\">ws</a>, <a href=\"lectures/L05/L05.pdf\">wb</a></td>\n",
       " <td><a href=\"lab2\">Lab 2: pandas</a></td>\n",
       " <td>Skiena 1.2<br/>Skiena 2.1</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>4/15</td>\n",
       " <td>Data Ethics 1 Discussion</td>\n",
       " <td><a href=\"ethics1\">Ethics 1</a> due</td>\n",
       " <td></td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>4/17</td>\n",
       " <td>Visualization: Principles<br/><a href=\"lectures/L06/L06.ipynb\">ipynb</a>, <a href=\"lectures/L06/L06.html\">html</a><br/><a href=\"lectures/L06/W06.html\">exit ticket</a></td>\n",
       " <td>Quiz 2</td>\n",
       " <td><a href=\"https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_23w/lab3/vdqa_excerpt.pdf\">Tufte\n",
       " excerpt</a><br/>Skiena 6<br/>McKinney 9<br/></td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>4/20 (3)</td>\n",
       " <td>Visualization: Practice<br/>Exit ticket data/practice: <a href=\"lectures/L07/L06_exit.ipynb\">ipynb</a>, <a href=\"lectures/L07/L06_exit.html\">html</a><br/>Notebook: <a href=\"lectures/L07/L07.ipynb\">ipynb</a>, <a href=\"lectures/L07/L07.html\">html</a><br/><a href=\"lectures/L07/W07.html\">ws</a></td>\n",
       " <td><a href=\"lab3/\">Lab 3: visualization</a></td>\n",
       " <td></td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>4/22</td>\n",
       " <td>Processing: outliers and missing data, numerical\n",
       " normalization<br/><a href=\"lectures/L08/L08.ipynb\">ipynb</a>, <a href=\"lectures/L08/L08.html\">html</a>, <a href=\"lectures/L08/cleaning_scenarios.pdf\">ws</a></td>\n",
       " <td></td>\n",
       " <td>McKinney 7<br/>Skiena 3.3, 4.3</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>4/24</td>\n",
       " <td>Processing: text normalization, NLP basics<br/>notebook: <a href=\"lectures/L09/L09.ipynb\">ipynb</a>, <a href=\"lectures/L09/L09.html\">html</a><br/>exercise: <a href=\"lectures/L09/W09.ipynb\">ipynb</a>, <a href=\"lectures/L09/W09.html\">html</a></td>\n",
       " <td>Quiz 3</td>\n",
       " <td>See Lab 4 Pre-Lab<br/>McKinney 7.4</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>4/27 (4)</td>\n",
       " <td>Text normalization and NLP, continued<br/>Notebook: <a href=\"lectures/L10/L10.ipynb\">ipynb</a>, <a href=\"lectures/L10/L10.html\">html</a><br/>L09 exercise: <a href=\"lectures/L10/W09.ipynb\">ipynb</a>, <a href=\"lectures/L10/W09.html\">html</a><br/>L10 exercise: <a href=\"lectures/L10/W10.ipynb\">ipynb</a>, <a href=\"lectures/L10/W10.html\">html</a></td>\n",
       " <td><a href=\"lab4\">Lab 4: text normalization and NLP</a><br/></td>\n",
       " <td></td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>4/29</td>\n",
       " <td>Sick day</td>\n",
       " <td></td>\n",
       " <td></td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>5/1</td>\n",
       " <td>(Responsible) Data collection and Structured Data 1:<br/>HTML, XML,\n",
       " and Web Scraping<br/>Notebook/exercises: <a href=\"lectures/L11/L11.ipynb\">ipynb</a>, <a href=\"lectures/L11/L11.html\">html</a></td>\n",
       " <td>Quiz 4</td>\n",
       " <td>Skiena 3.1-3.2</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>5/4 (5)</td>\n",
       " <td>Data Ethics 2 - Activity<br/>Notebook: <a href=\"ethics2/allocative_bias.ipynb\">ipynb</a></td>\n",
       " <td>Lab 5: Data Collection<br/><a href=\"ethics2\">Data Ethics 2</a></td>\n",
       " <td></td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>5/6</td>\n",
       " <td>Data collection, continued<br/>APIs; merging and joining\n",
       " data<br/>Notebook/exercises: <a href=\"lectures/L12/L12.ipynb\">ipynb</a>, <a href=\"lectures/L12/L12.html\">html</a></td>\n",
       " <td></td>\n",
       " <td>Skiena 3.2<br/>McKinney 8.2-8.3</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>5/8</td>\n",
       " <td>Exploratory Data Analysis</td>\n",
       " <td>Quiz 5</td>\n",
       " <td>Skiena 6.1<br/>McKinney 13</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>5/11 (6)</td>\n",
       " <td>Correlation (does not imply causation)</td>\n",
       " <td><a href=\"project\">Project - Collection</a></td>\n",
       " <td>Skiena 2.3<br/>McKinney 5.3</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>5/13</td>\n",
       " <td>Data Ethics 3 Discussion</td>\n",
       " <td>Data Ethics 3</td>\n",
       " <td></td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>5/15</td>\n",
       " <td>ML intro and taxonomy</td>\n",
       " <td>Quiz 6</td>\n",
       " <td>Skiena 7.1</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>5/18 (7)</td>\n",
       " <td>ML for Data Analysis: clustering and dimensionality reduction\n",
       " overview, distance metrics</td>\n",
       " <td><a href=\"project\">Project - Analysis</a></td>\n",
       " <td>Skiena 10.1</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>5/20</td>\n",
       " <td>ML for Data Analysis:<br/>Feature preprocessing<br/>K-Means\n",
       " Clustering</td>\n",
       " <td></td>\n",
       " <td>Skiena 10.5</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>5/22</td>\n",
       " <td>Supervised ML:<br/>Classification and regression; KNN</td>\n",
       " <td>Quiz 7</td>\n",
       " <td>Skiena 10.2</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>5/25 (8)</td>\n",
       " <td><strong>No Class - Memorial Day</strong></td>\n",
       " <td>Lab 7 - Machine Learning</td>\n",
       " <td></td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>5/27</td>\n",
       " <td>ML Generalization: bias, variance, risk</td>\n",
       " <td></td>\n",
       " <td>Skiena 7.3</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>5/29</td>\n",
       " <td>Generalization, Continued<br/>ML Experimental setup and\n",
       " evaluation</td>\n",
       " <td>Quiz 8</td>\n",
       " <td>Skiena 7.5</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>6/1 (9)</td>\n",
       " <td>Machine Learning Example</td>\n",
       " <td></td>\n",
       " <td></td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>6/3</td>\n",
       " <td>Evaluating ML: Classification and Regression metrics</td>\n",
       " <td></td>\n",
       " <td>Skiena 7.4</td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>6/5</td>\n",
       " <td><em>Ask Me Anything</em></td>\n",
       " <td>(practice) Quiz 9</td>\n",
       " <td></td>\n",
       " </tr>,\n",
       " <tr>\n",
       " <td>Thursday, 6/11</td>\n",
       " <td><strong>Final Exam - 3:30 pm - 5:30 pm</strong></td>\n",
       " <td></td>\n",
       " <td></td>\n",
       " </tr>]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# find all tr elements of the first table element in the document\n",
    "soup.table.find_all(\"tr\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4a185f5-afaa-44c2-bd27-11860bfceefc",
   "metadata": {
    "id": "0b973f90"
   },
   "source": [
    "Let's eliminate the `header` row by getting only rows with class `even` or `odd`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "id": "abac47ee-11e6-42d5-8f1f-320f79c97024",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(soup.find_all(\"tr\", attrs={\"class\": [\"odd\", \"even\"]}))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb4fed1c-e7c2-427a-88b5-09dfa0028bd4",
   "metadata": {},
   "source": [
    "**Exercise 1:** Get a list containing the text of all the navigation links on the course webpage. Hint: the navigation buttons are all `<a>` elements that live inside a `<nav>` element."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "id": "3bfafbf1-390d-48c1-be82-f2aacb16d525",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Course\\nOverview',\n",
       " 'Assessment',\n",
       " 'Resources',\n",
       " 'Logistics',\n",
       " 'Schedule',\n",
       " 'Course\\nPolicies']"
      ]
     },
     "execution_count": 30,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "link_texts = []\n",
    "\n",
    "for link in soup.find(\"nav\").find_all(\"a\"):\n",
    "    link_texts.append(link.text)\n",
    "link_texts\n",
    "\n",
    "# or, with a list comprehension\n",
    "[link.text for link in soup.find(\"nav\").find_all(\"a\")]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3cf19cde",
   "metadata": {
    "id": "3cf19cde"
   },
   "source": [
    "**Exercise 2**: collect the **Names** and **Office Numbers** of all WWU CS faculty from the department directory at this url: <https://cs.wwu.edu/faculty>. Collect the data in a DataFrame two columns, Name and Office."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "85f4d183-7083-4dd4-a985-fae0673fc159",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Shameem Ahmed, PhD\n",
      "Selina Akter, PhD\n",
      "Justice Banson, MS\n",
      "Kameron Decker Harris, PhD\n",
      "Marie Deschene, PhD\n",
      "Hanxiang Du, PhD\n",
      "Yasmine Elglaly, PhD\n",
      "Linda Epps, PhD\n",
      "Qiang Hao, PhD\n",
      "Caroline Hardin, PhD\n",
      "Hsiang-Jen Hong, PhD\n",
      "Fuqun Huang, PhD\n",
      "Brian Hutchinson, PhD\n",
      "Tarek Idriss, PhD\n",
      "Filip Jagodzinski, PhD\n",
      "Michael Koepp, MBA\n",
      "Yudong Liu, PhD\n",
      "Namita Mahajan, MS Cybersecurity\n",
      "Shri Mare, PhD\n",
      "Mubarek Mohammed, PhD\n",
      "John Mower, MS\n",
      "Phil Nelson, PhD\n",
      "Alexandra (Alex) Nilles, PhD\n",
      "Dustin O'Hara, PhD\n",
      "Blake Pedrini, BS\n",
      "Manoj Prasad, PhD\n",
      "Moushumi Sharmin, PhD\n",
      "See-Mong Tan, PhD\n",
      "Michael Tsikerdekis, PhD\n",
      "Scott Wehrwein, PhD\n",
      "Piper Wolters, MS\n"
     ]
    }
   ],
   "source": [
    "response = requests.get(\"https://cs.wwu.edu/faculty\")\n",
    "soup = bs4.BeautifulSoup(response.text)\n",
    "\n",
    "for card in soup.find_all(\"div\", class_=\"card\"):\n",
    "    print(card.h3.a.text)"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}