My favorite way to find outliers is to make histograms. I'd like to histogram all the images, but they're not single numbers, so I can't. One approach is to work around this quite directly: find a way to convert each image into a single number (a statistic, you could say), then plot a histogram of all the single numbers. You may want to try out a few of these - if you find weirdness in a histogram, you're probably onto something.
requests
beautifulsoup4
.This is the language that web page content is written in. Some basic facts about HTML:
<tagname>Contents</tagname>
<h1>Biggest Possible Heading</h1>
href
attribute of the a
tag:<a>
stands for "anchor" and really means "link"<a href="https://example.com">Click here to go to example dot com</a>
becomes Click here to go to example dot com.<!-- like this -->
Instead of reading the following list, let's look at the source for a couple of webpages and see what we find:
Some common HTML elements to know about:
Note: If Jupyter sees HTML amongst your Markdown, it will render it like HTML - that's why I'm able to show you both the code and how it renders in the examples below.
html
tagbody
tagh1
...h6
are headingsp
is for paragraphParagraph 1
Paragraph 2
div
is a general-purpose (and by default invisible) container for blocks of page contentspan
is a general-purpose container for snippets of text content<table>
<tr> <!-- begin header (first) row -->
<th>Heading 1</ht> <!-- column 1 heading -->
<th>Heading 2</ht> <!-- column 2 heading -->
</tr>
<tr> <!-- begin second row -->
<td>Row 1, Column 1</td>
<td>Row 1, Column 2</td>
</tr>
<tr> <!-- begin second row -->
<td>Row 2, Column 1</td>
<td>Row 2, Column 2</td>
</tr>
</table>
Heading 1 | Heading 2 |
---|---|
Row 1, Column 1 | Row 1, Column 2 |
Row 2, Column 1 | Row 2, Column 2 |
Packages you'll need to pip install
for this all to work (and for Lab 5):
requests
beautifulsoup4
Game plan:
requests
to get the HTML code for a webpage given its URLbeautifulsoup4
to parse the resulting HTML and extract the data we want from it.import requests
import bs4
url = "https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_21f/"
response = requests.get(url)
print(response.text)
<!DOCTYPE html> <html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang=""> <head> <meta charset="utf-8" /> <meta name="generator" content="pandoc" /> <meta name="viewport" content="width=device-width, initial-scale=1.0, user-scalable=yes" /> <meta name="author" content="Scott Wehrwein" /> <title>DATA 311 - Fundamentals of Data Science</title> <style> code{white-space: pre-wrap;} span.smallcaps{font-variant: small-caps;} span.underline{text-decoration: underline;} div.column{display: inline-block; vertical-align: top; width: 50%;} div.hanging-indent{margin-left: 1.5em; text-indent: -1.5em;} ul.task-list{list-style: none;} .display.math{display: block; text-align: center; margin: 0.5rem auto;} </style> <link rel="stylesheet" href="md.css" /> <!--[if lt IE 9]> <script src="//cdnjs.cloudflare.com/ajax/libs/html5shiv/3.7.3/html5shiv-printshiv.min.js"></script> <![endif]--> </head> <body> <header id="title-block-header"> <h1 class="title">DATA 311 - Fundamentals of Data Science</h1> <p class="author">Scott Wehrwein</p> <p class="date">Fall 2021</p> </header> <nav id="TOC" role="doc-toc"> <ul> <li><a href="#course-overview">Course Overview</a></li> <li><a href="#assessment">Assessment</a></li> <li><a href="#resources-for-getting-help-and-support">Resources for Getting Help and Support</a></li> <li><a href="#logistics">Logistics</a></li> <li><a href="#schedule">Schedule</a></li> <li><a href="#course-policies">Course Policies</a></li> </ul> </nav> <h2 id="course-overview">Course Overview</h2> <ul> <li>Syllabus and Course Website (you are here): <a href="https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_21f">https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_21f</a></li> <li>Instructor: Scott Wehrwein (<a href="mailto:scott.wehrwein@wwu.edu">scott.wehrwein@wwu.edu</a>)</li> <li>CRN: 44224</li> <li>Class Meetings: <ul> <li>Lecture: MTW 1:00 - 1:50 in AW 303</li> <li>Lab: F 1:00 - 1:50 in CF 165 / CF 167</li> </ul></li> <li>Office Hours (CF 471) - <em>Subject to chage as the quarter gets underway:</em> <ul> <li>Monday 3-4pm</li> <li>Wednesday 10:30-11:30am</li> <li>Thursday 12-1pm</li> </ul></li> </ul> <h3 id="what-is-this-course-about">What is this course about?</h3> <h6 id="synopsis-from-the-wwu-course-catalog">Synopsis from the WWU Course Catalog</h6> <p>Introduction to the fundamentals of data science, focusing on techniques for collecting, processing, visualizing and organizing data. Applied machine learning concepts will also be covered, including fundamentals of machine learning experimentation and the use of libraries to perform clustering, classification and regression. Includes lab.</p> <h6 id="official-course-outcomes">Official Course Outcomes</h6> <p>On completion of this course students will demonstrate:</p> <ul> <li>The ability to recognize and develop questions that can be answered using quantitative analysis of data.</li> <li>The ability to find, acquire, and transform datasets into analysis-ready format.</li> <li>The ability to apply exploratory data analysis techniques such as sampling, summary statistics, and basic visualization.</li> <li>A basic understanding of common experimental and evaluation techniques for data science.</li> <li>The ability to answer data science questions by selecting, applying, and analyzing the results of statistical and machine learning tools such as clustering, classification, regression, and dimensionality reduction.</li> </ul> <h3 id="textbook">Textbook</h3> <p>The following book is <strong>recommended</strong>, but not required:</p> <p>â <a href="https://www.data-manual.com/">The Data Science Design Manual, First Edition</a> by Steven S. Skiena. ISBN 9783319554433</p> <p>I try to avoid requiring expensive textbooks unless truly necessary, so Iâve decided not to require it. That said, I think this is a good, worthwhile book that is simultaneously very readable and packed with wisdom. I think if you are going into a data science major and possibly a data science career, this is a worthwhile book to have on your shelf. I will be drawing on material from this book throughout the quarter, and the Schedule table will be updated to include references to relevant readings.</p> <h2 id="assessment">Assessment</h2> <p>Data science is a practical pursuit, and this course takes a particularly practical-minded approach to it. We will focus less on the mathematical underpinnings of the tools of data science and more on strategies for successfully using those tools to extract insights from data. As such, the assessment in this course is entirely project-based. Grades will be calculated as a weighted average of scores on the following course components, each of which is described in more detail below:</p> <ul> <li>50% Weekly Lab Assignments</li> <li>30% Final Project</li> <li>10% Weekly Quizzes</li> <li>10% Participation in in-class activities and reading responses/discussions.</li> </ul> <p>The standard letter grade ranges apply (i.e., 90â100% is an A, 80â90% is a B, and so on). The calculated raw percentages <strong>may be curved</strong> at the instructorâs discretion, but any such curve used will not lower anyoneâs grade. â+â or â-â cutoffs will be decided at the instructorâs discretion.</p> <p>Students who demonstrate mastery of the material will get grades in the A range, and it is my goal to give as many Aâs as possible.</p> <h4 id="assessment-philosophy">Assessment Philosophy</h4> <p>In the labs and final project you will put the skills, concepts and processes discussed in lecture into practice. My approach to grading is inspired by what might be expected of you in a professional setting. A data scientistâs job is to extract and clearly and convincingly present insights from the messy realities of real data. This has several implications for how I think about assessment:</p> <ul> <li>Correctness is necessary, but not sufficient. In particular, you will be expected to present your results clearly and concisely, making effective use of the Jupyter Notebook environment to provide explanatory text alongside code and outputs. Your notebook should tell a compelling story about the data, and should not leave the reader to guess at why a given bit of analysis is done, or what it means.</li> <li>In any real-world setting such as a data science job or academic research, reality does not always align well what your boss or advisor has asked of you. If youâre unable to achieve the goals set out in the assignment due to complications you encountered with the data, you should (1) explain those complications; and (2) propose a path forward. A path forward might take the form of a plan for how the original goals could be achieved given more time or a different approach, or a proposal for an alternative set of goals that are achievable and align with the spirit of the original goals.</li> </ul> <h3 id="lab-assignments">Lab Assignments</h3> <p>Each full week of the course consists of three 50-minute âlecturesâ and one 50-minute lab. Lab periods will be spent introducing and getting started on the lab assignment for the week, which will be released on Friday at the start of the lab period and due the following Thursday night at 10:00pm. These labs, along with the final project, comprise the bulk of the workload for this course, so you should plan to allow significant time to complete them outside of class. Some labs will be done individually, while others may optionally be completed in pairs.</p> <h3 id="final-project">Final Project</h3> <p>A final project will be completed in groups of 2 or 3 students. The project will have multiple deliverables, including a proposal, milestone reports, a final report, and a presentation. Presentations will take place during our scheduled final exam slot, which for Fall 2021 is on <strong>Tuesday, December 7th from 3:30pm-5:30pm</strong> in our regular lecture classroom, AW 303.</p> <h3 id="quizzes">Quizzes</h3> <p>Weekly quizzes will be given to help you make sure youâre keeping up with the course content. These will be taken on Gradescope (see Logistics, below). They will be tentatively given asynrchronously, to be taken with a 15-minute timeline anytime between the end of class Friday and the beginning of class Monday. I reserve the right to switch to synchronous quizzes, in whcih case you would likely take the quiz in the first 10 or 15 minutes of each Fridayâs lab period.</p> <h3 id="in-class-activities-and-reading-responses">In-Class Activities and Reading Responses</h3> <p>My goal is to make the lecture component of this course as interactive as possible. Activities may range from simple class discussions and group work to quick writing prompts that will be handed in. Anything handed in will be graded for participation only (i.e., if you make an honest effort, you will receive full credit).</p> <p>I may assign a small number (certainly less than one per week) of required reading assignments that touch on the interactions between data science and society, often with a focus on ethical considerations. These will be evaluated by some combination of in-class discussions and short written responses.</p> <h2 id="resources-for-getting-help-and-support">Resources for Getting Help and Support</h2> <h3 id="help-with-course-content">Help with Course Content</h3> <p>If you are stuck, struggling, or need help on any aspect of the course, you have several avenues for seeking help:</p> <ul> <li>Come to my office hours (listed at the top of this page). My office hours are time specifically devoted to helping you, and I get lonely when nobody comes by, so please visit me!</li> <li>I am hoping that this class can become a supportive community of folks who are excited to help each other out. Whether through in-person interactions or chatting via the course Discord server, I encourage you to ask questions and exchange ideas with your peers.</li> <li>The CS Tutors (previously called the CS Mentors) are available to provide one-on-one help with CS premajor course material. Although you are welcome to visit the CS Tutors and they may be able to help, none of them will have taken this class previously because this is the first time this class has ever been offered! If you wish to try your luck, you can find information on how to get help on the <a href="https://tutorq.cs.wwu.edu/">CS Tutors Website</a>.</li> </ul> <p>If you are have concerns that go beyond the course material you are welcome to talk to me, but the following resources are also available to support you.</p> <h3 id="other-resources">Other Resources</h3> <h5 id="community-ambassadors">Community Ambassadors</h5> <p>The Computer Science department has both Faculty and Student community ambassadors who hold regular office hours:</p> <blockquote> <p>These hours are a time for students, staff and faculty to bring concerns, feedback or questions as it related to equity, inclusion and diversity within STEM. We hope that we, the Community Ambassadors and the STEM Inclusion and Outreach Specialist, can advise and also guide people to college, university or external resources.</p> </blockquote> <p>You can find information on Commnity Office Hours and contact details for both at the following link: https://cs.wwu.edu/equity-and-inclusion-ambassadors</p> <h3 id="university-resources">University Resources</h3> <p>As a reminder, the following University resources are always available:</p> <ul> <li>Student Health Center: http://www.wwu.edu/chw/student_health/</li> <li>Counseling Center http://www.wwu.edu/counseling/</li> <li>Disability Access Center: https://disability.wwu.edu/</li> <li>Equal Opportunity Office: http://www.wwu.edu/eoo/</li> </ul> <h2 id="logistics">Logistics</h2> <h3 id="course-webpage-syllabus">Course Webpage / Syllabus</h3> <p>The <a href="#schedule">Schedule</a> section of this page will be kept up-to-date as the quarter progresses with topics, links to all lecture materials (videos, slides, exercises, problems, and readings), as well as links to assignment and lab handouts. I suggest bookmarking this page; if you forget the URL and need to find your way back here, you can find the link on the Syllabus page in Canvas.</p> <h3 id="canvas">Canvas</h3> <p>I generally minimize the use of Canvas in favor of sharing materials via the course webpage. However, we will use Canvas for announcements, grades, and submission of assignments. Lab and assignment writeups will be linked from both the course webpage and the corresponding assignment on Canvas. Lecture materials, readings, etc. will only be posted on the course webpage. </p> <h3 id="gradescope">Gradescope</h3> <p>Quizzes will be taken, graded, and returned to you via an online tool called Gradescope. You will receive an email before the first quiz with instructions on how to set a password for the account that has been created for you. Thereafter, you can access quizzes and exams by logging into your account on <a href="gradescope.com">http://www.gradescope.com</a>.</p> <h3 id="discord">Discord</h3> <p>Discord is a popular communication platform that enables text, voice, and video chats to take place in a dedicated server. Iâve found Discord very helpful during remote instruction, and indeed have taught some of my courses entirely on Discord. Although we are in-person, I think having a central online platform for communicating about the course is a great way to build community, so Iâve created a Discord server for the class. The invitation link to join the server is on the Syllabus page of Canvas.</p> <p>If for some reason we need to switch to remote instruction, Discord may become a central part of how we conduct class; in the meantime, you are not required to join or participate, but I hope that you will join and chat with your classmates, ask questions in the Q&A channel, and post all the data science memes. I will, however ask that you (1) make sure that your nickname in our server is your real (or preferred) name, and (2) you keep in mind that our Discord server is an extension of our classroom environment, and everyoneâs conduct therefore needs to be as professional and respectful as it would be in an in-person classroom or lab.</p> <h3 id="computer-labs">Computer Labs</h3> <p>The CS department maintains a set of Computer Science computer labs separate from the general university labs. These systems are all set up with the software that you need to complete the work for this class.</p> <p><strong>CS Accounts</strong><br/> To log into the machines in these labs, you will need a separate Computer Science account, which youâll need to create unless youâve taken another CS course already. Your username will be the same as your WWU username, but you will need to activate your account and set a new password by visiting <a href="http://password.cs.wwu.edu">http://password.cs.wwu.edu</a>. Note that youâll need to do this <em>before</em> your first lab, since youâll be unable to log into the computers to access a web browser until youâve done this.</p> <p>If you didnât already have a CS account, you may not be able to log in during the first lab since accounts may not be created until the first Monday of the quarter. If this is the case, let me know and I will try to pair you up with someone who is able to log in so you can work on the lab together.</p> <p><strong>Lab Locations and Access</strong><br/> The following rooms in Communications Facility are CS Department labs: 162, 164, 165, 167, 405, 418, 420. These labs are open to all CS students (thatâs you!) any time except when scheduled for a class or other activity. The complete of CS labs and their schedules can be found on the <a href="https://gitlab.cs.wwu.edu/cs-support/public/-/wikis/home/survival_guide/resources/Labs#main-labs-supported">CS Support Wiki</a>. CF 405 is never booked, so itâs always available. Labs are open 24/7, although the building locks at 11pm so you wonât be able to enter later than that.</p> <p><strong>Fall 2021 Details:</strong></p> <ul> <li>Please make sure you are aware of the <a href="https://cs.wwu.edu/fall-2021-cs-lab-policies">Fall 2021 CS Lab policies</a>.</li> <li>Note that every other computer will be limited to remote access. You can see a list of remote-only lab hostnames <a href="https://gitlab.cs.wwu.edu/cs-support/public/-/wikis/home/survival_guide/day_to_day/Remotely_Accessing_Resources#systems-for-remote-access-only-new-for-fall-2021">here</a>.</li> </ul> <h5 id="remote-access">Remote Access</h5> <p>In this course, weâll be doing most of our work inside Jupyter notebooks, which are edited and run in a web browser. While you are welcome to set up the software needed for this class on your own computer, CS Support and I canât promise to troubleshoot any technical issues you encounter - Iâll help if I can, but I canât promise Iâll be able to solve your issues.</p> <p>The officially supported environment is the CS labs. It is possible to work on Jupyter notebooks remotely on the lab computers, but it requires some setup. Iâve written up detailed instructions for this, which you can find here: <a href="https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_21f/remote/remote.html">remote access instructions</a>. If you have any trouble with these instructions, please check the Troubleshooting section at the end; if nothing there solves your problem, get in touch with me and Iâll do my best to help as well as make corrections or clarifications in the instructions.</p> <h3 id="feedback">Feedback</h3> <p>If thereâs something I can improve about the course, I sincerely want to know about it. I take student feedback seriously, and I believe itâs especially important this quarter given that this is a new course and the pandemic continues to drag on. Any feedback youâre willing to give is greatly appreciated, and I will do my best to act on constructive feedback whenever possible. I will solicit feedback through surveys periodically throughout the course, but you are welcome and encouraged to provide feedback anytime in my office hours, by email, or if you desire anonymity you can fill out <a href="https://forms.gle/RB6FNaStiN23PsGx5">this Google Form</a>.</p> <h3 id="flexibility">Flexibility</h3> <p>A great deal of uncertainty remains surrounding our return to in-person instruction. If weâve learned anything since March of 2020, itâs that we canât be sure what the world will look like a month into the future. I plan to be as flexible and forgiving as I can, and I ask that you do the same for each other and for me.</p> <p>The Universityâs policies for Fall quarter give me the option to switch to remote instruction, either temporarily or for the remainder of the quarter, for any reason. I hope that we can stay face-to-face, but if safety or pedagogy would be better served by remote instruction, we may need to make such a switch.</p> <h2 id="schedule">Schedule</h2> <p>This table contains a rough outline of a schedule for the quarter. As the quarter progresses, I will update it with more detail on past and upcoming topics. You will also find links to all course materials I post. Unless otherwise noted, References refer to chapters/sections in the Skiena book.</p> <table> <colgroup> <col style="width: 8%" /> <col style="width: 41%" /> <col style="width: 41%" /> <col style="width: 8%" /> </colgroup> <thead> <tr class="header"> <th>Date</th> <th>Topics</th> <th>Assignments</th> <th>References</th> </tr> </thead> <tbody> <tr class="odd"> <td>9/22 (0)</td> <td>What is data science? What is data?<br /><a href="lectures/L01/L01_written.pdf">written notes</a>; <a href="lectures/L01/L01.pdf">typed notes</a></td> <td></td> <td>1.1, 1.3</td> </tr> <tr class="even"> <td>9/24</td> <td>Lab 1: Jupyter and Pandas - The Basics<br /><a href="lectures/L02/L02_written.pdf">written notes</a>; <a href="lectures/L02/L02.pdf">typed notes</a></td> <td><a href="lab1/">Lab 1</a><br /><a href="https://www.gradescope.com/courses/316497">Quiz 1</a><br />Start of Quarter Survey (Canvas)</td> <td></td> </tr> <tr class="odd"> <td>9/27 (1)</td> <td>Ideas: Finding data, Asking questions, Data formats<br />Notes: <a href="lectures/L03/L03.ipynb">ipynb</a>; <a href="lectures/L03/L03.html">html</a></td> <td></td> <td>1.2, 3.1-3.2</td> </tr> <tr class="even"> <td>9/28</td> <td>Math: Basic probability; summary statistics<br />Notes: <a href="lectures/L04/L04.ipynb">ipynb</a>; <a href="lectures/L04/L04.html">html</a></td> <td></td> <td>2.1</td> </tr> <tr class="odd"> <td>9/29</td> <td>Programming: python, jupyter, pandas<br />Notes: <a href="lectures/L05/L05.ipynb">ipynb</a>; <a href="lectures/L05/L05.html">html</a><br />Live code: <a href="lectures/L05/L05_1.ipynb">ipynb</a>; <a href="lectures/L05/L05_1.html">html</a></td> <td></td> <td></td> </tr> <tr class="even"> <td>10/1</td> <td>Lab 2: Answering weather questions<br />Notes: <a href="lectures/L06/L06.ipynb">ipynb</a>; <a href="lectures/L06/L06.html">html</a></td> <td><a href="lab2">Lab 2</a><br /><a href="http://www.gradescope.com">Quiz 2</a></td> <td></td> </tr> <tr class="odd"> <td>10/4 (2)</td> <td>Ideas: exploratory analysis intro<br />Notes: <a href="lectures/L07/L07.ipynb">ipynb</a>; <a href="lectures/L07/L07.html">html</a><br />Preprocessing: <a href="lectures/L07/L07_NHANES_preprocessing.ipynb">ipynb</a></td> <td></td> <td>6.1</td> </tr> <tr class="even"> <td>10/5</td> <td>Math: Conditional Probability; Independence; Variance<br />Notes: <a href="lectures/L08/L08.ipynb">ipynb</a>; <a href="lectures/L08/L08.html">html</a><br />Written notes: <a href="lectures/L08/L08_written.pdf">pdf</a></td> <td></td> <td>2.1, 2.2</td> </tr> <tr class="odd"> <td>10/6</td> <td>Programming: cold open<br />Notebook: <a href="lectures/L09/L09.ipynb">ipynb</a>; <a href="lectures/L09/L09.html">html</a></td> <td></td> <td></td> </tr> <tr class="even"> <td>10/8</td> <td>Lab: COVID not COVID<br />Notes: <a href="lectures/L10/L10.ipynb">ipynb</a>; <a href="lectures/L10/L10.html">html</a></td> <td><a href="lab3">Lab 3</a><br /><a href="http://www.gradescope.com">Quiz 3</a><br /><a href="ethics1">Ethics 1</a></td> <td></td> </tr> <tr class="odd"> <td>10/11 (3)</td> <td>Ideas: cleaning; missing data; preprocessing<br />Notes: <a href="lectures/L11/L11.ipynb">ipynb</a>; <a href="lectures/L11/L11.html">html</a></td> <td></td> <td>3.3</td> </tr> <tr class="even"> <td>10/12</td> <td>Math: probability distributions; z scores; logs and normalization<br />Notes: <a href="lectures/L12/L12.ipynb">ipynb</a>; <a href="lectures/L12/L12.html">html</a></td> <td></td> <td>5.1</td> </tr> <tr class="odd"> <td>10/13</td> <td>Programming: numpy<br />Notes: <a href="lectures/L13/L13.ipynb">ipynb</a>; <a href="lectures/L13/L13.html">html</a></td> <td></td> <td></td> </tr> <tr class="even"> <td>10/15</td> <td>Lab: numpy/cleaning/preprocessing exercises<br />Notes: <a href="lectures/L14/L14.ipynb">ipynb</a>; <a href="lectures/L14/L14.html">html</a></td> <td><a href="lab4">Lab 4</a><br /><a href="http://www.gradescope.com">Quiz 4</a><br /></td> <td></td> </tr> <tr class="odd"> <td>10/18 (4)</td> <td>Data Ethics 1 discussion<br />Notes: <a href="lectures/L14/L14.ipynb">ipynb</a>; <a href="lectures/L15/L15.html">html</a><br />Resample example: <a href="lectures/L15/resample_example.ipynb">ipynb</a>, <a href="lectures/L15/resample_example.html">html</a></td> <td></td> <td></td> </tr> <tr class="even"> <td></td> <td>Ideas: Correlation (does not imply causation)<br />Math: measuring correlation</td> <td></td> <td></td> </tr> <tr class="odd"> <td></td> <td>Programming: HTML markup and web scraping</td> <td></td> <td></td> </tr> <tr class="even"> <td></td> <td>Lab: Beautiful Soup</td> <td></td> <td></td> </tr> <tr class="odd"> <td>10/25 (5)</td> <td>Ideas: visualization types; principles of visualization aesthetics<br />Math: matrices and ML setup<br />Programming: visualization (seaborn; a bit of matplotlib)<br />Lab: visualization with seaborn and matplotlib</td> <td></td> <td></td> </tr> <tr class="even"> <td>11/1 (6)</td> <td>Ideas: prediction + ML overview (supervised vs un, clas vs reg)<br />Math: clustering and dim red<br />Programming: scikit-learn basics<br />Lab: discovering structure by clustering / dim red w/ scikit learn</td> <td></td> <td></td> </tr> <tr class="odd"> <td>11/8 (7)</td> <td>Ideas: overfitting; data splits; validation and cross-validation<br />Math: linear regression plus tricks<br />11/11: Veterans Day - No Class<br />Lab: making predictions with linear regression</td> <td></td> <td></td> </tr> <tr class="even"> <td>11/15 (8)</td> <td>Ideas: evaluating ML systems<br />Math: distance measures and accuracy metrics<br />Programming/ideas: classification<br />Lab: making predictions with classification</td> <td></td> <td></td> </tr> <tr class="odd"> <td>11/22 (9)</td> <td><br />W, F: Thanksgiving Week</td> <td></td> <td></td> </tr> <tr class="even"> <td>11/29 (10)</td> <td>Prep week</td> <td></td> <td></td> </tr> <tr class="odd"> <td>Tuesday, 12/7</td> <td>Final Project Presentations (3:30pm - 5:30pm)</td> <td></td> <td></td> </tr> </tbody> </table> <h2 id="course-policies">Course Policies</h2> <h3 id="inclusive-classroom-environment">Inclusive Classroom Environment</h3> <p>It is expected that everyone will promote a friendly, supportive, and respectful environment in the classroom, labs, and project groups. Everyoneâs participation will be equally welcomed and valued.</p> <h3 id="attendance">Attendance</h3> <p>Hopefully it goes without saying at this point: <strong>if you feel sick, donât come to class.</strong></p> <p>I will not explicitly track attendance. However, in-class activities (generally graded on completion) cannot be made up after the fact. These assessments will be sufficeintly low-stakes that missing a handful of days will not affect your grade at all. If you will be missing more than an occasional class here and there, or if you have any concerns about the effect of absences on your grade, please have a conversation with me about it.</p> <h3 id="communication">Communication</h3> <p>It is your responsibility to make sure that you promptly become aware of Canvas Announcements as they are posted; Canvas should be configured to send you an email notification by default, but if you are unsure, please come see me in office hours.</p> <h3 id="late-work">Late Work</h3> <p>You have three âslip daysâ that you may use at your discretion to submit labs late. Slip days apply only to labs and can not be applied to any other deadline. You may use slip days one at a time or together - for example, you might submit each of three labs one day late, or submit one lab three days late. A slip day moves the deadline by exactly 24 hours from the original deadline; if you go beyond this, you will need to use a second slip day, if available.</p> <p>After your slip days are exhausted, a penalty of 10% * floor(hours_late/24 + 1) - that is, 10% per day late, will be applied. This is calculated as a percentage of the total points possible, not of the points earned.</p> <p>The time of your submission will be recorded when you submit it on Canvas, so other than submitting your assignment and corresponding survey late, you do not need to take any action to use a slip day. Your grading feedback will include a note of how many slip days have been applied.</p> <h3 id="academic-honesty">Academic Honesty</h3> <p>The academic honesty guidelines for this course differ somewhat from those of a typical CS course. Much of the code you write will be written in chunks of a few lines at a time. The challenge will more often be knowing which library functions to use and how to correctly apply them, rather than solving complex algorithmic problems.</p> <p>Some labs will be done individually, while others may be done in pairs. For all lab assignments, you are welcome and encouraged to discuss the lab with your classmates. You should feel free to exchange ideas for how to solve pieces of an assignment; this collaboration may be as detailed as suggesting which library function to use and an English description of what you might use it for. You may <strong>not</strong> copy anyone elseâs code, nor should you allow anyone else to copy your code. Finally, most tasks of most labs will ask you to intersperse descriptive text with your code, to explain what the code is doing. This text must be your own and cannot be copied from, or even âinspired byâ anyone elseâs text. If you did get help on how to code up a task, you can prove that you understand the solution well by explaining it in your notebook.</p> <p>For labs done in pairs, any and all collaboration is permissible between members of the same pair. That said, <strong>both members must understand and be able to explain in detail all aspects of their submission</strong>. For this reason, âpair programmingâ is highly recommended - you should not split the tasks up for each group member complete independently. I reserve the right to meet with any student one-on-one and ask them to explain any part of their submission to me in detail.</p> <p>If you are collaborating with someone outside your pair and looking at their code, you may accidentally write identical code into your solution, even if you didnât do it by copy/pasting. For this reason, the safest way to avoid being flagged for academic dishonesty is to either (a) avoid looking at someone elseâs code in the first place or (b) wait at least 30 minutes after seeing someone elseâs code to write your own.</p> <h3 id="university-policies">University Policies</h3> <p>All University-wide policies apply to this course, including those outlined at <a href="syllabi.wwu.edu">http://syllabi.wwu.edu</a>. These policies cover issues including:</p> <ul> <li>COVID-19 Safety</li> <li>Academic Honesty</li> <li>Accommodations</li> <li>Ethical Conduct with WWU Network and Computing Resources</li> <li>Equal Opportunity</li> <li>Finals</li> <li>Medical Excuse Policy</li> <li>Student Conduct Code</li> </ul> </body> </html>
soup = bs4.BeautifulSoup(response.text, 'html.parser')
Things to demo:
soup.a
); extract text (.text
) and attributes ['href']
class_
kwarg)find_all
soup.a
<a href="#course-overview">Course Overview</a>
soup.a.text
'Course Overview'
soup.a['href']
'#course-overview'
soup.find('a', attrs={'href': "#assessment"})
<a href="#assessment">Assessment</a>
soup.find(class_="author")
<p class="author">Scott Wehrwein</p>
soup.find_all('p', class_="author")
[<p class="author">Scott Wehrwein</p>]
items = soup.find('ul').find_all('li')
[it.text for it in items]
['Course Overview', 'Assessment', 'Resources for Getting Help and Support', 'Logistics', 'Schedule', 'Course Policies']
str(items[0])
'<li><a href="#course-overview">Course Overview</a></li>'
url = "https://cs.wwu.edu/faculty"
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
cards = soup.find_all('div', class_="card")
wes = cards[5]
name = wes.h3.a.text
office = wes.find_all('div')[4].text
office
'CF 479'
import pandas as pd
import numpy as np
data = {
"Name": [],
"Office": []
}
for card in cards:
data["Name"].append(card.h3.a.text)
divs = card.find_all('div')
appended = False
for d in divs:
if d.text[:2] == "CF":
appended = True
data["Office"].append(d.text)
if not appended:
data["Office"].append(np.nan)
df = pd.DataFrame(data)
df
Name | Office | |
---|---|---|
0 | Shameem Ahmed, PhD | CF 491 |
1 | Selina Akter, PhD | CF 409 |
2 | Justice Banson | CF 413 |
3 | Aran Clauson, PhD | CF 411 |
4 | Kameron Decker Harris, PhD | CF 461 |
5 | Wesley Deneke, PhD | CF 479 |
6 | Abdul Derwish, MS | CF 409 |
7 | Marie Deschene, PhD | CF 413 |
8 | Yasmine Elglaly, PhD | CF 465 |
9 | Perry Fizzano, PhD | CF 469 |
10 | Erik Fretheim, PhD | CF 487 |
11 | Qiang Hao, PhD | CF 457 |
12 | Caroline Hardin, PhD | CF 463 |
13 | James Hearne, PhD | CF 455A |
14 | Brian Hutchinson, PhD | CF 475 |
15 | Tarek Idriss, PhD | CF 485 |
16 | Filip Jagodzinski, PhD | CF 493 |
17 | Yudong Liu, PhD | CF 483 |
18 | Shri Mare, PhD | CF 481 |
19 | Michael Meehan, PhD | CF 473 |
20 | Phil Nelson, PhD | CF 467 |
21 | Dustin O'Hara, PhD | CF 411 |
22 | Moushumi Sharmin, PhD | CF 477 |
23 | See-Mong Tan, PhD | CF 409 |
24 | Austin Thind | CF 411 |
25 | Michael Tsikerdekis, PhD | CF 489 |
26 | Scott Wehrwein, PhD | CF 471 |
27 | Jeff Woodcock, MS | NaN |