DATA 311 - Fundamentals of Data Science

Scott Wehrwein

Fall 2021

Course Overview

What is this course about?

Synopsis from the WWU Course Catalog

Introduction to the fundamentals of data science, focusing on techniques for collecting, processing, visualizing and organizing data. Applied machine learning concepts will also be covered, including fundamentals of machine learning experimentation and the use of libraries to perform clustering, classification and regression. Includes lab.

Official Course Outcomes

On completion of this course students will demonstrate:

Textbook

The following book is recommended, but not required:

The Data Science Design Manual, First Edition by Steven S. Skiena. ISBN 9783319554433

I try to avoid requiring expensive textbooks unless truly necessary, so I’ve decided not to require it. That said, I think this is a good, worthwhile book that is simultaneously very readable and packed with wisdom. I think if you are going into a data science major and possibly a data science career, this is a worthwhile book to have on your shelf. I will be drawing on material from this book throughout the quarter, and the Schedule table will be updated to include references to relevant readings.

Assessment

Data science is a practical pursuit, and this course takes a particularly practical-minded approach to it. We will focus less on the mathematical underpinnings of the tools of data science and more on strategies for successfully using those tools to extract insights from data. As such, the assessment in this course is entirely project-based. Grades will be calculated as a weighted average of scores on the following course components, each of which is described in more detail below:

The standard letter grade ranges apply (i.e., 90–100% is an A, 80–90% is a B, and so on). The calculated raw percentages may be curved at the instructor’s discretion, but any such curve used will not lower anyone’s grade. “+” or “-” cutoffs will be decided at the instructor’s discretion.

Students who demonstrate mastery of the material will get grades in the A range, and it is my goal to give as many A’s as possible.

Assessment Philosophy

In the labs and final project you will put the skills, concepts and processes discussed in lecture into practice. My approach to grading is inspired by what might be expected of you in a professional setting. A data scientist’s job is to extract and clearly and convincingly present insights from the messy realities of real data. This has several implications for how I think about assessment:

Lab Assignments

Each full week of the course consists of three 50-minute “lectures” and one 50-minute lab. Lab periods will be spent introducing and getting started on the lab assignment for the week, which will be released on Friday at the start of the lab period and due the following Thursday night at 10:00pm. These labs, along with the final project, comprise the bulk of the workload for this course, so you should plan to allow significant time to complete them outside of class. Some labs will be done individually, while others may optionally be completed in pairs.

Final Project

A final project will be completed in groups of 2 or 3 students. The project will have multiple deliverables, including a proposal, milestone reports, a final report, and a presentation. Presentations will take place during our scheduled final exam slot, which for Fall 2021 is on Tuesday, December 7th from 1:00pm-3:00pm in our regular lecture classroom, AW 303.

Quizzes

Weekly quizzes will be given to help you make sure you’re keeping up with the course content. These will be taken on Gradescope (see Logistics, below). They will be tentatively given asynrchronously, to be taken with a 15-minute timeline anytime between the end of class Friday and the beginning of class Monday. I reserve the right to switch to synchronous quizzes, in whcih case you would likely take the quiz in the first 10 or 15 minutes of each Friday’s lab period.

In-Class Activities and Reading Responses

My goal is to make the lecture component of this course as interactive as possible. Activities may range from simple class discussions and group work to quick writing prompts that will be handed in. Anything handed in will be graded for participation only (i.e., if you make an honest effort, you will receive full credit).

I may assign a small number (certainly less than one per week) of required reading assignments that touch on the interactions between data science and society, often with a focus on ethical considerations. These will be evaluated by some combination of in-class discussions and short written responses.

Resources for Getting Help and Support

Help with Course Content

If you are stuck, struggling, or need help on any aspect of the course, you have several avenues for seeking help:

If you are have concerns that go beyond the course material you are welcome to talk to me, but the following resources are also available to support you.

Other Resources

Community Ambassadors

The Computer Science department has both Faculty and Student community ambassadors who hold regular office hours:

These hours are a time for students, staff and faculty to bring concerns, feedback or questions as it related to equity, inclusion and diversity within STEM. We hope that we, the Community Ambassadors and the STEM Inclusion and Outreach Specialist, can advise and also guide people to college, university or external resources.

You can find information on Commnity Office Hours and contact details for both at the following link: https://cs.wwu.edu/equity-and-inclusion-ambassadors

University Resources

As a reminder, the following University resources are always available:

Logistics

Course Webpage / Syllabus

The Schedule section of this page will be kept up-to-date as the quarter progresses with topics, links to all lecture materials (videos, slides, exercises, problems, and readings), as well as links to assignment and lab handouts. I suggest bookmarking this page; if you forget the URL and need to find your way back here, you can find the link on the Syllabus page in Canvas.

Canvas

I generally minimize the use of Canvas in favor of sharing materials via the course webpage. However, we will use Canvas for announcements, grades, and submission of assignments. Lab and assignment writeups will be linked from both the course webpage and the corresponding assignment on Canvas. Lecture materials, readings, etc. will only be posted on the course webpage. 

Gradescope

Quizzes will be taken, graded, and returned to you via an online tool called Gradescope. You will receive an email before the first quiz with instructions on how to set a password for the account that has been created for you. Thereafter, you can access quizzes and exams by logging into your account on http://www.gradescope.com.

Discord

Discord is a popular communication platform that enables text, voice, and video chats to take place in a dedicated server. I’ve found Discord very helpful during remote instruction, and indeed have taught some of my courses entirely on Discord. Although we are in-person, I think having a central online platform for communicating about the course is a great way to build community, so I’ve created a Discord server for the class. The invitation link to join the server is on the Syllabus page of Canvas.

If for some reason we need to switch to remote instruction, Discord may become a central part of how we conduct class; in the meantime, you are not required to join or participate, but I hope that you will join and chat with your classmates, ask questions in the Q&A channel, and post all the data science memes. I will, however ask that you (1) make sure that your nickname in our server is your real (or preferred) name, and (2) you keep in mind that our Discord server is an extension of our classroom environment, and everyone’s conduct therefore needs to be as professional and respectful as it would be in an in-person classroom or lab.

Computer Labs

The CS department maintains a set of Computer Science computer labs separate from the general university labs. These systems are all set up with the software that you need to complete the work for this class.

CS Accounts
To log into the machines in these labs, you will need a separate Computer Science account, which you’ll need to create unless you’ve taken another CS course already. Your username will be the same as your WWU username, but you will need to activate your account and set a new password by visiting http://password.cs.wwu.edu. Note that you’ll need to do this before your first lab, since you’ll be unable to log into the computers to access a web browser until you’ve done this.

If you didn’t already have a CS account, you may not be able to log in during the first lab since accounts may not be created until the first Monday of the quarter. If this is the case, let me know and I will try to pair you up with someone who is able to log in so you can work on the lab together.

Lab Locations and Access
The following rooms in Communications Facility are CS Department labs: 162, 164, 165, 167, 405, 418, 420. These labs are open to all CS students (that’s you!) any time except when scheduled for a class or other activity. The complete of CS labs and their schedules can be found on the CS Support Wiki. CF 405 is never booked, so it’s always available. Labs are open 24/7, although the building locks at 11pm so you won’t be able to enter later than that.

Fall 2021 Details:

Remote Access

In this course, we’ll be doing most of our work inside Jupyter notebooks, which are edited and run in a web browser. While you are welcome to set up the software needed for this class on your own computer, CS Support and I can’t promise to troubleshoot any technical issues you encounter - I’ll help if I can, but I can’t promise I’ll be able to solve your issues.

The officially supported environment is the CS labs. It is possible to work on Jupyter notebooks remotely on the lab computers, but it requires some setup. I’ve written up detailed instructions for this, which you can find here: remote access instructions. If you have any trouble with these instructions, please check the Troubleshooting section at the end; if nothing there solves your problem, get in touch with me and I’ll do my best to help as well as make corrections or clarifications in the instructions.

Feedback

If there’s something I can improve about the course, I sincerely want to know about it. I take student feedback seriously, and I believe it’s especially important this quarter given that this is a new course and the pandemic continues to drag on. Any feedback you’re willing to give is greatly appreciated, and I will do my best to act on constructive feedback whenever possible. I will solicit feedback through surveys periodically throughout the course, but you are welcome and encouraged to provide feedback anytime in my office hours, by email, or if you desire anonymity you can fill out this Google Form.

Flexibility

A great deal of uncertainty remains surrounding our return to in-person instruction. If we’ve learned anything since March of 2020, it’s that we can’t be sure what the world will look like a month into the future. I plan to be as flexible and forgiving as I can, and I ask that you do the same for each other and for me.

The University’s policies for Fall quarter give me the option to switch to remote instruction, either temporarily or for the remainder of the quarter, for any reason. I hope that we can stay face-to-face, but if safety or pedagogy would be better served by remote instruction, we may need to make such a switch.

Schedule

This table contains a rough outline of a schedule for the quarter. As the quarter progresses, I will update it with more detail on past and upcoming topics. You will also find links to all course materials I post. Unless otherwise noted, References refer to chapters/sections in the Skiena book.

Date Topics Assignments References
9/22 (0) What is data science? What is data?
written notes; typed notes
1.1, 1.3
9/24 Lab 1: Jupyter and Pandas - The Basics
written notes; typed notes
Lab 1
Quiz 1
Start of Quarter Survey (Canvas)
9/27 (1) Ideas: Finding data, Asking questions, Data formats
Notes: ipynb; html
1.2, 3.1-3.2
9/28 Math: Basic probability; summary statistics
Notes: ipynb; html
2.1
9/29 Programming: python, jupyter, pandas
Notes: ipynb; html
Live code: ipynb; html
10/1 Lab 2: Answering weather questions
Notes: ipynb; html
Lab 2
Quiz 2
10/4 (2) Ideas: exploratory analysis intro
Notes: ipynb; html
Preprocessing: ipynb
6.1
10/5 Math: Conditional Probability; Independence; Variance
Notes: ipynb; html
Written notes: pdf
2.1, 2.2
10/6 Programming: cold open
Notebook: ipynb; html
10/8 Lab: COVID not COVID
Notes: ipynb; html
Lab 3
Quiz 3
Ethics 1
10/11 (3) Ideas: cleaning; missing data; preprocessing
Notes: ipynb; html
3.3
10/12 Math: probability distributions; z scores; logs and normalization
Notes: ipynb; html
5.1
10/13 Programming: numpy
Notes: ipynb; html
10/15 Lab: numpy/cleaning/preprocessing exercises
Notes: ipynb; html
Lab 4
Quiz 4
10/18 (4) Data Ethics 1 discussion
Notes: ipynb; html
Resample example: ipynb, html
10/19 Ideas: Correlation (does not imply causation)
Math: measuring correlation
Notes: ipynb, html
2.3
10/20 Programming: HTML markup and web scraping
Notes: ipynb, html
3.2.2
10/22 Lab: Movie Soup
Notes: ipynb, html
Lab 5
Quiz 5
10/25 (5) Ideas: principles of visualization aesthetics
Notes: ipynb, html; Examples
6.2, 6.4
10/26 Programming: plot types; seaborn, a bit of matplotlib
Notes: ipynb, html
6.3
10/27 Math: vectors and matrices; classification and regression
Notes: ipynb, html
1.4; 8.1-8.2
10/29 Lab: visualization: the bad and the good
Notes: ipynb, html
Lab 6
Quiz 6
11/1 (6) Ideas: prediction + ML overview (supervised vs un, clas vs reg)
Notes: ipynb, html
Ethics 2 11.5
11/2 Math: distances; dimensionality reduction
Notes: ipynb, html
Scribbles: pdf
10.1; 8.5; 10.5
11/3 Programming: scikit-learn basics; clustering
Notes: ipynb, html
11/5 Lab: discovering structure by clustering / dim red w/ scikit learn
Notes: ipynb, html
Lab 7
Quiz 7
FP proposal out
11/8 (7) Ideas: overfitting; data splits; validation and cross-validation; regularization?
Notes: ipynb, html, pdf
Exercises: pdf
7.1, 7.4-7.5
11/9 Math: linear regression plus tricks
Notes: ipynb, html, pdf
9.1-9.2, 9.5
11/10 Data Ethics 2 discussion
Announcements: ipynb, html
11/12 Lab: Linear Regression - YMMV
Announcements: ipynb, html
Lab 8
Quiz 8
FP proposal due
11/15 (8) Ideas: Evaluation - Baselines, Regression metrics
Notes: ipynb, html
7.3-7.5
11/16 Math: Evaluation: Classification Metrics
Notes: ipynb, html
7.3
11/17 Programming: Classification; sklearn classifier zoo
Notes: ipynb, html, pdf
10.2, 9.6, 11.2, 11.4
11/19 Lab: enter an ML competition!
Announcements: html
Lab 9
Quiz 9
11/22 (9) Scaling and Cross-validation
Notes: ipynb, html, pdf
FP milestone 1
11/23 Model selection hackery: Hyperparameter Search
Notes: ipynb, html
11/24 Thanksgiving Break Ethics 3
11/26 Thanksgiving Break
11/29 (10) Optional class: Q&A / special topics
Notes: ipynb, html
FP milestone 2
11/30 Bias in ML
Notes: ipynb, html
Racist sentiment classifier: ipynb, html
Ethics 3 due
12/1 Data Ethics 3 Discussion
Notes: ipynb, html
12/3 AMA (meet in AW 303)
Monday, 12/6 (finals week, no class) FP due
Tuesday, 12/7 Final Project Presentations (1:00pm - 3:00pm)

Course Policies

Inclusive Classroom Environment

It is expected that everyone will promote a friendly, supportive, and respectful environment in the classroom, labs, and project groups. Everyone’s participation will be equally welcomed and valued.

Attendance

Hopefully it goes without saying at this point: if you feel sick, don’t come to class.

I will not explicitly track attendance. However, in-class activities (generally graded on completion) cannot be made up after the fact. These assessments will be sufficeintly low-stakes that missing a handful of days will not affect your grade at all. If you will be missing more than an occasional class here and there, or if you have any concerns about the effect of absences on your grade, please have a conversation with me about it.

Communication

It is your responsibility to make sure that you promptly become aware of Canvas Announcements as they are posted; Canvas should be configured to send you an email notification by default, but if you are unsure, please come see me in office hours.

Late Work

You have three “slip days” that you may use at your discretion to submit labs late. Slip days apply only to labs and can not be applied to any other deadline. You may use slip days one at a time or together - for example, you might submit each of three labs one day late, or submit one lab three days late. A slip day moves the deadline by exactly 24 hours from the original deadline; if you go beyond this, you will need to use a second slip day, if available.

After your slip days are exhausted, a penalty of 10% * floor(hours_late/24 + 1) - that is, 10% per day late, will be applied. This is calculated as a percentage of the total points possible, not of the points earned.

The time of your submission will be recorded when you submit it on Canvas, so other than submitting your assignment and corresponding survey late, you do not need to take any action to use a slip day. Your grading feedback will include a note of how many slip days have been applied.

Academic Honesty

The academic honesty guidelines for this course differ somewhat from those of a typical CS course. Much of the code you write will be written in chunks of a few lines at a time. The challenge will more often be knowing which library functions to use and how to correctly apply them, rather than solving complex algorithmic problems.

Some labs will be done individually, while others may be done in pairs. For all lab assignments, you are welcome and encouraged to discuss the lab with your classmates. You should feel free to exchange ideas for how to solve pieces of an assignment; this collaboration may be as detailed as suggesting which library function to use and an English description of what you might use it for. You may not copy anyone else’s code, nor should you allow anyone else to copy your code. Finally, most tasks of most labs will ask you to intersperse descriptive text with your code, to explain what the code is doing. This text must be your own and cannot be copied from, or even “inspired by” anyone else’s text. If you did get help on how to code up a task, you can prove that you understand the solution well by explaining it in your notebook.

For labs done in pairs, any and all collaboration is permissible between members of the same pair. That said, both members must understand and be able to explain in detail all aspects of their submission. For this reason, “pair programming” is highly recommended - you should not split the tasks up for each group member complete independently. I reserve the right to meet with any student one-on-one and ask them to explain any part of their submission to me in detail.

If you are collaborating with someone outside your pair and looking at their code, you may accidentally write identical code into your solution, even if you didn’t do it by copy/pasting. For this reason, the safest way to avoid being flagged for academic dishonesty is to either (a) avoid looking at someone else’s code in the first place or (b) wait at least 30 minutes after seeing someone else’s code to write your own.

University Policies

All University-wide policies apply to this course, including those outlined at http://syllabi.wwu.edu. These policies cover issues including: