Fall 2021
To practice our exploratory data analysis skills as well as gain experience with the data-hunting process, this lab asks you to find and analyze a dataset that does not relate directly to COVID, but clearly shows COVID’s impact.
You have a lot of freedom to choose your dataset, which is both fun and dangerous. I encourage you to spend some time looking around for datasets and exploring more than one, but don’t spend all week: you should be searching with a focus on finding one that will allow you to satisfy the requirements of the assignment.
You are required to complete this lab in pairs. I highly recommend collaborating synchronously, as each partner will be responsible for understanding (and being able to independently explain) every aspect of your submission. As a reminder, here’s the collaboration policy for labs done in pairs from the syllabus:
For labs done in pairs, any and all collaboration is permissible between members of the same pair. That said, both members must understand and be able to explain in detail all aspects of their submission. For this reason, “pair programming” is highly recommended - you should not split the tasks up for each group member complete independently. I reserve the right to meet with any student one-on-one and ask them to explain any part of their submission to me in detail.
For this project, you get to find your own data. You can pick any dataset(s) you’d like, with the following stipulations:
If you’re not sure whether a dataset qualifies, or if you think you have a dataset that’s so interesting it should be exempted from rules #1 and/or #2, feel free to talk to me; I may grant case-by-case exceptions if your dataset and analysis seem particularly interesting or promising.
You can look anywhere you like for data. Here are a few ideas for possible data sources to help get your creativity flowing:
Please refer to Lab 1 for instructions on how to get the notebook server running, in case you’ve forgotten how; instructions for running on the labs remotely are here.
A big part of this lab is finding a dataset. I expect that you’ll need to spend some time searching around and finding a dataset that (a) satisfies our requirements, (b) has the necessary time extent, and (c) turns out to have visible effects that can be attributed to COVID.
Your lab will be contained in a single Jupyter notebook. Create the notebook and include a title and the names of each group member in a Markdown cell at the top.
Write a short description of your dataset, giving answers to some of the preliminary questions we talked about in lecture; it’s up to you to decide which questions are relevant to your dataset, but you should at least include:
You may also want to address questions such as:
Load the dataset into your notebook and perform whatever preprocessing, cleaning, analysis and visualization needed to convincingly show how the effects of the pandemic manifest in your data. Specifics will vary by dataset, but line graphs or scatter plots showing how a variable changed over time seem likely to be useful. A convincing analysis will probably need to compare 2020 to at least one or two prior years to show that the effects you find are not simply a seasonal trend.
In a Discussion section at the end of your notebook, write a short retrospective on the process you went through. Did you try any datasets before this one and find that they didn’t work out for one reason or another? Did your findings from your chosen dataset differ from what you expected to see going in? Are there any limitations that might cast doubt on the results of your analysis?
There is no set Extra Credit task, but if you go above and beyond the requirements of the assignment and impress me, I will award up to 5 points of extra credit. If you have an idea for extra credit, feel free to run it by me and I can let you know whether I think it would receive credit if executed well. As usual, each extra credit point is exponentially more difficult to get and I will need to be blown away to award all 5 points.
As usual, your analysis should tell a story clearly and convincingly. All the general guidelines from Lab 2 apply: assumptions, preprocessing, cleaning, and analysis should be clearly documented.
Also as usual, your notebook should start from the raw, unmodified data from wherever you sourced the data; make sure to include a reference to where the data can be obtained.
Double check that you are correctly grouped with your partner on Canvas. Each group needs to make only one submission.
Make sure that all your cells have been run and have the output you intended, then download a copy of your notebook in .ipynb format and submit the resulting file to the Lab 3 assignment on Canvas.
If your source data file(s) total under 100MB, upload it to Canvas in a zip file (please zip the data even if it’s only one file - this will preserve the data filename so your notebook can load it without modification); in this case, your notebook should assume the data is in the same file.
If your source data file(s) total more than 100MB, upload them to Google Drive (you can log into Google with your WWU credentials and access Google Drive with an unlimited storage quota). Share each data file publicly via link, and set up your notebook so that it loads the data from the Google Drive URL(s).
Finally, each individual must fill out the Week 3 Survey on Canvas. Your submission will not be considered complete until both partners have submitted the survey.
Part 1 is worth 10 points; Part 3 is worth 5; these are graded for completeness and clarity as follows:
Part 2 is worth 50 points, and is graded on the correct, convincing, and clear criteria.
Extra Credit
Up to 5 points of extra credit are available for the extra credit analysis, judged on the same criteria as above, but more stringently. Each extra credit point is exponentially more difficult to get; you will need to amaze me in order to get 5 points.