DATA 311 - Lab 3: COVID Not COVID

Scott Wehrwein

Fall 2021

Introduction

To practice our exploratory data analysis skills as well as gain experience with the data-hunting process, this lab asks you to find and analyze a dataset that does not relate directly to COVID, but clearly shows COVID’s impact.

You have a lot of freedom to choose your dataset, which is both fun and dangerous. I encourage you to spend some time looking around for datasets and exploring more than one, but don’t spend all week: you should be searching with a focus on finding one that will allow you to satisfy the requirements of the assignment.

Collaboration Policy

You are required to complete this lab in pairs. I highly recommend collaborating synchronously, as each partner will be responsible for understanding (and being able to independently explain) every aspect of your submission. As a reminder, here’s the collaboration policy for labs done in pairs from the syllabus:

For labs done in pairs, any and all collaboration is permissible between members of the same pair. That said, both members must understand and be able to explain in detail all aspects of their submission. For this reason, “pair programming” is highly recommended - you should not split the tasks up for each group member complete independently. I reserve the right to meet with any student one-on-one and ask them to explain any part of their submission to me in detail.

The Data

For this project, you get to find your own data. You can pick any dataset(s) you’d like, with the following stipulations:

The subject of the dataset can’t have anything directly to do with COVID. Hospital admissions, infections, vaccinations, lockdowns, madates, etc. are all off limits. I’m interested in looking at the effects of the global pandemic on winder society; since there have been plenty of such effects, I don’t think this is a big limitation.
The data should be as raw as possible. If you find and download a time series of some variable aggregated per month and show a sharp dip (or rise) in March 2020, you’ll receive very little credit. For example, if you want to look at air travel, find the raw data on departures and arrivals by day and by airport, rather than an already-aggregated listing of total daily flights. A good heuristic for this might be that your dataset should have (and your analysis should consider) at least thousands, not tens or hundreds, of rows.
The phenomena you find in the dataset must be convincingly attributable to the pandemic. A good example of what wouldn’t satisfy this rule is looking at stock market data: sure, the market tanked briefly in March 2020, but the stock market goes up and down all the time without help from global pandemics.

If you’re not sure whether a dataset qualifies, or if you think you have a dataset that’s so interesting it should be exempted from rules #1 and/or #2, feel free to talk to me; I may grant case-by-case exceptions if your dataset and analysis seem particularly interesting or promising.

You can look anywhere you like for data. Here are a few ideas for possible data sources to help get your creativity flowing:

Want to look at some aspect of transportation?
- Try starting with the Bureau of Transportation Statistics https://www.bts.gov/.
- Maybe you can detect changes due to COVID in bike share usage?
Business, employment, etc?
- The Bureau of Labor Statistics might have something for you: https://www.bls.gov/data/
Trade?
- Maybe try https://www.trade.gov
- It looks like the https://www.census.gov/foreign-trade/index.html may have some interesting data
- https://marinecadastre.gov/ais/ looks pretty cool
- Search for a major city’s port for data on container and tanker ship traffic?
Financial / economic data?
Get creative and search around for whatever seems interesting to you!

Getting Started

Please refer to Lab 1 for instructions on how to get the notebook server running, in case you’ve forgotten how; instructions for running on the labs remotely are here.

A big part of this lab is finding a dataset. I expect that you’ll need to spend some time searching around and finding a dataset that (a) satisfies our requirements, (b) has the necessary time extent, and (c) turns out to have visible effects that can be attributed to COVID.

Your lab will be contained in a single Jupyter notebook. Create the notebook and include a title and the names of each group member in a Markdown cell at the top.

Your Tasks

Part 1 - Dataset

Write a short description of your dataset, giving answers to some of the preliminary questions we talked about in lecture; it’s up to you to decide which questions are relevant to your dataset, but you should at least include:

what is the dataset about?
how/from where did you acquire it?
what are the meanings of the columns that your analysis focuses on?
in a nutshell, what effects did you find?

You may also want to address questions such as:

how big is the dataset?
what is its temporal extent and how frequently is data sampled/recorded?
why was the data collected and by whom?

Part 2 - Analysis

Load the dataset into your notebook and perform whatever preprocessing, cleaning, analysis and visualization needed to convincingly show how the effects of the pandemic manifest in your data. Specifics will vary by dataset, but line graphs or scatter plots showing how a variable changed over time seem likely to be useful. A convincing analysis will probably need to compare 2020 to at least one or two prior years to show that the effects you find are not simply a seasonal trend.

Part 3 - Discussion

In a Discussion section at the end of your notebook, write a short retrospective on the process you went through. Did you try any datasets before this one and find that they didn’t work out for one reason or another? Did your findings from your chosen dataset differ from what you expected to see going in? Are there any limitations that might cast doubt on the results of your analysis?

Extra Credit

There is no set Extra Credit task, but if you go above and beyond the requirements of the assignment and impress me, I will award up to 5 points of extra credit. If you have an idea for extra credit, feel free to run it by me and I can let you know whether I think it would receive credit if executed well. As usual, each extra credit point is exponentially more difficult to get and I will need to be blown away to award all 5 points.

Guidelines

As usual, your analysis should tell a story clearly and convincingly. All the general guidelines from Lab 2 apply: assumptions, preprocessing, cleaning, and analysis should be clearly documented.

Also as usual, your notebook should start from the raw, unmodified data from wherever you sourced the data; make sure to include a reference to where the data can be obtained.

Submitting your work

Double check that you are correctly grouped with your partner on Canvas. Each group needs to make only one submission.

Code

Make sure that all your cells have been run and have the output you intended, then download a copy of your notebook in .ipynb format and submit the resulting file to the Lab 3 assignment on Canvas.

Data

If your source data file(s) total under 100MB, upload it to Canvas in a zip file (please zip the data even if it’s only one file - this will preserve the data filename so your notebook can load it without modification); in this case, your notebook should assume the data is in the same file.

If your source data file(s) total more than 100MB, upload them to Google Drive (you can log into Google with your WWU credentials and access Google Drive with an unlimited storage quota). Share each data file publicly via link, and set up your notebook so that it loads the data from the Google Drive URL(s).

Survey

Finally, each individual must fill out the Week 3 Survey on Canvas. Your submission will not be considered complete until both partners have submitted the survey.

Rubric

Part 1 is worth 10 points; Part 3 is worth 5; these are graded for completeness and clarity as follows:

60% - Completeness
- 6/6 - complete description of the relevant information about the dataset
- 4/6 - mostly complete
- 2/6 - critical information is missing
- 0/6 - no effort / no submission
40% - Clarity
- 4/4 well-organized, concise, and easy to read
- 2/4 overly verbose or terse but generally comprehensible
- 0/4 no effort / no submission

Part 2 is worth 50 points, and is graded on the correct, convincing, and clear criteria.

50% - Correct
- 5/5 analysis is completely technically sound
- 4/5 small technical issues, but no apparent effect on the outcome of the analysis
- 3/5 some techniques are applied or interpreted incorrectly in a way that affects the analysis
- 2/5 an honest attempt was made, but it’s seriously incomplete or flawed
- 1/5 some effort was made
- 0/5 no effort / no submission
30% - Convincing
- 3/3 - Sensible preprocessing decisions are made; assumptions are reasonable
- 2/3 - Some questionable assumptions are made or important information ignored, weakening the conclusion of the analysis
- 1/3 - The analysis attempts to, but does not support the conclusion / provide insight into the issue in question.
- 0/3 no effort / no submission
20% - Clear
- 2/2 - code and analysis are clearly and concisely explained and justified
- 1/2 - code and analysis are partly explained/justified, or explanations are difficult to understand
- 0/2 no effort was made / no submission

Extra Credit

Up to 5 points of extra credit are available for the extra credit analysis, judged on the same criteria as above, but more stringently. Each extra credit point is exponentially more difficult to get; you will need to amaze me in order to get 5 points.