DATA 311 - Lab 6: Data Collection and Exploratory Data Analysis

Scott Wehrwein

Fall 2025

Overview

To practice our exploratory data analysis skills as well as gain experience with searching for, acquiring, and preprocessing datasets, this lab asks you to (a) collect and curate a dataset and (b) do some exploratory data analysis on it. This lab assignment spans two weeks, with the data collection part (a) due at the midway point, and the analysis (b) due the following week.

Collaboration Policy

This lab will be done individually. You can brainstorm with any classmates about ideas for datasets and data curation methodology. However, your collection methodology, preprocessing, and analysis should be uniquely your own. Your submission must acknowledge ideas or suggestions you received from other classmates in an acknowledgments section at the end of the notebook you submit.

The Data

Your task here is to collect, clean, curate, explore, and present an interesting story from a dataset of your choosing. You have a lot of freedom to choose your dataset, which is both fun and dangerous. For this reason, you will come up with at least two candidate data collection plans for the pre-lab, and pursue the most promising one beginning in lab. I encourage you to put some time and thought into your data topic and source; certain choices up front may have a large impact on the difficulty of satisfying the lab’s requirements.

Collection and Curation Guidelines

There must be some substantial effort involved. Downloading a pre-built dataset from Kaggle or any other source is generally not sufficient, unless you will be doing significant cleaning, curation, or other processing to the data. Examples of the kind of collection I have in mind include:

Merging and unifying more than one dataset from different sources
Building a dataset from a public API
Ethically scraping content from the web
Performing nontrivial cleaning or preprocessing to get the data into a form that supports the analysis you’re interested in doing

Analysis Guidelines

You should perform some analysis on your data and write up a polished presentation of your results. You may choose to go into the analysis (and collection) with a specific question in mind, or you may choose to explore and see what you find. The goal is to make sure your dataset of choice is rich enough that there’s at least some story to tell.

Pre-Lab

You have a lot of freedom to choose your dataset, which is both fun and dangerous. For this reason, you will come up with at least two candidate data collection plans for the pre-lab, and pursue the most promising one beginning in lab. I encourage you to put some time and thought into your data topic and source; certain choices up front may have a large impact on the difficulty of satisfying the lab’s requirements.

In a text editor of your choice that is capable of exporting to pdf, answer these questions for each of at least two distinct dataset ideas:

What is the subject of your dataset? Specifically what data (e.g., columns) would you plan to include in your final, analysis-ready dataset?
How would you collect the data, and from what source(s)? What preprocessing, cleaning, or curation will be required?
What, if anything are you hoping to find in the data? Is there a specific question you’d like to answer? If so, explain (as needed) how the data will allow you to answer it. If not, explain why you’re convinced that you will find a story to tell using the analysis techniques we’ve discussed.

Lab

Part A: Collection and Curation

You’ll perform your collection and any necessary curation and cleaning in a notebook called lab6a.ipynb. Any preprocessing decisions should be justified or explained, and the notebook should ultimately save out a CSV file containing your final dataset in analysis-ready form.

The notebook should be self-contained and the dataset should be exactly reproducible by running all cells in the notebook, with no manual intervention.

Part B: Analysis

Your analysis should be done in a notebook called lab6b.ipynb, and it should begin by loading the CSV created by your lab6a.ipynb notebook. Although you will likely do some amount of exploratory analysis and “scratch work” even once the dataset is in analysis-ready form, the goal of this part is to present some findings from the data in a somewhat polished form.

Your final notebook should contain only the code, analyses, and visualizations that are related to the story you’re telling. You should include the code to perform your analysis interspersed with concise Markdown cells explaining the analysis and presenting its results to a reader. As part of this presentation, the notebook should include at least one or two polished plots or visualizations.

Reflection

In a Discussion section at the bottom of lab6b.ipynb, write a short retrospective on the process you went through. Did you encounter any unexpected issues or hurdles in the collection, curation, or analysis of your data? Did the findings from your dataset differ from what you expected to see going in? Are there any limitations that might cast doubt on the results of your analysis? Is there any further data collection, curation, or analysis you’d perform next given what you know now?

Submission

For Part A, due the week after the lab is assigned, submit a single .zip file Firstname_Lastname_Lab6A.zip containing the following files:

lab6a.ipynb
dataset.csv

Upon submitting Part A, please fill out the Lab 6A survey on Canvas.

For Part B, due the week after Part A, submit a single .zip file Firstname_Lastname_Lab6B.zip containing the following files:

lab6a.ipynb
dataset.csv
lab6b.ipynb

I’m asking for all 3 files to account for the possibility that you needed to change your collection/preprocessing based on lessons learned while doing your analysis. Even if the Part A files are unchanged from your Part A submission, please submit all three files so we can look at them all together.

Upon submitting Part B, please fill out the Lab 6B survey on Canvas.

Rubric

Pre-Lab (10 points)

5 points per candidate dataset

Lab - Part A (50 points)

The grade for Part 1 will be based on the following criteria:

The data collection, curation, and cleaning is nontrivial and substantial
The data collection process is well-documented and reproducible
The data is collected responsibly and ethically

Lab - Part B (40 points)

Analysis (35 points)

The grade for Part 2 will be based on:

Soundness and interestingness of the analysis
Clarity of exposition

Reflection (5 points)

Your reflection will be graded on thoughtfulness and clarity.