Fall 2025
To practice our exploratory data analysis skills as well as gain experience with searching for, acquiring, and preprocessing datasets, this lab asks you to (a) collect and curate a dataset and (b) do some exploratory data analysis on it. This lab assignment spans two weeks, with the data collection part (a) due at the midway point, and the analysis (b) due the following week.
This lab will be done individually. You can brainstorm with any classmates about ideas for datasets and data curation methodology. However, your collection methodology, preprocessing, and analysis should be uniquely your own. Your submission must acknowledge ideas or suggestions you received from other classmates in an acknowledgments section at the end of the notebook you submit.
Your task here is to collect, clean, curate, explore, and present an interesting story from a dataset of your choosing. You have a lot of freedom to choose your dataset, which is both fun and dangerous. For this reason, you will come up with at least two candidate data collection plans for the pre-lab, and pursue the most promising one beginning in lab. I encourage you to put some time and thought into your data topic and source; certain choices up front may have a large impact on the difficulty of satisfying the lab’s requirements.
There must be some substantial effort involved. Downloading a pre-built dataset from Kaggle or any other source is generally not sufficient, unless you will be doing significant cleaning, curation, or other processing to the data. Examples of the kind of collection I have in mind include:
You should perform some analysis on your data and write up a polished presentation of your results. You may choose to go into the analysis (and collection) with a specific question in mind, or you may choose to explore and see what you find. The goal is to make sure your dataset of choice is rich enough that there’s at least some story to tell.
You have a lot of freedom to choose your dataset, which is both fun and dangerous. For this reason, you will come up with at least two candidate data collection plans for the pre-lab, and pursue the most promising one beginning in lab. I encourage you to put some time and thought into your data topic and source; certain choices up front may have a large impact on the difficulty of satisfying the lab’s requirements.
In a text editor of your choice that is capable of exporting to pdf, answer these questions for each of at least two distinct dataset ideas:
You’ll perform your collection and any necessary curation and
cleaning in a notebook called lab6a.ipynb. Any
preprocessing decisions should be justified or explained, and the
notebook should ultimately save out a CSV file containing your final
dataset in analysis-ready form.
The notebook should be self-contained and the dataset should be exactly reproducible by running all cells in the notebook, with no manual intervention.
Your analysis should be done in a notebook called
lab6b.ipynb, and it should begin by loading the CSV created
by your lab6a.ipynb notebook. Although you will likely do
some amount of exploratory analysis and “scratch work” even once the
dataset is in analysis-ready form, the goal of this part is to present
some findings from the data in a somewhat polished form.
Your final notebook should contain only the code, analyses, and visualizations that are related to the story you’re telling. You should include the code to perform your analysis interspersed with concise Markdown cells explaining the analysis and presenting its results to a reader. As part of this presentation, the notebook should include at least one or two polished plots or visualizations.
In a Discussion section at the bottom of lab6b.ipynb,
write a short retrospective on the process you went through. Did you
encounter any unexpected issues or hurdles in the collection, curation,
or analysis of your data? Did the findings from your dataset differ from
what you expected to see going in? Are there any limitations that might
cast doubt on the results of your analysis? Is there any further data
collection, curation, or analysis you’d perform next given what you know
now?
For Part A, due the week after the lab is assigned, submit a single
.zip file Firstname_Lastname_Lab6A.zip containing the
following files:
lab6a.ipynbdataset.csvUpon submitting Part A, please fill out the Lab 6A survey on Canvas.
For Part B, due the week after Part A, submit a single .zip file
Firstname_Lastname_Lab6B.zip containing the following
files:
lab6a.ipynbdataset.csvlab6b.ipynbI’m asking for all 3 files to account for the possibility that you needed to change your collection/preprocessing based on lessons learned while doing your analysis. Even if the Part A files are unchanged from your Part A submission, please submit all three files so we can look at them all together.
Upon submitting Part B, please fill out the Lab 6B survey on Canvas.
The grade for Part 1 will be based on the following criteria:
The data collection, curation, and cleaning is nontrivial and substantial
The data collection process is well-documented and reproducible
The data is collected responsibly and ethically
The grade for Part 2 will be based on:
Your reflection will be graded on thoughtfulness and clarity.