Spring 2026
This project gives you the opportunity to put together several of the course learning outcomes, each reflected in the project’s three deliverables:
The project spans about 2 and a half weeks, including two labs, with three major deliverables:
This project will be done individually. You can brainstorm with any classmates about ideas for datasets, data curation methodology, and analysis. However, your collection methodology, preprocessing, and analysis should be performed entirely by you. Your submission must acknowledge ideas or suggestions you received from other sources in an acknowledgments section at the end of the notebook you submit.
You have a lot of freedom to choose your dataset, which is both fun and dangerous. I encourage you to put some time and thought into your data topic and source; certain choices up front may have a large impact on the difficulty of satisfying the project requirements.
First, read this entire document to ensure that you understand what is expected.
In a text editor of your choice that is capable of exporting to pdf, answer these questions:
You may be asked to submit a revised proposal, and iterate until we can agree that the project is on track for a successful completion.
Perform your data collection and any necessary curation and cleaning
in a notebook called collection.ipynb. Any preprocessing
decisions should be justified or explained, and the notebook should
ultimately save out a CSV file called dataset.csv
containing your final dataset in analysis-ready form.
The notebook should be self-contained and the dataset should be exactly reproducible by running all cells in the notebook, with no manual intervention.
Additionally, your milestone submission should include a file
progress.pdf that describes:
There must be some substantial effort involved in the collection, curation, or organization of your data. Downloading a pre-built, analysis-ready dataset from Kaggle or any other source is generally not sufficient, unless you will be doing significant cleaning, curation, or other processing to the data. Examples of the kind of collection I have in mind include:
It is understood that you will learn new things as you begin data collection, and your project’s goals or approach may need to change. I encourage you to talk to me or your TA if you find that you won’t be able to complete what you proposed - we will be happy to help you overcome obstacles and/or pivot towards something more achievable.
Perform your data analysis notebook called
analysis.ipynb. This notebook should begin by loading the
CSV created by your collection.ipynb notebook. Although you
will likely do some amount of exploratory analysis and “scratch work”
even once the dataset is in analysis-ready form, the goal of this part
is to present some findings from the data in a somewhat polished
form.
Your final notebook should contain only the code, analyses, and visualizations that are related to the story you’re telling. You should include the code to perform your analysis interspersed with concise Markdown cells explaining the analysis and presenting its results to a reader. As part of this presentation, the notebook should include at least two highly polished plots or visualizations.
In a Discussion section at the bottom of analysis.ipynb,
write a short retrospective on the process you went through. Did you
encounter any unexpected issues or hurdles in the collection, curation,
or analysis of your data? Did the findings from your dataset differ from
what you expected to see going in? Are there any limitations that might
cast doubt on the results of your analysis? Is there any further data
collection, curation, or analysis you’d perform next given what you know
now?
Your analysis must be nontrivial and tell an interesting story. You may choose to go into the analysis (and collection) with a specific question in mind, or you may choose to explore and see what you find. The goal is to make sure your dataset of choice is rich enough that there’s at least some story to tell.
There is no strict length requirement; higher quality is more important than higher quantity. However, your explanations and written analyses should be thorough and substantive. You should think of the final report as a blog post showcasing your analysis, results, and conclusions.
Submit your proposal in PDF format to the Project Proposal assignment Canvas.
Submit a .zip file Firstname_Lastname_Milestone.zip to
the Project Milestone assignment on Canvas. Your zip file should contain
the following files:
collection.ipynbdataset.csvprogress.mdFor the Final Report, due the week after the Milestone, submit a
single .zip file Firstname_Lastname_Project.zip containing
the following files:
collection.ipynbdataset.csvanalysis.ipynbI’m asking for all 3 files to account for the possibility that you needed to change your collection/preprocessing based on lessons learned while doing your analysis. Even if the Milestone files are unchanged from your Milestone submission, please submit all three files so we can look at them all together.
Upon submitting the Final Report please fill out the Project Survey on Canvas.
Your proposal should be well-considered and complete. We will provide feedback and iterate until we have agreed on a satisfactory path. The 10 points for this deliverable will be assigned for completeness and responsiveness to feedback, if any.
Your milestone deliverable should demonstrate that you have completed data collection, curation, cleaning, etc. Any changes in scope or plan should be clearly documented. The 10 points for this deliverable will be assigned for completeness and responsiveness to feedback, if any.
The bulk of the project grade will be assigned holistically, considering all deliverables. This allows for some flexibility for you to put more emphasis on collection or analysis. I think of the grade for a project like this as a product of ambition and execution: to get a high score, you need to both take on a significant project and execute it well.
Data collection will be evaluated based on the following criteria:
The data collection, curation, and cleaning is nontrivial and substantial
The data collection process is well-documented and reproducible
The data is collected responsibly and ethically
Data analysis will evaluated be based on the following criteria:
Your reflection will be assessed on thoughtfulness and clarity.