DATA 311 - Project: Data Collection and Exploratory Data Analysis

Scott Wehrwein

Spring 2026

Overview

Objectives

This project gives you the opportunity to put together several of the course learning outcomes, each reflected in the project’s three deliverables:

Deliverables

The project spans about 2 and a half weeks, including two labs, with three major deliverables:

Collaboration Policy

This project will be done individually. You can brainstorm with any classmates about ideas for datasets, data curation methodology, and analysis. However, your collection methodology, preprocessing, and analysis should be performed entirely by you. Your submission must acknowledge ideas or suggestions you received from other sources in an acknowledgments section at the end of the notebook you submit.

Project Description

1. Proposal

You have a lot of freedom to choose your dataset, which is both fun and dangerous. I encourage you to put some time and thought into your data topic and source; certain choices up front may have a large impact on the difficulty of satisfying the project requirements.

First, read this entire document to ensure that you understand what is expected.

In a text editor of your choice that is capable of exporting to pdf, answer these questions:

  1. What is the subject of your dataset? Specifically what data (e.g., columns) would you plan to include in your final, analysis-ready dataset?
  2. How would you collect the data, and from what source(s)? What preprocessing, cleaning, or curation will be required?
  3. What, if anything are you hoping to find in the data? Is there a specific question you’d like to answer? If so, explain (as needed) how the data will allow you to answer it. If not, explain why you’re convinced that you will find a story to tell using the analysis techniques we’ve discussed.

You may be asked to submit a revised proposal, and iterate until we can agree that the project is on track for a successful completion.

2. Milestone - Data Collection

Perform your data collection and any necessary curation and cleaning in a notebook called collection.ipynb. Any preprocessing decisions should be justified or explained, and the notebook should ultimately save out a CSV file called dataset.csv containing your final dataset in analysis-ready form.

The notebook should be self-contained and the dataset should be exactly reproducible by running all cells in the notebook, with no manual intervention.

Additionally, your milestone submission should include a file progress.pdf that describes:

Guidelines

There must be some substantial effort involved in the collection, curation, or organization of your data. Downloading a pre-built, analysis-ready dataset from Kaggle or any other source is generally not sufficient, unless you will be doing significant cleaning, curation, or other processing to the data. Examples of the kind of collection I have in mind include:

It is understood that you will learn new things as you begin data collection, and your project’s goals or approach may need to change. I encourage you to talk to me or your TA if you find that you won’t be able to complete what you proposed - we will be happy to help you overcome obstacles and/or pivot towards something more achievable.

3. Final Report - Data Analysis

Perform your data analysis notebook called analysis.ipynb. This notebook should begin by loading the CSV created by your collection.ipynb notebook. Although you will likely do some amount of exploratory analysis and “scratch work” even once the dataset is in analysis-ready form, the goal of this part is to present some findings from the data in a somewhat polished form.

Your final notebook should contain only the code, analyses, and visualizations that are related to the story you’re telling. You should include the code to perform your analysis interspersed with concise Markdown cells explaining the analysis and presenting its results to a reader. As part of this presentation, the notebook should include at least two highly polished plots or visualizations.

Reflection

In a Discussion section at the bottom of analysis.ipynb, write a short retrospective on the process you went through. Did you encounter any unexpected issues or hurdles in the collection, curation, or analysis of your data? Did the findings from your dataset differ from what you expected to see going in? Are there any limitations that might cast doubt on the results of your analysis? Is there any further data collection, curation, or analysis you’d perform next given what you know now?

Guidelines

Your analysis must be nontrivial and tell an interesting story. You may choose to go into the analysis (and collection) with a specific question in mind, or you may choose to explore and see what you find. The goal is to make sure your dataset of choice is rich enough that there’s at least some story to tell.

There is no strict length requirement; higher quality is more important than higher quantity. However, your explanations and written analyses should be thorough and substantive. You should think of the final report as a blog post showcasing your analysis, results, and conclusions.

Submission

Proposal

Submit your proposal in PDF format to the Project Proposal assignment Canvas.

Milestone

Submit a .zip file Firstname_Lastname_Milestone.zip to the Project Milestone assignment on Canvas. Your zip file should contain the following files:

  1. collection.ipynb
  2. dataset.csv
  3. progress.md

Final Report

For the Final Report, due the week after the Milestone, submit a single .zip file Firstname_Lastname_Project.zip containing the following files:

  1. collection.ipynb
  2. dataset.csv
  3. analysis.ipynb

I’m asking for all 3 files to account for the possibility that you needed to change your collection/preprocessing based on lessons learned while doing your analysis. Even if the Milestone files are unchanged from your Milestone submission, please submit all three files so we can look at them all together.

Upon submitting the Final Report please fill out the Project Survey on Canvas.

Rubric

Proposal (10 points)

Your proposal should be well-considered and complete. We will provide feedback and iterate until we have agreed on a satisfactory path. The 10 points for this deliverable will be assigned for completeness and responsiveness to feedback, if any.

Milestone (10 points)

Your milestone deliverable should demonstrate that you have completed data collection, curation, cleaning, etc. Any changes in scope or plan should be clearly documented. The 10 points for this deliverable will be assigned for completeness and responsiveness to feedback, if any.

Final Report (100 points)

The bulk of the project grade will be assigned holistically, considering all deliverables. This allows for some flexibility for you to put more emphasis on collection or analysis. I think of the grade for a project like this as a product of ambition and execution: to get a high score, you need to both take on a significant project and execute it well.

Data collection will be evaluated based on the following criteria:

Data analysis will evaluated be based on the following criteria:

Reflection

Your reflection will be assessed on thoughtfulness and clarity.