Winter 2023
In the final project, you will complete a project that puts together just about everything we’ve covered in this course, covering the full extent of the data science pipeline from coming up with a question to presenting your results.
You will work in groups of 3; other group sizes (2 or 4) will be allowed only by request and only to satisfy divisibility constraints. You are free to choose your groups.
The topic for the project is left open-ended so you can explore something of interest to you. The requirements are:
It’s pretty easy to scope a project incorrectly or go down a path where good data isn’t available, so we may need to iterate a little on the proposals before settling on a topic. It’s also possible that you’ll encounter unexpected road blocks; in this case, well-justified pivots will be allowed at the milestone, provided that you end up with a quality project at the end. Pivots because you didn’t start early enough and ran out of time are not considered well-justified.
Due Date: Friday, 2/24
Write a proposal document for your final project. One person from each group will submit a single proposal as a PDF by the deadline above. Make it as short or long as needed (ideally no longer than needed), but I expect these to be about 1-2 pages. The document should include the following sections:
Due Date: Friday, 3/3Sunday 3/5
Submit a milestone report describing your progress, and supporting notebooks, code, etc. Submit a PDF file with the following information:
random_state
argument to
train_test_split
.Alongside your report, you should submit a zip file containing artifacts showing your progress. This will likely take the form of Jupyter notebooks, but if something else makes sense, go for it. I expect to see at least evidence of exploratory analysis results and baseline/evaluation setup. You do not need to submit the data necessary to run the notebooks, but you should submit the notebooks with up-to-date outputs that show me what I’d see if I did run them.
Due Date: Friday, 3/10
Your final submission must present interesting results on your original motivating question. The exploratory analysis component must be Correct, Convincing, and Clear, and should include at least some polished visualizations in the spirit of Lab 3.
Expectations for the machine learning component depend on the style of approach.
For unsupervised learning tasks, your analysis should demonstrate interesting and non-trivial structure in your data. The criteria for this more closely resemble the criteria for exploratory analysis, and in fact such a component may be a direct extension of your exploratory component.
For supervised prediction tasks, you should achieve the best
performance you can on your prediction task and demonstrate that your
results outperform one or more sensible baselines using well-motivated
evaluation metrics. This likely will involve preprocessing, model
selection, and hyperparameter tuning; you should try multiple different
classification or regression models provided by sklearn
and
do at least some hyperparameter tuning to get the best possible
performance you can on each. This process should be documented in a
supporting notebook.
This is a writeup of your project for a general audience. It will be
a notebook (titled blog.ipynb
), but you should think of it
as a blog post. It should talk not only about your results, but also
provide background on what you set out to do, why is it interesting and
worth reading about, the size, source, etc. of the data you used. It
should walk the reader through your most interesting findings and
provide discussion and interpretation of what the results mean and their
implications. The blog post should address both the exploratory analysis
you did and the predictions you made, though it can focus more on one or
the other if one of them turned out to be more interesting. The blog
post should include at least some Lab 3-quality visualizations to
support the exposition; that is, the visualizations should be highly
polished and adhering to the principles of visualization aesthetics,
though you don’t need to explain your designs here as you did in Lab
3.
Your blog notebook should also refer to one or more additional notebooks (in .ipynb format) containing the details of your analysis that you will submit alongside the blog post. These notebooks should be correct, clear, and convincing according to the guidelines used for many labs in this course. The audience here is an interested reader of your blog post who wants to dive into the details of your analysis, and potentially reproduce it - as such, the notebooks together should contain everything needed to reproduce your results.
Submit a single zip file to Canvas containing:
blog.ipynb
file containing your blog post.data.txt
file with
instructions for downloading the data - ideally this would either be
from the original source or a shared Google Drive or OneDrive link.Submissions will be posted on the course webpage and linked from a final project showcase page for posterity.
In class on 3/7 and 3/8
We will use Tuesday, Wednesday (and if needed, Friday) of the last week of classes to give brief (5-7 minute) presentations of the final projects. These are relatively informal presentations that will give you a chance to see the fun and interesting results found by other groups. You may make separate presentation slides, or just show and talk about your blog post, but make sure that you’re keeping it short and talking about the highlights of what was interesting about your findings. If you have slides or any other visuals to show, you will need to send them to me by one hour before the beginning of class so we can present from a single computer. The presentation schedule will be posted as we approach the last week.