Fall 2021
Now that we have the language to talk about, and the tools to create, effective visualizations, this lab asks you to spend some effort producing some nice visualizations.
This lab will be done individually. As usual, you may spend the lab period working together with a partner. After the lab period ends, you will work independently and submit your own solution, though you may continue to collaborate in accordance with the individual assignment collaboration policy listed on the syllabus.
If you haven’t already, you’ll probably want to install Seaborn: pip install seaborn
.
Having been introduced to some of the principles of good visualization, you are now qualified to cultivate your visualization snobbery. A great way to do this is to peruse some of the various websites devoted to collecting and ridiculing weird and bad graphs and data visualizations.
If you want to revisit or dig a little deeper into some of the visualization aesthetics ideas we talked about in class, check out Chapter 6.2 in the textbook for a quick summary of the Tufte principles we discussed in class, and I also recommend perusing this excerpt from Tufte’s The Visual Display of Quantatitive Information.
For Part 1 of this lab, we’re going to dig into one particularly bad plot, talk about why it’s bad, then make it better. I’d like you to consider the graphs in this random blog post. The author was trying to do a bit of amateur data science back in 2015, and came up with some rather unusual ways to present the results. In particular, let’s focus on the third figure (the one with the “parents” and “all users” lines).
Your tasks are the following:
When making your plot, Just read the numbers off of the old plots as best as you can. It’s a little tedious, but I don’t really want to go ask the author for his spreadsheet…
Write up your discussions and the code for your plot in a Jupyter notebook that you will submit to Canvas.
In this part, you will produce three really nice visualizations. The data you choose to visualize is mostly up to you - you are free to revisit any of the datasets we’ve worked with in class, in prior labs, or the datasets built into Seaborn (accessible via sns.load_dataset
, discoverable via sns.get_dataset_names()
). You can use one dataset for all three plots, or make multiple plots with a given dataset - this is up to you.
Though many nice plots can be created with simple calls to Seaborn functions, you will probably need to drop down to Matplotlib functions to fine-tune and customize your plots. We haven’t covered much on how to do this, so I expect you’ll need to do some searching around for the functionality you need.
I’m looking for not only a faithful representation of the data, but a high degree of polish. Notice that on the rubric entry for the plot itself, a basic df.plot.____()
would probably score around 2, since those plots get the point across but aren’t very nice beyond that; a basic, completely uncustomized Seaborn plot might get a 3 thanks to Seaborn’s better defaults.
For each of your plots, include:
Your plots should meet the following guidelines:
Make the worst graph of the data from Part 1 that you can! Break all the rules. Aim for something that still technically represents all of the right numbers while actually being totally misleading or unreadable. Prefer being misleading to merely unreadable. Plots exhibiting a superior degree of badness can receive up to 2 points of extra credit, and will be showcased for the class.
For Part 1, submit a single Jupyter notebook as described. For Part 2, submit one notebook per dataset that you use for your plots - that is, if you made 3 plots from one dataset, you’ll submit only one notebook. If you’re creating a plot based on data from a prior lab and the data requires significant cleaning, preprocessing, etc., then you may submit an updated version of that lab’s notebook; if you do this, include a note at the top letting me know where in the notebook I can find your plot and discussions.
If your plots require data that is not built-in to Seaborn or accessible directly via url, submit the necessary data in CSV format.
Finally fill out the Week 6 Survey on Canvas. Your submission will not be considered complete until you have submitted the survey.
Note: A more detailed rubric will be updated soon.
Part 1 is worth 15 points, based on the quality and clarity of your discussions and plot.
Part 2 is worth 10 points per plot, for a total of 30 points.
Grading of plots will be done by critique based on the princples of visualization.
Plot
Justification
Extra Credit
Up to 2 points as detailed above.