DATA 311 - Lab 6: Visualize!

Scott Wehrwein

Fall 2021

Introduction

Now that we have the language to talk about, and the tools to create, effective visualizations, this lab asks you to spend some effort producing some nice visualizations.

Collaboration Policy

This lab will be done individually. As usual, you may spend the lab period working together with a partner. After the lab period ends, you will work independently and submit your own solution, though you may continue to collaborate in accordance with the individual assignment collaboration policy listed on the syllabus.

Getting Started

If you haven’t already, you’ll probably want to install Seaborn: pip install seaborn.

Part 1: A really bad plot

Background

Having been introduced to some of the principles of good visualization, you are now qualified to cultivate your visualization snobbery. A great way to do this is to peruse some of the various websites devoted to collecting and ridiculing weird and bad graphs and data visualizations.

If you want to revisit or dig a little deeper into some of the visualization aesthetics ideas we talked about in class, check out Chapter 6.2 in the textbook for a quick summary of the Tufte principles we discussed in class, and I also recommend perusing this excerpt from Tufte’s The Visual Display of Quantatitive Information.

Your Tasks

For Part 1 of this lab, we’re going to dig into one particularly bad plot, talk about why it’s bad, then make it better. I’d like you to consider the graphs in this random blog post. The author was trying to do a bit of amateur data science back in 2015, and came up with some rather unusual ways to present the results. In particular, let’s focus on the third figure (the one with the “parents” and “all users” lines).

Your tasks are the following:

Write a paragraph or two discussing all of the merits and deficiencies you can find in the original plot. Try to find some of both.
Make a new plot of the same data that communicates the data as clearly as possible.
Write another paragraph or two justifying your design choices.

When making your plot, Just read the numbers off of the old plots as best as you can. It’s a little tedious, but I don’t really want to go ask the author for his spreadsheet…

Write up your discussions and the code for your plot in a Jupyter notebook that you will submit to Canvas.

Part 2: Some really nice plots

In this part, you will produce three really nice visualizations. The data you choose to visualize is mostly up to you - you are free to revisit any of the datasets we’ve worked with in class, in prior labs, or the datasets built into Seaborn (accessible via sns.load_dataset, discoverable via sns.get_dataset_names()). You can use one dataset for all three plots, or make multiple plots with a given dataset - this is up to you.

Though many nice plots can be created with simple calls to Seaborn functions, you will probably need to drop down to Matplotlib functions to fine-tune and customize your plots. We haven’t covered much on how to do this, so I expect you’ll need to do some searching around for the functionality you need.

I’m looking for not only a faithful representation of the data, but a high degree of polish. Notice that on the rubric entry for the plot itself, a basic df.plot.____() would probably score around 2, since those plots get the point across but aren’t very nice beyond that; a basic, completely uncustomized Seaborn plot might get a 3 thanks to Seaborn’s better defaults.

For each of your plots, include:

Any data processing code needed
The code to produce the plot and plot itself
A caption describing what the plot shows and what the reader should focus on
A discussion of the design decisions you made when creating your plot

Your plots should meet the following guidelines:

Your plots should be carefully designed to tell a specific story. This does not preclude you from making rich, data-dense plots (indeed, this is encouraged!), but the effect that motivated you to make the plot should be easy to see.
Two, or ideally all three of your plots must be more complex than the basic form of the given type of visualization. For example, you should do something more interesting than showing a basic scatterplot showing that two variables are correlated.
Your design discussion should justify the choices you made in terms of the principles we talked about in class. You need not address every principle, but any that apply to your plot should be discussed. If there are important design elements where you stayed with the default settings of the plotting library you used, you should justify this too.

Extra Credit

Make the worst graph of the data from Part 1 that you can! Break all the rules. Aim for something that still technically represents all of the right numbers while actually being totally misleading or unreadable. Prefer being misleading to merely unreadable. Plots exhibiting a superior degree of badness can receive up to 2 points of extra credit, and will be showcased for the class.

Submitting your work

Notebooks and Data

For Part 1, submit a single Jupyter notebook as described. For Part 2, submit one notebook per dataset that you use for your plots - that is, if you made 3 plots from one dataset, you’ll submit only one notebook. If you’re creating a plot based on data from a prior lab and the data requires significant cleaning, preprocessing, etc., then you may submit an updated version of that lab’s notebook; if you do this, include a note at the top letting me know where in the notebook I can find your plot and discussions.

If your plots require data that is not built-in to Seaborn or accessible directly via url, submit the necessary data in CSV format.

Survey

Finally fill out the Week 6 Survey on Canvas. Your submission will not be considered complete until you have submitted the survey.

Rubric

Note: A more detailed rubric will be updated soon.

Part 1 is worth 15 points, based on the quality and clarity of your discussions and plot.

5 points: discussion of the blog post plot’s merits
- 5/5 thoughtful, observant, and detailed; identifies both strengths and weaknesses
- 3/5 identifies only one or two glaring issues with the plot
- 1/5 an attempt was made
10 points: your plot will be graded using the same scheme as the plots from Part 2.

Part 2 is worth 10 points per plot, for a total of 30 points.

Grading of plots will be done by critique based on the princples of visualization.

Plot

5/5 Plot is excellent
4/5 A few small improvements that could be made to the plot
3/5 Multiple uncontroversial improvements could be made
2/5 Minimal thought was put into the design of the plot
1/5 A plot exists
0/5 No plot exists

Justification

5/5 All decisions are carefully considered and well justified
4/5 A few important design decisions are not well justified.
3/5 Many design decisions are not well justified
2/5 Most design decisions are not well justified
0/5 No attempt was made to justify plot design

Extra Credit

Up to 2 points as detailed above.