Lecture 7 - Plots¶

Types, How to Make Them, and When to Use Them¶

Goals¶

  • Solidify understanding of the six visualization principles introduced last class
  • Know how to produce, interpret, and choose when to use several of the most commonly used types of data visualizations:
    • Tables
    • Dot and line plots
    • Box and whisker plots
    • Scatter plots
    • Bar/column plots and (usually not) pie charts
    • Histograms
In [ ]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Visualization Principles - Discussion¶

Some Datasets to Play With¶

In [ ]:
penguins = sns.load_dataset("penguins")
fmri = sns.load_dataset("fmri")
mpg = sns.load_dataset("mpg")
In [ ]:
penguins
In [ ]:
fmri
In [ ]:
fmri.sort_values(by=["subject", "timepoint"])
In [ ]:
mpg

Matplotlib¶

In [ ]:
colors = {"Adelie": "red", "Gentoo": "green", "Chinstrap": "blue"}
size = lambda x: 10 if x > 40 else 1
plt.scatter("body_mass_g", "flipper_length_mm", data=penguins,
            c=penguins["species"].map(colors),
            s=((penguins["bill_depth_mm"]/4)**2))
plt.legend()
plt.xlabel("Body Mass (g)")
plt.ylabel("Flipper Length (mm)");

Seaborn¶

In [ ]:
sns.relplot(x="body_mass_g", y="flipper_length_mm",
            hue="species", size="bill_depth_mm", data=penguins)

Key distinction: figure-level vs. axes-level: https://seaborn.pydata.org/tutorial/function_overview.html

Common Data Visualizations¶

Tables¶

Suppose you want to see the 5 biggest penguins.

In [ ]:
penguins.sort_values("body_mass_g", ascending=False).iloc[:5,:]

Table Tips:

  • Think about row and column ordering
  • Label columns and rows well (clear but concise).
  • Uniform precision, right-justified numbers.
  • Sometimes: bold or emphasize max or min values in a column
In [ ]:
p = penguins.rename(columns={"species": "Species", "island": "Island",
                 "bill_length_mm": "Bill Length (mm)","bill_depth_mm": "Bill Depth (mm)",
                 "flipper_length_mm": "Flipper Length (mm)", "body_mass_g": "Body Mass (g)",
                 "sex": "Sex"})
p = p[["Species", "Island", "Sex", "Body Mass (g)", "Bill Length (mm)", "Bill Depth (mm)", "Flipper Length (mm)"]]
p.sort_values("Body Mass (g)", ascending=False).iloc[:5,:]

Dot plots, Line Plots¶

Conceptually (but not technically) different from a scatter plot, in that $x$ values are assumed to be ordered.

In [ ]:
mpg
In [ ]:
mpg_year = mpg.groupby("model_year")[["mpg"]].mean()
mpg_year

No connected dots - technically the same as a scatter plot.

In [ ]:
sns.relplot(x="model_year", y="mpg", kind="scatter", data=mpg_year)

Connect the dots: now you have a line plot:

In [ ]:
sns.relplot(x="model_year", y="mpg", kind="line", data=mpg_year)

Seaborn does sensible things if you have multiple datapoints per $x$ value:

In [ ]:
sns.relplot(x="model_year", y="mpg", kind="line", data=mpg)

Exercise: when should you connect the dots?

Box and whisker plots¶

In [ ]:
sns.boxplot(x="species", y="body_mass_g", data=penguins)

Exercise: Of the ones we've discussed so far (table, dot/line, box and whisker), which kind of visualization would you use to illustrate each of the following?

  1. The number of cars per model year in the MPG dataset
  2. The distribution of each penguin body measurement, independent of species.
  3. The centrality and variability of each penguin body measurement per species.

Scatter plots¶

In [ ]:
sns.relplot(data=penguins, x="flipper_length_mm", y="bill_length_mm", hue="species")

Bar/column plots and (usually not) pie charts¶

In [ ]:
sns.catplot(x="species", data=penguins, kind="count")
In [ ]:
sns.catplot(x="species", data=penguins, kind="count", col="island")
In [ ]:
sns.catplot(x="species", data=penguins, kind="count")

Histograms¶

In [ ]:
sns.displot(penguins, x="flipper_length_mm")
In [ ]:
sns.displot(penguins, x="flipper_length_mm", stat='density')
In [ ]:
sns.displot(penguins, x="flipper_length_mm", col='species')
In [ ]:
sns.displot(penguins, x="flipper_length_mm", hue="species", stat="density")
In [ ]:
sns.displot(penguins, x="flipper_length_mm", hue="species", col="island")
In [ ]:
sns.displot(penguins, x="flipper_length_mm", hue="species", col="island", kde='True')
In [ ]:
sns.jointplot(x="bill_length_mm", y="bill_depth_mm", data=penguins, kind='hex')
In [ ]:
fmri[fmri["subject"]=="s0"].sort_values(by="timepoint")

Exercise: Of the ones we've discussed so far (table, dot/line, box and whisker), which kind of visualization would you use to illustrate each of the following?

  1. Average signal per subject in the fmri dataset.
  2. The signal over time for each event type in patient 0, regardless of region.
  3. The distribution of bill lengths for Adelie penguins.

A helpful figure from the book:

Additional Practice: Download L06_exit.ipynb and add code to make one or more plots visualizing the response data.