Lecture 13 - Introduction to Exploratory Data Analysis¶

Announcements:¶

  • Project proposal due Sunday night
  • Ethics 2 due Monday night

Goals:¶

  • Know some strategies for approaching a new dataset:
    • Know what basic (meta-)information you should find out before looking at the data itself.
    • Have a toolbox of first steps in summarizing, visualizing, and observing interesting or odd features about the data.

Hypothesis-Driven vs Exploratory Analysis¶

  • Previously: Hypothesis- or question-driven analysis (e.g., Labs 2, 4)

    Hypothesis or Question -> Dataset -> Insight

  • Today: Exploratory Data Analysis:

    Dataset -> ?? -> Insight

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Activity¶

In Pairs (one computer per pair!): Grab the activity notebook from the course webpage and explore one of the two datasets.

While you do so, think about the following questions:

  • What do you want to know about a dataset before you even load it?
  • What are your first steps for exploring a dataset after loading it?
  • What were the most interesting findings from your exploration?

Spend 20 minutes getting as much exploration done as you can.

Then, you'll spend 5 minutes distilling your answers to the above three questions in preparation to share with the class.

Then we'll spend about 10 minutes sharing out, and I'll show you some fun EDA-relevant plotting tricks.

What do you want to know about a dataset before you even look at it?

Brainstorm:

  • What kind of data is it? What is the subject of the data?
  • How big is it? Rows, columns?
  • What are the data types involved?
  • Where did it come from? Who collected? Why?
  • How much of the population was sampled?
  • How can it be related to other datasets?

What are your first steps for exploring a dataset after loading it?

Brainstorm:

  • .describe() - summary statistics
  • make scatterplots!
  • make histograms
  • bar, line plots
  • think about questions
  • .info()
In [ ]:

Pre-open ideas:

  • How big is it?
  • What is it called or what it is about?
  • Is it legal / ethical?
  • How was it collected?
  • Where did it come from? Can you trust the source?
  • Who collected it? Why?
  • Why are we looking at it?

Post-open ideas:

  • look at a few rows Which ones?
    • Maybe the first few, or a few random ones, to get a look at some arbitrary data.
    • Maybe some extremes - what does the tallest person look like? Longest-armed?
  • compute summary statistics (of numerical columns)
  • Look at distribution of each column
  • Look at pairwise scatter plots of all pairs of (numerical) columns

Example - NHANES¶

In [2]:
data_url = "/cluster/academic/DATA311/202620/NHANES/NHANES.csv"
In [3]:
cols_renamed = {"SEQN": "SEQN",
                "RIAGENDR": "Gender", # 1 = M, 2 = F
                "RIDAGEYR": "Age", # years
                "BMXWT": "Weight", # kg
                "BMXHT": "Height", # cm
                "BMXLEG": "Leg", # cm
                "BMXARML": "Arm", # cm
                "BMXARMC": "Arm Cir", # cm
                "BMXWAIST": "Waist Cir"} # cm

df = pd.read_csv(data_url)
df.info()
<class 'pandas.DataFrame'>
RangeIndex: 8704 entries, 0 to 8703
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   SEQN      8704 non-null   float64
 1   RIAGENDR  8704 non-null   float64
 2   RIDAGEYR  8704 non-null   float64
 3   BMXWT     8580 non-null   float64
 4   BMXHT     8016 non-null   float64
 5   BMXLEG    6703 non-null   float64
 6   BMXARML   8177 non-null   float64
 7   BMXARMC   8173 non-null   float64
 8   BMXWAIST  7601 non-null   float64
dtypes: float64(9)
memory usage: 612.1 KB

Let's rename those columns:

In [4]:
df = df.rename(cols_renamed, axis='columns')
df = df.drop("SEQN", axis='columns') # we don't care about sequence # - it's just an ID
df
Out[4]:
Gender Age Weight Height Leg Arm Arm Cir Waist Cir
0 2.0 2.0 13.7 88.6 NaN 18.0 16.2 48.2
1 1.0 2.0 13.9 94.2 NaN 18.6 15.2 50.0
2 2.0 66.0 79.5 158.3 37.0 36.0 32.0 101.8
3 1.0 18.0 66.3 175.7 46.6 38.8 27.0 79.3
4 1.0 13.0 45.4 158.4 38.1 33.8 21.5 64.1
... ... ... ... ... ... ... ... ...
8699 2.0 70.0 49.0 156.5 34.4 32.6 25.1 82.2
8700 1.0 42.0 97.4 164.9 38.2 36.6 40.6 114.8
8701 2.0 41.0 69.1 162.6 39.2 35.2 26.8 86.4
8702 2.0 14.0 111.9 156.6 39.2 35.0 44.5 113.5
8703 1.0 38.0 111.5 175.8 42.5 38.0 40.0 122.0

8704 rows × 8 columns

In [5]:
# First few rows:
import random
df.sample(n=3)
Out[5]:
Gender Age Weight Height Leg Arm Arm Cir Waist Cir
1954 2.0 19.0 92.6 163.9 40.9 36.5 37.4 99.8
1004 1.0 43.0 125.3 187.4 45.3 43.3 42.6 120.8
6795 1.0 10.0 29.4 141.2 32.2 28.4 17.7 58.0
In [6]:
# Two tallest people:
df.sort_values("Height", ascending=False).iloc[:2,:]
Out[6]:
Gender Age Weight Height Leg Arm Arm Cir Waist Cir
2614 1.0 65.0 97.5 197.7 44.0 44.3 30.9 100.4
8247 1.0 34.0 89.9 195.8 46.0 49.9 32.5 88.2
In [7]:
# What's that in feet?
df.sort_values("Height", ascending=False).iloc[:2,:]["Height"] / 2.54 / 12
Out[7]:
2614    6.486220
8247    6.423885
Name: Height, dtype: float64
In [8]:
# How many standard deviations above the mean are those? Are they outliers?
# compute z-scores
ht = df["Height"]
ht_z = (ht - ht.mean()) / ht.std()
ht_z.sort_values(ascending=False)
Out[8]:
2614    1.846835
8247    1.761472
8130    1.752487
3828    1.747994
2455    1.747994
          ...   
8649         NaN
8670         NaN
8678         NaN
8685         NaN
8690         NaN
Name: Height, Length: 8704, dtype: float64

Compute summary statistics (of numerical columns)¶

Tukey's 5-number summary: minimum, 25th percentile, median (50th percentile), 75th percentile, maximum

Mean and standard deviation are nice too.

In [9]:
# summary statistics
df.describe()
Out[9]:
Gender Age Weight Height Leg Arm Arm Cir Waist Cir
count 8704.000000 8.704000e+03 8580.000000 8016.000000 6703.000000 8177.000000 8173.000000 7601.000000
mean 1.509076 3.443865e+01 65.138508 156.593401 38.643980 33.667996 29.193589 89.928851
std 0.499946 2.537904e+01 32.890754 22.257858 4.158013 7.229185 7.970648 22.805093
min 1.000000 5.397605e-79 3.200000 78.300000 24.800000 9.400000 11.200000 40.000000
25% 1.000000 1.100000e+01 43.100000 151.400000 35.800000 32.000000 23.800000 73.900000
50% 2.000000 3.100000e+01 67.750000 161.900000 38.800000 35.800000 30.100000 91.200000
75% 2.000000 5.800000e+01 85.600000 171.200000 41.500000 38.400000 34.700000 105.300000
max 2.000000 8.000000e+01 242.600000 197.700000 55.000000 49.900000 56.300000 169.500000

Maybe take stock of missing data?¶

In [10]:
# how many rows and columns does the table have?
df.shape
Out[10]:
(8704, 8)
In [11]:
# what percent of each column is missing?
df.isna().sum() / df.shape[0] * 100
Out[11]:
Gender        0.000000
Age           0.000000
Weight        1.424632
Height        7.904412
Leg          22.989430
Arm           6.054688
Arm Cir       6.100643
Waist Cir    12.672335
dtype: float64
In [12]:
# what does the pattern of missing data look like?
plt.imshow(df.isna().iloc[:500,:].T.to_numpy(), aspect='auto', interpolation='none')
plt.gca().set_yticks(range(8))
plt.gca().set_yticklabels([str(c) for c in df.columns]);
No description has been provided for this image

Look at distribution of each column¶

Histograms! I love histograms!

In [13]:
print(len(df.columns))
8
In [14]:
# make a figure with a 4-row, 2-column grid of axes:
fig, axes = plt.subplots(4, 2, figsize=(8, 16))

for col, ax in zip(df.columns, axes.flatten()):
    sns.histplot(data=df[col].values, ax=ax)
    ax.set_title(col)
No description has been provided for this image

Look at pairwise scatter plots of all pairs of (numerical) columns¶

In [15]:
sns.pairplot(data=df);
No description has been provided for this image

What if we consider only adults (21+)?

In [ ]:
sns.pairplot(data=df[df["Age"]>=21])

To-do list - did we do all of this?

  • Basic meta info:
    • who/what/why/when was the data collected?
    • how much data is there? if too much, can you sensibly subsample?
    • what are the columns and what do they mean?
  • Diving into the data:
    • look at a few rows
    • summary stats; Tukey's five # summary
    • Distributions of each column
    • Pairwise scatter plots