Lecture 13 - Introduction to Exploratory Data Analysis¶
Announcements:¶
- Project proposal due Sunday night
- Ethics 2 due Monday night
Goals:¶
- Know some strategies for approaching a new dataset:
- Know what basic (meta-)information you should find out before looking at the data itself.
- Have a toolbox of first steps in summarizing, visualizing, and observing interesting or odd features about the data.
Hypothesis-Driven vs Exploratory Analysis¶
- Previously: Hypothesis- or question-driven analysis (e.g., Labs 2, 4)
Hypothesis or Question -> Dataset -> Insight
- Today: Exploratory Data Analysis:
Dataset -> ?? -> Insight
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Activity¶
In Pairs (one computer per pair!): Grab the activity notebook from the course webpage and explore one of the two datasets.
While you do so, think about the following questions:
- What do you want to know about a dataset before you even load it?
- What are your first steps for exploring a dataset after loading it?
- What were the most interesting findings from your exploration?
Spend 20 minutes getting as much exploration done as you can.
Then, you'll spend 5 minutes distilling your answers to the above three questions in preparation to share with the class.
Then we'll spend about 10 minutes sharing out, and I'll show you some fun EDA-relevant plotting tricks.
What do you want to know about a dataset before you even look at it?
Brainstorm:
- What kind of data is it? What is the subject of the data?
- How big is it? Rows, columns?
- What are the data types involved?
- Where did it come from? Who collected? Why?
- How much of the population was sampled?
- How can it be related to other datasets?
What are your first steps for exploring a dataset after loading it?
Brainstorm:
.describe()- summary statistics- make scatterplots!
- make histograms
- bar, line plots
- think about questions
.info()
Pre-open ideas:
- How big is it?
- What is it called or what it is about?
- Is it legal / ethical?
- How was it collected?
- Where did it come from? Can you trust the source?
- Who collected it? Why?
- Why are we looking at it?
Post-open ideas:
- look at a few rows Which ones?
- Maybe the first few, or a few random ones, to get a look at some arbitrary data.
- Maybe some extremes - what does the tallest person look like? Longest-armed?
- compute summary statistics (of numerical columns)
- Look at distribution of each column
- Look at pairwise scatter plots of all pairs of (numerical) columns
Example - NHANES¶
data_url = "/cluster/academic/DATA311/202620/NHANES/NHANES.csv"
cols_renamed = {"SEQN": "SEQN",
"RIAGENDR": "Gender", # 1 = M, 2 = F
"RIDAGEYR": "Age", # years
"BMXWT": "Weight", # kg
"BMXHT": "Height", # cm
"BMXLEG": "Leg", # cm
"BMXARML": "Arm", # cm
"BMXARMC": "Arm Cir", # cm
"BMXWAIST": "Waist Cir"} # cm
df = pd.read_csv(data_url)
df.info()
<class 'pandas.DataFrame'> RangeIndex: 8704 entries, 0 to 8703 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SEQN 8704 non-null float64 1 RIAGENDR 8704 non-null float64 2 RIDAGEYR 8704 non-null float64 3 BMXWT 8580 non-null float64 4 BMXHT 8016 non-null float64 5 BMXLEG 6703 non-null float64 6 BMXARML 8177 non-null float64 7 BMXARMC 8173 non-null float64 8 BMXWAIST 7601 non-null float64 dtypes: float64(9) memory usage: 612.1 KB
Let's rename those columns:
df = df.rename(cols_renamed, axis='columns')
df = df.drop("SEQN", axis='columns') # we don't care about sequence # - it's just an ID
df
| Gender | Age | Weight | Height | Leg | Arm | Arm Cir | Waist Cir | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2.0 | 2.0 | 13.7 | 88.6 | NaN | 18.0 | 16.2 | 48.2 |
| 1 | 1.0 | 2.0 | 13.9 | 94.2 | NaN | 18.6 | 15.2 | 50.0 |
| 2 | 2.0 | 66.0 | 79.5 | 158.3 | 37.0 | 36.0 | 32.0 | 101.8 |
| 3 | 1.0 | 18.0 | 66.3 | 175.7 | 46.6 | 38.8 | 27.0 | 79.3 |
| 4 | 1.0 | 13.0 | 45.4 | 158.4 | 38.1 | 33.8 | 21.5 | 64.1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 8699 | 2.0 | 70.0 | 49.0 | 156.5 | 34.4 | 32.6 | 25.1 | 82.2 |
| 8700 | 1.0 | 42.0 | 97.4 | 164.9 | 38.2 | 36.6 | 40.6 | 114.8 |
| 8701 | 2.0 | 41.0 | 69.1 | 162.6 | 39.2 | 35.2 | 26.8 | 86.4 |
| 8702 | 2.0 | 14.0 | 111.9 | 156.6 | 39.2 | 35.0 | 44.5 | 113.5 |
| 8703 | 1.0 | 38.0 | 111.5 | 175.8 | 42.5 | 38.0 | 40.0 | 122.0 |
8704 rows × 8 columns
# First few rows:
import random
df.sample(n=3)
| Gender | Age | Weight | Height | Leg | Arm | Arm Cir | Waist Cir | |
|---|---|---|---|---|---|---|---|---|
| 1954 | 2.0 | 19.0 | 92.6 | 163.9 | 40.9 | 36.5 | 37.4 | 99.8 |
| 1004 | 1.0 | 43.0 | 125.3 | 187.4 | 45.3 | 43.3 | 42.6 | 120.8 |
| 6795 | 1.0 | 10.0 | 29.4 | 141.2 | 32.2 | 28.4 | 17.7 | 58.0 |
# Two tallest people:
df.sort_values("Height", ascending=False).iloc[:2,:]
| Gender | Age | Weight | Height | Leg | Arm | Arm Cir | Waist Cir | |
|---|---|---|---|---|---|---|---|---|
| 2614 | 1.0 | 65.0 | 97.5 | 197.7 | 44.0 | 44.3 | 30.9 | 100.4 |
| 8247 | 1.0 | 34.0 | 89.9 | 195.8 | 46.0 | 49.9 | 32.5 | 88.2 |
# What's that in feet?
df.sort_values("Height", ascending=False).iloc[:2,:]["Height"] / 2.54 / 12
2614 6.486220 8247 6.423885 Name: Height, dtype: float64
# How many standard deviations above the mean are those? Are they outliers?
# compute z-scores
ht = df["Height"]
ht_z = (ht - ht.mean()) / ht.std()
ht_z.sort_values(ascending=False)
2614 1.846835
8247 1.761472
8130 1.752487
3828 1.747994
2455 1.747994
...
8649 NaN
8670 NaN
8678 NaN
8685 NaN
8690 NaN
Name: Height, Length: 8704, dtype: float64
Compute summary statistics (of numerical columns)¶
Tukey's 5-number summary: minimum, 25th percentile, median (50th percentile), 75th percentile, maximum
Mean and standard deviation are nice too.
# summary statistics
df.describe()
| Gender | Age | Weight | Height | Leg | Arm | Arm Cir | Waist Cir | |
|---|---|---|---|---|---|---|---|---|
| count | 8704.000000 | 8.704000e+03 | 8580.000000 | 8016.000000 | 6703.000000 | 8177.000000 | 8173.000000 | 7601.000000 |
| mean | 1.509076 | 3.443865e+01 | 65.138508 | 156.593401 | 38.643980 | 33.667996 | 29.193589 | 89.928851 |
| std | 0.499946 | 2.537904e+01 | 32.890754 | 22.257858 | 4.158013 | 7.229185 | 7.970648 | 22.805093 |
| min | 1.000000 | 5.397605e-79 | 3.200000 | 78.300000 | 24.800000 | 9.400000 | 11.200000 | 40.000000 |
| 25% | 1.000000 | 1.100000e+01 | 43.100000 | 151.400000 | 35.800000 | 32.000000 | 23.800000 | 73.900000 |
| 50% | 2.000000 | 3.100000e+01 | 67.750000 | 161.900000 | 38.800000 | 35.800000 | 30.100000 | 91.200000 |
| 75% | 2.000000 | 5.800000e+01 | 85.600000 | 171.200000 | 41.500000 | 38.400000 | 34.700000 | 105.300000 |
| max | 2.000000 | 8.000000e+01 | 242.600000 | 197.700000 | 55.000000 | 49.900000 | 56.300000 | 169.500000 |
Maybe take stock of missing data?¶
# how many rows and columns does the table have?
df.shape
(8704, 8)
# what percent of each column is missing?
df.isna().sum() / df.shape[0] * 100
Gender 0.000000 Age 0.000000 Weight 1.424632 Height 7.904412 Leg 22.989430 Arm 6.054688 Arm Cir 6.100643 Waist Cir 12.672335 dtype: float64
# what does the pattern of missing data look like?
plt.imshow(df.isna().iloc[:500,:].T.to_numpy(), aspect='auto', interpolation='none')
plt.gca().set_yticks(range(8))
plt.gca().set_yticklabels([str(c) for c in df.columns]);
Look at distribution of each column¶
Histograms! I love histograms!
print(len(df.columns))
8
# make a figure with a 4-row, 2-column grid of axes:
fig, axes = plt.subplots(4, 2, figsize=(8, 16))
for col, ax in zip(df.columns, axes.flatten()):
sns.histplot(data=df[col].values, ax=ax)
ax.set_title(col)
Look at pairwise scatter plots of all pairs of (numerical) columns¶
sns.pairplot(data=df);
What if we consider only adults (21+)?
sns.pairplot(data=df[df["Age"]>=21])
To-do list - did we do all of this?
- Basic meta info:
- who/what/why/when was the data collected?
- how much data is there? if too much, can you sensibly subsample?
- what are the columns and what do they mean?
- Diving into the data:
- look at a few rows
- summary stats; Tukey's five # summary
- Distributions of each column
- Pairwise scatter plots