import pandas as pd
import seaborn as sns
(a brief topic I ran out of time for on Tuesday)
In the NHANES dataset, heights and other length measurements are given in centimeters. I don't have intuition for what's a normal height in centimeters - if you're 160cm tall, are you short? tall? average? One thing I could do is convert to feet and inches which I do know. But sometimes you don't have any units that are intuitive.
data_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_21f/data/NHANES/NHANES.csv"
cols_renamed = {"SEQN": "SEQN",
"RIAGENDR": "Gender", # 1 = M, 2 = F
"RIDAGEYR": "Age", # years
"BMXWT": "Weight", # kg
"BMXHT": "Height", # cm
"BMXLEG": "Leg", # cm
"BMXARML": "Arm", # cm
"BMXARMC": "Arm Cir", # cm
"BMXWAIST": "Waist Cir"} # cm
df = pd.read_csv(data_url)
df = df.rename(cols_renamed, axis='columns')
df = df.drop("SEQN", axis='columns')
df = df[df["Age"] >= 21]
ht_col = df["Height"]
ht_col
2 158.3
5 150.2
6 151.1
8 170.6
10 178.6
...
8697 180.1
8699 156.5
8700 164.9
8701 162.6
8703 175.8
Name: Height, Length: 5193, dtype: float64
To compute a $z$-score:
df["Height-z"] = (ht_col - ht_col.mean()) / ht_col.std()
df["Height-z"]
2 -0.787712
5 -1.589290
6 -1.500226
8 0.429499
10 1.221180
...
8697 1.369621
8699 -0.965840
8700 -0.134575
8701 -0.362183
8703 0.944092
Name: Height-z, Length: 5193, dtype: float64
sns.histplot(x="Height-z", data=df)
<AxesSubplot:xlabel='Height-z', ylabel='Count'>
df["Height-z"].plot.hist()
<AxesSubplot:ylabel='Frequency'>
Now instead of the raw data value, you have an interpretable measure of how close each point is to the mean. If you have an approximately-Gaussian distribution, you also have a good idea of how unusual that point is!