import numpy as np
import seaborn as sns
Last talks (for a little while anyway) this week:
Data Ethics 1 due tomorrow night
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
— Charles Babbage, Passages from the Life of a Philosopher
What do we need to watch out for when approaching a new dataset?
Data types and units
Numerical representations
Unification and general apples-to-apples issues
A potentially insidious example: In the LCD data, there are two types of Hourly reports: FM-15 and FM-16. The latter appears to be taken more frequently than hourly, only when aviators need more frequent updates due to some interesting weather. What might this mean for if:
Sometimes data should be there but isn't. What would you do here?
Complete the worksheet in groups of three.
What general strategies can we extract from these examples? Replace missing values with:
What strategies for handling missing data can we extract from the above (and some others that may not have come up)?
Strategies for dealing with outliers that you've decided are erroneous - treat as missing data.
Be careful - could you be wrong? How would this affect the outcomes of your analysis?
Let's load up the NHANES body measurement dataset.
data_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_21f/data/NHANES/NHANES.csv"
cols_renamed = {"SEQN": "SEQN",
"RIAGENDR": "Gender", # 1 = M, 2 = F
"RIDAGEYR": "Age", # years
"BMXWT": "Weight", # kg
"BMXHT": "Height", # cm
"BMXLEG": "Leg", # cm
"BMXARML": "Arm", # cm
"BMXARMC": "Arm Cir", # cm
"BMXWAIST": "Waist Cir"} # cm
df = pd.read_csv(data_url)
df = df.rename(cols_renamed, axis='columns')
df = df.drop("SEQN", axis='columns')
df = df[df["Age"] >= 21]
In the NHANES dataset, heights and other length measurements are given in centimeters.
ht_col = df["Height"]
ht_col
Question: If you're 160cm tall, are you short? tall? average? Answer:
To compute a $z$-score:
In math: $$ \hat{x}_i = \frac{x_i -\mu}{\sigma}$$
In pandas:
df["Height-z"] = (ht_col - ht_col.mean()) / ht_col.std()
df["Height-z"]
sns.histplot(x="Height-z", data=df)
Nice properties of $z$-scores:
If we need to to make values non-negative, can exponentiate: $$ \hat{x}_i = e^{x_i}$$
x = np.linspace(-5,5,num=10000)
sns.lineplot(x=x, y = np.exp(x))