Lecture 9 - Preprocessing and Cleaning: Missing Data; Outliers; Numerical Normalization¶

https://imgs.xkcd.com/comics/every_data_table.png

In [8]:
import numpy as np
import seaborn as sns

Announcements:¶

Last talks (for a little while anyway) this week:

  • Mon 1/30 Hanxiang Du, 4pm CF 105 - Research (Educational Data Mining, Artificial Intelligence and Machine Learning in Education)
  • Tue 1/31 Hanxiang Du, 4pm CF 025 - Teaching Demo
  • Thu 2/2 Czilard Vajda, 4pm CF 105 - Research (machine learning/data science/etc.; wine quality esimation?)
  • Fri 2/3 Czilard Vajda, 4pm CF 316 - Teaching Demo

Data Ethics 1 due tomorrow night

Goals:¶

  • Be aware of, and get practice deciding how to handle common issues that arise before analysis:
    • Data types and units
    • Missing Data
    • Outliers
  • Know how and why to compute a few different numerical normalizations:
    • $z$-scores
    • 0-1 normalization
    • Exponential normalization

Rule #1 of Data Science: GIGO¶

In [ ]:

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

— Charles Babbage, Passages from the Life of a Philosopher

Data Pitfalls¶

What do we need to watch out for when approaching a new dataset?

  • Data types and units

    • Kilograms vs pounds
  • Numerical representations

    • 1.6e1 vs 16 vs 16.0 vs sixteen vs 32/2
    • Generally, do processing on floats unless (and perhaps even when) the underlying data is integral
  • Unification and general apples-to-apples issues

    • Money - inflation, economic context
    • Time zones
    • Multiple data sources - name differences (Wehrwein, S. vs Scott Wehrwein)

A potentially insidious example: In the LCD data, there are two types of Hourly reports: FM-15 and FM-16. The latter appears to be taken more frequently than hourly, only when aviators need more frequent updates due to some interesting weather. What might this mean for if:

  • you investigate how often thunderstorms happen by counting hourly measurements with thunderstorms in the weather column?
  • you compare average wind speeds in two cities, one of which has very gusty (i.e., variable) winds often, while the other has high sustained winds more often?
In [ ]:

Missing Data Worksheet¶

Sometimes data should be there but isn't. What would you do here?

Complete the worksheet in groups of three.

What general strategies can we extract from these examples? Replace missing values with:

  • Zero
  • Plausible averages?
  • Derive intermediate quantities
  • Drop them
  • Interpolation (average nearby values)
  • Data from another source
  • Data from another column
In [ ]:

What strategies for handling missing data can we extract from the above (and some others that may not have come up)?

  • Heuristics (birth + life expectancy for Year of Death)
  • Mean value (fill in the average)
  • Random value (fill in garbage)
  • Nearest Neighbor (in time, or most-similar datapoint)
  • Interpolation / imputation

Outliers¶

  • An outlier is a datapoint that is significantly separated from the main body of observations/data
  • Several causes:
    • They can actual, valid observations/measurement.
      • The "heavier tail" the distribution that the data comes from is, the more likely these are to appear. "Heavy tail" has more likelihood of things far from the mean appearing.
    • Data entry errors; e.g., punching in the wrong numbers
    • Fraud; e.g., tampering with the data
    • Instrument error; e.g., malfunctioning sensor
    • Imputation gone awry

Strategies for dealing with outliers that you've decided are erroneous - treat as missing data.

Be careful - could you be wrong? How would this affect the outcomes of your analysis?

Numerical Normalization¶

Let's load up the NHANES body measurement dataset.

In [ ]:
data_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_21f/data/NHANES/NHANES.csv"
cols_renamed = {"SEQN": "SEQN",
                "RIAGENDR": "Gender", # 1 = M, 2 = F
                "RIDAGEYR": "Age", # years
                "BMXWT": "Weight", # kg
                "BMXHT": "Height", # cm
                "BMXLEG": "Leg", # cm
                "BMXARML": "Arm", # cm
                "BMXARMC": "Arm Cir", # cm
                "BMXWAIST": "Waist Cir"} # cm

df = pd.read_csv(data_url)
df = df.rename(cols_renamed, axis='columns')
df = df.drop("SEQN", axis='columns')
df = df[df["Age"] >= 21]

In the NHANES dataset, heights and other length measurements are given in centimeters.

In [ ]:
ht_col = df["Height"]
ht_col

Question: If you're 160cm tall, are you short? tall? average? Answer:

$z$-scores¶

To compute a $z$-score:

  1. Subtract the mean
  2. Divide by the standard deviation.

In math: $$ \hat{x}_i = \frac{x_i -\mu}{\sigma}$$

In pandas:

In [ ]:
df["Height-z"] = (ht_col - ht_col.mean()) / ht_col.std()
df["Height-z"]
In [ ]:
sns.histplot(x="Height-z", data=df)

Nice properties of $z$-scores:

  • $\hat{x}_i < 0$, smaller than average
  • $\hat{x}_i > 0$, greater than average
  • $\hat{x}_i > 1$, more than one standard deviation above average
  • etc.
  • Can give context to how normal or anomalous a datapoint is

0-1 normalization¶

$$ \hat{x}_i = \frac{x_i - x_{min}}{x_{max}-x_{min}}$$
  • Here $x_{max}$ and $x_{min}$ are the max/min values observed in the dataset -- *or* a theoretical min or max.
  • Warning: if a new datapoint comes along and you use the same mapping, can get values that are $<0$ or $>1$.
  • We did this with images!

Exponentiation¶

If we need to to make values non-negative, can exponentiate: $$ \hat{x}_i = e^{x_i}$$

  • $x_i \to -\infty$, normalized value approaches 0
  • $x_i \to \infty$, normalized value gets large quickly!
In [ ]:
x = np.linspace(-5,5,num=10000)

sns.lineplot(x=x, y = np.exp(x))