Lecture 11 - Data Cleaning and Missing Data

https://imgs.xkcd.com/comics/every_data_table.png

Announcements:

Goals:

Rule #1 of Data Science: GIGO

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

— Charles Babbage, Passages from the Life of a Philosopher

Lab 3: What real-data problems have you encountered so far in lab 3?

Or, in this course so far?

Ideas from the class:

What do we need to watch out for when approaching a new dataset?

A potentially insidious example: In the LCD data, there are two types of Hourly reports: FM-15 and FM-16. The latter appears to be taken more frequently than hourly, only when aviators need more frequent updates due to some interesting weather. What might this mean for if:

What would you do here? LCD Weather data:

NHANES or similar survey:

Assignment survey:

Avengers:

What general strategies can we extract from the above? Suggestions from the class:

What strategies for handling missing data can we extract from the above (and some others that may not have come up)?