import pandas as pd
survey_results = pd.read_csv("~/311/students/lab2_hours/Week 2 Survey Survey Student Analysis Report.csv")
hrs = survey_results.iloc[:,8]
hrs.describe()
count 35.000000 mean 11.328571 std 8.682253 min 2.000000 25% 5.000000 50% 8.000000 75% 15.000000 max 42.000000 Name: 28088758: Approximately how many hours did you spend on the lab?, dtype: float64
survey_results.iloc[:,8].plot.hist(bins=range(0,50,5))
<AxesSubplot:ylabel='Frequency'>
On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
— Charles Babbage, Passages from the Life of a Philosopher
Or, in this course so far?
Ideas from the class:
What do we need to watch out for when approaching a new dataset?
Outliers
Unification and general apples-to-apples issues
A potentially insidious example: In the LCD data, there are two types of Hourly reports: FM-15 and FM-16. The latter appears to be taken more frequently than hourly, only when aviators need more frequent updates due to some interesting weather. What might this mean for if:
What would you do here? LCD Weather data:
NHANES or similar survey:
Assignment survey:
Avengers:
What general strategies can we extract from the above? Suggestions from the class:
What strategies for handling missing data can we extract from the above (and some others that may not have come up)?