Lecture 14 - Exploratory Data Analysis: "Cold Open"¶
Names:¶
Announcements¶
- Happy Halloween!
- Lab 5: don't worry about scraping Genre - it's harder than it used to be!
- Data Ethics 3, on crawling/scraping/data collection for AI training, is out!
- You'll read one of 4 articles for Wednesday; we'll share what we read in class, then discuss.
Goals¶
- Get practice running exploratory data analysis to find interesting trends or insights in an unfamiliar dataset.
Your Task¶
- Find a partner and log into a single (shared) computer.
- Download this notebook from the course webpage and fill in your names in the cell above.
- Choose one of the following two datasets to explore.
- Explore it!
Find and document interesting trends or insights; things that match up with your expectations are good, but things that are surprising or lead you to some sort of greater understanding of the world are even better!
By the end of class, create a brief writeup of at least one such finding at the bottom of this notebook. Exploratory "scratch work" can be left above your writeup, but the writeup should be a little mroe polished. Your analysis should include:
- The code that got you to the insight in code blocks, with Markdown cells interspersed explaining your methodology.
- At least one well-produced plot that illustrates the finding.
Submit your notebook in HTML format (File -> Save and export as -> HTML) to the 10/31 EDA assignment on Canvas.
The Data¶
Option 1: Yellow Cab Data¶
yellow_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_25f/data/yellow_tripdata_2018-06_small.csv"
The data came from here: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.
It was preprocessed by a friend using this notebook. I think he told me he pulled out a subset of columns and subsampled the rows, but I don't know any more than that.
Option 2: Flight Data¶
flights_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_25f/data/nycflights.csv"
The data came from https://github.com/tidyverse/nycflights13/tree/main.
If you want some more data to go with it, you can check out https://github.com/tidyverse/nycflights13/tree/main/data-raw, which has tables for weather, planes, airports, and airlines.
...and go.¶
import pandas as pd
import seaborn as sns
Note: I spent most of class chatting with folks about their analsyes, but I spent a couple minutes doing a tiny bit of exploration, shown below.
flights = pd.read_csv(flights_url)
flights.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32735 entries, 0 to 32734 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 32735 non-null int64 1 month 32735 non-null int64 2 day 32735 non-null int64 3 dep_time 32735 non-null int64 4 dep_delay 32735 non-null int64 5 arr_time 32735 non-null int64 6 arr_delay 32735 non-null int64 7 carrier 32735 non-null object 8 tailnum 32735 non-null object 9 flight 32735 non-null int64 10 origin 32735 non-null object 11 dest 32735 non-null object 12 air_time 32735 non-null int64 13 distance 32735 non-null int64 14 hour 32735 non-null int64 15 minute 32735 non-null int64 dtypes: int64(12), object(4) memory usage: 4.0+ MB
flights.head()
| year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | hour | minute | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2013 | 6 | 30 | 940 | 15 | 1216 | -4 | VX | N626VA | 407 | JFK | LAX | 313 | 2475 | 9 | 40 |
| 1 | 2013 | 5 | 7 | 1657 | -3 | 2104 | 10 | DL | N3760C | 329 | JFK | SJU | 216 | 1598 | 16 | 57 |
| 2 | 2013 | 12 | 8 | 859 | -1 | 1238 | 11 | DL | N712TW | 422 | JFK | LAX | 376 | 2475 | 8 | 59 |
| 3 | 2013 | 5 | 14 | 1841 | -4 | 2122 | -34 | DL | N914DL | 2391 | JFK | TPA | 135 | 1005 | 18 | 41 |
| 4 | 2013 | 7 | 21 | 1102 | -3 | 1230 | -8 | 9E | N823AY | 3652 | LGA | ORF | 50 | 296 | 11 | 2 |
sns.displot(flights.dep_time % 100)
<seaborn.axisgrid.FacetGrid at 0x14c6a90ca060>