Lecture 14 - Exploratory Data Analysis: "Cold Open"¶

Names:¶

Announcements¶

Happy Halloween!
Lab 5: don't worry about scraping Genre - it's harder than it used to be!
Data Ethics 3, on crawling/scraping/data collection for AI training, is out!
- You'll read one of 4 articles for Wednesday; we'll share what we read in class, then discuss.

Goals¶

Get practice running exploratory data analysis to find interesting trends or insights in an unfamiliar dataset.

Your Task¶

Find a partner and log into a single (shared) computer.
Download this notebook from the course webpage and fill in your names in the cell above.
Choose one of the following two datasets to explore.
Explore it!

Find and document interesting trends or insights; things that match up with your expectations are good, but things that are surprising or lead you to some sort of greater understanding of the world are even better!

By the end of class, create a brief writeup of at least one such finding at the bottom of this notebook. Exploratory "scratch work" can be left above your writeup, but the writeup should be a little mroe polished. Your analysis should include:

The code that got you to the insight in code blocks, with Markdown cells interspersed explaining your methodology.
At least one well-produced plot that illustrates the finding.

Submit your notebook in HTML format (File -> Save and export as -> HTML) to the 10/31 EDA assignment on Canvas.

The Data¶

Option 1: Yellow Cab Data¶

In [ ]:

yellow_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_25f/data/yellow_tripdata_2018-06_small.csv"

The data came from here: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.

It was preprocessed by a friend using this notebook. I think he told me he pulled out a subset of columns and subsampled the rows, but I don't know any more than that.

Option 2: Flight Data¶

In [3]:

flights_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_25f/data/nycflights.csv"

The data came from https://github.com/tidyverse/nycflights13/tree/main.

If you want some more data to go with it, you can check out https://github.com/tidyverse/nycflights13/tree/main/data-raw, which has tables for weather, planes, airports, and airlines.

...and go.¶

In [12]:

import pandas as pd
import seaborn as sns

Note: I spent most of class chatting with folks about their analsyes, but I spent a couple minutes doing a tiny bit of exploration, shown below.

In [7]:

flights = pd.read_csv(flights_url)

In [9]:

flights.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32735 entries, 0 to 32734
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   year       32735 non-null  int64 
 1   month      32735 non-null  int64 
 2   day        32735 non-null  int64 
 3   dep_time   32735 non-null  int64 
 4   dep_delay  32735 non-null  int64 
 5   arr_time   32735 non-null  int64 
 6   arr_delay  32735 non-null  int64 
 7   carrier    32735 non-null  object
 8   tailnum    32735 non-null  object
 9   flight     32735 non-null  int64 
 10  origin     32735 non-null  object
 11  dest       32735 non-null  object
 12  air_time   32735 non-null  int64 
 13  distance   32735 non-null  int64 
 14  hour       32735 non-null  int64 
 15  minute     32735 non-null  int64 
dtypes: int64(12), object(4)
memory usage: 4.0+ MB

In [10]:

flights.head()

Out[10]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
0	2013	6	30	940	15	1216	-4	VX	N626VA	407	JFK	LAX	313	2475	9	40
1	2013	5	7	1657	-3	2104	10	DL	N3760C	329	JFK	SJU	216	1598	16	57
2	2013	12	8	859	-1	1238	11	DL	N712TW	422	JFK	LAX	376	2475	8	59
3	2013	5	14	1841	-4	2122	-34	DL	N914DL	2391	JFK	TPA	135	1005	18	41
4	2013	7	21	1102	-3	1230	-8	9E	N823AY	3652	LGA	ORF	50	296	11	2

In [ ]:

In [15]:

sns.displot(flights.dep_time % 100)

Out[15]:

<seaborn.axisgrid.FacetGrid at 0x14c6a90ca060>

No description has been provided for this image