Lecture 14 - Exploratory Data Analysis: "Cold Open"¶

Names:¶

Announcements¶

  • Happy Halloween!
  • Lab 5: don't worry about scraping Genre - it's harder than it used to be!
  • Data Ethics 3, on crawling/scraping/data collection for AI training, is out!
    • You'll read one of 4 articles for Wednesday; we'll share what we read in class, then discuss.

Goals¶

  • Get practice running exploratory data analysis to find interesting trends or insights in an unfamiliar dataset.

Your Task¶

  1. Find a partner and log into a single (shared) computer.
  2. Download this notebook from the course webpage and fill in your names in the cell above.
  3. Choose one of the following two datasets to explore.
  4. Explore it!

Find and document interesting trends or insights; things that match up with your expectations are good, but things that are surprising or lead you to some sort of greater understanding of the world are even better!

By the end of class, create a brief writeup of at least one such finding at the bottom of this notebook. Exploratory "scratch work" can be left above your writeup, but the writeup should be a little mroe polished. Your analysis should include:

  • The code that got you to the insight in code blocks, with Markdown cells interspersed explaining your methodology.
  • At least one well-produced plot that illustrates the finding.

Submit your notebook in HTML format (File -> Save and export as -> HTML) to the 10/31 EDA assignment on Canvas.

The Data¶

Option 1: Yellow Cab Data¶

In [ ]:
yellow_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_25f/data/yellow_tripdata_2018-06_small.csv"

The data came from here: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.

It was preprocessed by a friend using this notebook. I think he told me he pulled out a subset of columns and subsampled the rows, but I don't know any more than that.

Option 2: Flight Data¶

In [3]:
flights_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_25f/data/nycflights.csv"

The data came from https://github.com/tidyverse/nycflights13/tree/main.

If you want some more data to go with it, you can check out https://github.com/tidyverse/nycflights13/tree/main/data-raw, which has tables for weather, planes, airports, and airlines.

...and go.¶

In [12]:
import pandas as pd
import seaborn as sns

Note: I spent most of class chatting with folks about their analsyes, but I spent a couple minutes doing a tiny bit of exploration, shown below.

In [7]:
flights = pd.read_csv(flights_url)
In [9]:
flights.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32735 entries, 0 to 32734
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   year       32735 non-null  int64 
 1   month      32735 non-null  int64 
 2   day        32735 non-null  int64 
 3   dep_time   32735 non-null  int64 
 4   dep_delay  32735 non-null  int64 
 5   arr_time   32735 non-null  int64 
 6   arr_delay  32735 non-null  int64 
 7   carrier    32735 non-null  object
 8   tailnum    32735 non-null  object
 9   flight     32735 non-null  int64 
 10  origin     32735 non-null  object
 11  dest       32735 non-null  object
 12  air_time   32735 non-null  int64 
 13  distance   32735 non-null  int64 
 14  hour       32735 non-null  int64 
 15  minute     32735 non-null  int64 
dtypes: int64(12), object(4)
memory usage: 4.0+ MB
In [10]:
flights.head()
Out[10]:
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
0 2013 6 30 940 15 1216 -4 VX N626VA 407 JFK LAX 313 2475 9 40
1 2013 5 7 1657 -3 2104 10 DL N3760C 329 JFK SJU 216 1598 16 57
2 2013 12 8 859 -1 1238 11 DL N712TW 422 JFK LAX 376 2475 8 59
3 2013 5 14 1841 -4 2122 -34 DL N914DL 2391 JFK TPA 135 1005 18 41
4 2013 7 21 1102 -3 1230 -8 9E N823AY 3652 LGA ORF 50 296 11 2
In [ ]:
 
In [15]:
sns.displot(flights.dep_time % 100)
Out[15]:
<seaborn.axisgrid.FacetGrid at 0x14c6a90ca060>
No description has been provided for this image