Lecture 13 - Exploratory Data Analysis: "Cold Open"¶

The data:

In [ ]:
data_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_23w/data/yellow_tripdata_2018-06_small.csv"

The data came from here: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.

It was preprocessed by a friend using this notebook. I think he told me he pulled out a subset of columns and subsampled the rows, but I don't know any more than that.

...and go.¶

In [2]:
import pandas as pd
df =pd.read_csv(data_url)
df
Out[2]:
Unnamed: 0 time_elapsed_min passenger_count trip_distance payment_type fare_amount tip_amount
0 4218845 12 1 2.70 1 11.5 1.00
1 7051345 13 1 4.42 2 15.5 0.00
2 3944882 9 1 1.20 1 8.0 2.60
3 6523355 3 1 0.75 1 5.0 0.63
4 3004564 14 2 1.46 1 10.5 2.26
... ... ... ... ... ... ... ...
435687 5549269 19 5 4.39 1 17.5 3.76
435688 5389974 6 1 0.40 2 5.5 0.00
435689 3129538 36 2 3.69 1 22.5 4.86
435690 1558533 3 1 0.46 2 4.0 0.00
435691 3486394 5 1 0.84 2 5.5 0.00

435692 rows × 7 columns

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 435692 entries, 0 to 435691
Data columns (total 7 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   Unnamed: 0        435692 non-null  int64  
 1   time_elapsed_min  435692 non-null  int64  
 2   passenger_count   435692 non-null  int64  
 3   trip_distance     435692 non-null  float64
 4   payment_type      435692 non-null  int64  
 5   fare_amount       435692 non-null  float64
 6   tip_amount        435692 non-null  float64
dtypes: float64(3), int64(4)
memory usage: 23.3 MB
In [6]:
df.groupby(by="passenger_count")["time_elapsed_min"].describe()
Out[6]:
count mean std min 25% 50% 75% max
passenger_count
0 3493.0 14.262525 12.573817 0.0 6.00 11.0 19.00 108.0
1 309968.0 16.693594 62.360731 0.0 6.00 11.0 18.00 1439.0
2 63663.0 18.335894 69.706152 0.0 6.00 11.0 19.00 1439.0
3 18253.0 18.218704 69.287559 0.0 7.00 11.0 19.00 1439.0
4 8787.0 17.295436 52.822602 0.0 7.00 12.0 19.00 1438.0
5 19441.0 19.091302 82.836403 0.0 6.00 11.0 18.00 1439.0
6 12083.0 18.782835 79.826336 0.0 6.00 11.0 18.00 1439.0
7 2.0 10.500000 14.849242 0.0 5.25 10.5 15.75 21.0
8 1.0 0.000000 NaN 0.0 0.00 0.0 0.00 0.0
9 1.0 42.000000 NaN 42.0 42.00 42.0 42.00 42.0
In [7]:
df.groupby(by="passenger_count")["payment_type"].describe()
Out[7]:
count mean std min 25% 50% 75% max
passenger_count
0 3493.0 1.347839 0.547403 1.0 1.00 1.0 2.00 4.0
1 309968.0 1.308974 0.486133 1.0 1.00 1.0 2.00 4.0
2 63663.0 1.324678 0.482142 1.0 1.00 1.0 2.00 4.0
3 18253.0 1.336602 0.488298 1.0 1.00 1.0 2.00 4.0
4 8787.0 1.389894 0.513224 1.0 1.00 1.0 2.00 4.0
5 19441.0 1.301682 0.463461 1.0 1.00 1.0 2.00 4.0
6 12083.0 1.313829 0.468505 1.0 1.00 1.0 2.00 4.0
7 2.0 1.500000 0.707107 1.0 1.25 1.5 1.75 2.0
8 1.0 1.000000 NaN 1.0 1.00 1.0 1.00 1.0
9 1.0 1.000000 NaN 1.0 1.00 1.0 1.00 1.0
In [11]:
df[["trip_distance", "fare_amount"]].sort_values(by="trip_distance")
Out[11]:
trip_distance fare_amount
135540 0.00 30.0
213675 0.00 2.5
95160 0.00 52.0
141490 0.00 58.0
213694 0.00 2.5
... ... ...
410333 56.32 176.0
426199 63.25 320.0
99504 66.30 300.0
313173 66.71 267.0
311198 69.46 225.0

435692 rows × 2 columns

In [16]:
import seaborn as sns
sns.relplot(x="trip_distance", y="fare_amount", data=df.sample(n=6000, axis=0))
Out[16]:
<seaborn.axisgrid.FacetGrid at 0x7f364b554760>
In [21]:
sns.relplot(x="trip_distance", y="tip_amount", col="passenger_count", data=df.sample(n=6000, axis=0))
Out[21]:
<seaborn.axisgrid.FacetGrid at 0x7f364b334c40>
In [22]:
sns.relplot(x="fare_amount", y="tip_amount", data=df.sample(n=6000, axis=0))
Out[22]:
<seaborn.axisgrid.FacetGrid at 0x7f364b466f10>
In [24]:
neg_fares = df[df["fare_amount"] < 0]
In [26]:
sns.relplot(x="trip_distance", y="fare_amount", data=neg_fares)
Out[26]:
<seaborn.axisgrid.FacetGrid at 0x7f3647efc700>
In [31]:
(df["fare_amount"] + df["tip_amount"]).sum()
Out[31]:
6611198.3599999985
In [30]:
(df["fare_amount"] + df["tip_amount"]).mean()
Out[30]:
15.174018251425315
In [33]:
(df["fare_amount"] + df["tip_amount"]).sum() / df["trip_distance"].sum()
Out[33]:
5.037009795215846
In [34]:
neg_fares["tip_amount"].plot.hist()
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3647ce5c10>