The data:
data_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_23w/data/yellow_tripdata_2018-06_small.csv"
The data came from here: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.
It was preprocessed by a friend using this notebook. I think he told me he pulled out a subset of columns and subsampled the rows, but I don't know any more than that.
import pandas as pd
df =pd.read_csv(data_url)
df
Unnamed: 0 | time_elapsed_min | passenger_count | trip_distance | payment_type | fare_amount | tip_amount | |
---|---|---|---|---|---|---|---|
0 | 4218845 | 12 | 1 | 2.70 | 1 | 11.5 | 1.00 |
1 | 7051345 | 13 | 1 | 4.42 | 2 | 15.5 | 0.00 |
2 | 3944882 | 9 | 1 | 1.20 | 1 | 8.0 | 2.60 |
3 | 6523355 | 3 | 1 | 0.75 | 1 | 5.0 | 0.63 |
4 | 3004564 | 14 | 2 | 1.46 | 1 | 10.5 | 2.26 |
... | ... | ... | ... | ... | ... | ... | ... |
435687 | 5549269 | 19 | 5 | 4.39 | 1 | 17.5 | 3.76 |
435688 | 5389974 | 6 | 1 | 0.40 | 2 | 5.5 | 0.00 |
435689 | 3129538 | 36 | 2 | 3.69 | 1 | 22.5 | 4.86 |
435690 | 1558533 | 3 | 1 | 0.46 | 2 | 4.0 | 0.00 |
435691 | 3486394 | 5 | 1 | 0.84 | 2 | 5.5 | 0.00 |
435692 rows × 7 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 435692 entries, 0 to 435691 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Unnamed: 0 435692 non-null int64 1 time_elapsed_min 435692 non-null int64 2 passenger_count 435692 non-null int64 3 trip_distance 435692 non-null float64 4 payment_type 435692 non-null int64 5 fare_amount 435692 non-null float64 6 tip_amount 435692 non-null float64 dtypes: float64(3), int64(4) memory usage: 23.3 MB
df.groupby(by="passenger_count")["time_elapsed_min"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
passenger_count | ||||||||
0 | 3493.0 | 14.262525 | 12.573817 | 0.0 | 6.00 | 11.0 | 19.00 | 108.0 |
1 | 309968.0 | 16.693594 | 62.360731 | 0.0 | 6.00 | 11.0 | 18.00 | 1439.0 |
2 | 63663.0 | 18.335894 | 69.706152 | 0.0 | 6.00 | 11.0 | 19.00 | 1439.0 |
3 | 18253.0 | 18.218704 | 69.287559 | 0.0 | 7.00 | 11.0 | 19.00 | 1439.0 |
4 | 8787.0 | 17.295436 | 52.822602 | 0.0 | 7.00 | 12.0 | 19.00 | 1438.0 |
5 | 19441.0 | 19.091302 | 82.836403 | 0.0 | 6.00 | 11.0 | 18.00 | 1439.0 |
6 | 12083.0 | 18.782835 | 79.826336 | 0.0 | 6.00 | 11.0 | 18.00 | 1439.0 |
7 | 2.0 | 10.500000 | 14.849242 | 0.0 | 5.25 | 10.5 | 15.75 | 21.0 |
8 | 1.0 | 0.000000 | NaN | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 |
9 | 1.0 | 42.000000 | NaN | 42.0 | 42.00 | 42.0 | 42.00 | 42.0 |
df.groupby(by="passenger_count")["payment_type"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
passenger_count | ||||||||
0 | 3493.0 | 1.347839 | 0.547403 | 1.0 | 1.00 | 1.0 | 2.00 | 4.0 |
1 | 309968.0 | 1.308974 | 0.486133 | 1.0 | 1.00 | 1.0 | 2.00 | 4.0 |
2 | 63663.0 | 1.324678 | 0.482142 | 1.0 | 1.00 | 1.0 | 2.00 | 4.0 |
3 | 18253.0 | 1.336602 | 0.488298 | 1.0 | 1.00 | 1.0 | 2.00 | 4.0 |
4 | 8787.0 | 1.389894 | 0.513224 | 1.0 | 1.00 | 1.0 | 2.00 | 4.0 |
5 | 19441.0 | 1.301682 | 0.463461 | 1.0 | 1.00 | 1.0 | 2.00 | 4.0 |
6 | 12083.0 | 1.313829 | 0.468505 | 1.0 | 1.00 | 1.0 | 2.00 | 4.0 |
7 | 2.0 | 1.500000 | 0.707107 | 1.0 | 1.25 | 1.5 | 1.75 | 2.0 |
8 | 1.0 | 1.000000 | NaN | 1.0 | 1.00 | 1.0 | 1.00 | 1.0 |
9 | 1.0 | 1.000000 | NaN | 1.0 | 1.00 | 1.0 | 1.00 | 1.0 |
df[["trip_distance", "fare_amount"]].sort_values(by="trip_distance")
trip_distance | fare_amount | |
---|---|---|
135540 | 0.00 | 30.0 |
213675 | 0.00 | 2.5 |
95160 | 0.00 | 52.0 |
141490 | 0.00 | 58.0 |
213694 | 0.00 | 2.5 |
... | ... | ... |
410333 | 56.32 | 176.0 |
426199 | 63.25 | 320.0 |
99504 | 66.30 | 300.0 |
313173 | 66.71 | 267.0 |
311198 | 69.46 | 225.0 |
435692 rows × 2 columns
import seaborn as sns
sns.relplot(x="trip_distance", y="fare_amount", data=df.sample(n=6000, axis=0))
<seaborn.axisgrid.FacetGrid at 0x7f364b554760>
sns.relplot(x="trip_distance", y="tip_amount", col="passenger_count", data=df.sample(n=6000, axis=0))
<seaborn.axisgrid.FacetGrid at 0x7f364b334c40>
sns.relplot(x="fare_amount", y="tip_amount", data=df.sample(n=6000, axis=0))
<seaborn.axisgrid.FacetGrid at 0x7f364b466f10>
neg_fares = df[df["fare_amount"] < 0]
sns.relplot(x="trip_distance", y="fare_amount", data=neg_fares)
<seaborn.axisgrid.FacetGrid at 0x7f3647efc700>
(df["fare_amount"] + df["tip_amount"]).sum()
6611198.3599999985
(df["fare_amount"] + df["tip_amount"]).mean()
15.174018251425315
(df["fare_amount"] + df["tip_amount"]).sum() / df["trip_distance"].sum()
5.037009795215846
neg_fares["tip_amount"].plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x7f3647ce5c10>