L05

Announcements¶

Talks:

Thu 1/19 4pm CF 105 Ryan Bockmon - Research (CS education)
Fri 1/20 4pm CF 316 Ryan Bockmon - Teaching

Don't forget to take Quiz 2 between Friday's class and Monday's class.

Quiz 1 Frequently Missed Questions:

Statistics is concerned with modeling how data is generated in theory, whereas probability is concerned with deriving such models from observed data.
A pandas Series is most similar to which basic python data structure?

In [1]:

import pandas as pd
pd.Series([1, 2, 3, 4])

Out[1]:

0    1
1    2
2    3
3    4
dtype: int64

Goals:¶

Think about what makes a good data science question and practice formulating some of your own. (Skiena 1.2)
Know the terminology and properties of basic probability (Skiena 2.1):
- Experiment; Outcome; Sample Space; Event; Probability; Random Variable; Expected Value
Know the definition and implications of independence
- Know the definition of have intuition for conditional probability
- Wrongly assuming independence can lead to poor modeling/predictions
- A lack of independence leads to correlations, which is where modeling/predictive power comes from

Asking interesting data questions¶

Consider a couple example datasets:
IMDB: All things movies. https://www.imdb.com/interfaces/
- Films: title, duration, genre tags, date, cast, crew, user ratings, critic ratings, ...
- People (actors, directors, writers, producers, crew): appearances/credits, birth/death dates, height, awards, ...
Boston Bike Share data: https://www.bluebikes.com/system-data
- Trips: Trip Duration, Start Time and Date, Stop Time and Date, Start Station Name & ID, End Station Name & ID, Bike ID, User Type (Casual = Single Trip or Day Pass user; Member = Annual or Monthly Member), Birth Year, Gender (self-reported by member)
- Stations: ID, Name, GPS coordinates, # docks
Think-pair-share: Come up wtih one interesting question you might want to answer with each of these datasets.

Ideas from the class:¶

IMDB

Age of director when film released vs critical ratings and user ratings - do older directors make "better" films?
- Broken out by genre
Do tall actors get more awards?
Do films with certain words in the title get higher ratings?
Are producers/directors racist?
Average rating for movies per actor

Bikes

What is the average trip duration per birth year?
Are younger riders more likely to be casual users or members?
What neighborhood of boston has the longest trips coming from it?
What are the most common start and endpoints for rides? What user types use which stations?

Insight: Datasets can often answer questions that the data isn't directly about.

Example: baseball stats (http://www.baseball-reference.com) has details about ~20,000 major league baseball players over the last 150 years. This data includes handedness and birth/death dates, so we can study a possible link between handedness and lifespan.

Probability - Basics¶

Probability is actually hard to define! But easy to have intuition about it.
Also straightforward to write down its properties (i.e., how it behaves)

First, need some terminology.

An experiment is a process that results in one of a set of possible outcomes.
The sample space ($S$) of the experiment is the set of all possible outcomes.
An event ($E$) is a subset of the outcomes.
The probability of an outcome $s$ is written $P(s)$ and has these properties:
- $P(s)$ is between 0 and 1: $0 \le P(s) \le 1$.
- The sum of probabilities of all outcomes is exactly 1: $$\sum_{s \in S} P(s) = 1$$
A random variable $(V)$ is a function that maps an outcome to a number.
The expected value $E(V)$ of a random variable $V$ is the sum of the probability of each outcome times the random variable's value at that outcome: $$E(V) = \sum_{s \in S} P(s) \cdot V(s)$$

If we run an experiment where we toss a fair coin, the sample space contains the outcomes ${H, T}$ representing heads and tails. The coin is fair, so the probability of each outcome is 0.5, which satisfies both of the properties above.

Suppose you made a deal with a friend to toss a coin, and if it comes up heads, your friend gives you a dollar. If it comes up tails, no money changes hands. The random variable $V$ that's relevant to your wallet is $V(H) = 1, V(T) = 0$. The expected value of this random variable is $V(H) * P(H) + V(T) * P(T) = 0.5$, which you can think of as the amount of money you would expect to earn per flip, on average, if you repeated this experiment many, many times.

Exercise: describe the rolling of a six-sided die using the same terminology as above. For a random variable, use the number on the die itself; find the expected value.

Probability Distributions¶

The expected value is one important property of a random variable, but if we want the whole story, we need to look at its probability density function (PDF): a graph with random variable's values on the $x$ axis and the probability of such an outcome occurring on the $y$ axis.

Here's the PDF of the random variable described above:

In [ ]:

import matplotlib.pyplot as plt
plt.bar(["0", "1"], [0.5, 0.5])
plt.xlabel("V(s)")
plt.ylabel("P(s)")

Out[ ]:

Text(0, 0.5, 'P(s)')

Exercise: Draw the PDF for the random variable that gives the value of a loaded five-sided die that comes up 1 with probability 0.6 and has an equal chance of each the remaining four faces.

Conditional Probability and Independence¶

Simple probability experiments like rolling two dice produce math that's friendly and easy to work with. The outcome of one die doesn't affect the outcome of the other.

Real life is rarely so simple.

A joint probability distribution on two random variables $P(X, Y)$ is the probability of each possible combination of values of $X$ and $Y$.

If $X$ is the number on the first die and $Y$ is the number on the second die, $P(X,Y)$ has a friendly property: $$P(X, Y) = P(X) P(Y)$$

Let's see why with our physically-improbable three-sided dice (written notes).

In convincing ourselves of the above properly, we sneakily started talking about conditional probability: a probability of something given that you know something else. In our dice example, the probability of die 2 being 1 given that die1 was a 1 was $1/3$. We write this in math as: $$P(Y=1 | X = 1) = 1/3$$ where the vertical bar $|$ is read as "given", or "conditioned on".

Independence is the property we described above: two events are independent if the outcome of one event doesn't affect, or change our understanding of the probability of another.

Another way to view independence is that the conditional probability is equal to the unconditional probability. For example, $$P(Y = 1 | X = 1) = 1/3 = P(Y = 1)$$ The information that $X = 1$ doesn't add anything to our understanding of the situation.

When are events not independent? Most of the time.

An abstract, probability-theory-type example might be: you flip a fair coin; if that fair coin is heads, you roll a fair 3-sided die, but if it's tails you roll a weighted three-sided die whose odds of coming up 1 are 0.6, while the odds coming up 2 or 3 are 0.2 each.

Exercise 1: Let $C$ be the outcome of the coin flip and $D$ be the outcome of the die roll. write down the full joint distribution $P(C, D)$ for this experiment. I've given you the first one: $$P(C=H, D=1) = 1/6\\ P(C=H, D=2) = \hspace{1.6em}\\ P(C=H, D=3) = \hspace{1.6em}\\ P(C=T, D=1) = \hspace{1.6em}\\ P(C=T, D=2) = \hspace{1.6em}\\ P(C=T, D=3) = \hspace{1.6em}$$

Exercise 2: What is $P(D=1 | C=T)$?

Exercise 3: What is $P(D=1)$?

Less abstractly...¶

A fundamental assumption that data scientists implicitly make is that their data is generated by some process that follows the laws of probability. Let's look at a dataset and think about these concepts in terms of some of the columns therein.

This dataset is called NHANES and it consists of a bunch of different body measurements from a population of people. The code below loads the dataset, renames the columns with more sensible labesls, and drops the unique identifier for each person (SEQN), which we don't need.

In [3]:

import pandas as pd
data_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_21f/data/NHANES/NHANES.csv"
cols_renamed = {"SEQN": "SEQN",
                "RIAGENDR": "Gender", # 1 = M, 2 = F
                "RIDAGEYR": "Age", # years
                "BMXWT": "Weight", # kg
                "BMXHT": "Height", # cm
                "BMXLEG": "Leg", # cm
                "BMXARML": "Arm", # cm
                "BMXARMC": "Arm Cir", # cm
                "BMXWAIST": "Waist Cir"} # cm

df = pd.read_csv(data_url)
df = df.rename(cols_renamed, axis='columns')
df = df.drop("SEQN", axis='columns')

Remember that histograms are like empircal estimates of probability distributions. If we think of two columns as random variables of the outcomes of an experiment in which "A human grows to be an adult", we can similarly about the empirical estimate of the joint probability distribution, whether the columns are independent, and the conditional probability of one column's value given the other.

First, let's filter out children:

In [4]:

df = df[df["Age"] >= 21]

Now let's consider the age and height columns. I'm going to use a nifty visualization from the Seaborn library, which we'll use when we dig deeper into visualization:

In [5]:

import seaborn as sns

sns.jointplot(x="Age", y="Height", data=df)

Out[5]:

<seaborn.axisgrid.JointGrid at 0x7f4967eff760>

From the scatter plot, it doesn't appear that these variables have much to do with each other. This matches our intuition - once they reach adulthood, people don't grow or shrink (much) in height as they age. This leads to a hypothesis that these variables are independent. We can think about this in terms of conditional probability by seeing if the distribution of heights conditioned on each age is the same as the unconditional distribution.

The following plot shows our empirical estimate of $P($Height$)$ alongside an empirical estimate of $P($Height $|$ the person is in their 20s$)$:

In [6]:

df["Height"].plot.hist(legend=True, label="P(Height)", density=True)
twenties = df[(df["Age"] >= 21) & (df["Age"] < 30)]
twenties["Height"].plot.hist(legend=True, label="P(height|20s)", density=True)

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f4950cc82b0>

The density=True argument tells the plotting library to divide by the total so instead of counts, we get probability-like values that all sum to one. This allows the $y$ axis scale to be comparable across two histograms of different total values.

Let's pick a different pair of columns and do a similar analysis:

In [7]:

sns.jointplot(x="Height", y="Leg", data=df)

Out[7]:

<seaborn.axisgrid.JointGrid at 0x7f495299f4c0>

In [8]:

df["Leg"].plot.hist(legend=True, label="P(Leg)", density=True)

Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f4950a2f310>

In [9]:

for height in [140, 160, 180]:
    mask = (df["Height"] >= height) & (df["Height"] < height+20)
    label = f"P(Leg | {height} <= Ht < {height+20})"
    df[mask]["Leg"].plot.hist(legend=True, label=label,  density=True)

These columns are decidedly not independent! There is a strong correlation between them (we'll come back to that word and use it more formally later on). In other words, $P(Leg | Height) \ne P(Leg)$. That means that if we know Height, we have a better idea of what to expect from Leg. This forms the basis of our ability to make predictions from data!

The key insight I want you to take away from this is that the presence of correlations - aka a lack of independence - in the data yield predictive power. Two implications:

If there are no correlations, we're unlikely to be able to make predictions.
The correlation between leg length and height is clearly due to a real, underlying relationship between height and leg length. Not all correlations are, though! It's important to keep in mind that when we find a correlation, we have only failed to rule out the possibility a causal relationship between the two variables; but we have not confirmed its existence. In other words: correlation does not imply causation.

Lecture 5 - Probability, Conditional Probability, Independence, and Prediction¶