Lecture 5 - Probability, Conditional Probability, Independence, and Prediction¶

Announcements¶

Quiz 1 graded; you should have gotten an email from Gradescope; about half of you have viewed your quiz.
- Grades will be entered into Canvas about a week after they're released on Gradescope
- Please check for grading errors this week and let me know if you find any.
Data Ethics 1 due Wednesday; discussion in class.
Tomorrow 11-3pm in KB 122 the CS department is hosting an AI Panel
- 10 experts from industry will answer questions about AI in industry in the present and future
- Some impressive panelists - including senior people from Microsoft, Dell, Snowflake, and a VP of engineering at Meta
- Schedule:
  - 11 a.m. - noon: Panel 1: AI in the Workplace - the present
  - Noon -1 p.m.: Networking lunch for students and panelists
  - 1 - 2 p.m.: Panel 2: AI in the Workplace - the future
  - 2- 3 p.m.: Networking open session

Goals:¶

Know how to compute and interpret basic summary statistics (Skiena 2.2):
- Centrality measures: Arithmetic mean, Geometric Mean, Median, (Mode)
- Variability measures: Standard Deviation, Variance
Know how to compute summary statistics in pandas.
Know the definition and implications of independence
- Know the definition of have intuition for conditional probability
- Wrongly assuming independence can lead to poor modeling/predictions
- A lack of independence leads to correlations, which is where modeling/predictive power comes from
Think about what makes a good data science question and practice formulating some of your own. (Skiena 1.2)

Probability and Statistics, Continued¶

As a reminder,

Probability: how we model real-world data generating processes
Statistics: how we infer things about those processes (e.g., estimate the parameters of the model)

There are many dual concepts:

We saw last time that we can use a histogram as an estimate of the probability density function.

In [1]:

import matplotlib.pyplot as plt
fig, axs = plt.subplots(1, 2, figsize=(6,3))
# probability density function of a fair coin toss:
axs[0].bar(["0", "1"], [0.5, 0.5])
axs[0].set_xlabel("V(s)")
axs[0].set_ylabel("P(s)")

# histogram of 10,000 fair coin tosses:
import random
N = 10000
outcomes = []
for i in range(N):
    outcomes.append(random.choice(("H", "T")))

n_heads = 0
n_tails = 0
for out in outcomes:
    if out == "H":
        n_heads += 1
    if out == "T":
        n_tails += 1

axs[1].bar(["0", "1"], [n_tails, n_heads])
axs[1].set_xlabel("Outcome")
axs[1].set_ylabel("Frequency")
plt.tight_layout()

No description has been provided for this image

The mean of a dataset gives an estimate of the expected value of the random variable being measured.
(and many more)

Summary Statistics¶

Real-world experiments of interest are more complicated - more complicated processes, more complicated sets of outcomes, etc. It's often useful to summarize the salient properties of an observed distribution. For this, we use summary statistics.

Central Tendency Measures¶

These tell you something about where the data is "centered".

(Arithmetic) Mean, aka "average": The sum of the values divided by the number of values: $$\mu_X = \frac{1}{n} \sum_{i=1}^n x_i$$.

This works well for data sets where there aren't many outliers; for example: the average height of a female American is 5 feet 4 inches.

Geometric Mean: The $n$th root of the product of $n$ values: $$\left(\prod_{i=1}^n a_i\right)^\frac{1}{n}$$

This is a weird one, and not as often applicable. If you have a single zero, the geometric mean is zero. But it's useful for measuring the central tendency of a collection of ratios.

Median: The middle value - the element appearing exaclty in the middle if the data were sorted. This is useful in the presence of outliers or more generally when the distribution is weirdly-shaped.

*-iles

These generalize the median to fractions other than one half. For example, the five quartiles of a dataset are the minimum, the value that is larger than one quarter of the data, the median, the value that is larger than three quarters of the data, and the maximum.

Common examples aside from quartiles include percentiles (divide the data into 100ths), deciles (10ths), and quintiles (fifths).

Variability Measures¶

These tell you something about the spread of the data, i.e., how far measurements tend to be from the center.

Standard Deviation ($\sigma$): The square root of the sum of squared differences between the elements and the mean: $$\sqrt{\frac{\sum_{i=1}^n (a_i - \bar{a})^2}{n-1}}$$

Variance: the square of the Standard Deviation (i.e., same thing without the square root).

Variance is easier to intuit: it's the average sqaured distance from the mean, with a small caveat that it's divided by n-1 rather than by n.

Summary Statistics in Pandas¶

There are built-in functions that do all of the above for us. To demo, we'll use a dataset of body measurements from a sample of humans.

In [3]:

import pandas as pd

Load the data and do a little tidying:

In [4]:

url = '/cluster/academic/DATA311/202620/NHANES/NHANES.csv'
df = pd.read_csv(url).rename(
   columns={"SEQN": "SEQN",
            "RIAGENDR": "Gender", # 1 = M, 2 = F
            "RIDAGEYR": "Age", # years
            "BMXWT": "Weight", # kg
            "BMXHT": "Height", # cm
            "BMXLEG": "Leg", # cm
            "BMXARML": "Arm", # cm
            "BMXARMC": "Arm Cir", # cm
            "BMXWAIST": "Waist Cir"} # cm
)
df

Out[4]:

	SEQN	Gender	Age	Weight	Height	Leg	Arm	Arm Cir	Waist Cir
0	93703.0	2.0	2.0	13.7	88.6	NaN	18.0	16.2	48.2
1	93704.0	1.0	2.0	13.9	94.2	NaN	18.6	15.2	50.0
2	93705.0	2.0	66.0	79.5	158.3	37.0	36.0	32.0	101.8
3	93706.0	1.0	18.0	66.3	175.7	46.6	38.8	27.0	79.3
4	93707.0	1.0	13.0	45.4	158.4	38.1	33.8	21.5	64.1
...	...	...	...	...	...	...	...	...	...
8699	102952.0	2.0	70.0	49.0	156.5	34.4	32.6	25.1	82.2
8700	102953.0	1.0	42.0	97.4	164.9	38.2	36.6	40.6	114.8
8701	102954.0	2.0	41.0	69.1	162.6	39.2	35.2	26.8	86.4
8702	102955.0	2.0	14.0	111.9	156.6	39.2	35.0	44.5	113.5
8703	102956.0	1.0	38.0	111.5	175.8	42.5	38.0	40.0	122.0

8704 rows × 9 columns

In [5]:

df["Height"].plot.hist()

Out[5]:

<Axes: ylabel='Frequency'>

In [ ]:

We can see the names and datatypes of all the columns with info:

In [6]:

df.info()

<class 'pandas.DataFrame'>
RangeIndex: 8704 entries, 0 to 8703
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   SEQN       8704 non-null   float64
 1   Gender     8704 non-null   float64
 2   Age        8704 non-null   float64
 3   Weight     8580 non-null   float64
 4   Height     8016 non-null   float64
 5   Leg        6703 non-null   float64
 6   Arm        8177 non-null   float64
 7   Arm Cir    8173 non-null   float64
 8   Waist Cir  7601 non-null   float64
dtypes: float64(9)
memory usage: 612.1 KB

To compute a useful collection of summary statistics on each column, we can use describe:

In [7]:

df.describe()

Out[7]:

	SEQN	Gender	Age	Weight	Height	Leg	Arm	Arm Cir	Waist Cir
count	8704.000000	8704.000000	8.704000e+03	8580.000000	8016.000000	6703.000000	8177.000000	8173.000000	7601.000000
mean	98315.452091	1.509076	3.443865e+01	65.138508	156.593401	38.643980	33.667996	29.193589	89.928851
std	2669.112899	0.499946	2.537904e+01	32.890754	22.257858	4.158013	7.229185	7.970648	22.805093
min	93703.000000	1.000000	5.397605e-79	3.200000	78.300000	24.800000	9.400000	11.200000	40.000000
25%	96000.750000	1.000000	1.100000e+01	43.100000	151.400000	35.800000	32.000000	23.800000	73.900000
50%	98308.500000	2.000000	3.100000e+01	67.750000	161.900000	38.800000	35.800000	30.100000	91.200000
75%	100625.250000	2.000000	5.800000e+01	85.600000	171.200000	41.500000	38.400000	34.700000	105.300000
max	102956.000000	2.000000	8.000000e+01	242.600000	197.700000	55.000000	49.900000	56.300000	169.500000

Poke around! Some ideas:

mean, median, min, max per column
.plot.hist per column
Perform some unit conversions
Extract a subset of rows meeting a criterion

Exercise 1: Standard deviation is related to the mean, and thus sensitive to outliers. Can you devise a variability measure that would be more robust to outliers?

In [ ]:

Conditional Probability and Independence¶

Simple probability experiments like rolling two dice produce math that's friendly and easy to work with. The outcome of one die doesn't affect the outcome of the other.

Real life is rarely so simple.

A joint probability distribution on two random variables $P(X, Y)$ is the probability of each possible combination of values of $X$ and $Y$.

If $X$ is the number on the first die and $Y$ is the number on the second die, $P(X,Y)$ has a friendly property: $$P(X, Y) = P(X) P(Y)$$

Let's see why with our physically-improbable three-sided dice (written notes).

In convincing ourselves of the above properly, we sneakily started talking about conditional probability: a probability of something given that you know something else. In our dice example, the probability of die 2 being 1 given that die1 was a 1 was $1/3$. We write this in math as: $$P(Y=1 | X = 1) = 1/3$$ where the vertical bar $|$ is read as "given", or "conditioned on".

Independence is the property we described above: two events are independent if the outcome of one event doesn't affect, or change our understanding of the probability of another.

Another way to view independence is that the conditional probability is equal to the unconditional probability. For example, $$P(Y = 1 | X = 1) = 1/3 = P(Y = 1)$$ The information that $X = 1$ doesn't add anything to our understanding of the situation.

When are events not independent? Most of the time.

An abstract, probability-theory-type example might be: you flip a fair coin; if that fair coin is heads, you roll a fair 3-sided die, but if it's tails you roll a weighted three-sided die whose odds of coming up 1 are 0.6, while the odds coming up 2 or 3 are 0.2 each.

Exercise 1: Let $C$ be the outcome of the coin flip and $D$ be the outcome of the die roll. write down the full joint distribution $P(C, D)$ for this experiment. I've given you the first one:

$P(C=H, D=1) = 1/6$
$P(C=H, D=2) = \hspace{1.6em}$
$P(C=H, D=3) = \hspace{1.6em}$
$P(C=T, D=1) = \hspace{1.6em}$
$P(C=T, D=2) = \hspace{1.6em}$
$P(C=T, D=3) = \hspace{1.6em}$

Exercise 2: What is $P(D=1 | C=T)$?

Exercise 3: What is $P(D=1)$?

Less abstractly...¶

A fundamental assumption that data scientists implicitly make is that their data is generated by some process that follows the laws of probability. Let's look at a dataset and think about these concepts in terms of some of the columns therein.

This dataset is called NHANES and it consists of a bunch of different body measurements from a population of people. The code below loads the dataset, renames the columns with more sensible labesls, and drops the unique identifier for each person (SEQN), which we don't need.

In [8]:

import pandas as pd
data_url = "/cluster/academic/DATA311/202620/NHANES/NHANES.csv"
cols_renamed = {"SEQN": "SEQN",
                "RIAGENDR": "Gender", # 1 = M, 2 = F
                "RIDAGEYR": "Age", # years
                "BMXWT": "Weight", # kg
                "BMXHT": "Height", # cm
                "BMXLEG": "Leg", # cm
                "BMXARML": "Arm", # cm
                "BMXARMC": "Arm Cir", # cm
                "BMXWAIST": "Waist Cir"} # cm

df = pd.read_csv(data_url)
df = df.rename(cols_renamed, axis='columns')
df = df.drop("SEQN", axis='columns')

Remember that histograms are like empircal estimates of probability distributions. If we think of two columns as random variables of the outcomes of an experiment in which "A human grows to be an adult", we can similarly about the empirical estimate of the joint probability distribution, whether the columns are independent, and the conditional probability of one column's value given the other.

First, let's filter out children:

In [10]:

df = df[df["Age"] >= 21]

Now let's consider the age and height columns. I'm going to use a nifty visualization from the Seaborn library, which we'll use when we dig deeper into visualization:

In [11]:

import seaborn as sns

sns.jointplot(x="Age", y="Height", data=df)

Out[11]:

<seaborn.axisgrid.JointGrid at 0x14611b2ff1a0>

From the scatter plot, it doesn't appear that these variables have much to do with each other. This matches our intuition - once they reach adulthood, people don't grow or shrink (much) in height as they age. This leads to a hypothesis that these variables are independent. We can think about this in terms of conditional probability by seeing if the distribution of heights conditioned on each age is the same as the unconditional distribution.

The following plot shows our empirical estimate of $P($Height$)$ alongside an empirical estimate of $P($Height $|$ the person is in their 20s$)$:

In [12]:

df["Height"].plot.hist(legend=True, label="P(Height)", density=True)
twenties = df[(df["Age"] >= 21) & (df["Age"] < 30)]
twenties["Height"].plot.hist(legend=True, label="P(height|20s)", density=True)

Out[12]:

<Axes: ylabel='Frequency'>

The density=True argument tells the plotting library to divide by the total so instead of counts, we get probability-like values that all sum to one. This allows the $y$ axis scale to be comparable across two histograms of different total values.

Let's pick a different pair of columns and do a similar analysis:

In [13]:

sns.jointplot(x="Height", y="Leg", data=df)

Out[13]:

<seaborn.axisgrid.JointGrid at 0x14611aa08aa0>

In [14]:

df["Leg"].plot.hist(legend=True, label="P(Leg)", density=True)

Out[14]:

<Axes: ylabel='Frequency'>

In [15]:

for height in [140, 160, 180]:
    mask = (df["Height"] >= height) & (df["Height"] < height+20)
    label = f"P(Leg | {height} <= Ht < {height+20})"
    df[mask]["Leg"].plot.hist(legend=True, label=label,  density=True)

These columns are decidedly not independent! There is a strong correlation between them (we'll come back to that word and use it more formally later on). In other words, $P(Leg | Height) \ne P(Leg)$. That means that if we know Height, we have a better idea of what to expect from Leg. This forms the basis of our ability to make predictions from data!

The key insight I want you to take away from this is that the presence of correlations - aka a lack of independence - in the data yield predictive power. Two implications:

If there are no correlations, we're unlikely to be able to make predictions.
The correlation between leg length and height is clearly due to a real, underlying relationship between height and leg length. Not all correlations are, though! It's important to keep in mind that when we find a correlation, we have only failed to rule out the possibility a causal relationship between the two variables; but we have not confirmed its existence. In other words: correlation does not imply causation.

Asking interesting data questions¶

Consider a couple example datasets:
IMDB: All things movies. https://developer.imdb.com/non-commercial-datasets/
- Films: title, duration, genre tags, date, cast, crew, user ratings, critic ratings, ...
- People (actors, directors, writers, producers, crew): appearances/credits, birth/death dates, height, awards, ...
Boston Bike Share data: https://www.bluebikes.com/system-data
- Trips: Trip Duration, Start Time and Date, Stop Time and Date, Start Station Name & ID, End Station Name & ID, Bike ID, User Type (Casual = Single Trip or Day Pass user; Member = Annual or Monthly Member), Birth Year, Gender (self-reported by member)
- Stations: ID, Name, GPS coordinates, # docks
Exercise, as a table: Come up with at least one interesting question you might want to answer with each of these datasets.

Ideas from the class:¶

IMDB

Bikes

Insight: Datasets can often answer questions that the data isn't directly about.

Example: baseball stats (http://www.baseball-reference.com) has details about ~20,000 major league baseball players over the last 150 years. This data includes handedness and birth/death dates, so we can study a possible link between handedness and lifespan.