Lecture 5 - Probability, Conditional Probability, Independence, and Prediction¶

Announcements¶

  • Quiz 1 graded; you should have gotten an email from Gradescope; about half of you have viewed your quiz.

    • Grades will be entered into Canvas about a week after they're released on Gradescope
    • Please check for grading errors this week and let me know if you find any.
  • Data Ethics 1 due Wednesday; discussion in class.

  • Tomorrow 11-3pm in KB 122 the CS department is hosting an AI Panel

    • 10 experts from industry will answer questions about AI in industry in the present and future
    • Some impressive panelists - including senior people from Microsoft, Dell, Snowflake, and a VP of engineering at Meta
    • Schedule:
      • 11 a.m. - noon: Panel 1: AI in the Workplace - the present
      • Noon -1 p.m.: Networking lunch for students and panelists
      • 1 - 2 p.m.: Panel 2: AI in the Workplace - the future
      • 2- 3 p.m.: Networking open session

Goals:¶

  • Know how to compute and interpret basic summary statistics (Skiena 2.2):

    • Centrality measures: Arithmetic mean, Geometric Mean, Median, (Mode)
    • Variability measures: Standard Deviation, Variance
  • Know how to compute summary statistics in pandas.

  • Know the definition and implications of independence

    • Know the definition of have intuition for conditional probability
    • Wrongly assuming independence can lead to poor modeling/predictions
    • A lack of independence leads to correlations, which is where modeling/predictive power comes from
  • Think about what makes a good data science question and practice formulating some of your own. (Skiena 1.2)

Probability and Statistics, Continued¶

As a reminder,

  • Probability: how we model real-world data generating processes
  • Statistics: how we infer things about those processes (e.g., estimate the parameters of the model)

There are many dual concepts:

  • We saw last time that we can use a histogram as an estimate of the probability density function.
In [1]:
import matplotlib.pyplot as plt
fig, axs = plt.subplots(1, 2, figsize=(6,3))
# probability density function of a fair coin toss:
axs[0].bar(["0", "1"], [0.5, 0.5])
axs[0].set_xlabel("V(s)")
axs[0].set_ylabel("P(s)")

# histogram of 10,000 fair coin tosses:
import random
N = 10000
outcomes = []
for i in range(N):
    outcomes.append(random.choice(("H", "T")))

n_heads = 0
n_tails = 0
for out in outcomes:
    if out == "H":
        n_heads += 1
    if out == "T":
        n_tails += 1

axs[1].bar(["0", "1"], [n_tails, n_heads])
axs[1].set_xlabel("Outcome")
axs[1].set_ylabel("Frequency")
plt.tight_layout()
No description has been provided for this image
  • The mean of a dataset gives an estimate of the expected value of the random variable being measured.
  • (and many more)

Summary Statistics¶

Real-world experiments of interest are more complicated - more complicated processes, more complicated sets of outcomes, etc. It's often useful to summarize the salient properties of an observed distribution. For this, we use summary statistics.

Central Tendency Measures¶

These tell you something about where the data is "centered".

(Arithmetic) Mean, aka "average": The sum of the values divided by the number of values: $$\mu_X = \frac{1}{n} \sum_{i=1}^n x_i$$.

This works well for data sets where there aren't many outliers; for example: the average height of a female American is 5 feet 4 inches.

Geometric Mean: The $n$th root of the product of $n$ values: $$\left(\prod_{i=1}^n a_i\right)^\frac{1}{n}$$

This is a weird one, and not as often applicable. If you have a single zero, the geometric mean is zero. But it's useful for measuring the central tendency of a collection of ratios.

Median: The middle value - the element appearing exaclty in the middle if the data were sorted. This is useful in the presence of outliers or more generally when the distribution is weirdly-shaped.

*-iles

These generalize the median to fractions other than one half. For example, the five quartiles of a dataset are the minimum, the value that is larger than one quarter of the data, the median, the value that is larger than three quarters of the data, and the maximum.

Common examples aside from quartiles include percentiles (divide the data into 100ths), deciles (10ths), and quintiles (fifths).

Variability Measures¶

These tell you something about the spread of the data, i.e., how far measurements tend to be from the center.

Standard Deviation ($\sigma$): The square root of the sum of squared differences between the elements and the mean: $$\sqrt{\frac{\sum_{i=1}^n (a_i - \bar{a})^2}{n-1}}$$

Variance: the square of the Standard Deviation (i.e., same thing without the square root).

Variance is easier to intuit: it's the average sqaured distance from the mean, with a small caveat that it's divided by n-1 rather than by n.

Summary Statistics in Pandas¶

There are built-in functions that do all of the above for us. To demo, we'll use a dataset of body measurements from a sample of humans.

In [3]:
import pandas as pd

Load the data and do a little tidying:

In [4]:
url = '/cluster/academic/DATA311/202620/NHANES/NHANES.csv'
df = pd.read_csv(url).rename(
   columns={"SEQN": "SEQN",
            "RIAGENDR": "Gender", # 1 = M, 2 = F
            "RIDAGEYR": "Age", # years
            "BMXWT": "Weight", # kg
            "BMXHT": "Height", # cm
            "BMXLEG": "Leg", # cm
            "BMXARML": "Arm", # cm
            "BMXARMC": "Arm Cir", # cm
            "BMXWAIST": "Waist Cir"} # cm
)
df
Out[4]:
SEQN Gender Age Weight Height Leg Arm Arm Cir Waist Cir
0 93703.0 2.0 2.0 13.7 88.6 NaN 18.0 16.2 48.2
1 93704.0 1.0 2.0 13.9 94.2 NaN 18.6 15.2 50.0
2 93705.0 2.0 66.0 79.5 158.3 37.0 36.0 32.0 101.8
3 93706.0 1.0 18.0 66.3 175.7 46.6 38.8 27.0 79.3
4 93707.0 1.0 13.0 45.4 158.4 38.1 33.8 21.5 64.1
... ... ... ... ... ... ... ... ... ...
8699 102952.0 2.0 70.0 49.0 156.5 34.4 32.6 25.1 82.2
8700 102953.0 1.0 42.0 97.4 164.9 38.2 36.6 40.6 114.8
8701 102954.0 2.0 41.0 69.1 162.6 39.2 35.2 26.8 86.4
8702 102955.0 2.0 14.0 111.9 156.6 39.2 35.0 44.5 113.5
8703 102956.0 1.0 38.0 111.5 175.8 42.5 38.0 40.0 122.0

8704 rows × 9 columns

In [5]:
df["Height"].plot.hist()
Out[5]:
<Axes: ylabel='Frequency'>
No description has been provided for this image
In [ ]:
 

We can see the names and datatypes of all the columns with info:

In [6]:
df.info()
<class 'pandas.DataFrame'>
RangeIndex: 8704 entries, 0 to 8703
Data columns (total 9 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   SEQN       8704 non-null   float64
 1   Gender     8704 non-null   float64
 2   Age        8704 non-null   float64
 3   Weight     8580 non-null   float64
 4   Height     8016 non-null   float64
 5   Leg        6703 non-null   float64
 6   Arm        8177 non-null   float64
 7   Arm Cir    8173 non-null   float64
 8   Waist Cir  7601 non-null   float64
dtypes: float64(9)
memory usage: 612.1 KB

To compute a useful collection of summary statistics on each column, we can use describe:

In [7]:
df.describe()
Out[7]:
SEQN Gender Age Weight Height Leg Arm Arm Cir Waist Cir
count 8704.000000 8704.000000 8.704000e+03 8580.000000 8016.000000 6703.000000 8177.000000 8173.000000 7601.000000
mean 98315.452091 1.509076 3.443865e+01 65.138508 156.593401 38.643980 33.667996 29.193589 89.928851
std 2669.112899 0.499946 2.537904e+01 32.890754 22.257858 4.158013 7.229185 7.970648 22.805093
min 93703.000000 1.000000 5.397605e-79 3.200000 78.300000 24.800000 9.400000 11.200000 40.000000
25% 96000.750000 1.000000 1.100000e+01 43.100000 151.400000 35.800000 32.000000 23.800000 73.900000
50% 98308.500000 2.000000 3.100000e+01 67.750000 161.900000 38.800000 35.800000 30.100000 91.200000
75% 100625.250000 2.000000 5.800000e+01 85.600000 171.200000 41.500000 38.400000 34.700000 105.300000
max 102956.000000 2.000000 8.000000e+01 242.600000 197.700000 55.000000 49.900000 56.300000 169.500000

Poke around! Some ideas:

  • mean, median, min, max per column
  • .plot.hist per column
  • Perform some unit conversions
  • Extract a subset of rows meeting a criterion

Exercise 1: Standard deviation is related to the mean, and thus sensitive to outliers. Can you devise a variability measure that would be more robust to outliers?

In [ ]:
 

Conditional Probability and Independence¶

Simple probability experiments like rolling two dice produce math that's friendly and easy to work with. The outcome of one die doesn't affect the outcome of the other.

Real life is rarely so simple.

A joint probability distribution on two random variables $P(X, Y)$ is the probability of each possible combination of values of $X$ and $Y$.

If $X$ is the number on the first die and $Y$ is the number on the second die, $P(X,Y)$ has a friendly property: $$P(X, Y) = P(X) P(Y)$$

Let's see why with our physically-improbable three-sided dice (written notes).

In convincing ourselves of the above properly, we sneakily started talking about conditional probability: a probability of something given that you know something else. In our dice example, the probability of die 2 being 1 given that die1 was a 1 was $1/3$. We write this in math as: $$P(Y=1 | X = 1) = 1/3$$ where the vertical bar $|$ is read as "given", or "conditioned on".

Independence is the property we described above: two events are independent if the outcome of one event doesn't affect, or change our understanding of the probability of another.

Another way to view independence is that the conditional probability is equal to the unconditional probability. For example, $$P(Y = 1 | X = 1) = 1/3 = P(Y = 1)$$ The information that $X = 1$ doesn't add anything to our understanding of the situation.

When are events not independent? Most of the time.

An abstract, probability-theory-type example might be: you flip a fair coin; if that fair coin is heads, you roll a fair 3-sided die, but if it's tails you roll a weighted three-sided die whose odds of coming up 1 are 0.6, while the odds coming up 2 or 3 are 0.2 each.

Exercise 1: Let $C$ be the outcome of the coin flip and $D$ be the outcome of the die roll. write down the full joint distribution $P(C, D)$ for this experiment. I've given you the first one:

  • $P(C=H, D=1) = 1/6$
  • $P(C=H, D=2) = \hspace{1.6em}$
  • $P(C=H, D=3) = \hspace{1.6em}$
  • $P(C=T, D=1) = \hspace{1.6em}$
  • $P(C=T, D=2) = \hspace{1.6em}$
  • $P(C=T, D=3) = \hspace{1.6em}$

Exercise 2: What is $P(D=1 | C=T)$?

Exercise 3: What is $P(D=1)$?

Less abstractly...¶

A fundamental assumption that data scientists implicitly make is that their data is generated by some process that follows the laws of probability. Let's look at a dataset and think about these concepts in terms of some of the columns therein.

This dataset is called NHANES and it consists of a bunch of different body measurements from a population of people. The code below loads the dataset, renames the columns with more sensible labesls, and drops the unique identifier for each person (SEQN), which we don't need.

In [8]:
import pandas as pd
data_url = "/cluster/academic/DATA311/202620/NHANES/NHANES.csv"
cols_renamed = {"SEQN": "SEQN",
                "RIAGENDR": "Gender", # 1 = M, 2 = F
                "RIDAGEYR": "Age", # years
                "BMXWT": "Weight", # kg
                "BMXHT": "Height", # cm
                "BMXLEG": "Leg", # cm
                "BMXARML": "Arm", # cm
                "BMXARMC": "Arm Cir", # cm
                "BMXWAIST": "Waist Cir"} # cm

df = pd.read_csv(data_url)
df = df.rename(cols_renamed, axis='columns')
df = df.drop("SEQN", axis='columns')

Remember that histograms are like empircal estimates of probability distributions. If we think of two columns as random variables of the outcomes of an experiment in which "A human grows to be an adult", we can similarly about the empirical estimate of the joint probability distribution, whether the columns are independent, and the conditional probability of one column's value given the other.

First, let's filter out children:

In [10]:
df = df[df["Age"] >= 21]

Now let's consider the age and height columns. I'm going to use a nifty visualization from the Seaborn library, which we'll use when we dig deeper into visualization:

In [11]:
import seaborn as sns

sns.jointplot(x="Age", y="Height", data=df)
Out[11]:
<seaborn.axisgrid.JointGrid at 0x14611b2ff1a0>
No description has been provided for this image

From the scatter plot, it doesn't appear that these variables have much to do with each other. This matches our intuition - once they reach adulthood, people don't grow or shrink (much) in height as they age. This leads to a hypothesis that these variables are independent. We can think about this in terms of conditional probability by seeing if the distribution of heights conditioned on each age is the same as the unconditional distribution.

The following plot shows our empirical estimate of $P($Height$)$ alongside an empirical estimate of $P($Height $|$ the person is in their 20s$)$:

In [12]:
df["Height"].plot.hist(legend=True, label="P(Height)", density=True)
twenties = df[(df["Age"] >= 21) & (df["Age"] < 30)]
twenties["Height"].plot.hist(legend=True, label="P(height|20s)", density=True)
Out[12]:
<Axes: ylabel='Frequency'>
No description has been provided for this image

The density=True argument tells the plotting library to divide by the total so instead of counts, we get probability-like values that all sum to one. This allows the $y$ axis scale to be comparable across two histograms of different total values.

Let's pick a different pair of columns and do a similar analysis:

In [13]:
sns.jointplot(x="Height", y="Leg", data=df)
Out[13]:
<seaborn.axisgrid.JointGrid at 0x14611aa08aa0>
No description has been provided for this image
In [14]:
df["Leg"].plot.hist(legend=True, label="P(Leg)", density=True)
Out[14]:
<Axes: ylabel='Frequency'>
No description has been provided for this image
In [15]:
for height in [140, 160, 180]:
    mask = (df["Height"] >= height) & (df["Height"] < height+20)
    label = f"P(Leg | {height} <= Ht < {height+20})"
    df[mask]["Leg"].plot.hist(legend=True, label=label,  density=True)
No description has been provided for this image

These columns are decidedly not independent! There is a strong correlation between them (we'll come back to that word and use it more formally later on). In other words, $P(Leg | Height) \ne P(Leg)$. That means that if we know Height, we have a better idea of what to expect from Leg. This forms the basis of our ability to make predictions from data!

The key insight I want you to take away from this is that the presence of correlations - aka a lack of independence - in the data yield predictive power. Two implications:

  • If there are no correlations, we're unlikely to be able to make predictions.
  • The correlation between leg length and height is clearly due to a real, underlying relationship between height and leg length. Not all correlations are, though! It's important to keep in mind that when we find a correlation, we have only failed to rule out the possibility a causal relationship between the two variables; but we have not confirmed its existence. In other words: correlation does not imply causation.

Asking interesting data questions¶

  • Consider a couple example datasets:

  • IMDB: All things movies. https://developer.imdb.com/non-commercial-datasets/

    • Films: title, duration, genre tags, date, cast, crew, user ratings, critic ratings, ...
    • People (actors, directors, writers, producers, crew): appearances/credits, birth/death dates, height, awards, ...
  • Boston Bike Share data: https://www.bluebikes.com/system-data

    • Trips: Trip Duration, Start Time and Date, Stop Time and Date, Start Station Name & ID, End Station Name & ID, Bike ID, User Type (Casual = Single Trip or Day Pass user; Member = Annual or Monthly Member), Birth Year, Gender (self-reported by member)
    • Stations: ID, Name, GPS coordinates, # docks
  • Exercise, as a table: Come up with at least one interesting question you might want to answer with each of these datasets.

Ideas from the class:¶

IMDB

Bikes

Insight: Datasets can often answer questions that the data isn't directly about.

  • Example: baseball stats (http://www.baseball-reference.com) has details about ~20,000 major league baseball players over the last 150 years. This data includes handedness and birth/death dates, so we can study a possible link between handedness and lifespan.