import pandas as pd
zero_two = [0 for i in range(10)] + [2 for i in range(100)]
df = pd.DataFrame(zero_two)
df.std(), df.mean()
(0 0.577591 dtype: float64, 0 1.818182 dtype: float64)
$P(Y=H | X=T) + P(Y=H | X=H) =1$
# [value for loop_var in collection]
[i**2 for i in range(1,10)]
[]
for _ in range(10):
print("*", end="")
[0 for _ in range(10)]
**********
zero_two = [0 for i in range(10)] + [2 for i in range(10)]
df = pd.DataFrame(zero_two)
df.std()
import random
import seaborn as sns
def flip_coins(N, bias=0.5):
""" Flip N coins, return a list of the resutls """
return random.choices(["H", "T"], weights=[bias, 1-bias], k=N)
N = 100000
sns.histplot(flip_coins(N))
<AxesSubplot:ylabel='Count'>
def roll_dice(N):
""" Roll N fair 6-sided dice """
return random.choices(range(1,7), k=N)
N = 1000000
sns.histplot(roll_dice(N), bins=6)
<AxesSubplot:ylabel='Count'>
Probability density function:
$P(X = x) = \frac{1}{K}$ where $K$ is the number of possible outcomes.
Properties
Exercise: Can you think of something in real life that would be well modeled by a uniform distribution?
def n_heads(flips):
return sum([1 if x == "H" else 0 for x in flips])
n = 2000
N = 10000
results = [n_heads(flip_coins(n, bias=.1)) for _ in range(N)]
sns.histplot(results, bins=10)
<AxesSubplot:ylabel='Count'>
$P(X = x) {n \choose x} p^x (1 - p)^{(n-x)}$ where:
Properties of the binomial distribution:
Examples of things that are Gaussian distributed in practice*:
* This requires conditions because reality is never simple: for heights, we'd need to:
data_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_21f/data/NHANES/NHANES.csv"
cols_renamed = {"SEQN": "SEQN",
"RIAGENDR": "Gender", # 1 = M, 2 = F
"RIDAGEYR": "Age", # years
"BMXWT": "Weight", # kg
"BMXHT": "Height", # cm
"BMXLEG": "Leg", # cm
"BMXARML": "Arm", # cm
"BMXARMC": "Arm Cir", # cm
"BMXWAIST": "Waist Cir"} # cm
df = pd.read_csv(data_url)
df = df.rename(cols_renamed, axis='columns')
df = df.drop("SEQN", axis='columns')
df = df[df["Age"] >= 21]
sns.histplot(x="Height", data=df)
<AxesSubplot:xlabel='Height', ylabel='Count'>
ax = sns.histplot(x="Arm", data=df, stat="density", bins=20)
sns.histplot(x="Leg", data=df, bins=20)
<AxesSubplot:xlabel='Leg', ylabel='Count'>
$P(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{\frac{-(x-\mu)^2}{2\sigma^2}}$ where:
Properties:
You might have heard of this if you've ever heard of "80/20" principle, or "the long tail", or "rich get richer" systems. This is when things are distributed according to an exponential curve.
Examples from real-life:
$P(X=x) = cx^{-\alpha}$, where
worth_millions = 190000
count = 1
worths = []
counts = []
for i in range(10):
worths.append(worth_millions)
counts.append(count)
worth_millions = worth_millions / 2
count = count * 4
df = pd.DataFrame({"Worth": worths, "Count": counts})
df
Worth | Count | |
---|---|---|
0 | 190000.00000 | 1 |
1 | 95000.00000 | 4 |
2 | 47500.00000 | 16 |
3 | 23750.00000 | 64 |
4 | 11875.00000 | 256 |
5 | 5937.50000 | 1024 |
6 | 2968.75000 | 4096 |
7 | 1484.37500 | 16384 |
8 | 742.18750 | 65536 |
9 | 371.09375 | 262144 |
df.plot.scatter(x="Worth", y="Count")
<AxesSubplot:xlabel='Worth', ylabel='Count'>
Properties of power law distributions:
df.plot.scatter(x="Worth", y="Count", loglog=True)
<AxesSubplot:xlabel='Worth', ylabel='Count'>
In the NHANES dataset, heights and other length measurements are given in centimeters. I don't have intuition for what's a normal height in centimeters - if you're 160cm tall, are you short? tall? average? One thing I could do is convert to feet and inches which I do know. But sometimes you don't have any units that are intuitive.
data_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_21f/data/NHANES/NHANES.csv"
cols_renamed = {"SEQN": "SEQN",
"RIAGENDR": "Gender", # 1 = M, 2 = F
"RIDAGEYR": "Age", # years
"BMXWT": "Weight", # kg
"BMXHT": "Height", # cm
"BMXLEG": "Leg", # cm
"BMXARML": "Arm", # cm
"BMXARMC": "Arm Cir", # cm
"BMXWAIST": "Waist Cir"} # cm
df = pd.read_csv(data_url)
df = df.rename(cols_renamed, axis='columns')
df = df.drop("SEQN", axis='columns')
df = df[df["Age"] >= 21]
ht_col = df["Height"]
ht_col
To compute a $z$-score:
df["Height-z"] = (ht_col - ht_col.mean()) / ht_col.std()
df["Height-z"]
sns.histplot(x="Height-z", data=df)
Now instead of the raw data value, you have an interpretable measure of how close each point is to the mean. If you have an approximately-Gaussian distribution, you also have a good idea of how unusual that point is!