Lecture 12 - Probability Distributions, Z Scores and Normalization, (Logs)

Announcements:

$P(Y=H | X=T) + P(Y=H | X=H) =1$

Goals:

Uniform Distribution

Uniform Distribution - Analytically:

Probability density function:

$P(X = x) = \frac{1}{K}$ where $K$ is the number of possible outcomes.

Properties

Exercise: Can you think of something in real life that would be well modeled by a uniform distribution?

Binomial Distribution

Binomial Distribution - Analytically:

$P(X = x) {n \choose x} p^x (1 - p)^{(n-x)}$ where:

Properties of the binomial distribution:

Gaussian Distribution

Examples of things that are Gaussian distributed in practice*:

* This requires conditions because reality is never simple: for heights, we'd need to:

Gaussian Distribution, Analytically

$P(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{\frac{-(x-\mu)^2}{2\sigma^2}}$ where:

Properties:

Power Law Distributions

You might have heard of this if you've ever heard of "80/20" principle, or "the long tail", or "rich get richer" systems. This is when things are distributed according to an exponential curve.

Examples from real-life:

Power Law Distribution, Analytically:

$P(X=x) = cx^{-\alpha}$, where

Properties of power law distributions:

(image source: https://capitalaspower.com/2019/10/visualizing-power-law-distributions/)

Z-Scores and Normalization

In the NHANES dataset, heights and other length measurements are given in centimeters. I don't have intuition for what's a normal height in centimeters - if you're 160cm tall, are you short? tall? average? One thing I could do is convert to feet and inches which I do know. But sometimes you don't have any units that are intuitive.

To compute a $z$-score:

  1. Subtract the mean
  2. Divide by the standard deviation.

Now instead of the raw data value, you have an interpretable measure of how close each point is to the mean. If you have an approximately-Gaussian distribution, you also have a good idea of how unusual that point is!