Lecture 16

Correlation (does not imply causation)

Measuring Correlation

Announcements:

Goals:

Correlation (does not imply causation)

What do we mean by correlation?

So far, an informal definition: Variables are correlated if one appears to predict the other, or if an increase or decrease in one is consistently associated with an increase or decrease in the other.

Variables are correlated if one appears to predict the other, or if an increase or decrease in one is consistently associated with an increase or decrease in the other.

What does correlation imply?

Suppose you open up a dataset, make a jointplot and find that two variables A and B are highly correlated. Without knowing any more than that, what might have given rise to this correlation?

A is liquid precip per minute, B is number of people wearing raincoats.

What happened? A caused B.


A is a measure of lung health; B is number of cigarettes smoked per day.

What happened? B caused A


A is the ranking of the college a person graduated from; B is their score on the GRE (an SAT-like exam sometimes used for graduate school admissions).

What happened? C caused B and A


A is height in centimeters; B is height in feet

What happened? A is B.


A is total firefox downloads and B is membership in the Wicca religion.

What happened?

More (real) examples: https://www.tylervigen.com/spurious-correlations

In seriousness:

It's easy to find ridiculous correlations that are obviously not causal and have a good laugh about it.

It's still easy to see correlation and jump to the conclusion that there is a causation at work, (even if you understand intellectually that correlation does not imply causation!)

It's also vastly easier to find correlations than to prove causation - this is a big part of what makes science hard.

A fun activity: google "study finds link" and look at some of the news articles that result.

From my second result, way down the page:

[The scientists] added that randomized intervention trials, as well as long-term objective measurements of physical activity in prospective studies, are also needed to assess the validity and causality of the association they reported.

Title text: "Correlation doesn't imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing 'look over there'."

Measuring Correlation

We'd like a number (a statistic, really) that captures the degree of correlation between two variables.

Here's one such statistic, called the Pearson Correlation Coefficient: $$r(X, Y) = \frac{\sum_{i=1}^n (X_i - \bar{X}) (Y_i - \bar{Y})} {\sqrt{\sum_{i=1}^n (X_i-\bar{X})^2}\sqrt{\sum_{i=1}^n(Y_i-\bar{Y})^2}}$$

where $\bar{X}$ and $\bar{Y}$ are the means $X$ and $Y$, respectively.

On the one hand, yikes! On the other hand, this is made of mostly things we've seen before. Notice that the denominator is the product of the two variables' standard deviations! $$r(X, Y) = \frac{\sum_{i=1}^n (X_i - \bar{X}) (Y_i - \bar{Y})} {\sigma(X) \sigma(Y)}$$

And the numerator is... well, it's sort of like a variance, except instead of squaring the distance of one varaible to its mean, we multiply $X_i$'s distance from the mean by $Y_i$'s distance from the mean. This is a nifty enough thing that it also has its own name: covariance. So we can equivalently write

$$r(X, Y) = \frac{\mathrm{Cov}(X,Y)}{\sigma(X) \sigma(Y)}$$

Note: You might notice that the denominators are not quite standard deviations: they're missing the $1/(n-1)$ normalizer. Covariance alone is also usually defined with a $1/(n-1)$, but by leaving it out on the top and bottom we get the same result because the $1/(n-1)$s would cancel each other out.

Of course pandas has a function to calculate $r$ for you:

You can also just compute all the correlations among all columns of a dataframe:

Let's look at a synthetic example to make a couple points here:

Notice that Pearson correlation ($r$) is concerned with how well the varaibles are modeled by a linear relationship. Deviations from that result in lower $r$. Also outliers have an outsided effect on $r$.

Another measure of correlation that is less susceptible to these is called the Spearman Rank Correlation Coefficient.

The intuition is to look at the relative positioning of a value of $x$ within its fellow $x$s and see how that compares to the corresponding $y$'s position among $y$s. Let $rank(x_i)$ be the position of $x_i$ in sorted order (i.e., 1st for the smallest, $n$ for the largest). Then if the $x$'s and $y$'s are perfectly "rank-correlated", then $rank(x_i)$ will equal $rank(y_i)$ for all $i$. A measure of how far the dataset deviates from this ideal is: $$\rho = 1 - \frac{6 \sum_{i=1}^n (rank(x_i) - rank(y_i))^2}{n(n^2-1)}$$

The details of this formula aren't too interesting; the factor of 6 in the numerator and the entire denominator are just there to make the sure the reusult lies in the range from -1 to 1.

TODO: Add the Spearman Correlation Coefficient to the plots above and see how the two measures compare for our synthetic examples.