{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "fb0c41dd",
   "metadata": {},
   "source": [
    "# Lecture 15 - Correlation\n",
    "## (does not imply causation)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c3063ce5",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d853eb04",
   "metadata": {},
   "source": [
    "#### Announcements:\n",
    "\n",
    "\n",
    "#### Goals:\n",
    "* Understand the distinction between correlation and causation, and be wary of the dangers of assuming that correlation implies causation.\n",
    "* Know the definition and have an intuition for the meaning of the Pearson Correlation Coefficient and the Spearman Rank Correlation Coefficient, the differences between them, and when to use each."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e274b68",
   "metadata": {},
   "source": [
    "## Correlation (does not imply causation)\n",
    "\n",
    "What do we mean by correlation?\n",
    "\n",
    "So far, an informal definition: Variables are correlated if one **appears to predict** the other, or if an increase or decrease in one is **consistently associated with** an increase or decrease in the other."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "14495c28",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import seaborn as sns\n",
    "data_url = \"https://fw.cs.wwu.edu/~wehrwes/courses/data311_21f/data/NHANES/NHANES.csv\"\n",
    "cols_renamed = {\"SEQN\": \"SEQN\",\n",
    "                \"RIAGENDR\": \"Gender\", # 1 = M, 2 = F\n",
    "                \"RIDAGEYR\": \"Age\", # years\n",
    "                \"BMXWT\": \"Weight\", # kg\n",
    "                \"BMXHT\": \"Height\", # cm\n",
    "                 \"BMXLEG\": \"Leg\", # cm\n",
    "                \"BMXARML\": \"Arm\", # cm\n",
    "                \"BMXARMC\": \"Arm Cir\", # cm\n",
    "                \"BMXWAIST\": \"Waist Cir\"} # cm\n",
    "\n",
    "bm = pd.read_csv(data_url)\n",
    "bm = bm.rename(cols_renamed, axis='columns')\n",
    "bm = bm.drop(\"SEQN\", axis='columns')\n",
    "bm = bm[bm[\"Age\"] >= 21]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4c901cc-b261-475c-95e9-45eabf99a2df",
   "metadata": {},
   "source": [
    "There seems to be a pretty strong correlation between height and leg length:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a9fd5947",
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.jointplot(x=\"Height\", y=\"Leg\", data=bm, s=3, height=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "92e40410-3705-448b-81ec-c09628dece0f",
   "metadata": {},
   "source": [
    "Not as much between height and arm circumference - maybe a little upward trend, if you squint?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee7fe66f",
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.jointplot(x=\"Height\", y=\"Arm Cir\", data=bm, s=3, height=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "764c3096-7133-4302-909c-602a07d41e46",
   "metadata": {},
   "source": [
    "Really hard to see much of a correlation between age and weight:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e8c6445",
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.jointplot(x=\"Age\", y=\"Weight\", data=bm, s=3, height=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "828dfe5c-a2d1-4053-924c-ed37778cebb0",
   "metadata": {},
   "source": [
    "It kind of looks like there's a slight negative correlation here between age and arm circumference?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01c5802b",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "sns.jointplot(x=\"Age\", y=\"Arm Cir\", data=bm[bm[\"Age\"] >= 60], s=3, height=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c5ab901",
   "metadata": {},
   "source": [
    "Variables are correlated if one **appears to predict** the other, or if an increase or decrease in one is **consistently associated with** an increase or decrease in the other."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aad55a58",
   "metadata": {},
   "source": [
    "# Measuring Correlation\n",
    "\n",
    "We'd like a number (a statistic, really) that captures the *degree* of correlation between two variables.\n",
    "\n",
    "Here's one such statistic, called the **Pearson Correlation Coefficient**:\n",
    "$$r(X, Y) = \\frac{\\sum_{i=1}^n (X_i - \\bar{X}) (Y_i - \\bar{Y})}\n",
    "                  {\\sqrt{\\sum_{i=1}^n (X_i-\\bar{X})^2}\\sqrt{\\sum_{i=1}^n(Y_i-\\bar{Y})^2}}$$\n",
    "                  \n",
    "where $\\bar{X}$ and $\\bar{Y}$ are the means $X$ and $Y$, respectively.\n",
    "\n",
    "On the one hand, yikes! On the other hand, this is made of mostly things we've seen before. Notice that the denominator is the product of the two variables' standard deviations!\n",
    "$$r(X, Y) = \\frac{\\sum_{i=1}^n (X_i - \\bar{X}) (Y_i - \\bar{Y})}\n",
    "                  {\\sigma(X) \\sigma(Y)}$$\n",
    "                 \n",
    "And the numerator is... well, it's sort of like a variance, except instead of squaring the distance of one varaible to its mean, we multiply $X_i$'s distance from the mean by $Y_i$'s distance from the mean. This is a nifty enough thing that it also has its own name: *covariance*. So we can equivalently write \n",
    "\n",
    "$$r(X, Y) = \\frac{\\mathrm{Cov}(X,Y)}{\\sigma(X) \\sigma(Y)}$$\n",
    "\n",
    ">*Note*: You might notice that the denominators are not quite standard deviations: they're missing the $1/(n-1)$ normalizer. Covariance alone is also usually defined with a $1/(n-1)$, but by leaving it out on the top and bottom we get the same result because the $1/(n-1)$s would cancel each other out.\n",
    "\n",
    "Of course pandas has a function to calculate $r$ for you:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "53276fab",
   "metadata": {},
   "outputs": [],
   "source": [
    "bm[\"Height\"].corr(bm[\"Leg\"]) # pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5efe789b",
   "metadata": {},
   "source": [
    "You can also just compute *all* the correlations among all columns of a dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aee83af0",
   "metadata": {},
   "outputs": [],
   "source": [
    "bm.corr() # pandas-er"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0d084693",
   "metadata": {},
   "outputs": [],
   "source": [
    "def corrplot(c1, c2, df, sz=1):\n",
    "    r = df[c1].corr(df[c2]) # Pearson\n",
    "    sns.lmplot(x=c1, y=c2, data=df, height=4, scatter_kws={'s': sz}).set(title=f\"Pearson r= {r:.02}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b20abca",
   "metadata": {},
   "outputs": [],
   "source": [
    "corrplot(\"Height\", \"Leg\", bm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7345e92c",
   "metadata": {},
   "outputs": [],
   "source": [
    "corrplot(\"Height\", \"Arm Cir\", bm)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "278b95e7",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "corrplot(\"Age\", \"Height\", bm)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0ddc6c3e",
   "metadata": {},
   "source": [
    "### Two variables are correlated. What does that tell us?\n",
    "\n",
    "Suppose you open up a dataset, make a `jointplot` and find that two variables A and B are highly correlated. Without knowing any more than that, what might have given rise to this correlation?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7cd3d1e2",
   "metadata": {},
   "source": [
    "A is liquid precip per minute, B is number of people wearing raincoats.\n",
    "\n",
    "What happened? \n",
    "\n",
    "**A caused B**\n",
    "\n",
    "-----"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25b8e012",
   "metadata": {},
   "source": [
    "A is a measure of lung health; B is number of cigarettes smoked per day.\n",
    "\n",
    "What happened? \n",
    "\n",
    "**B caused A**\n",
    "\n",
    "-----"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb3cbff7",
   "metadata": {},
   "source": [
    "**A** is the ranking of the college a person graduated from; **B** is their score on the GRE (an SAT-like exam sometimes used for graduate school admissions).\n",
    "\n",
    "What happened? \n",
    "\n",
    "(C, D, E, F, Z) caused A and B\n",
    "\n",
    "-----"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c3b24c46",
   "metadata": {},
   "source": [
    "A is height in centimeters; B is height in feet\n",
    "\n",
    "What happened? \n",
    "\n",
    "**A is B**\n",
    "\n",
    "\n",
    "-----"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e5c280f",
   "metadata": {},
   "source": [
    "**A** is the cost to send a letter via the USPS; **B** is the number of Google searches for 'i am dizzy'.\n",
    "\n",
    "What happened?\n",
    "\n",
    "If you look at how enough things vary, many of them will line up.\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "817a51e8-e20d-4003-9e9f-c23404f1be58",
   "metadata": {},
   "source": [
    "![](https://www.tylervigen.com/spurious/correlation/image/2607_cost-to-send-a-letter-via-the-usps_correlates-with_google-searches-for-i-am-dizzy.svg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee3f2468-8471-445a-8653-e3f9e9f2d45a",
   "metadata": {},
   "source": [
    "![](https://imgs.xkcd.com/comics/firefox_wicca.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ebf0ece2",
   "metadata": {},
   "source": [
    "More (real) examples: https://www.tylervigen.com/spurious-correlations"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "df1f7dff",
   "metadata": {},
   "source": [
    "### In seriousness:\n",
    "\n",
    "It's easy to find ridiculous correlations that are obviously not causal and have a good laugh about it.\n",
    "\n",
    "It's **still** easy to see correlation and jump to the conclusion that there is a causation at work, (even if you understand intellectually that correlation does not imply causation!)\n",
    "\n",
    "It's also vastly easier to find correlations than to prove causation - this is a big part of what makes science hard.\n",
    "\n",
    "A fun activity: google \"[study finds link](https://www.google.com/search?q=study+finds+link)\" and look at some of the news articles that result.\n",
    "\n",
    "From my [second result](https://www.news-medical.net/news/20210910/Study-finds-link-between-physically-active-lifestyle-and-lower-risk-of-anxiety.aspx), way down the page:\n",
    "> \\[The scientists\\] added that randomized intervention trials, as well as long-term objective measurements of physical activity in prospective studies, are also needed to **assess the validity and causality of the association they reported**."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6bca3d2e",
   "metadata": {},
   "source": [
    "![](https://imgs.xkcd.com/comics/correlation.png)\n",
    "Title text: \"Correlation doesn't imply causation, but it does waggle its eyebrows suggestively and gesture furtively while mouthing 'look over there'.\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "daa34423-4b23-4601-89f1-42f4c5af3cc8",
   "metadata": {},
   "source": [
    "### Some notable properties of Pearson's $r$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "459f29ef",
   "metadata": {},
   "source": [
    "Let's look at a synthetic example to make a couple points here:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5320751e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "df = pd.DataFrame()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dbc16554-0b95-4336-9f11-038d138ee088",
   "metadata": {},
   "outputs": [],
   "source": [
    "def corrplot(c1, c2, df, sz=1):\n",
    "    r = df[c1].corr(df[c2]) # Pearson\n",
    "    p = df[c1].corr(df[c2], method=\"spearman\")\n",
    "    sns.lmplot(x=c1, y=c2, data=df, height=4, scatter_kws={'s': sz}).set(title=f\"Pearson r= {r:.02}; Spearman p = {p:.02}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "94df5f61",
   "metadata": {},
   "outputs": [],
   "source": [
    "x1 = np.arange(0, 10, .1)\n",
    "y_ideal = np.arange(6, 10, 0.04)\n",
    "noise = np.random.randn(*y_ideal.shape) * .2\n",
    "y1 = y_ideal + noise\n",
    "df[\"x1\"] = x1\n",
    "df[\"y1\"] = y1\n",
    "corrplot(\"x1\", \"y1\", df, sz=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "111303b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "x2 = np.arange(0, 10, .1)\n",
    "y2_ideal = np.arange(6, 10, 0.04)\n",
    "y2_ideal[50] = 38\n",
    "noise2 = np.random.randn(*y2_ideal.shape) * .5\n",
    "y2 = y2_ideal + noise2\n",
    "df[\"x2\"] = x2\n",
    "df[\"y2\"] = y2\n",
    "corrplot(\"x2\", \"y2\", df, sz=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dae29a25",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "x3 = np.concatenate((np.arange(0, 5, .1), np.arange(5, 6, 0.02)))\n",
    "y3 = np.concatenate((np.arange(6, 8, 0.04), np.arange(8, 12, 0.08)))\n",
    "df[\"x3\"] = x3\n",
    "df[\"y3\"] = y3\n",
    "corrplot(\"x3\", \"y3\", df, sz=4)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "315ad317",
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "x4 = np.arange(0, 100)\n",
    "y4 = np.arange(6, 106)**6\n",
    "df[\"x4\"] = x4\n",
    "df[\"y4\"] = y4\n",
    "corrplot(\"x4\", \"y4\", df, sz=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a23c3cc9",
   "metadata": {},
   "source": [
    "#### Two observations:\n",
    "* Pearson measures the strength of a **linear** relationship\n",
    "* Outliers have an outsided effect on $r$.\n",
    "\n",
    "### Spearman \n",
    "Another measure of correlation that is less susceptible to these is called the **Spearman Rank Correlation Coefficient**.\n",
    "\n",
    "The intuition is to look at the *relative* positioning of a value of $x$ within its fellow $x$s and see how that compares to the corresponding $y$'s position among $y$s. Let $rank(x_i)$ be the position of $x_i$ in sorted order (i.e., 1st for the smallest, $n$ for the largest). Then if the $x$'s and $y$'s are perfectly \"rank-correlated\", then $rank(x_i)$ will equal $rank(y_i)$ for all $i$. A measure of how far the dataset deviates from this ideal is:\n",
    "$$\\rho = 1 - \\frac{6 \\sum_{i=1}^n (rank(x_i) - rank(y_i))^2}{n(n^2-1)}$$\n",
    "\n",
    "The details of this formula aren't too interesting; the factor of 6 in the numerator and the entire denominator are just there to make the sure the reusult lies in the range from -1 to 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc708e8c-d7f7-4335-8952-350dc1c9727f",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c417558a-84db-4a6c-9ca4-71543e4aba8b",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
