{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "4e860ae3",
   "metadata": {
    "id": "4e860ae3"
   },
   "source": [
    "# Lecture 18 - Machine Learning Fundamentals: Generalization"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37302a22-c756-4acf-8a89-89e8214fe903",
   "metadata": {},
   "source": [
    "### Announcements\n",
    "\n",
    "  * New seats, new friends!\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "5d64c663-ad2d-405e-ace7-4385b5d40ac4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['Malik', 'Josh', 'Alli', 'Erika']\n",
      "['Marcus', 'Keira', 'Maven', 'Dylan', 'Zach']\n",
      "['Narina', 'Finnley', 'Haden', 'Sebastian']\n"
     ]
    }
   ],
   "source": [
    "import random\n",
    "random.seed(518)\n",
    "datafolk = \"Alli Keira Malik Erika Narina Sebastian Josh Dylan Haden Zach Maven Marcus Finnley\".split()\n",
    "random.shuffle(datafolk)\n",
    "print(datafolk[:4])\n",
    "print(datafolk[4:9])\n",
    "print(datafolk[9:])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a21c895c",
   "metadata": {
    "id": "a21c895c"
   },
   "source": [
    "#### Goals\n",
    "\n",
    "\n",
    "* Understand the near-universal **in-distribution** assumption and its implications\n",
    "* Know how **true risk** differs from **empirical risk**.\n",
    "* Know how to define **bias**, **variance**, **irreducible error**.\n",
    "* Be able to identify the most common causes of the above types of error, and explain how they relate to generalization, risk, **overfitting**, **underfitting**.\n",
    "* Know why and how to create separate **validation** and **test** sets to evaluate a model\n",
    "* Know why and how to subdivide datasets into **training**, **validation**, and **test** sets\n",
    "* Understand what **hyperparameters** are and how to tune them using a validation set\n",
    "* Know how cross-validation works and why you might want to use it."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "927f77fe-e54e-44ae-9d2d-245148c5bd45",
   "metadata": {
    "id": "fa59edc0"
   },
   "source": [
    "## Machine Learning: Foundational Assumptions\n",
    "\n",
    "### The In-Distribution Assumption\n",
    "\n",
    "Generally: **unseen data is drawn from the same distribution as your dataset.**\n",
    "\n",
    "*Consequence:* We don't assume correlation is causation, but we do assume that observed correlations will hold in unseen data."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cbc2cf81-45cb-417e-8b80-9ef2d4580d31",
   "metadata": {
    "id": "A2HrQzl-3NON"
   },
   "source": [
    "## Big Idea: Generalization\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9e0eb67e-78eb-4e31-836d-e6c6f845ab6e",
   "metadata": {
    "id": "gLlnqEIDCIHZ"
   },
   "source": [
    "**Generalization** is the ability of a model to perform well on **unseen** data (i.e., data that was not in the training set).\n",
    "* As discussed above: we're usually hoping to perform well on unseen data that is drawn from the **same** distribution as the training set."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "604a0f7d-f125-42e5-b9b6-0985b2ec0877",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ddd49938-4aad-45ec-a21c-7dfa557d50f2",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 297
    },
    "executionInfo": {
     "elapsed": 1241,
     "status": "ok",
     "timestamp": 1676484839023,
     "user": {
      "displayName": "Scott Wehrwein",
      "userId": "11327482518794216604"
     },
     "user_tz": 480
    },
    "id": "ZIJVIfQk3ShB",
    "outputId": "0aa0daaf-5342-40bd-e4fe-fa40765f8526"
   },
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\n",
    "  \"X\": [0.49, 0.18, 0.31, 0.40, 0.24],\n",
    "  \"Y\": [0.09, 0.45, 0.23, 0.19, 0.48]\n",
    "})\n",
    "fig = sns.scatterplot(data=df,x=\"X\",y=\"Y\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c94ddab2-4fe9-4286-bfe6-49cbfd1e3639",
   "metadata": {
    "id": "I93ihiH43Vqd"
   },
   "source": [
    "Consider the following possible ways to draw a line that fits the data:\n",
    "* Linear functions (degree-1 polynomials)\n",
    "* Quadratic functions (degree-2 polynomials)\n",
    "* Degree-10 polynomials"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02341176-4e84-4b4b-b401-625e3091c20d",
   "metadata": {
    "id": "Yd7x029o3wcM"
   },
   "source": [
    "**Question 1**: Which of these chioces of model will result in the best fit on the training data?\n",
    "\n",
    "**Question 2**: Which of these will result in the best fit on a *different* batch of data drawn from the same distribution as $\\mathcal{X}$? In other words, which of these will **generalize** best?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7ac86b91-cba4-4319-b041-f03b0e948c9e",
   "metadata": {
    "id": "x1nOLd7mdFE0"
   },
   "source": [
    "### Empirical Risk vs True Risk\n",
    "Let's formalize the above distinction.\n",
    "\n",
    "We use the word **risk** (or sometimes **loss**, or **cost**) to measure how \"badly\" a model fits the data.\n",
    "\n",
    "In the case of a regression problem like the above, we might measure the sum of squared distances from each $y$ value to the line's value at that $x$.\n",
    "\n",
    "When fitting a model, what we *truly* care about is a quantity known as (true) *risk*: $R(h; {\\cal X})$.\n",
    "- True risk is the expected loss \"in the wild\"\n",
    "- Depends on a probability distribution that we don't know: $P(x,y)$ -- the joint distribution of inputs and outputs.\n",
    "  - If we knew $P$, there's nothing left to \"learn\": let $\\hat{y} = \\arg\\max_y P(y | x)$.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08de8aa0-0fbe-4fcc-8a54-0818fd67722e",
   "metadata": {
    "id": "eAHiJZICCkp5"
   },
   "source": [
    "### Where does risk come from?\n",
    "There are three contributors to risk:\n",
    "1. Bias (not the same bias as the $b$ in our linear model)\n",
    "2. Variance\n",
    "3. Irrereducible error\n",
    "\n",
    "To understand bias and variance, we need to consider hypothetical:\n",
    "  - There is some underlying distribution/source generating input-output pairs\n",
    "    - The probabily of a pair is denoted $P(x,y)$\n",
    "    - The probability of the output given the input is denoted $P(y|x)$\n",
    "    - Why a distribution? Because the same input (x) can have different ouputs (y).\n",
    "      - Example: x contains home features: square feet, # bedrooms. Many houses are 2400 square feet with 3 bedrooms, and they're not all priced the same.\n",
    "  - for i in 1..K\n",
    "    - Get a random training set with $N$ points sampled from $P$\n",
    "    - Train a model on that training set, call that $h_i(x)$.\n",
    "  - Define $\\bar{h}(x) = \\frac{1}{K} \\sum_{i=1}^K h_i(x)$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d346f2b0-c3e5-4f3f-be5e-1c77286d815d",
   "metadata": {
    "id": "pnYFTt04gH8s"
   },
   "source": [
    "#### Bias\n",
    "- The **bias** of the training process is how far $\\bar{h}(x)$ is from the mean of $P(y|x)$.\n",
    "- High bias implies something is keeping you from capturing true behavior of the source.\n",
    "- Most common cause of bias? The model class is too restrictive aka too simple aka not powerful enough aka not expressive enough.\n",
    "  - E.g., if the true relationship is quadratic, using linear functions will have high bias.\n",
    "- Training processes with high bias are prone to **underfitting**.\n",
    "  - Underfitting is when you fail to capture important phenomena in the input-output relationship, leading to higher risk."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e3aab6ed-58e5-4687-b59f-6ae0719b6b85",
   "metadata": {
    "id": "4jg6kt9KhUzp"
   },
   "source": [
    "#### Variance\n",
    "- The **variance** of a training process is the variance of the individual models $h_i(x)$; that is, how spread they are around $\\bar{h}(x)$.\n",
    "- This is a problem, because we only have one $h_i$, not $\\bar{h}(x)$, so our model might be way off even if the average is good.\n",
    "- Most common causes of variance?\n",
    "  - Too powerful/expressive of a model, which is capable of **overfitting** the the training. Overfitting means memorizing or being overly influenced by noise in the training set.\n",
    "  - Small training set sizes ($N$).\n",
    "  - Higher irreducible error (noisier training set)."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04026a20-0138-46a8-9d50-732d6a9bf77a",
   "metadata": {
    "id": "6xdpSAYfkrP-",
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "#### Irreducible Error\n",
    "- Even if you have a zero bias, zero variance training process, you then predict the mean $P(y|x)$, which is almost never right.\n",
    "  - Because the truth is non-deterministic.\n",
    "  - This error that remains is the *irreducible error*.\n",
    "- Source of irreducible error?\n",
    "  - Not having enough, or enough relevant features in $x$.\n",
    "- Note: this error is only irreducible for a given feature set (the information in $x$). *If you change the problem to include more features, you can reduce irreducible error.*"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "472434fd-9744-4ce4-ae10-0c76831e1782",
   "metadata": {
    "id": "-5_EXGsAHahr"
   },
   "source": [
    "### Worksheet: Problems 1 - 4"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02cd04cc-afe6-4b2a-891f-5f1a9b6bdf53",
   "metadata": {
    "id": "Zipg06JrmEQR",
    "jp-MarkdownHeadingCollapsed": true
   },
   "source": [
    "### Identifying the model with the best generalization (i.e. lowest true risk)\n",
    "- Answer: hold out a *test set*. Use this to estimate (true) risk.\n",
    "- So we need a training set and a test set. But that is not enough in practice. Why?\n",
    "  - The more times you see results on the test set, the less representative it is as an estimate for $R$.\n",
    "  - Example: 10k random \"models\"\n",
    "- We need a training set, a test set, and ideally a *development* or *validation* set.\n",
    "  - This set is a surrogate for the test set."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}