{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3703334c-7691-4d04-841c-5286458a0830",
   "metadata": {
    "id": "3703334c-7691-4d04-841c-5286458a0830"
   },
   "source": [
    "# Activity: Auditing Bias in Medical Risk Prediction\n",
    "\n",
    "## Part B: Reproducing Findings From Obermeyer et al.\n",
    "\n",
    "The authors of this study collected sensitive medical and demographic data for a large set of hospital patients, along with the risk scores assigned to those patients by the recommendation system. They then released a *synthetic version* of the data set, in which many features were randomized and anonymized but important correlations in the data were preserved. This means that we can reproduce many of the analyses of the paper for ourselves. The remainder of this activity will guide us through a partial reproduction of the study’s findings.\n",
    "\n",
    "The code below will import several packages, download the synthetic data set, and load it into a pandas DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "7b9decb8",
   "metadata": {
    "id": "7b9decb8"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from matplotlib import pyplot as plt\n",
    "import seaborn as sns\n",
    "\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "sns.set_theme(style=\"whitegrid\", context=\"notebook\")\n",
    "sns.set_palette(\"viridis\", n_colors=2)\n",
    "\n",
    "url = \"https://gitlab.com/labsysmed/dissecting-bias/-/raw/master/data/data_new.csv?inline=false\"\n",
    "df = pd.read_csv(url).copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "25a99bec-0b19-4545-a26f-a481bb71ce8c",
   "metadata": {
    "id": "25a99bec-0b19-4545-a26f-a481bb71ce8c"
   },
   "source": [
    "In this data, patients status in the intensive care program was informed by the value of the risk score. Risk scores in the 97th percentile and above (so, the top 3% of patients with the most predicted risk) were *automatically enrolled* in the program, while patients with risk scores in the 55th to 96th percentiles were *screened* for possible enrollment. Patients with risk scores below the 55th percentile were not considered for enrollment.\n",
    "\n",
    "For the meaning of the columns in the data frame, please refer to the [data dictionary](https://gitlab.com/labsysmed/dissecting-bias/-/blob/master/data/data_dictionary.md?ref_type=heads) supplied by the authors.\n",
    "\n",
    "### Exercise B1\n",
    "\n",
    "Please write a quick one-liner which adds the *percentile* risk score as an integer column in the data frame named “percentile_risk_score_t”. If a given patient has a risk score higher than 70% of all patients, their percentile risk score should be 70. While there are many good ways to do this, it’s possible to achieve this in one long-ish line using the `rank` method of pandas Series objects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "bd84151f",
   "metadata": {
    "id": "bd84151f"
   },
   "outputs": [],
   "source": [
    "# TODO: Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61130a19-7bd9-4053-ada4-b5b864c5f56b",
   "metadata": {
    "id": "61130a19-7bd9-4053-ada4-b5b864c5f56b"
   },
   "source": [
    "### Exercise B2\n",
    "\n",
    "Now write code which adds three new boolean columns to the data frame named “not_enrolled”, “screened”, “auto_enrolled”, indicating whether a patient was (a) not enrolled in intensive care, (b) screened for possible enrollment, or (c) automatically enrolled. Use the percentile risk score to determine these values. Recall that you can use the `&` (and), `|` (or), and `~` (not) operators to combine boolean conditions in Pandas, but we be careful about the parentheses to ensure the desired order of operations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "22d08984-219d-4175-b811-c5cac66e6beb",
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f5410016-e069-499d-9545-fd264dfe0d69",
   "metadata": {
    "id": "f5410016-e069-499d-9545-fd264dfe0d69"
   },
   "source": [
    "So, how likely is a patient with some given medical characteristics and demographic profile to receive a high risk score? Rather than work with the many medical variables in the data set, we’ll follow the authors and focus on the number of active chronic illnesses a patient has in the year preceding the experiment, which are summarized in the variable `gagne_sum_tm1`. One might reasonably expect that patients with more chronic illnesses would be more likely to receive high risk scores.\n",
    "\n",
    "### Exercise B3\n",
    "\n",
    "Write code which produces line plots showing the proportion of patients recommended for intensive care (either screened or automatically enrolled) as a function of the number of chronic illnesses they have. Produce separate lines for Black and white patients, and separate panels for male and female patients. The vast majority of patients have 10 or fewer chronic illnesses, so for visualization purposes it’s fine to restrict the x-axis to this range.\n",
    "\n",
    "Some suggestions:\n",
    "\n",
    "-   Recall that Seaborn figure level functions have `row` and `col` arguments which create subplots (along the row or column, respectively) for the values of provided variables. The `hue` argument is similar, but puts multiple plots on one axes with different colors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6c7bd57f-32ba-448a-a06d-f9275823a577",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 500
    },
    "id": "5948efcf",
    "outputId": "05430d47-79fa-40a9-bb86-0a569a395987"
   },
   "outputs": [],
   "source": [
    "# TODO: Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db3ae752-cd77-4a19-8422-70bbdaed95c7",
   "metadata": {
    "id": "db3ae752-cd77-4a19-8422-70bbdaed95c7"
   },
   "source": [
    "## Part C: Sources of Bias\n",
    "\n",
    "It is a common trope that bias in an automated decision system is a consequence of “biased data.” This is not necessarily false, but it is important to be specific about the details. It is also important to consider the *design decisions* that go into building and training these systems.\n",
    "\n",
    "The model used to assign risk scores to patients in this case was trained to predict *future healthcare costs* based on past medical and demographic data. The idea here is that “health risk” isn’t a well-defined, measurable concept, but healthcare *costs* are. So, an algorithm that can predict the healthcare *costs* incurred by a patient might be a useful proxy for predicting their *health risk*.\n",
    "\n",
    "### Exercise C1\n",
    "\n",
    "Let’s first check whether the total healthcare cost is truly correlated with risk score. Make a plot showing the average healthcare cost incurred by patients at each percentile of risk score. Use a logarithmic scale for the y-axis, since healthcare costs vary widely across patients."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e020b6f-1936-464c-8472-023eed15c5a6",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 501
    },
    "id": "b39f360f",
    "outputId": "e92b1f52-bfb7-428d-af72-3ffb6ddf32b9"
   },
   "outputs": [],
   "source": [
    "# TODO: Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c217ad20-fc82-4e18-8637-68f16042ee00",
   "metadata": {
    "id": "c217ad20-fc82-4e18-8637-68f16042ee00"
   },
   "source": [
    "### Exercise C2\n",
    "\n",
    "The assumption of training a model on healthcare *costs* is that these costs should be correlated with their health *risks*. Make a plot showing the average healthcare cost incurred by patients as a function of the number of chronic illnesses they had in the previous year. You may wish to restrict the horizontal axis to patients with 5 or fewer chronic illnesses, since very few patients have more than this number."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f71ff486-063a-4958-b402-f5b58821d817",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 501
    },
    "id": "7517ce3e",
    "outputId": "aeb88016-d3f8-47b2-d736-3ce34423fd48"
   },
   "outputs": [],
   "source": [
    "# TODO: Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "223206d1-18be-480c-b862-dc54fa2373b5",
   "metadata": {
    "id": "223206d1-18be-480c-b862-dc54fa2373b5"
   },
   "source": [
    "### Exercise C3\n",
    "\n",
    "Now make a similar line plot showing the average healthcare cost incurred by patients as a function of the number of chronic illnesses they had in the previous year, but this time produce separate lines for Black and white patients. Then a new text cell immediately below the plot (or edit the placeholder text) to briefly comment on your findings. What implications do these findings have for using healthcare costs as a proxy for health risk?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "84050d6b",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 501
    },
    "id": "84050d6b",
    "outputId": "4ce01efa-fdac-451d-d4d9-395c019abe75"
   },
   "outputs": [],
   "source": [
    "# TODO: Your code here"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7a7eb56-d682-429c-8335-1d84afd99352",
   "metadata": {
    "id": "d7a7eb56-d682-429c-8335-1d84afd99352"
   },
   "source": [
    "*[TODO: Your response here]*\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5f580f16-6c80-4922-bd4e-8c4035fc3a45",
   "metadata": {
    "id": "5f580f16-6c80-4922-bd4e-8c4035fc3a45"
   },
   "source": [
    "## Collaboration statement\n",
    "\n",
    "In a new text cell immediately below this paragraph (or by editing this text cell to add a paragraph), briefly list who or what you collaborated with and how. Cite any sources here or with relevant inline comments in your code. Acknowledge all contributors, both people and AI, and what portions of this notebook they contributed. You do not need to cite or acknowledge your partner, nor any material provided in this starter file, lecture materials, etc.\n",
    "\n",
    "## Submitting your notebook\n",
    "\n",
    "Submit your completed notebook to the Data Ethics 2 - Activity assignment on Canvas."
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.13"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}