# Lab 6: Measuring Disability Bias in BERT

## Introduction
### What is BERT?
BERT (Bidirectional Encoder Representations from Transformers) is a family of powerful language models developed first developed in 2018. BERT represented a leap forward in NLP technology, offering large performance increases over the previous state of the art techniques. BERT models lend themselves to a variety of applications and have thus become widely used in many tasks such as search engines, text summarization, sentence classification, and translation.

### How does BERT relate to the machine learning concepts we've discussed so far?

BERT is a **generative** model that is trained to fit the distribution of natural language. One way to think of this is that BERT is modeling $P(X)$, where $X$ is all possible combinations of English words, phrases, sentences, paragraphs, etc.

BERT is trained in what's called a **self-supervised** manner. This really just means that it's **supervised** but the ground truth "labels" come directly from the data itself rather than being separate "labels". Another way to think of this is that self-supervised means ground truth that comes for "free".

In BERT's case, the model is trained to predict missing words in text sequences. If you have a sentence, you can "mask out" any word and then ask the model to predict the missing (masked) word. Because you chose to mask it out, you know the correct answer (i.e., ground truth).

### How does BERT work?
For the purposes of this lab, all you need to know is that if you give BERT a sentence with a missing word, it is capable of (indeed, very good at) predicting the missing word.

If you're curious to peek under the hood a little bit, you can learn more about BERT from these resources.

 - A 6-minute video introduction to what BERT is and how it can be used: <https://www.youtube.com/watch?v=ioGry-89gqE>.
 - [Illustrated guide to Bert](https://jalammar.github.io/illustrated-bert/)
 - [How do transformers create embeddings](https://www.baeldung.com/cs/transformer-text-embeddings)

### Statistical Bias vs Social Bias

It's important to note that the bias we're measuring in this lab is **not** the same as as the statistical bias we discussed as a source of error in machine learning systems. That bias is measured under the assumption that the ground truth is correct and infallible; if statistical bias is high, it indicates that a model is unable to accurately fit the training data.

The bias we're looking at here might be called *social* bias, which arises in machine learning models not because they can't fit their training data, but actually because they *can*. The problem is that the training data itself originated from humans, who have their own biases (prejudices).

The social bias we're measuring here is defined as a prejudice in favor of or against one thing compared to another. As data scientists, we should be aware of the biases in our models - both statistical and social - and how they might affect use cases for those models. In this lab, we will recreate a published analysis of BERT's biases related to disability.

**Further Reading**
 - [Social Biases in NLP Models as Barriers for Persons with Disabilities](https://aclanthology.org/2020.acl-main.487/)
 - [Nakamura, Karen - "My Algorithms Have Determined You're Not Human: AI-ML, Reverse Turing-Tests, and the Disability Experience."](https://dl.acm.org/doi/10.1145/3308561.3353812)


## Setup


### Package Installations

We'll be using the [HuggingFace](https://huggingface.co/) `transformers`, which provides easy use of BERT. This library also provides useful tools for loading and preparing datasets so we can run our models more efficiently. The `datasets` model (also from HuggingFace) has some data loading utilities, and we will use `nltk` for sentiment analysis.

In [None]:
!pip install datasets
!pip install transformers
!pip install nltk

### A Dataset to Probe for Bias

For this assignment, we will use a set of 5 datasets prepared for investigating biases within NLP systems. Each set is made up of sentences that follows the pattern: "The [identifying information] person [connecting verb] [MASK]". Each dataset uses different types of identifying information to test biases.

 - A: The person [connecting verb] [MASK]
 - B: The [disability referent] person [connecting verb] [MASK]
 - C: The [gender referent ][disability referent] person [connecting verb] [MASK]
 - D: The [gender referent ][disability referent] person [connecting verb] [MASK]
 - E: The [race referent] [gender referent][disability referent] person [connecting verb] [MASK]

For each sentence, we'll use BERT to predict what word should be in the location of [MASK]. Having done this, we will use a second model to determine the sentiment of the sentence to quantify how positive or negative the meaning of the sentence is when we use BERT to complete it. Our pessimistic hypothesis is that the sentences with disability, gender, and/or race related referents will have more negative sentiment than those without.

The dataset is provided in this git repository:

In [None]:
!git clone https://github.com/saadhassan96/ableist-bias.git

In [None]:
import pandas as pd
from datasets import load_dataset

# Load each of our datasets into a huggingface dataset class
A = load_dataset('csv', data_files='ableist-bias/A.csv')['train']
B = load_dataset('csv', data_files='ableist-bias/B.csv')['train']
C = load_dataset('csv', data_files='ableist-bias/C.csv')['train']
D = load_dataset('csv', data_files='ableist-bias/D.csv')['train']
E = load_dataset('csv', data_files='ableist-bias/E.csv')['train']

Now that we the datasets loaded, we can take a peek at what the data looks like:

In [None]:
B.to_pandas().iloc[:5]

Notice that the sentence we'll be feeding into BERT is in the "Sentence" column.

### BERT Setup

Next up we need to get BERT ready. Thanks to HuggingFace's [Pipeline](https://huggingface.co/docs/transformers/main_classes/pipelines) this is pretty simple.

In [None]:
# Import pipeline method
from transformers import pipeline

# Set Bert as the predictive model
bert = pipeline(
    # The pipeline's task
    "fill-mask",
    # Which model the pipeline should download and use
    model="distilbert-base-uncased",
    # Set it to use the GPU
    device=0,
    # How many predictions it should return
    top_k=10
    )

Don't worry about the details here; the result of this setup is that we have a variable `bert` that's callable and can predict masked words as follows:

In [None]:
bert("Hi! How are you [MASK]", top_k=5)

I just asked for the 5 most likely things to fill in for [MASK], and got some sensible answers.

## Predict masked words



We'll now use BERT predict what words belong in place of the mask token in the dataset sentences. Let's try using it to see what kind of results we get.

In [None]:
B[2]["Sentence"]

In [None]:
B_pred = bert(B[2]["Sentence"], top_k = 10)
for sentence in B_pred:
  print(sentence['sequence'])

In [None]:
B_pred[0]

Here's a function that uses some slightly hairy pandas munging to get a DataFrame with the top 10 predictions for one of the datasets (A, B, C, D, E):

In [None]:
def bert_predictions(data, k):
    """ Predict k sentence completions using bert for each sentence in data.
    Returns a DataFrame with Sentence and Predictions column, one row per prediction."""
    preds = pd.DataFrame({
          "Input Sentence": data["Sentence"],
          "Prediction": bert(data["Sentence"], top_k=k)
        })
    preds = preds.explode("Prediction")
    preds["Word"] = preds["Prediction"].apply(lambda x: x['token_str'])
    preds["Prediction"] = preds["Prediction"].apply(lambda x: x['sequence'])
    return preds

Running this on the smalleset dataset, A, gives us the following:

In [None]:
bert_predictions(A, 10)

There are 10 rows in this DataFrame for each of the 14 sentences in A - one for each of the top 10 mask predictions from BERT.

Now let's run this on all 5 datasets. Note that each dataset is significantly larger than the last, so this will take a few minutes. Feel free to read on while this is running.

In [None]:
all_preds = [bert_predictions(dataset, 10) for dataset in (A, B, C, D, E)]

**TODO 1** Concatenate the resulting dataframes into a single DataFrame, keeping track of which dataset each row originated from. Depending how you do this, you might end up with a new column for the original dataset or a MultiIndex, which represents a hierarchical indexing structure.

Call your new dataframe `preds` so that future cells can refer to it.

In [None]:
# TODO 1

Since the predictions were expensive to compute, let's save them to a CSV file so we don't have to keep running the predictions if we leave the session.

In [None]:
preds.to_csv('preds.csv', index_label=False)

If we need to read this back later, we should be able to just:

In [None]:
# preds = pd.read_csv('preds.csv')
# preds

### Remove Punctuation and Stopword results

Recall from our early test run of BERT that we got several results that were either punctuation or a [stop word](https://en.wikipedia.org/wiki/Stop_word). These results aren't very useful to us since it results in an incomplete sentence. In order to remedy this, we'll write a function to filter these from our predictions.

**TODO 2** Write a function called `is_useful` that takes the predicted word and returns `False` if either:
* The predicted word is a stopword, or
* The predicted word is a punctuatio character.
and returns `True` otherwise.

You'll probably find it useful to use `spacy` or `nltk` for stopwords. For punctuation, and Python's builtin `string.punctuation` gets all the ASCII punctuation characters, but BERT seems to predict other unicode punctuation as well. This [stackoverflow post](https://stackoverflow.com/questions/60983836/complete-set-of-punctuation-marks-for-python-not-just-ascii) seems to give the standard solution for checking if a character is punctuation.

In [None]:
# TODO 2

**TODO 3** Using your `is_useful` function, filter out rows that correspond to stopword or punctuation predictions. I'd recommend looking through at least a few tens of results to make sure the filter matches your expectations. After my own filtering, the number of rows drops from about 212k to about 79k.

In [None]:
# TODO 3

## Sentiment Analysis

Now that we have our filtered BERT predictions, we'll get set up to do sentiment analysis on the resulting sentences. Here we'll use a seperate model called [VADER](https://ojs.aaai.org/index.php/ICWSM/article/view/14550) that can determine the sentiment of a sentence. With this model we can give it a sentence and receive a "polarity" score for it, which represents how positive or negative a sentence is. This score is in the range [-1.0, 1.0], with a negative score representing negative sentiment and likewise for a positive score.

**Further Reading**
 - [VADER Sentiment Analysis Explaned](https://medium.com/@piocalderon/vader-sentiment-analysis-explained-f1c4f9101cd9)

In [None]:
import nltk
nltk.downloader.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

vader = SentimentIntensityAnalyzer()

That's it - let's try it out:

In [None]:
vader.polarity_scores("Today is a good day.")

VADER gives us several numbers, but the the `compound` score is the single number that attempts to summarize the overall polarity on a scale from -1 to 1.

In [None]:
vader.polarity_scores("Today is a good day.")["compound"]

In [None]:
vader.polarity_scores("Today is a terrible day.")["compound"]

In [None]:
def get_polarity(sentence):
  return vader.polarity_scores(sentence)["compound"]

**TODO 4** Apply sentiment analysis to every prediction in our table and add a new column "Sentiment" with the predicted polarity.

In [None]:
# TODO 4

## Statistical Analysis

Now that we've gotten all of our predictions on the data, we can do some basic analysis to quantify bias.

**TODO 5** Start by computing summary statistics of the sentiment predictions (at least mean and standard deviation) per dataset.

In [None]:
# TODO 5

It looks like there is a notable difference in average sentiment when ability related language is added to the sentences.

**TODO 6** Make a nice (in the spirit of Lab 3) plot illustrating the polarity scores present in the five different datasets. The type and design of the plot is up to you. Provide your interpretation of what the plot shows.

In [None]:
# TODO 6


**TODO 7** Finally, find and display the 15 most commonly-predicted words for each of the five dataset.

In [None]:
# TODO 7

## Discussion

**TODO 8** Please write a brief (1 or 2 paragraphs at the most) discussion of each of the following questions.

 * Q1. Please look through the cards on [Tarot Cards of Tech](https://tarotcardsoftech.artefactgroup.com/). Pick any two (such as "The Smash Hit" and "The Service Dog") and write about how they each might apply to BERT.

 * Q2. With the work we've done now, where do you think the biases in BERT come from? What caused these biases to form?
 
 * Q3. Now that you've seen examples of bias in an NLP model, what kind of biases or ethical problems do you think other machine learning models or AI applications could have? For example other language models such as the one used in ChatGPT, or other models entirely such as those relating to image recognition/generation, social media analysis, speech recognition, etc.

 * Q4. Based on your answer from Q2, how might you show that these biases exist in the model/application?

## Reflection
**TODO 9** Now that you've worked through this assignment, please write a 1-2 paragraph reflection on what you've learned. Has your view on the ethics of machine learning models changed? What technical knowledge have you gained?

## Acknowledgements
 - This lab is heavily based on a notebook developed by Pax Newman under the direction of Yasmine Elglaly and Yudong Liu. Any awesomeness is due to them; any errors are due to my adaptations.
 - The analysis here is inspired by the paper [Unpacking the Interdependent Systems of Discrimination:
Ableist Bias in NLP Systems through an Intersectional Lens](https://arxiv.org/abs/2110.00521)
 - The sentence datasets (introduced in the above paper) can be found here: [ableist-bias dataset](https://github.com/saadhassan96/ableist-bias)



## Extra Credit

Extend the analysis in some interesting way. This might be something like looking at the effects of specific intersectional categories, or addressing shortcomings of the existing analysis to make it more convincing. You can earn up to 3 points of extra credit, and as usual each point is exponentially more difficult to earn.