Lecture 20 - ML Evaluation and Experimentation: Baselines, Generalization 1¶

The italicized ones are things we covered last time; today we'll cover the remaining parts:

Understand the near-universal in-distribution assumption and its implications
Know how true risk differs from empirical risk.
Know how to define bias, variance, ***irreducible error**.
Be able to identify the most common causes of the above types of error, and explain how they relate to generalization, risk, overfitting, underfitting.
*Know why and how to create separate validation*** and **test sets to evaluate a model
Know how cross-validation works and why you might want to use it.

We use the word risk (or sometimes loss, or cost) to measure how "badly" a model fits the data.

In the case of a regression problem, we might measure the sum of squared distances from each $y$ value to the line's value at that $x$.

When fitting a model, what we truly care about is a quantity known as (true) risk: $R(h; {\cal X})$.

True risk is the expected loss "in the wild"
Depends on a probability distribution that we don't know: $P(x,y)$ -- the joint distribution of inputs and outputs.
- If we knew $P$, there's nothing left to "learn": let $\hat{y} = \arg\max_y P(y | x)$.

Reminder of Bias and Variance from last time:

The bias of the training process is how far $\bar{h}(x)$ is from the mean of $P(y|x)$.
High bias implies something is keeping you from capturing true behavior of the source.
Most common cause of bias? The model class is too restrictive aka too simple aka not powerful enough aka not expressive enough.
- E.g., if the true relationship is quadratic, using linear functions will have high bias.
Training processes with high bias are prone to underfitting.
- Underfitting is when you fail to capture important phenomena in the input-output relationship, leading to higher risk.

The variance of a training process is the variance of the individual models $h_i(x)$; that is, how spread they are around $\bar{h}(x)$.
This is a problem, because we only have one $h_i$, not $\bar{h}(x)$, so our model might be way off even if the average is good.
Most common causes of variance?
- Too powerful/expressive of a model, which is capable of overfitting the the training. Overfitting means memorizing or being overly influenced by noise in the training set.
- Small training set sizes ($N$).
- Higher irreducible error (noisier training set).

A third source of error/risk:

Even if you have a zero bias, zero variance training process, you then predict the mean $P(y|x)$, which is almost never right.
- Because the truth is non-deterministic.
- This error that remains is the irreducible error.
Source of irreducible error?
- Not having enough, or enough relevant features in $x$.
Note: this error is only irreducible for a given feature set (the information in $x$). If you change the problem to include more features, you can reduce irreducible error.

Answer: hold out a test set. Use this to estimate (true) risk.
So we need a training set and a test set. But that is not enough in practice. Why?
- The more times you see results on the test set, the less representative it is as an estimate for $R$.
- Example: 10k random "models"
We need a training set, a test set, and ideally a development or validation set.
- This set is a surrogate for the test set.