Lecture 20 - ML Evaluation and Experimentation: Baselines, Generalization 1¶
Goals - Generalization¶
The italicized ones are things we covered last time; today we'll cover the remaining parts:
- Understand the near-universal in-distribution assumption and its implications
- Know how true risk differs from empirical risk.
- Know how to define bias, variance, ***irreducible error**.
- Be able to identify the most common causes of the above types of error, and explain how they relate to generalization, risk, overfitting, underfitting.
- *Know why and how to create separate validation*** and **test sets to evaluate a model
- Know how cross-validation works and why you might want to use it.
Empirical Risk vs True Risk¶
We use the word risk (or sometimes loss, or cost) to measure how "badly" a model fits the data.
In the case of a regression problem, we might measure the sum of squared distances from each $y$ value to the line's value at that $x$.
When fitting a model, what we truly care about is a quantity known as (true) risk: $R(h; {\cal X})$.
- True risk is the expected loss "in the wild"
- Depends on a probability distribution that we don't know: $P(x,y)$ -- the joint distribution of inputs and outputs.
- If we knew $P$, there's nothing left to "learn": let $\hat{y} = \arg\max_y P(y | x)$.
Reminder of Bias and Variance from last time:
Bias¶
- The bias of the training process is how far $\bar{h}(x)$ is from the mean of $P(y|x)$.
- High bias implies something is keeping you from capturing true behavior of the source.
- Most common cause of bias? The model class is too restrictive aka too simple aka not powerful enough aka not expressive enough.
- E.g., if the true relationship is quadratic, using linear functions will have high bias.
- Training processes with high bias are prone to underfitting.
- Underfitting is when you fail to capture important phenomena in the input-output relationship, leading to higher risk.
Variance¶
- The variance of a training process is the variance of the individual models $h_i(x)$; that is, how spread they are around $\bar{h}(x)$.
- This is a problem, because we only have one $h_i$, not $\bar{h}(x)$, so our model might be way off even if the average is good.
- Most common causes of variance?
- Too powerful/expressive of a model, which is capable of overfitting the the training. Overfitting means memorizing or being overly influenced by noise in the training set.
- Small training set sizes ($N$).
- Higher irreducible error (noisier training set).
A third source of error/risk:
Irreducible Error¶
- Even if you have a zero bias, zero variance training process, you then predict the mean $P(y|x)$, which is almost never right.
- Because the truth is non-deterministic.
- This error that remains is the irreducible error.
- Source of irreducible error?
- Not having enough, or enough relevant features in $x$.
- Note: this error is only irreducible for a given feature set (the information in $x$). If you change the problem to include more features, you can reduce irreducible error.
Identifying the model with the best generalization (i.e. lowest true risk)¶
- Answer: hold out a test set. Use this to estimate (true) risk.
- So we need a training set and a test set. But that is not enough in practice. Why?
- The more times you see results on the test set, the less representative it is as an estimate for $R$.
- Example: 10k random "models"
- We need a training set, a test set, and ideally a development or validation set.
- This set is a surrogate for the test set.