Lecture 20 - ML Evaluation and Experimentation: Baselines, Generalization 1¶

Goals - Generalization¶

The italicized ones are things we covered last time; today we'll cover the remaining parts:

  • Understand the near-universal in-distribution assumption and its implications
  • Know how true risk differs from empirical risk.
  • Know how to define bias, variance, ***irreducible error**.
  • Be able to identify the most common causes of the above types of error, and explain how they relate to generalization, risk, overfitting, underfitting.
  • *Know why and how to create separate validation*** and **test sets to evaluate a model
  • Know how cross-validation works and why you might want to use it.

Empirical Risk vs True Risk¶

We use the word risk (or sometimes loss, or cost) to measure how "badly" a model fits the data.

In the case of a regression problem, we might measure the sum of squared distances from each $y$ value to the line's value at that $x$.

When fitting a model, what we truly care about is a quantity known as (true) risk: $R(h; {\cal X})$.

  • True risk is the expected loss "in the wild"
  • Depends on a probability distribution that we don't know: $P(x,y)$ -- the joint distribution of inputs and outputs.
    • If we knew $P$, there's nothing left to "learn": let $\hat{y} = \arg\max_y P(y | x)$.

Reminder of Bias and Variance from last time:

Bias¶

  • The bias of the training process is how far $\bar{h}(x)$ is from the mean of $P(y|x)$.
  • High bias implies something is keeping you from capturing true behavior of the source.
  • Most common cause of bias? The model class is too restrictive aka too simple aka not powerful enough aka not expressive enough.
    • E.g., if the true relationship is quadratic, using linear functions will have high bias.
  • Training processes with high bias are prone to underfitting.
    • Underfitting is when you fail to capture important phenomena in the input-output relationship, leading to higher risk.

Variance¶

  • The variance of a training process is the variance of the individual models $h_i(x)$; that is, how spread they are around $\bar{h}(x)$.
  • This is a problem, because we only have one $h_i$, not $\bar{h}(x)$, so our model might be way off even if the average is good.
  • Most common causes of variance?
    • Too powerful/expressive of a model, which is capable of overfitting the the training. Overfitting means memorizing or being overly influenced by noise in the training set.
    • Small training set sizes ($N$).
    • Higher irreducible error (noisier training set).

A third source of error/risk:

Irreducible Error¶

  • Even if you have a zero bias, zero variance training process, you then predict the mean $P(y|x)$, which is almost never right.
    • Because the truth is non-deterministic.
    • This error that remains is the irreducible error.
  • Source of irreducible error?
    • Not having enough, or enough relevant features in $x$.
  • Note: this error is only irreducible for a given feature set (the information in $x$). If you change the problem to include more features, you can reduce irreducible error.

Identifying the model with the best generalization (i.e. lowest true risk)¶

  • Answer: hold out a test set. Use this to estimate (true) risk.
  • So we need a training set and a test set. But that is not enough in practice. Why?
    • The more times you see results on the test set, the less representative it is as an estimate for $R$.
    • Example: 10k random "models"
  • We need a training set, a test set, and ideally a development or validation set.
    • This set is a surrogate for the test set.