Lecture 21 - Generalization 2; Evaluating ML models: Baselines, Regression Metrics¶

Announcements¶

  • Last candidate talk today:
    • 4pm CF 025 - Teaching Demo: SSL/TLS: From Basics to Best Practices
  • FP Milestone now due Sunday night
  • A quick refresher/retry on "explained variance"?

Goals¶

Generalization, Continued:

  • Know why and how to create separate validation and test sets to evaluate a model
  • Know how cross-validation works and why you might want to use it.

ML Evaluation 1:

  • Know the basic pieces necessary to evaluate a prediction system.
  • Know how to come up with good baselines for a variety of prediction tasks.
  • Know how to think about errors regression problems, and a few ways they can be measured.
    • Regression: absolute, relative, squared; MSE, RMSE, MAE, coefficient of determination
    • Later: classification

Generalization, Continued¶

Reminder:

  • Underfitting - high bias, model does not fit data well
  • Overfitting - high variance, model fits specifics of the training set too well

Tools in the fight against overfitting:

  • Keep your model simple even if a more powerful model appears to give better performance.
  • Bias your model towards simpler solutions (called regularization; no time to cover this here)
  • Key idea: hold back some data so it's artificially "unseen"

(Whiteboard)

So you've made some predictions...¶

How good are they? Assume we're in a supervised setting, so we have some ground truth labels for our training and validation data. Should you call it good and present your results, or keep tweaking your model?

You need an evaluation environment. What do you need to make this?

  • Data splits: train, val[idation], and test (terminology varies; the book confusingly calls these train, test, and evaluation)
  • Evaluation metrics: hard numbers that you can compare from one run to the next
  • Baselines: simple approaches that hint at how hard the problem is, and how well you can expect to do

Make it convenient; make it informative¶

With good reason, the book recommends that you package all your evaluation machinery into a single-command program (this could also be a single notebook or sequence of cells in a notebook).

You should output your candidate model's performance:

  • on all the relevant performance metrics
  • in comparison with your baselines and other candidate models

It's also a good idea to output:

  • Statistics and/or distributions of errors - Do you have lots of small errors and a few big ones? All medium-sized errors? One giant outlier?
  • If your data has natural categories or segments, break the errors out by categories:
    • Looking at data over 10 years? Check if your errors are getting better or worse with time.
    • Multiclass classification? Look at your accuracy on each class.
    • etc.

Baselines¶

The first rule of machine learning is to start without machine learning. (Google says so, so it must be true.)

Why?

  • If you aren't learning from data, you can't overfit to it.
  • It gives you hints about how hard your problem is, putting your model's performance in perspective.

Example: you are a computer vision expert working on biomedical image classification, trying to predict whether an MRI scan shows a tumor or not. Your training data contains 90% non-tumor images (negative examples) and 10% tumor images (positive examples).

Baseline Brainstorm¶

Example prediction problems:

  • Biomedical image classification: predict whether an MRI scan shows a tumor or not.

    • Training data contains 90% non-tumor images (negative examples) and 10% tumor images (positive examples).
  • Spam email classification: predict whether a message is spam.

    • Training data contains equal numbers of spam (positive) and non-spam (negative) examples.
  • Weather prediction: given all weather measurements from today and prior,

    • Predict whether it will rain tomorrow
    • Predict the amount of rainfall tomorrow
  • Body measurements: predict leg length given height

Ideas?

  • Weather: average rain chance over all time
  • Always say "no"
  • Randomly choose a label
In [ ]:

What generic strategies can we extract from the above?

In [ ]:

My ideas:

  • Guess randomly
  • Guess the mean/median/mode
  • Single-feature model
  • Linear regression
  • History repeats itself
  • Special mention: upper-bound baselines

Evaluation Metrics¶

So you've made some predictions... how good are they?

Let's consider regression first. Our model is some function that maps an input datapoint to a numerical value:

$y_i^\mathrm{pred} = f(x_i)$

and we have a ground-truth value $y_i^\mathrm{true} $for $x_i$.

How do we measure how wrong we are?

  • Error is pretty simple to define:

    $y_i^\mathrm{true} - y_i^\mathrm{pred}$

  • But we want to evaluate our model on the whole train or val set. Average error is a bad idea:

    $\sum_i y_i^\mathrm{true} - y_i^\mathrm{pred}$

  • Absolute error solves this problem:

    $|y_i^\mathrm{true} - y_i^\mathrm{pred}|$

  • Mean absolute error measures performance on a whole train or val set:

    $\frac{1}{n} \sum_i |y_i^\mathrm{true} - y_i^\mathrm{pred}$|

  • Squared error disproportionately punishes larger errors. This may be desirable or not.

    $\sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)^2$

  • Mean squared error (MSE) does the same over a collection of training exaples:

    $\frac{1}{n} \sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)^2$

  • MSE becomes more interpretable if you square-root it, because now it's in the units of the input. This gives us Root Mean Squared Error (RMSE):

    $\sqrt{ \frac{1}{n} \sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)^2}$

Problem with any of the above:

You can make your error metric go as small as you want! Just scale the data: $$ X \leftarrow X / k $$ $$ \mathbf{y}^\mathrm{true} \leftarrow \mathbf{y}^\mathrm{true} / k $$ $$ \mathbf{y}^\mathrm{pred} \leftarrow \mathbf{y}^\mathrm{pred} / k $$

Also: Is 10 vs 12 is a bigger error than 1 vs 2?

Solutions:

  • Relative error:

    $|y_i^\mathrm{true} - y_i^\mathrm{pred}|$

  • Coefficient of variation:

    • Let $\bar{y}$ be the mean of $\mathbf{y}^\mathrm{true}$.
    • Let $SS_\mathrm{tot} = \sum_i \left(y_i^\mathrm{true} - \bar{y}\right)$.
    • Let $SS_\mathrm{res} = \sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)$.
    • Then the coefficient of determination is:

    $1 - \frac{SS_\mathrm{res}}{SS_\mathrm{tot}}$

    • This is:
      • 0 if you predict the mean
      • 1 if you're perfect
      • negative if you do worse than the mean-prediction baseline!