Announcements¶

Final project proposal feedback released this morning. Please take note of my feedback and ask me any questions you have about it.
Applicable to all projects: MS1 should include an "evaluation environment". Details of what this means will be covered today and tomorrow.
- Train/val/test data splits (you do have all your data by now, right?)
- A list of metrics you will use to evaluate your predictions.
- A list of prediction baselines and quantitative results on your metrics for each one.
Details of what to submit for MS1 will be added to the FP writeup by sometime tomorrow.

Goals¶

Know the basic pieces necessary to evaluate a prediction system.
Know how to come up with good baselines for a variety of prediction tasks.
Know how to think about errors regression problems, and a few ways they can be measured.
- Regression: absolute, relative, squared; MSE, RMSE, MAE
- Tomorrow: classification

So you've made some predictions...¶

How good are they? Assume we're in a supervised setting, so we have some ground truth labels for our training and validation data. Should you call it good and present your results, or keep tweaking your model?

You need an evaluation environment. What do you need to make this?

Data splits: train, val[idation], and test (terminology varies; the book calls these train, test, and evaluation)
Evaluation metrics: hard numbers that you can compare from one run to the next
Baselines: simple approaches that hint at how hard the problem is, and how well you can expect to do

Make it convenient; make it informative¶

With good reason, the book recommends that you package all your evaluation machinery into a single-command program (this could also be a single notebook or sequence of cells in a notebook).

You should output your candidate model's performance:

on all the relevant performance metrics
in comparison with your baselines and other candidate models

It's also a good idea to output:

Statistics and/or distributions of errors - Do you have lots of small errors and a few big ones? All medium-sized errors? One giant outlier?
If your data has natural categories or segments, break the errors out by categories:
- Looking at data over 10 years? Check if your errors are getting better or worse with time.
- Multiclass classification? Look at your accuracy on each class.
- etc.

Baselines¶

The first rule of machine learning is to start without machine learning. (Google says so, so it must be true.)

Why?

Occam's razor - if you don't need ML, don't use ML.
It gives you hints about how hard your problem is, putting your model's performance in perspective.

Example: you are a computer vision expert working on biomedical image classification, trying to predict whether an MRI scan shows a tumor or not. Your training data contains 90% non-tumor images (negative examples) and 10% tumor images (positive examples).

Baseline Brainstorm¶

Example prediction problems:

Biomedical image classification: predict whether an MRI scan shows a tumor or not.
- Training data contains 90% non-tumor images (negative examples) and 10% tumor images (positive examples).

Spam email classification: predict whether a message is spam.
- Training data contains equal numbers of spam (positive) and non-spam (negative) examples.

Weather prediction: given all weather measurements from today and prior,
- Predict whether it will rain tomorrow
- Predict the amount of rainfall tomorrow

Body measurements: predict leg length given height

Ideas?

Guess the most common label in the dataset
Random guessing
Predict the mode
Predict the mean
Predict the median
History will repeat itself
Single-variable models

My ideas:

Guess randomly
Guess the mean/median/mode
Single-feature model
Linear regression
History repeats itself
Special mention: upper-bound baselines

Evaluation Metrics¶

So you've made some predictions... how good are they?

Let's consider regression first. Our model is some function that maps an input datapoint to a numerical value:

$y_i^\mathrm{pred} = f(x_i)$

and we have a ground-truth value $y_i^\mathrm{true} $for $x_i$.

How do we measure how wrong we are?

Error is pretty simple to define:

$y_i^\mathrm{true} - y_i^\mathrm{pred}$

But we want to evaluate our model on the whole train or val set. Average error is a bad idea:

$\sum_i y_i^\mathrm{true} - y_i^\mathrm{pred}$

Absolute error solves this problem:

$|y_i^\mathrm{true} - y_i^\mathrm{pred}|$

Mean absolute error measures performance on a whole train or val set:

$\frac{1}{n} \sum_i |y_i^\mathrm{true} - y_i^\mathrm{pred}$|

Squared error disproportionately punishes larger errors. This may be desirable or not.

$\sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)^2$

Mean squared error (MSE) does the same over a collection of training exaples:

$\frac{1}{n} \sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)^2$

MSE becomes more interpretable if you square-root it, because now it's in the units of the input. This gives us Root Mean Squared Error (RMSE):

$\sqrt{ \frac{1}{n} \sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)^2}$

Problem with any of the above:

You can make your error metric go as small as you want! Just scale: $$ X \leftarrow X / k $$ $$ \mathbf{y}^\mathrm{true} \leftarrow \mathbf{y}^\mathrm{true} / k $$ $$ \mathbf{y}^\mathrm{pred} \leftarrow \mathbf{y}^\mathrm{pred} / k $$

Also: Is 10 vs 12 is a bigger error than 1 vs 2?

Solutions:

Relative error:

$\frac{|y_i^\mathrm{true} - y_i^\mathrm{pred}|}{y_i^\mathrm{true}}$
Coefficient of variation:
- Let $\bar{y}$ be the mean of $\mathbf{y}^\mathrm{true}$.
- Let $SS_\mathrm{tot} = \sum_i \left(y_i^\mathrm{true} - \bar{y}\right)$. This is the sum of squared differences from the mean.
- Let $SS_\mathrm{res} = \sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)$. This is the sum of squared differences from the ground truth (a.k.a. the sum of squared residuals).
- Then the coefficient of variation is: $1 - \frac{SS_\mathrm{res}}{SS_\mathrm{tot}}$
- This is:
  - 0 if you predict the mean
  - 1 if you're perfect
  - negative if you do worse than the mean-prediction baseline!