How good are they? Assume we're in a supervised setting, so we have some ground truth labels for our training and validation data. Should you call it good and present your results, or keep tweaking your model?
You need an evaluation environment. What do you need to make this?
With good reason, the book recommends that you package all your evaluation machinery into a single-command program (this could also be a single notebook or sequence of cells in a notebook).
You should output your candidate model's performance:
It's also a good idea to output:
The first rule of machine learning is to start without machine learning. (Google says so, so it must be true.)
Why?
Example: you are a computer vision expert working on biomedical image classification, trying to predict whether an MRI scan shows a tumor or not. Your training data contains 90% non-tumor images (negative examples) and 10% tumor images (positive examples).
Example prediction problems:
Ideas?
My ideas:
So you've made some predictions... how good are they?
Let's consider regression first. Our model is some function that maps an input datapoint to a numerical value:
$y_i^\mathrm{pred} = f(x_i)$
and we have a ground-truth value $y_i^\mathrm{true} $for $x_i$.
How do we measure how wrong we are?
Error is pretty simple to define:
$y_i^\mathrm{true} - y_i^\mathrm{pred}$
But we want to evaluate our model on the whole train or val set. Average error is a bad idea:
$\sum_i y_i^\mathrm{true} - y_i^\mathrm{pred}$
Absolute error solves this problem:
$|y_i^\mathrm{true} - y_i^\mathrm{pred}|$
Mean absolute error measures performance on a whole train or val set:
$\frac{1}{n} \sum_i |y_i^\mathrm{true} - y_i^\mathrm{pred}$|
Squared error disproportionately punishes larger errors. This may be desirable or not.
$\sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)^2$
Mean squared error (MSE) does the same over a collection of training exaples:
$\frac{1}{n} \sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)^2$
MSE becomes more interpretable if you square-root it, because now it's in the units of the input. This gives us Root Mean Squared Error (RMSE):
$\sqrt{ \frac{1}{n} \sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)^2}$
Problem with any of the above:
You can make your error metric go as small as you want! Just scale: $$ X \leftarrow X / k $$ $$ \mathbf{y}^\mathrm{true} \leftarrow \mathbf{y}^\mathrm{true} / k $$ $$ \mathbf{y}^\mathrm{pred} \leftarrow \mathbf{y}^\mathrm{pred} / k $$
Also: Is 10 vs 12 is a bigger error than 1 vs 2?
Solutions:
Relative error:
$\frac{|y_i^\mathrm{true} - y_i^\mathrm{pred}|}{y_i^\mathrm{true}}$
Coefficient of variation:
Then the coefficient of variation is: $1 - \frac{SS_\mathrm{res}}{SS_\mathrm{tot}}$
This is: