Lecture 22 - Baselines, Regression metrics, Linear Classifiers¶

Announcements¶

  • (none?)

Goals¶

  • Know how to come up with good baselines for a variety of prediction tasks.
  • Know how to think about errors regression problems, and a few ways they can be measured.
    • Regression: absolute, relative, squared; MSE, RMSE, MAE, coefficient of determination
    • Later: classification
  • Know the basics of how to do linear regression using sklearn
  • Understand the basic formulation for a linear classifier
  • Know how binary classifiers can be generalized to multiclass classifiers

Baselines¶

The first rule of machine learning is to start without machine learning. (Google says so, so it must be true.)

Why?

  • If you aren't learning from data, you can't overfit to it.
  • It gives you hints about how hard your problem is, putting your model's performance in perspective.

Example: you are a computer vision expert working on biomedical image classification, trying to predict whether an MRI scan shows a tumor or not. Your training data contains 90% non-tumor images (negative examples) and 10% tumor images (positive examples).

Baseline Brainstorm¶

Example prediction problems:

  • Biomedical image classification: predict whether an MRI scan shows a tumor or not.

    • Training data contains 90% non-tumor images (negative examples) and 10% tumor images (positive examples).
  • Spam email classification: predict whether a message is spam.

    • Training data contains equal numbers of spam (positive) and non-spam (negative) examples.
  • Weather prediction: given all weather measurements from today and prior,

    • Predict whether it will rain tomorrow
    • Predict the amount of rainfall tomorrow
  • Body measurements: predict leg length given height

Ideas?

  • Weather: average rain chance over all time
  • Tumors: Always say "no"
  • Spam: Randomly choose a label
  • Leg length: Binned population averages
  • Rain: average daily rainfall

What generic strategies can we extract from the above?

  • Predict the mean/median/mode
  • Randomly guess
In [ ]:

My ideas:

  • Guess randomly
  • Guess the mean/median/mode
  • Single-feature model
  • Linear regression
  • History repeats itself
  • Special mention: upper-bound baselines

Evaluation Metrics¶

So you've made some predictions... how good are they?

Let's consider regression first. Our model is some function that maps an input datapoint to a numerical value:

$y_i^\mathrm{pred} = f(x_i)$

and we have a ground-truth value $y_i^\mathrm{true} $for $x_i$.

How do we measure how wrong we are?

  • Error is pretty simple to define:

    $y_i^\mathrm{true} - y_i^\mathrm{pred}$

  • But we want to evaluate our model on the whole train or val set. Average error is a bad idea:

    $\sum_i y_i^\mathrm{true} - y_i^\mathrm{pred}$

  • Absolute error solves this problem:

    $|y_i^\mathrm{true} - y_i^\mathrm{pred}|$

  • Mean absolute error measures performance on a whole train or val set:

    $\frac{1}{n} \sum_i |y_i^\mathrm{true} - y_i^\mathrm{pred}$|

  • Squared error disproportionately punishes larger errors. This may be desirable or not.

    $\sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)^2$

  • Mean squared error (MSE) does the same over a collection of training exaples:

    $\frac{1}{n} \sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)^2$

  • MSE becomes more interpretable if you square-root it, because now it's in the units of the input. This gives us Root Mean Squared Error (RMSE):

    $\sqrt{ \frac{1}{n} \sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)^2}$

Problem with any of the above:

You can make your error metric go as small as you want! Just scale the data: $$ X \leftarrow X / k $$ $$ \mathbf{y}^\mathrm{true} \leftarrow \mathbf{y}^\mathrm{true} / k $$ $$ \mathbf{y}^\mathrm{pred} \leftarrow \mathbf{y}^\mathrm{pred} / k $$

Also: Is 10 vs 12 is a bigger error than 1 vs 2?

Solutions:

  • Relative error:

    $\frac{|y_i^\mathrm{true} - y_i^\mathrm{pred}|}{y_i^\mathrm{true}}$

  • Coefficient of determination:

    • Let $\bar{y}$ be the mean of $\mathbf{y}^\mathrm{true}$.
    • Let $SS_\mathrm{tot} = \sum_i \left(y_i^\mathrm{true} - \bar{y}\right)$.
    • Let $SS_\mathrm{res} = \sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)$.
    • Then the coefficient of determination is:

    $1 - \frac{SS_\mathrm{res}}{SS_\mathrm{tot}}$

    • This is:
      • 0 if you predict the mean
      • 1 if you're perfect
      • negative if you do worse than the mean-prediction baseline!

Linear Regression in sklearn¶

In [51]:
import seaborn as sns
import sklearn
import numpy as np
import pandas as pd

penguins = sns.load_dataset('penguins')

penguins.info()
penguins = penguins.dropna()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
dtypes: float64(4), object(3)
memory usage: 18.9+ KB
In [52]:
Xall = penguins[["bill_length_mm", "bill_depth_mm", "body_mass_g"]]
yall = penguins["flipper_length_mm"]

Xall
Out[52]:
bill_length_mm bill_depth_mm body_mass_g
0 39.1 18.7 3750.0
1 39.5 17.4 3800.0
2 40.3 18.0 3250.0
4 36.7 19.3 3450.0
5 39.3 20.6 3650.0
... ... ... ...
338 47.2 13.7 4925.0
340 46.8 14.3 4850.0
341 50.4 15.7 5750.0
342 45.2 14.8 5200.0
343 49.9 16.1 5400.0

333 rows × 3 columns

In [53]:
yall
Out[53]:
0      181.0
1      186.0
2      195.0
4      193.0
5      190.0
       ...  
338    214.0
340    215.0
341    222.0
342    212.0
343    213.0
Name: flipper_length_mm, Length: 333, dtype: float64
In [55]:
from sklearn.model_selection import train_test_split
Xtrain, Xtemp, ytrain, ytemp = train_test_split(Xall, yall, train_size=0.6, random_state=16)
Xval, Xtest, yval, ytest = train_test_split(Xtemp, ytemp, test_size=0.5, random_state=16)

[len(d) for d in [Xtrain, Xval, Xtest]]
Out[55]:
[199, 67, 67]
In [56]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error

reg = LinearRegression()

reg.fit(Xtrain, ytrain)

ytrain_pred = reg.predict(Xtrain)
yval_pred = reg.predict(Xval)

metrics = {
  "Coeff. of determination":
      [reg.score(Xtrain, ytrain),
       reg.score(Xval, yval)],
  "MAE": 
      [mean_absolute_error(ytrain, ytrain_pred),
       mean_absolute_error(yval, yval_pred)],
  "RMSE": 
      [mean_squared_error(ytrain, ytrain_pred, squared=False),
       mean_squared_error(yval, yval_pred, squared=False)]
}
pd.DataFrame(metrics, index=["Train", "Val"])
           
Out[56]:
Coeff. of determination MAE RMSE
Train 0.815466 4.565753 5.937327
Val 0.853758 4.215489 5.197380
In [59]:
pd.DataFrame([Xall.columns, reg.coef_])
Out[59]:
0 1 2
0 bill_length_mm bill_depth_mm body_mass_g
1 0.504554 -1.498074 0.011291
In [57]:
sns.relplot(x=yval, y=yval_pred)
sns.displot(np.abs(yval - yval_pred))
Out[57]:
<seaborn.axisgrid.FacetGrid at 0x7f90cc90f100>

Things to try:

  • Use categorical columns? Encode numerically using ordinal or one-hot encoding.
    • sklearn.preprocessing.{OrdinalEncoder,OneHotEncoder}
  • Scale and center the data?
    • sklearn.preprocessing.{StandardScalar,RobustScaler}
  • Feature expansion?
    • sklearn.preprocessing.PolynomialFeatures

Linear Regression, But For Classification?¶

The simplest regression model we could come up with was linear regression, which assumes that $y$ is a linear function of the input vector $\mathbf{x}$.

What's the corresponding model for classification?

  • Still assume a linear model $\hat{y} = \mathbf{x} \cdot \mathbf{w}$.
  • However, $\hat{y}$ should not be directly interpreted as a class label (it is continuous, after all).
  • Instead, map $\hat{y}$ to $y$ using some function $h$ that encodes our intuition about how classifiers should work.

Multiple possible choices for $h$:

  • Sign function: $y = \mathrm{sign}({\hat{y}})$
    • "Which side of the decision boundary is it on?"
  • Sigmoid function: $y = \frac{1}{1+e^{-\hat{y}}}$
    • What's the probability of it being positive?

Multiple variants of linear classifiers exist with the same modeling assumptions. Their variations derive from on how you find the linear weights $\mathbf{w}$.

How do you do multilabel classification?¶

Some classifiers handle this naturally by outputting a score or probability for each class.

Other classifiers don't; you can adapt them to multiclass classifiers using a one-vs-rest strategy:

  • Train one binary classifier that treats one class as positive and the rest as negative.
  • Return the label with the highest probability or best score.