sklearn
The first rule of machine learning is to start without machine learning. (Google says so, so it must be true.)
Why?
Example: you are a computer vision expert working on biomedical image classification, trying to predict whether an MRI scan shows a tumor or not. Your training data contains 90% non-tumor images (negative examples) and 10% tumor images (positive examples).
Example prediction problems:
Biomedical image classification: predict whether an MRI scan shows a tumor or not.
Spam email classification: predict whether a message is spam.
Weather prediction: given all weather measurements from today and prior,
Body measurements: predict leg length given height
Ideas?
What generic strategies can we extract from the above?
My ideas:
So you've made some predictions... how good are they?
Let's consider regression first. Our model is some function that maps an input datapoint to a numerical value:
$y_i^\mathrm{pred} = f(x_i)$
and we have a ground-truth value $y_i^\mathrm{true} $for $x_i$.
How do we measure how wrong we are?
Error is pretty simple to define:
$y_i^\mathrm{true} - y_i^\mathrm{pred}$
But we want to evaluate our model on the whole train or val set. Average error is a bad idea:
$\sum_i y_i^\mathrm{true} - y_i^\mathrm{pred}$
Absolute error solves this problem:
$|y_i^\mathrm{true} - y_i^\mathrm{pred}|$
Mean absolute error measures performance on a whole train or val set:
$\frac{1}{n} \sum_i |y_i^\mathrm{true} - y_i^\mathrm{pred}$|
Squared error disproportionately punishes larger errors. This may be desirable or not.
$\sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)^2$
Mean squared error (MSE) does the same over a collection of training exaples:
$\frac{1}{n} \sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)^2$
MSE becomes more interpretable if you square-root it, because now it's in the units of the input. This gives us Root Mean Squared Error (RMSE):
$\sqrt{ \frac{1}{n} \sum_i \left(y_i^\mathrm{true} - y_i^\mathrm{pred}\right)^2}$
Problem with any of the above:
You can make your error metric go as small as you want! Just scale the data: $$ X \leftarrow X / k $$ $$ \mathbf{y}^\mathrm{true} \leftarrow \mathbf{y}^\mathrm{true} / k $$ $$ \mathbf{y}^\mathrm{pred} \leftarrow \mathbf{y}^\mathrm{pred} / k $$
Also: Is 10 vs 12 is a bigger error than 1 vs 2?
Solutions:
Relative error:
$\frac{|y_i^\mathrm{true} - y_i^\mathrm{pred}|}{y_i^\mathrm{true}}$
Coefficient of determination:
$1 - \frac{SS_\mathrm{res}}{SS_\mathrm{tot}}$
sklearn
¶import seaborn as sns
import sklearn
import numpy as np
import pandas as pd
penguins = sns.load_dataset('penguins')
penguins.info()
penguins = penguins.dropna()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 344 entries, 0 to 343 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 species 344 non-null object 1 island 344 non-null object 2 bill_length_mm 342 non-null float64 3 bill_depth_mm 342 non-null float64 4 flipper_length_mm 342 non-null float64 5 body_mass_g 342 non-null float64 6 sex 333 non-null object dtypes: float64(4), object(3) memory usage: 18.9+ KB
Xall = penguins[["bill_length_mm", "bill_depth_mm", "body_mass_g"]]
yall = penguins["flipper_length_mm"]
Xall
bill_length_mm | bill_depth_mm | body_mass_g | |
---|---|---|---|
0 | 39.1 | 18.7 | 3750.0 |
1 | 39.5 | 17.4 | 3800.0 |
2 | 40.3 | 18.0 | 3250.0 |
4 | 36.7 | 19.3 | 3450.0 |
5 | 39.3 | 20.6 | 3650.0 |
... | ... | ... | ... |
338 | 47.2 | 13.7 | 4925.0 |
340 | 46.8 | 14.3 | 4850.0 |
341 | 50.4 | 15.7 | 5750.0 |
342 | 45.2 | 14.8 | 5200.0 |
343 | 49.9 | 16.1 | 5400.0 |
333 rows × 3 columns
yall
0 181.0 1 186.0 2 195.0 4 193.0 5 190.0 ... 338 214.0 340 215.0 341 222.0 342 212.0 343 213.0 Name: flipper_length_mm, Length: 333, dtype: float64
from sklearn.model_selection import train_test_split
Xtrain, Xtemp, ytrain, ytemp = train_test_split(Xall, yall, train_size=0.6, random_state=16)
Xval, Xtest, yval, ytest = train_test_split(Xtemp, ytemp, test_size=0.5, random_state=16)
[len(d) for d in [Xtrain, Xval, Xtest]]
[199, 67, 67]
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
reg = LinearRegression()
reg.fit(Xtrain, ytrain)
ytrain_pred = reg.predict(Xtrain)
yval_pred = reg.predict(Xval)
metrics = {
"Coeff. of determination":
[reg.score(Xtrain, ytrain),
reg.score(Xval, yval)],
"MAE":
[mean_absolute_error(ytrain, ytrain_pred),
mean_absolute_error(yval, yval_pred)],
"RMSE":
[mean_squared_error(ytrain, ytrain_pred, squared=False),
mean_squared_error(yval, yval_pred, squared=False)]
}
pd.DataFrame(metrics, index=["Train", "Val"])
Coeff. of determination | MAE | RMSE | |
---|---|---|---|
Train | 0.815466 | 4.565753 | 5.937327 |
Val | 0.853758 | 4.215489 | 5.197380 |
pd.DataFrame([Xall.columns, reg.coef_])
0 | 1 | 2 | |
---|---|---|---|
0 | bill_length_mm | bill_depth_mm | body_mass_g |
1 | 0.504554 | -1.498074 | 0.011291 |
sns.relplot(x=yval, y=yval_pred)
sns.displot(np.abs(yval - yval_pred))
<seaborn.axisgrid.FacetGrid at 0x7f90cc90f100>
Things to try:
sklearn.preprocessing.{OrdinalEncoder,OneHotEncoder}
sklearn.preprocessing.{StandardScalar,RobustScaler}
sklearn.preprocessing.PolynomialFeatures
The simplest regression model we could come up with was linear regression, which assumes that $y$ is a linear function of the input vector $\mathbf{x}$.
What's the corresponding model for classification?
Multiple possible choices for $h$:
Multiple variants of linear classifiers exist with the same modeling assumptions. Their variations derive from on how you find the linear weights $\mathbf{w}$.
Some classifiers handle this naturally by outputting a score or probability for each class.
Other classifiers don't; you can adapt them to multiclass classifiers using a one-vs-rest strategy: