Lecture 21 - Multiclass Classification Example, end-to-end¶

Goals:¶

  • Cover no new concepts
  • Put a whole supervised learning system together end-to-end from scratch.
  • Image of the beans to be classified

Our goal: predict a dry bean's species given various geometric measures which have been computed from images of the bean via computer vision.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
In [2]:
RANDOM_SEED = 42

1: Load and Split the Data¶

In [3]:
beans = pd.read_csv('/cluster/academic/DATA311/202620/drybean.csv')
In [4]:
beans
Out[4]:
Area Perimeter MajorAxisLength MinorAxisLength AspectRatio Eccentricity ConvexArea EquivDiameter Extent Solidity Roundness Compactness ShapeFactor1 ShapeFactor2 ShapeFactor3 ShapeFactor4 Class
0 28395 610.291 208.178117 173.888747 1.197191 0.549812 28715 190.141097 0.763923 0.988856 0.958027 0.913358 0.007332 0.003147 0.834222 0.998724 SEKER
1 28734 638.018 200.524796 182.734419 1.097356 0.411785 29172 191.272751 0.783968 0.984986 0.887034 0.953861 0.006979 0.003564 0.909851 0.998430 SEKER
2 29380 624.110 212.826130 175.931143 1.209713 0.562727 29690 193.410904 0.778113 0.989559 0.947849 0.908774 0.007244 0.003048 0.825871 0.999066 SEKER
3 30008 645.884 210.557999 182.516516 1.153638 0.498616 30724 195.467062 0.782681 0.976696 0.903936 0.928329 0.007017 0.003215 0.861794 0.994199 SEKER
4 30140 620.134 201.847882 190.279279 1.060798 0.333680 30417 195.896503 0.773098 0.990893 0.984877 0.970516 0.006697 0.003665 0.941900 0.999166 SEKER
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
13606 42097 759.696 288.721612 185.944705 1.552728 0.765002 42508 231.515799 0.714574 0.990331 0.916603 0.801865 0.006858 0.001749 0.642988 0.998385 DERMASON
13607 42101 757.499 281.576392 190.713136 1.476439 0.735702 42494 231.526798 0.799943 0.990752 0.922015 0.822252 0.006688 0.001886 0.676099 0.998219 DERMASON
13608 42139 759.321 281.539928 191.187979 1.472582 0.734065 42569 231.631261 0.729932 0.989899 0.918424 0.822730 0.006681 0.001888 0.676884 0.996767 DERMASON
13609 42147 763.779 283.382636 190.275731 1.489326 0.741055 42667 231.653247 0.705389 0.987813 0.907906 0.817457 0.006724 0.001852 0.668237 0.995222 DERMASON
13610 42159 772.237 295.142741 182.204716 1.619841 0.786693 42600 231.686223 0.788962 0.989648 0.888380 0.784997 0.007001 0.001640 0.616221 0.998180 DERMASON

13611 rows × 17 columns

In [5]:
# separate features from labels
features = beans.drop(columns="Class")
labels = beans["Class"]
In [6]:
labels.value_counts()
Out[6]:
Class
DERMASON    3546
SIRA        2636
SEKER       2027
HOROZ       1928
CALI        1630
BARBUNYA    1322
BOMBAY       522
Name: count, dtype: int64
In [7]:
from sklearn.model_selection import train_test_split

VAL_FRAC  = 0.2
TEST_FRAC = 0.2

features_rest, features_test, labels_rest, labels_test = sklearn.model_selection.train_test_split(
    features, labels,
    test_size=TEST_FRAC,
    shuffle=True, # default
    stratify=labels,
    random_state=RANDOM_SEED,
)

features_train, features_val, labels_train, labels_val = sklearn.model_selection.train_test_split(
    features_rest, labels_rest,
    test_size = VAL_FRAC,
    shuffle=True,
    stratify=labels_rest,
    random_state=RANDOM_SEED,
)

2: Baselines¶

In [8]:
labels_train.value_counts()
Out[8]:
Class
DERMASON    2269
SIRA        1687
SEKER       1297
HOROZ       1234
CALI        1043
BARBUNYA     846
BOMBAY       334
Name: count, dtype: int64

If we predict the most common label ("DERMASON"), we'll get 2269 correct. Our accuracy will be:

In [9]:
labels_train.value_counts().iloc[0] / labels_train.shape[0]
Out[9]:
np.float64(0.26050516647531574)

I could also calculate random guessing (or random guessing with probabilities weighted by the the class balance. I don't expect any other baseline to beat the above.

3. Feature Extraction¶

In [31]:
# Naively take features as they are:
X_train = features_train.to_numpy()
X_val   = features_val.to_numpy()
X_test  = features_test.to_numpy()

# Convert numerical features to z-scores:
scaler = sklearn.preprocessing.StandardScaler().fit(features_train)
X_train = scaler.transform(features_train)
X_val   = scaler.transform(features_val)
X_test  = scaler.transform(features_test)

# Convert categorical labels to ordinal integers
encoder = sklearn.preprocessing.LabelEncoder().fit(labels_train)
y_train = encoder.transform(labels_train)
y_val   = encoder.transform(labels_val)
y_test  = encoder.transform(labels_test)

4. Learn the Machine!¶

In [42]:
# Hyperparameters

K = 15

# simple metric choices: 'l1', 'l2'
METRIC = 'l2'
In [43]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=K, metric=METRIC)

knn.fit(X_train, y_train);

5. Evaluate Performance¶

In [44]:
def evaluate(classifier, X_train, y_train, X_val, y_val):
    # Classification accuracy on training and validation sets:

    y_train_pred = knn.predict(X_train)
    y_val_pred   = knn.predict(X_val)

    train_acc = (y_train_pred == y_train).sum() / y_train.shape[0]
    val_acc   = (y_val_pred == y_val).sum() / y_val.shape[0]
    
    print(f"Training accuracy: {train_acc:.4f} ({train_acc*100:.2f}%)")
    print(f"Validation accuracy: {val_acc:.4f} ({val_acc*100:.2f}%)")
In [45]:
evaluate(knn, X_train, y_train, X_val, y_val)
Training accuracy: 0.9323 (93.23%)
Validation accuracy: 0.9311 (93.11%)

Where could we go from here?¶

  • Tune $K$
  • Try different distance metrics
  • Feature selection: does a subset of the features work better than using all of them?
  • Different models: go beyond KNN
In [46]:
X_train.shape
Out[46]:
(8710, 16)