Lecture 17 - Supervised Learning¶

Classification and Regression; K-Nearest Neighbors; Feature Engineering¶

In [ ]:

Announcements:¶

Goals:¶

  • Be able to define supervised learning, classification and regression
  • Understand and be able to implement a k-nearest-neighbors (KNN) classifier or regressor
  • Know the purpose and mechanism of some basic feature preprocessing techniques:
    • Standardization and scaling for numerical features
    • Ordinal and one-hot encoding for categorical features
In [121]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
In [124]:
penguins
Out[124]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
30 Adelie Dream 39.5 16.7 178.0 3250.0 Female
317 Gentoo Biscoe 46.9 14.6 222.0 4875.0 Female
79 Adelie Torgersen 42.1 19.1 195.0 4000.0 Male
201 Chinstrap Dream 49.8 17.3 198.0 3675.0 Female
63 Adelie Biscoe 41.1 18.2 192.0 4050.0 Male
... ... ... ... ... ... ... ...
194 Chinstrap Dream 50.9 19.1 196.0 3550.0 Male
77 Adelie Torgersen 37.2 19.4 184.0 3900.0 Male
112 Adelie Biscoe 39.7 17.7 193.0 3200.0 Female
277 Gentoo Biscoe 45.5 15.0 220.0 5000.0 Male
108 Adelie Biscoe 38.1 17.0 181.0 3175.0 Female

333 rows × 7 columns

Today we'll talk more about supervised learning; as a reminder, this is where we have:

  • a dataset $X$ with shape $(n, d)$ that has $n$ datapoints each represented by a $d$-dimensional feature vector.
    • for now, let's use the numerical columns in the penguins dataset as our features:
In [133]:
numerical_features = [
    'bill_length_mm',
    'bill_depth_mm',
    'flipper_length_mm',
    'body_mass_g'
]
penguins = sns.load_dataset("penguins").dropna().sample(frac=1, random_state=42)

X = penguins[numerical_features].to_numpy()

print(X.shape)
(333, 4)
  • $y$, a length-$n$ vector of labels representing some aspect we'd like to predict
    • for the penguins example, we'll use the species column as our $y$:
In [134]:
y = penguins["species"]
print(y.shape)
print(y[0])
(333,)
Adelie

This is a categorical column; the task of predicting its value is called classification becuase we are trying to classify the penguin as one of a discrete set of categories or labels.

We could also imagine predicting, say, flipper length from body mass and bill length:

In [135]:
y = penguins["flipper_length_mm"]
print(y.shape)
print(y[0])
(333,)
181.0

In this case, we are predicting a (continuous) numerical quantity; this is called regression.

For now, we'll stick with the species classification problem:

In [136]:
y = penguins["species"].map({"Adelie": 1, "Chinstrap": 2, "Gentoo": 3}).to_numpy()

We'll split the dataset into some "known" training data and "unseen" validation data to test on:

In [137]:
Xtrain = X[:300, :]
ytrain = y[:300]
Xtest = X[300:, :]
ytest = y[300:]

Now we'll train a KNN classifier on the training set:

In [138]:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1).fit(Xtrain, ytrain)

Predict labels for the test set:

In [139]:
ypred = knn.predict(Xtest)

...and evaluate the results:

In [140]:
print(np.sum(ypred == ytest), "/", len(ytest), "predictions were correct")
28 / 33 predictions were correct
In [ ]:
 

Why isn't this perfect?¶

In [141]:
sns.pairplot(data=penguins, hue="species");
No description has been provided for this image

I did some Pandas nonsense to add a column to the penguins df containing distances to penguin 0:

In [ ]:
 
In [142]:
# our friendly L^p distance function:
def L(p, a, b):
    """ Compute the L^p distance between vectors a and b
    Pre: p > 0 and a, b are d-dimensional 1d arrays """
    return np.sum(np.abs(a - b) ** p) ** (1/p)

# grab the 0th penguin (in the whole dataset, not just the training set)
penguin_0 = penguins.iloc[0][numerical_features]

# add an "L2" column with distance to pengiun_0
penguins["L2"] = penguins[numerical_features].apply(lambda b: L(2, penguin_0, b[numerical_features]), axis=1)

Now I can rank the penguins by similarity to Penguin 0:

In [143]:
penguins.sort_values("L2").head(20)
Out[143]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex L2
30 Adelie Dream 39.5 16.7 178.0 3250.0 Female 0.000000
158 Chinstrap Dream 46.1 18.2 178.0 3250.0 Female 6.768309
208 Chinstrap Dream 45.2 16.6 191.0 3250.0 Female 14.195070
200 Chinstrap Dream 51.5 18.7 187.0 3250.0 Male 15.132746
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female 17.068392
126 Adelie Torgersen 38.8 17.6 191.0 3275.0 Female 28.201064
12 Adelie Torgersen 41.1 17.6 182.0 3200.0 Female 50.193326
38 Adelie Dream 37.6 19.3 181.0 3300.0 Female 50.193326
182 Chinstrap Dream 40.9 16.6 187.0 3200.0 Female 50.822928
27 Adelie Biscoe 40.5 17.9 187.0 3200.0 Female 50.827552
94 Adelie Dream 36.2 17.3 187.0 3300.0 Female 50.914143
32 Adelie Dream 39.5 17.8 188.0 3300.0 Female 51.002059
80 Adelie Torgersen 34.6 17.2 189.0 3200.0 Female 51.432091
112 Adelie Biscoe 39.7 17.7 193.0 3200.0 Female 52.211493
176 Chinstrap Dream 46.7 17.9 195.0 3300.0 Female 53.313038
168 Chinstrap Dream 50.3 20.0 197.0 3300.0 Male 54.667449
108 Adelie Biscoe 38.1 17.0 181.0 3175.0 Female 75.073631
18 Adelie Torgersen 34.4 18.4 184.0 3325.0 Female 75.431426
119 Adelie Torgersen 41.1 18.6 189.0 3325.0 Male 75.843062
130 Adelie Torgersen 38.5 17.9 190.0 3325.0 Female 75.969994

Let's pairplot the penguins, now with these distances included:

In [144]:
sns.pairplot(data=penguins, hue="species");
No description has been provided for this image

Notice: the L2 distance and the body_mass_g columns are... basically the same thing.

In [145]:
sns.relplot(data=penguins, x="body_mass_g", y="L2", hue="species")
Out[145]:
<seaborn.axisgrid.FacetGrid at 0x1489eab161b0>
No description has been provided for this image

Discuss at your table: What's happening here? Is this what we wanted? If not, why did this happen and what could we do about it?

In [ ]:

Let's look at two penguins and see why this might have happened:

In [146]:
two_penguins = penguins.iloc[:2][numerical_features].to_numpy()
two_penguins
Out[146]:
array([[  39.5,   16.7,  178. , 3250. ],
       [  46.9,   14.6,  222. , 4875. ]])

The difference betweeen these two 4D vectors?

In [147]:
d = two_penguins[0,:] - two_penguins[1,:]
d
Out[147]:
array([   -7.4,     2.1,   -44. , -1625. ])

What does this mean? With our current numerical features, the L2 distance is dominated by body mass, so each penguin's "nearest neighbor" is just going to be the penguin whose weight is closest. The other features won't make a dent!

Feature Extraction, Version 0.1¶

Previously, we extracted a 4D feature vector by just taking the numerical columns for each penguin.

This time, we'll add a step: convert each numerical columns to $z$-scores:

In [148]:
df = penguins.copy(deep=True)
for col in numerical_features:
    data = df[col]
    df[col] = (data - data.mean()) / data.std()
In [149]:
df
Out[149]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex L2
30 Adelie Dream -0.821552 -0.236064 -1.638652 -1.188572 Female 0.000000
317 Gentoo Biscoe 0.531612 -1.302467 1.500670 0.829520 Female 1625.613783
79 Adelie Torgersen -0.346116 0.982683 -0.425733 -0.257145 Male 750.200986
201 Chinstrap Dream 1.061905 0.068623 -0.211688 -0.660763 Female 425.595406
63 Adelie Biscoe -0.528976 0.525653 -0.639777 -0.195050 Male 800.125496
... ... ... ... ... ... ... ... ...
194 Chinstrap Dream 1.263051 0.982683 -0.354384 -0.816001 Male 300.765224
77 Adelie Torgersen -1.242129 1.135027 -1.210563 -0.381335 Male 650.037368
112 Adelie Biscoe -0.784980 0.271748 -0.568429 -1.250667 Female 52.211493
277 Gentoo Biscoe 0.275608 -1.099343 1.357973 0.984758 Male 1750.515036
108 Adelie Biscoe -1.077555 -0.083720 -1.424608 -1.281715 Female 75.073631

333 rows × 8 columns

Now compute the L2 distance again:

In [150]:
penguin_0 = df.iloc[0][numerical_features]

df["L2_scaled"] = df[numerical_features].apply(lambda b: L(2, penguin_0, b[numerical_features]), axis=1)

Rank them:

In [151]:
df.sort_values("L2_scaled").head(20)
Out[151]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex L2 L2_scaled
30 Adelie Dream -0.821552 -0.236064 -1.638652 -1.188572 Female 0.000000 0.000000
122 Adelie Torgersen -0.693550 -0.083720 -1.781349 -0.940192 Female 200.011450 0.348781
108 Adelie Biscoe -1.077555 -0.083720 -1.424608 -1.281715 Female 75.073631 0.378467
12 Adelie Torgersen -0.528976 0.220967 -1.353259 -1.250667 Female 50.193326 0.616265
102 Adelie Biscoe -1.150699 -0.591532 -1.281911 -1.405905 Female 175.082066 0.639682
182 Chinstrap Dream -0.565548 -0.286845 -0.996518 -1.250667 Female 50.822928 0.695923
138 Adelie Dream -1.278701 -0.337626 -1.139215 -1.002287 Female 150.184187 0.709536
44 Adelie Dream -1.278701 -0.134501 -1.139215 -1.499048 Female 250.110556 0.751754
24 Adelie Biscoe -0.949553 0.017842 -1.495956 -0.505525 Male 550.004309 0.753504
62 Adelie Biscoe -1.168985 -0.083720 -1.139215 -0.753906 Female 350.075278 0.763080
6 Adelie Torgersen -0.931267 0.322529 -1.424608 -0.722858 Female 375.014093 0.766007
141 Adelie Dream -0.620406 0.017842 -0.996518 -0.909144 Male 225.183170 0.771585
58 Adelie Biscoe -1.370131 -0.286845 -1.424608 -1.685333 Female 400.022512 0.772078
172 Chinstrap Dream -0.291258 0.068623 -1.424608 -0.753906 Female 350.025385 0.780253
56 Adelie Biscoe -0.912981 0.170185 -1.067867 -0.816001 Female 300.108131 0.798751
50 Adelie Biscoe -0.803266 0.271748 -1.067867 -0.878096 Female 250.129986 0.824863
184 Chinstrap Dream -0.272972 -0.236064 -0.996518 -1.064382 Female 100.448992 0.853639
134 Adelie Dream -1.077555 0.220967 -0.996518 -0.971239 Female 175.239179 0.856729
116 Adelie Torgersen -0.986125 -0.083720 -0.925170 -1.623238 Female 350.144113 0.865034
60 Adelie Biscoe -1.516419 -0.134501 -1.139215 -1.312762 Female 100.316898 0.870642

And pairplot:

In [152]:
sns.pairplot(data=df.drop(columns="L2"), hue="species")
Out[152]:
<seaborn.axisgrid.PairGrid at 0x1489f71922d0>
No description has been provided for this image

Let's try our KNN classifier again, now with standardized features:

In [153]:
X = df[numerical_features].to_numpy()
y = df["species"].map({"Adelie": 1, "Chinstrap": 2, "Gentoo": 3}).to_numpy()

Xtrain = X[:300, :]
ytrain = y[:300]
Xtest = X[300:, :]
ytest = y[300:]
In [154]:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1).fit(Xtrain, ytrain)
ypred = knn.predict(Xtest)
print(np.sum(ypred == ytest), "/", len(ytest), "predictions were correct")
33 / 33 predictions were correct

What about Categorical columns?¶

In [155]:
dfo = df.copy(deep=True)
In [156]:
categorical_features = ["island", "sex"]
dfo
Out[156]:
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex L2 L2_scaled
30 Adelie Dream -0.821552 -0.236064 -1.638652 -1.188572 Female 0.000000 0.000000
317 Gentoo Biscoe 0.531612 -1.302467 1.500670 0.829520 Female 1625.613783 4.110512
79 Adelie Torgersen -0.346116 0.982683 -0.425733 -0.257145 Male 750.200986 2.012490
201 Chinstrap Dream 1.061905 0.068623 -0.211688 -0.660763 Female 425.595406 2.440298
63 Adelie Biscoe -0.528976 0.525653 -0.639777 -0.195050 Male 800.125496 1.628082
... ... ... ... ... ... ... ... ... ...
194 Chinstrap Dream 1.263051 0.982683 -0.354384 -0.816001 Male 300.765224 2.760266
77 Adelie Torgersen -1.242129 1.135027 -1.210563 -0.381335 Male 650.037368 1.700490
112 Adelie Biscoe -0.784980 0.271748 -0.568429 -1.250667 Female 52.211493 1.186779
277 Gentoo Biscoe 0.275608 -1.099343 1.357973 0.984758 Male 1750.515036 3.956278
108 Adelie Biscoe -1.077555 -0.083720 -1.424608 -1.281715 Female 75.073631 0.378467

333 rows × 9 columns

Numerical Encodings for Categorical Values¶

In [157]:
dfo["island"].value_counts()
Out[157]:
island
Biscoe       163
Dream        123
Torgersen     47
Name: count, dtype: int64

Ordinal Encoding¶

In [158]:
dfo["island_ordinal"] = dfo["island"].map({"Biscoe": 1, "Dream": 2, "Torgersen": 3})
dfo[categorical_features + ["island_ordinal"]]
Out[158]:
island sex island_ordinal
30 Dream Female 2
317 Biscoe Female 1
79 Torgersen Male 3
201 Dream Female 2
63 Biscoe Male 1
... ... ... ...
194 Dream Male 2
77 Torgersen Male 3
112 Biscoe Female 1
277 Biscoe Male 1
108 Biscoe Female 1

333 rows × 3 columns

One-Hot Encoding¶

In [159]:
islands = list(dfo["island"].unique())
for island_name in islands:
    dfo[island_name] = (dfo["island"] == island_name).astype(int)
In [160]:
dfo[categorical_features + islands]
Out[160]:
island sex Dream Biscoe Torgersen
30 Dream Female 1 0 0
317 Biscoe Female 0 1 0
79 Torgersen Male 0 0 1
201 Dream Female 1 0 0
63 Biscoe Male 0 1 0
... ... ... ... ... ...
194 Dream Male 1 0 0
77 Torgersen Male 0 0 1
112 Biscoe Female 0 1 0
277 Biscoe Male 0 1 0
108 Biscoe Female 0 1 0

333 rows × 5 columns

Exercise: When would ordinal vs one-hot be advantageous?

In [ ]: