Announcements:¶
Goals:¶
- Be able to define supervised learning, classification and regression
- Understand and be able to implement a k-nearest-neighbors (KNN) classifier or regressor
- Know the purpose and mechanism of some basic feature preprocessing techniques:
- Standardization and scaling for numerical features
- Ordinal and one-hot encoding for categorical features
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
penguins
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
|---|---|---|---|---|---|---|---|
| 30 | Adelie | Dream | 39.5 | 16.7 | 178.0 | 3250.0 | Female |
| 317 | Gentoo | Biscoe | 46.9 | 14.6 | 222.0 | 4875.0 | Female |
| 79 | Adelie | Torgersen | 42.1 | 19.1 | 195.0 | 4000.0 | Male |
| 201 | Chinstrap | Dream | 49.8 | 17.3 | 198.0 | 3675.0 | Female |
| 63 | Adelie | Biscoe | 41.1 | 18.2 | 192.0 | 4050.0 | Male |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 194 | Chinstrap | Dream | 50.9 | 19.1 | 196.0 | 3550.0 | Male |
| 77 | Adelie | Torgersen | 37.2 | 19.4 | 184.0 | 3900.0 | Male |
| 112 | Adelie | Biscoe | 39.7 | 17.7 | 193.0 | 3200.0 | Female |
| 277 | Gentoo | Biscoe | 45.5 | 15.0 | 220.0 | 5000.0 | Male |
| 108 | Adelie | Biscoe | 38.1 | 17.0 | 181.0 | 3175.0 | Female |
333 rows × 7 columns
Today we'll talk more about supervised learning; as a reminder, this is where we have:
- a dataset $X$ with shape $(n, d)$ that has $n$ datapoints each represented by a $d$-dimensional feature vector.
- for now, let's use the numerical columns in the penguins dataset as our features:
numerical_features = [
'bill_length_mm',
'bill_depth_mm',
'flipper_length_mm',
'body_mass_g'
]
penguins = sns.load_dataset("penguins").dropna().sample(frac=1, random_state=42)
X = penguins[numerical_features].to_numpy()
print(X.shape)
(333, 4)
- $y$, a length-$n$ vector of labels representing some aspect we'd like to predict
- for the penguins example, we'll use the species column as our $y$:
y = penguins["species"]
print(y.shape)
print(y[0])
(333,) Adelie
This is a categorical column; the task of predicting its value is called classification becuase we are trying to classify the penguin as one of a discrete set of categories or labels.
We could also imagine predicting, say, flipper length from body mass and bill length:
y = penguins["flipper_length_mm"]
print(y.shape)
print(y[0])
(333,) 181.0
In this case, we are predicting a (continuous) numerical quantity; this is called regression.
For now, we'll stick with the species classification problem:
y = penguins["species"].map({"Adelie": 1, "Chinstrap": 2, "Gentoo": 3}).to_numpy()
We'll split the dataset into some "known" training data and "unseen" validation data to test on:
Xtrain = X[:300, :]
ytrain = y[:300]
Xtest = X[300:, :]
ytest = y[300:]
Now we'll train a KNN classifier on the training set:
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1).fit(Xtrain, ytrain)
Predict labels for the test set:
ypred = knn.predict(Xtest)
...and evaluate the results:
print(np.sum(ypred == ytest), "/", len(ytest), "predictions were correct")
28 / 33 predictions were correct
Why isn't this perfect?¶
sns.pairplot(data=penguins, hue="species");
I did some Pandas nonsense to add a column to the penguins df containing distances to penguin 0:
# our friendly L^p distance function:
def L(p, a, b):
""" Compute the L^p distance between vectors a and b
Pre: p > 0 and a, b are d-dimensional 1d arrays """
return np.sum(np.abs(a - b) ** p) ** (1/p)
# grab the 0th penguin (in the whole dataset, not just the training set)
penguin_0 = penguins.iloc[0][numerical_features]
# add an "L2" column with distance to pengiun_0
penguins["L2"] = penguins[numerical_features].apply(lambda b: L(2, penguin_0, b[numerical_features]), axis=1)
Now I can rank the penguins by similarity to Penguin 0:
penguins.sort_values("L2").head(20)
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | L2 | |
|---|---|---|---|---|---|---|---|---|
| 30 | Adelie | Dream | 39.5 | 16.7 | 178.0 | 3250.0 | Female | 0.000000 |
| 158 | Chinstrap | Dream | 46.1 | 18.2 | 178.0 | 3250.0 | Female | 6.768309 |
| 208 | Chinstrap | Dream | 45.2 | 16.6 | 191.0 | 3250.0 | Female | 14.195070 |
| 200 | Chinstrap | Dream | 51.5 | 18.7 | 187.0 | 3250.0 | Male | 15.132746 |
| 2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female | 17.068392 |
| 126 | Adelie | Torgersen | 38.8 | 17.6 | 191.0 | 3275.0 | Female | 28.201064 |
| 12 | Adelie | Torgersen | 41.1 | 17.6 | 182.0 | 3200.0 | Female | 50.193326 |
| 38 | Adelie | Dream | 37.6 | 19.3 | 181.0 | 3300.0 | Female | 50.193326 |
| 182 | Chinstrap | Dream | 40.9 | 16.6 | 187.0 | 3200.0 | Female | 50.822928 |
| 27 | Adelie | Biscoe | 40.5 | 17.9 | 187.0 | 3200.0 | Female | 50.827552 |
| 94 | Adelie | Dream | 36.2 | 17.3 | 187.0 | 3300.0 | Female | 50.914143 |
| 32 | Adelie | Dream | 39.5 | 17.8 | 188.0 | 3300.0 | Female | 51.002059 |
| 80 | Adelie | Torgersen | 34.6 | 17.2 | 189.0 | 3200.0 | Female | 51.432091 |
| 112 | Adelie | Biscoe | 39.7 | 17.7 | 193.0 | 3200.0 | Female | 52.211493 |
| 176 | Chinstrap | Dream | 46.7 | 17.9 | 195.0 | 3300.0 | Female | 53.313038 |
| 168 | Chinstrap | Dream | 50.3 | 20.0 | 197.0 | 3300.0 | Male | 54.667449 |
| 108 | Adelie | Biscoe | 38.1 | 17.0 | 181.0 | 3175.0 | Female | 75.073631 |
| 18 | Adelie | Torgersen | 34.4 | 18.4 | 184.0 | 3325.0 | Female | 75.431426 |
| 119 | Adelie | Torgersen | 41.1 | 18.6 | 189.0 | 3325.0 | Male | 75.843062 |
| 130 | Adelie | Torgersen | 38.5 | 17.9 | 190.0 | 3325.0 | Female | 75.969994 |
Let's pairplot the penguins, now with these distances included:
sns.pairplot(data=penguins, hue="species");
Notice: the L2 distance and the body_mass_g columns are... basically the same thing.
sns.relplot(data=penguins, x="body_mass_g", y="L2", hue="species")
<seaborn.axisgrid.FacetGrid at 0x1489eab161b0>
Discuss at your table: What's happening here? Is this what we wanted? If not, why did this happen and what could we do about it?
Let's look at two penguins and see why this might have happened:
two_penguins = penguins.iloc[:2][numerical_features].to_numpy()
two_penguins
array([[ 39.5, 16.7, 178. , 3250. ],
[ 46.9, 14.6, 222. , 4875. ]])
The difference betweeen these two 4D vectors?
d = two_penguins[0,:] - two_penguins[1,:]
d
array([ -7.4, 2.1, -44. , -1625. ])
What does this mean? With our current numerical features, the L2 distance is dominated by body mass, so each penguin's "nearest neighbor" is just going to be the penguin whose weight is closest. The other features won't make a dent!
Feature Extraction, Version 0.1¶
Previously, we extracted a 4D feature vector by just taking the numerical columns for each penguin.
This time, we'll add a step: convert each numerical columns to $z$-scores:
df = penguins.copy(deep=True)
for col in numerical_features:
data = df[col]
df[col] = (data - data.mean()) / data.std()
df
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | L2 | |
|---|---|---|---|---|---|---|---|---|
| 30 | Adelie | Dream | -0.821552 | -0.236064 | -1.638652 | -1.188572 | Female | 0.000000 |
| 317 | Gentoo | Biscoe | 0.531612 | -1.302467 | 1.500670 | 0.829520 | Female | 1625.613783 |
| 79 | Adelie | Torgersen | -0.346116 | 0.982683 | -0.425733 | -0.257145 | Male | 750.200986 |
| 201 | Chinstrap | Dream | 1.061905 | 0.068623 | -0.211688 | -0.660763 | Female | 425.595406 |
| 63 | Adelie | Biscoe | -0.528976 | 0.525653 | -0.639777 | -0.195050 | Male | 800.125496 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 194 | Chinstrap | Dream | 1.263051 | 0.982683 | -0.354384 | -0.816001 | Male | 300.765224 |
| 77 | Adelie | Torgersen | -1.242129 | 1.135027 | -1.210563 | -0.381335 | Male | 650.037368 |
| 112 | Adelie | Biscoe | -0.784980 | 0.271748 | -0.568429 | -1.250667 | Female | 52.211493 |
| 277 | Gentoo | Biscoe | 0.275608 | -1.099343 | 1.357973 | 0.984758 | Male | 1750.515036 |
| 108 | Adelie | Biscoe | -1.077555 | -0.083720 | -1.424608 | -1.281715 | Female | 75.073631 |
333 rows × 8 columns
Now compute the L2 distance again:
penguin_0 = df.iloc[0][numerical_features]
df["L2_scaled"] = df[numerical_features].apply(lambda b: L(2, penguin_0, b[numerical_features]), axis=1)
Rank them:
df.sort_values("L2_scaled").head(20)
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | L2 | L2_scaled | |
|---|---|---|---|---|---|---|---|---|---|
| 30 | Adelie | Dream | -0.821552 | -0.236064 | -1.638652 | -1.188572 | Female | 0.000000 | 0.000000 |
| 122 | Adelie | Torgersen | -0.693550 | -0.083720 | -1.781349 | -0.940192 | Female | 200.011450 | 0.348781 |
| 108 | Adelie | Biscoe | -1.077555 | -0.083720 | -1.424608 | -1.281715 | Female | 75.073631 | 0.378467 |
| 12 | Adelie | Torgersen | -0.528976 | 0.220967 | -1.353259 | -1.250667 | Female | 50.193326 | 0.616265 |
| 102 | Adelie | Biscoe | -1.150699 | -0.591532 | -1.281911 | -1.405905 | Female | 175.082066 | 0.639682 |
| 182 | Chinstrap | Dream | -0.565548 | -0.286845 | -0.996518 | -1.250667 | Female | 50.822928 | 0.695923 |
| 138 | Adelie | Dream | -1.278701 | -0.337626 | -1.139215 | -1.002287 | Female | 150.184187 | 0.709536 |
| 44 | Adelie | Dream | -1.278701 | -0.134501 | -1.139215 | -1.499048 | Female | 250.110556 | 0.751754 |
| 24 | Adelie | Biscoe | -0.949553 | 0.017842 | -1.495956 | -0.505525 | Male | 550.004309 | 0.753504 |
| 62 | Adelie | Biscoe | -1.168985 | -0.083720 | -1.139215 | -0.753906 | Female | 350.075278 | 0.763080 |
| 6 | Adelie | Torgersen | -0.931267 | 0.322529 | -1.424608 | -0.722858 | Female | 375.014093 | 0.766007 |
| 141 | Adelie | Dream | -0.620406 | 0.017842 | -0.996518 | -0.909144 | Male | 225.183170 | 0.771585 |
| 58 | Adelie | Biscoe | -1.370131 | -0.286845 | -1.424608 | -1.685333 | Female | 400.022512 | 0.772078 |
| 172 | Chinstrap | Dream | -0.291258 | 0.068623 | -1.424608 | -0.753906 | Female | 350.025385 | 0.780253 |
| 56 | Adelie | Biscoe | -0.912981 | 0.170185 | -1.067867 | -0.816001 | Female | 300.108131 | 0.798751 |
| 50 | Adelie | Biscoe | -0.803266 | 0.271748 | -1.067867 | -0.878096 | Female | 250.129986 | 0.824863 |
| 184 | Chinstrap | Dream | -0.272972 | -0.236064 | -0.996518 | -1.064382 | Female | 100.448992 | 0.853639 |
| 134 | Adelie | Dream | -1.077555 | 0.220967 | -0.996518 | -0.971239 | Female | 175.239179 | 0.856729 |
| 116 | Adelie | Torgersen | -0.986125 | -0.083720 | -0.925170 | -1.623238 | Female | 350.144113 | 0.865034 |
| 60 | Adelie | Biscoe | -1.516419 | -0.134501 | -1.139215 | -1.312762 | Female | 100.316898 | 0.870642 |
And pairplot:
sns.pairplot(data=df.drop(columns="L2"), hue="species")
<seaborn.axisgrid.PairGrid at 0x1489f71922d0>
Let's try our KNN classifier again, now with standardized features:
X = df[numerical_features].to_numpy()
y = df["species"].map({"Adelie": 1, "Chinstrap": 2, "Gentoo": 3}).to_numpy()
Xtrain = X[:300, :]
ytrain = y[:300]
Xtest = X[300:, :]
ytest = y[300:]
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=1).fit(Xtrain, ytrain)
ypred = knn.predict(Xtest)
print(np.sum(ypred == ytest), "/", len(ytest), "predictions were correct")
33 / 33 predictions were correct
What about Categorical columns?¶
dfo = df.copy(deep=True)
categorical_features = ["island", "sex"]
dfo
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | L2 | L2_scaled | |
|---|---|---|---|---|---|---|---|---|---|
| 30 | Adelie | Dream | -0.821552 | -0.236064 | -1.638652 | -1.188572 | Female | 0.000000 | 0.000000 |
| 317 | Gentoo | Biscoe | 0.531612 | -1.302467 | 1.500670 | 0.829520 | Female | 1625.613783 | 4.110512 |
| 79 | Adelie | Torgersen | -0.346116 | 0.982683 | -0.425733 | -0.257145 | Male | 750.200986 | 2.012490 |
| 201 | Chinstrap | Dream | 1.061905 | 0.068623 | -0.211688 | -0.660763 | Female | 425.595406 | 2.440298 |
| 63 | Adelie | Biscoe | -0.528976 | 0.525653 | -0.639777 | -0.195050 | Male | 800.125496 | 1.628082 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 194 | Chinstrap | Dream | 1.263051 | 0.982683 | -0.354384 | -0.816001 | Male | 300.765224 | 2.760266 |
| 77 | Adelie | Torgersen | -1.242129 | 1.135027 | -1.210563 | -0.381335 | Male | 650.037368 | 1.700490 |
| 112 | Adelie | Biscoe | -0.784980 | 0.271748 | -0.568429 | -1.250667 | Female | 52.211493 | 1.186779 |
| 277 | Gentoo | Biscoe | 0.275608 | -1.099343 | 1.357973 | 0.984758 | Male | 1750.515036 | 3.956278 |
| 108 | Adelie | Biscoe | -1.077555 | -0.083720 | -1.424608 | -1.281715 | Female | 75.073631 | 0.378467 |
333 rows × 9 columns
Numerical Encodings for Categorical Values¶
dfo["island"].value_counts()
island Biscoe 163 Dream 123 Torgersen 47 Name: count, dtype: int64
Ordinal Encoding¶
dfo["island_ordinal"] = dfo["island"].map({"Biscoe": 1, "Dream": 2, "Torgersen": 3})
dfo[categorical_features + ["island_ordinal"]]
| island | sex | island_ordinal | |
|---|---|---|---|
| 30 | Dream | Female | 2 |
| 317 | Biscoe | Female | 1 |
| 79 | Torgersen | Male | 3 |
| 201 | Dream | Female | 2 |
| 63 | Biscoe | Male | 1 |
| ... | ... | ... | ... |
| 194 | Dream | Male | 2 |
| 77 | Torgersen | Male | 3 |
| 112 | Biscoe | Female | 1 |
| 277 | Biscoe | Male | 1 |
| 108 | Biscoe | Female | 1 |
333 rows × 3 columns
One-Hot Encoding¶
islands = list(dfo["island"].unique())
for island_name in islands:
dfo[island_name] = (dfo["island"] == island_name).astype(int)
dfo[categorical_features + islands]
| island | sex | Dream | Biscoe | Torgersen | |
|---|---|---|---|---|---|
| 30 | Dream | Female | 1 | 0 | 0 |
| 317 | Biscoe | Female | 0 | 1 | 0 |
| 79 | Torgersen | Male | 0 | 0 | 1 |
| 201 | Dream | Female | 1 | 0 | 0 |
| 63 | Biscoe | Male | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... |
| 194 | Dream | Male | 1 | 0 | 0 |
| 77 | Torgersen | Male | 0 | 0 | 1 |
| 112 | Biscoe | Female | 0 | 1 | 0 |
| 277 | Biscoe | Male | 0 | 1 | 0 |
| 108 | Biscoe | Female | 0 | 1 | 0 |
333 rows × 5 columns
Exercise: When would ordinal vs one-hot be advantageous?