Announcements:¶
Goals:¶
- Be able to define supervised learning, classification and regression
- Understand and be able to implement a k-nearest-neighbors (KNN) classifier or regressor
- Know why and how to subdivide datasets into training, validation, and test sets
- Understand what hyperparameters are and how to tune them using a validation set
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Today we'll talk more about supervised learning; as a reminder, this is where we have:
- (as before) a dataset $X$ with shape $(n, d)$ that has $n$ datapoints each represented by a $d$-dimensional feature vector.
numerical_features = [
'bill_length_mm',
# 'bill_depth_mm',
# 'flipper_length_mm',
'body_mass_g'
]
penguins = sns.load_dataset("penguins").dropna().sample(frac=1, random_state=42)
# standardize the nmerical columns to produce X:
X = np.zeros_like(penguins[numerical_features].to_numpy())
for i, col in enumerate(numerical_features):
c = penguins[col]
X[:,i] = (c - c.mean()) / c.std()
print(X.shape)
- (new for supervised learning) $y$, a length-$n$ vector of labels representing some aspect we'd like to predict
In our Penguins example, we could use the species column as our $y$.
y = penguins["species"]
print(y.shape)
print(y[0])
This is a categorical column; the task of predicting its value is called classification becuase we are trying to classify the penguin as one of a discrete set of categories or labels.
We could alos imagine predicting, say, flipper length from body mass and bill length:
y = penguins["flipper_length_mm"]
print(y.shape)
print(y[0])
In this case, we are predicting a (continuous) numerical quantity; this is called regression.
For now, we'll stick with the species classification problem:
y = penguins["species"]
We still have our trusty $L^p$ distance function available, and I've followed it with a "vectorized" version that calculates many distances in one go:
def L(p, a, b):
""" Compute the L^p distance between vectors a and b
Pre: p > 0 and a, b are d-dimensional 1d arrays """
return np.sum(np.abs(a - b) ** p) ** (1/p)
def L_vectorized(p, X, b):
""" Compute the L^p distance between each row of X (n, d)
and b (d,). Returns a vector of size (n,). """
n, d = X.shape
return np.sum(np.abs(X - b.reshape((1, d))), axis=1) ** (1/p)
We'll split the dataset into some "known" training data and "unseen" validation data to test on:
Xtrain = X[:300, :]
ytrain = y[:300]
Xval = X[300:, :]
yval = y[300:]
Exercise: implement and evaluate the accuracy of a 1-nearest-neigbor classifier by following the pseudocode below. You should find the L_vectorized function from above helpful.
correct_guesses = 0
for i in range(len(yval)):
pass
# calculate the distances between each row of Xtrain and Xval[i,:]
# find the label of the penguin with the smallest distance
# check whether it matches the true label (y[i]), add one to correct_guesses if so
Next: generalize this to a K-nearest neighbors classifier by taking the most common label from among the K nearest neighbors.