Lecture 16 - Machine Learning: Intro/Overview/Example; Vectors and Distances¶
Announcements:¶
- Detailed rubric for the Project is available - linked from the project writeup.
Goals:¶
- See a few different perspectives on "what is machine learning?"
- CS/programmer's perspective
- DS/data wrangler's perspective
- Stats/mathematician's perspective
- Know the basic mathematical setup of most machine learning methods.
- Be able to give some example use cases where machine learning is the best, if not only practical, way to solve a computing problem.
- Be able to describe the distinction between, and provide examples of, unsupervised and supervised learning.
- Know how to represent datapoints as vectors, and how to compute $L_p$ distances between them.
A data scientist's perspective:¶
A way to discover underlying structure in data that allows us to:
- Better understand the world that generated the data
- Predict values (columns, rows, subsets thereof) that are:
- missing (imputation), or
- haven't happened yet (prediction, time series forecasting, etc.)
A mathematical / probabilistic perspective:¶
Given a set of data $X$ and (possibly) labels $y$ (i.e., quantities to predict), model one or more of the following:
- $P(x)$
- $P(y \mid x)$
- $P(x, y)$.
Math Fact: If you know $P(x, y)$, you can compute $P(y \mid x)$ and $P(x)$:
$$P(y \mid x) = \frac{P(x, y)}{P(x)}$$ and $$P(x) = \sum_i P(x,y_i)$$
Machine Learning: When and Why?¶
- We don't know the underlying process that generated the data, but we have examples of its outputs
- Use cases: (so far) data-rich domains with problems we don't know how to solve any other way.
- High-profile examples are from complex domains:
- Speech recognition
- Image recognition / object detection
- Weather forecasting
- Natural Language Processing
- Self-driving cars (robotics)
- Drug design
Side note: Machine Learning vs. Artificial Intelligence¶
- AI is the study of making computer systems behave with "intelligence" (what does that mean?)
- ML is generally considered a subfield of AI
- ML has come the closest so far to exhibiting intelligent behavior
- Actively debated: is ML the way to "general" AI?
- In current usage, "AI" almost exclusively refers to ML-based methods
Supervised vs Unsupervised¶
Supervised Learning, by example:
- $X$ is a collection of $n$ homes with $d$ numbers about each one (square feet, # bedrooms, ...); $\mathbf{y}$ is the list of market values of each home.
Key property: a dataset $X$ with corresponding "ground truth" labels $\mathbf{y}$ is available, and we want be able to to predict a $y$ given a new $\mathbf{x}$.
Unsupervised Learning, by example:
- Same setup, but instead of predicting $y$ for a new $\mathbf{x}$, you want to know which of the $d$ variables most heavily influences home price.
Key property: you aren't given "the right answer" - you're looking to discover structure in the data.
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Machine Learning Hello World¶
We'll work with the classic iris dataset, which includes measurements and species labels for three different kinds of iris flowers.
iris = sns.load_dataset("iris")
iris
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| ... | ... | ... | ... | ... | ... |
| 145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
sns.pairplot(data=iris, hue="species")
<seaborn.axisgrid.PairGrid at 0x1476524b36b0>
Extract Features¶
The vast majority of machine learning methods assume each of your input datapoints is represented by a feature vector.
If it's not, it's usually your job to make it so - this is called feature extraction.
Given a DataFrame, we can treat each row as the feature vector for the thing the row describes. Traditionally, we arrange our dataset in an $N \times D$ matrix, where each row corresponds to a datapoint and each column corresponds to a single feature (variable). This is the same layout as a pandas table.
If your input is an audio signal, a sentence of text, an image, or some other not-obviously-vector-like thing, there may be more work to do.
Given a dataframe like the iris dataset, it's pretty easy to get to a ML-style training dataset X, y:
X = iris[["sepal_length", "sepal_width", "petal_length", "petal_width"]].to_numpy()
X.shape
(150, 4)
X[:10,:]
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2],
[5.4, 3.9, 1.7, 0.4],
[4.6, 3.4, 1.4, 0.3],
[5. , 3.4, 1.5, 0.2],
[4.4, 2.9, 1.4, 0.2],
[4.9, 3.1, 1.5, 0.1]])
iris["species"].value_counts()
species setosa 50 versicolor 50 virginica 50 Name: count, dtype: int64
y = iris["species"].map({"setosa": 1, "versicolor": 2, "virginica": 3}).to_numpy()
y.shape
(150,)
y
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])
Split the data into two sets: a training set and a test set.
shuffled = np.random.permutation(range(iris.shape[0]))
train_inds = shuffled[:100]
test_inds = shuffled[100:]
Xtrain = X[train_inds,:]
Xtest = X[test_inds, :]
ytrain = y[train_inds]
ytest = y[test_inds]
Train a model¶
(on the training data)
import sklearn
# Train a model on the training features (Xtrain) and labels (ytrain)
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=3).fit(Xtrain, ytrain)
Use the Model to Make Predictions¶
(on the test data)
# Predict the test set label (ypred) given the features of the test set
ypred = knn.predict(Xtest)
Evaluate the Quality of the Predictions¶
# Compare the predictions (ypred) to the true labels for the test set (ytest)
(ytest == ypred).sum() / len(ytest)
np.float64(0.96)
Let's visualize this:
results = pd.DataFrame(np.column_stack([Xtest, ytest, ypred, ytest == ypred]),
columns=list(iris.columns)+["Prediction", "Correct?"])
results
| sepal_length | sepal_width | petal_length | petal_width | species | Prediction | Correct? | |
|---|---|---|---|---|---|---|---|
| 0 | 6.3 | 2.9 | 5.6 | 1.8 | 3.0 | 3.0 | 1.0 |
| 1 | 4.8 | 3.4 | 1.6 | 0.2 | 1.0 | 1.0 | 1.0 |
| 2 | 6.9 | 3.1 | 4.9 | 1.5 | 2.0 | 2.0 | 1.0 |
| 3 | 6.3 | 2.7 | 4.9 | 1.8 | 3.0 | 3.0 | 1.0 |
| 4 | 7.3 | 2.9 | 6.3 | 1.8 | 3.0 | 3.0 | 1.0 |
| 5 | 4.5 | 2.3 | 1.3 | 0.3 | 1.0 | 1.0 | 1.0 |
| 6 | 5.8 | 2.8 | 5.1 | 2.4 | 3.0 | 3.0 | 1.0 |
| 7 | 5.2 | 2.7 | 3.9 | 1.4 | 2.0 | 2.0 | 1.0 |
| 8 | 4.4 | 3.2 | 1.3 | 0.2 | 1.0 | 1.0 | 1.0 |
| 9 | 5.9 | 3.0 | 5.1 | 1.8 | 3.0 | 3.0 | 1.0 |
| 10 | 6.3 | 3.4 | 5.6 | 2.4 | 3.0 | 3.0 | 1.0 |
| 11 | 4.6 | 3.6 | 1.0 | 0.2 | 1.0 | 1.0 | 1.0 |
| 12 | 6.1 | 3.0 | 4.6 | 1.4 | 2.0 | 2.0 | 1.0 |
| 13 | 5.6 | 2.8 | 4.9 | 2.0 | 3.0 | 3.0 | 1.0 |
| 14 | 6.2 | 2.2 | 4.5 | 1.5 | 2.0 | 2.0 | 1.0 |
| 15 | 4.9 | 3.1 | 1.5 | 0.1 | 1.0 | 1.0 | 1.0 |
| 16 | 6.7 | 3.3 | 5.7 | 2.5 | 3.0 | 3.0 | 1.0 |
| 17 | 6.9 | 3.1 | 5.4 | 2.1 | 3.0 | 3.0 | 1.0 |
| 18 | 5.2 | 3.4 | 1.4 | 0.2 | 1.0 | 1.0 | 1.0 |
| 19 | 4.9 | 3.6 | 1.4 | 0.1 | 1.0 | 1.0 | 1.0 |
| 20 | 5.4 | 3.9 | 1.3 | 0.4 | 1.0 | 1.0 | 1.0 |
| 21 | 6.7 | 3.3 | 5.7 | 2.1 | 3.0 | 3.0 | 1.0 |
| 22 | 5.0 | 3.6 | 1.4 | 0.2 | 1.0 | 1.0 | 1.0 |
| 23 | 6.6 | 2.9 | 4.6 | 1.3 | 2.0 | 2.0 | 1.0 |
| 24 | 5.1 | 3.5 | 1.4 | 0.2 | 1.0 | 1.0 | 1.0 |
| 25 | 6.2 | 3.4 | 5.4 | 2.3 | 3.0 | 3.0 | 1.0 |
| 26 | 6.1 | 2.9 | 4.7 | 1.4 | 2.0 | 2.0 | 1.0 |
| 27 | 5.5 | 2.5 | 4.0 | 1.3 | 2.0 | 2.0 | 1.0 |
| 28 | 4.9 | 3.1 | 1.5 | 0.2 | 1.0 | 1.0 | 1.0 |
| 29 | 6.5 | 3.2 | 5.1 | 2.0 | 3.0 | 3.0 | 1.0 |
| 30 | 7.7 | 2.8 | 6.7 | 2.0 | 3.0 | 3.0 | 1.0 |
| 31 | 5.2 | 4.1 | 1.5 | 0.1 | 1.0 | 1.0 | 1.0 |
| 32 | 5.6 | 2.9 | 3.6 | 1.3 | 2.0 | 2.0 | 1.0 |
| 33 | 6.3 | 3.3 | 4.7 | 1.6 | 2.0 | 2.0 | 1.0 |
| 34 | 7.7 | 3.8 | 6.7 | 2.2 | 3.0 | 3.0 | 1.0 |
| 35 | 4.9 | 2.5 | 4.5 | 1.7 | 3.0 | 2.0 | 0.0 |
| 36 | 6.1 | 2.6 | 5.6 | 1.4 | 3.0 | 3.0 | 1.0 |
| 37 | 6.7 | 3.0 | 5.0 | 1.7 | 2.0 | 2.0 | 1.0 |
| 38 | 5.9 | 3.0 | 4.2 | 1.5 | 2.0 | 2.0 | 1.0 |
| 39 | 6.7 | 3.1 | 4.7 | 1.5 | 2.0 | 2.0 | 1.0 |
| 40 | 5.1 | 3.8 | 1.9 | 0.4 | 1.0 | 1.0 | 1.0 |
| 41 | 5.0 | 3.4 | 1.5 | 0.2 | 1.0 | 1.0 | 1.0 |
| 42 | 6.3 | 2.8 | 5.1 | 1.5 | 3.0 | 2.0 | 0.0 |
| 43 | 5.4 | 3.4 | 1.5 | 0.4 | 1.0 | 1.0 | 1.0 |
| 44 | 5.4 | 3.4 | 1.7 | 0.2 | 1.0 | 1.0 | 1.0 |
| 45 | 6.0 | 3.0 | 4.8 | 1.8 | 3.0 | 3.0 | 1.0 |
| 46 | 7.4 | 2.8 | 6.1 | 1.9 | 3.0 | 3.0 | 1.0 |
| 47 | 6.9 | 3.1 | 5.1 | 2.3 | 3.0 | 3.0 | 1.0 |
| 48 | 5.7 | 2.6 | 3.5 | 1.0 | 2.0 | 2.0 | 1.0 |
| 49 | 5.0 | 3.5 | 1.6 | 0.6 | 1.0 | 1.0 | 1.0 |
palette = {
1.0: "#4C72B0", # setosa → blue
2.0: "#DD8452", # versicolor → orange
3.0: "#55A868", # virginica → green
}
sns.scatterplot(results, x="sepal_length", y="petal_length", hue="species", palette=palette, style="Correct?")
<Axes: xlabel='sepal_length', ylabel='petal_length'>
Distance Metrics¶
We used a k nearest neighbors classifier, which assigns the label of the $k$ nearest datapoints in the training set.
An important question here is: what do we mean by nearest?
This is a fundamental primitive that many if not all ML methods rely on: The abbility to ability to compare datapoints to determine similar or different are they?
$L^p$ Distances¶
A common family of distance metrics is the $L^p$ distance:
$$d_p(a, b) = \sqrt[p]{\sum_{i=1}^d |a_i - b_i|^p}$$
When $p = 2$, this is the Euclidean distance we're all used to, based on the Pythagorean theorem; in 2D, it reduces to: $$\sqrt{(b_x - a_x)^2 + (b_y - a_y^2)}$$
Different values of $p$ give different behavior:
- For smaller $p$, we care less about how different the per-dimension differences are from each other.
- For larger $p$, we care more about how different the per-dimension differences are from each other.
Exercise 1: Write a function to compute $L^p$ distance between two vectors:
def L(p, a, b):
""" Compute the L^p distance between vectors a and b
Precondition: p > 0 and a, b are d-dimensional 1d arrays """
The following visualizes the $L^p$ distance from the origin of points in a square with side length 2 centered at the origin. To help with visualization, the colormap is set up so that:
- Points a distance of 1 from the origin (i.e., on the unit circle) appear white.
- Points a distance less than 1 (i.e., inside the unit circle) appear blue, and
- Points a distance greater than 1 (i.e., outside the unit circle) appear red
def plot_lp_distance(p):
""" Visualize the L^p distance from the origin. """
N = 401
x = np.linspace(-1, 1, N)
y = np.linspace(-1, 1, N)
X, Y = np.meshgrid(x, y)
Z = (np.abs(X)**p + np.abs(Y)**p) ** (1.0 / p)
plt.imshow(Z-1, extent=[-1,1,-1,1], origin='lower', cmap='seismic', vmin=-1, vmax=1)
plt.colorbar()
plot_lp_distance(0.5)
Exercise 2: Run the above cell using the following values of $p$. For each one, describe the shape of the "unit circle".
- $p = 2$
- $p = 1$
- $p = 4$
- What is the shape of the unit circle in the limit as $p \rightarrow \infty$? How could you intuitively interpret this distance in terms of the values of $x$ and $y$?
- $p = 0.5$
- What is the shape of the unit circle in the limit as $p \rightarrow 0$? How could you intuitively interpret this distance in terms of the values of $x$ and $y$?
More Distance Metrics¶
Hamming Distance¶
For vectors of categorical values, Hamming distance is the number of dimensions in which two vectors differ: $$d(a, b) = \sum_i \mathbb{1}(a_i \ne b_i)$$ where $\mathbb{1}(\cdot)$ is an indicator function that has value 1 if its argument is true and 0 otherwise.
Cosine Similarity¶
A similarity (not distance) metric that considers only vector direction, not magnitude:
$$ sim(a, b) = \cos \theta = \frac{a^Tb}{\sqrt{(a^Ta)(b^Tb)}}$$
