Lecture 16 - Machine Learning: Intro/Overview/Example; Vectors and Distances¶

In [ ]:

Announcements:¶

  • Detailed rubric for the Project is available - linked from the project writeup.

Goals:¶

  • See a few different perspectives on "what is machine learning?"
    • CS/programmer's perspective
    • DS/data wrangler's perspective
    • Stats/mathematician's perspective
  • Know the basic mathematical setup of most machine learning methods.
  • Be able to give some example use cases where machine learning is the best, if not only practical, way to solve a computing problem.
  • Be able to describe the distinction between, and provide examples of, unsupervised and supervised learning.
  • Know how to represent datapoints as vectors, and how to compute $L_p$ distances between them.

What the heck is ~all this hype~ machine learning about?¶

A programmer's perspective:¶

It's a tool that we can use when we don't know how to just write the code to compute the answer.

A data scientist's perspective:¶

A way to discover underlying structure in data that allows us to:

  • Better understand the world that generated the data
  • Predict values (columns, rows, subsets thereof) that are:
    • missing (imputation), or
    • haven't happened yet (prediction, time series forecasting, etc.)

A mathematical / probabilistic perspective:¶

Given a set of data $X$ and (possibly) labels $y$ (i.e., quantities to predict), model one or more of the following:

  • $P(x)$
  • $P(y \mid x)$
  • $P(x, y)$.

Math Fact: If you know $P(x, y)$, you can compute $P(y \mid x)$ and $P(x)$:

$$P(y \mid x) = \frac{P(x, y)}{P(x)}$$ and $$P(x) = \sum_i P(x,y_i)$$

Machine Learning: When and Why?¶

  • We don't know the underlying process that generated the data, but we have examples of its outputs
  • Use cases: (so far) data-rich domains with problems we don't know how to solve any other way.
  • High-profile examples are from complex domains:
    • Speech recognition
    • Image recognition / object detection
    • Weather forecasting
    • Natural Language Processing
    • Self-driving cars (robotics)
    • Drug design

Side note: Machine Learning vs. Artificial Intelligence¶

  • AI is the study of making computer systems behave with "intelligence" (what does that mean?)
  • ML is generally considered a subfield of AI
    • ML has come the closest so far to exhibiting intelligent behavior
    • Actively debated: is ML the way to "general" AI?
    • In current usage, "AI" almost exclusively refers to ML-based methods

Supervised vs Unsupervised¶

Supervised Learning, by example:

  • $X$ is a collection of $n$ homes with $d$ numbers about each one (square feet, # bedrooms, ...); $\mathbf{y}$ is the list of market values of each home.

Key property: a dataset $X$ with corresponding "ground truth" labels $\mathbf{y}$ is available, and we want be able to to predict a $y$ given a new $\mathbf{x}$.

Unsupervised Learning, by example:

  • Same setup, but instead of predicting $y$ for a new $\mathbf{x}$, you want to know which of the $d$ variables most heavily influences home price.

Key property: you aren't given "the right answer" - you're looking to discover structure in the data.

In [1]:
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Machine Learning Hello World¶

We'll work with the classic iris dataset, which includes measurements and species labels for three different kinds of iris flowers.

image.png

In [2]:
iris = sns.load_dataset("iris")
iris
Out[2]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

In [3]:
sns.pairplot(data=iris, hue="species")
Out[3]:
<seaborn.axisgrid.PairGrid at 0x1476524b36b0>
No description has been provided for this image

Extract Features¶

The vast majority of machine learning methods assume each of your input datapoints is represented by a feature vector.

If it's not, it's usually your job to make it so - this is called feature extraction.

Given a DataFrame, we can treat each row as the feature vector for the thing the row describes. Traditionally, we arrange our dataset in an $N \times D$ matrix, where each row corresponds to a datapoint and each column corresponds to a single feature (variable). This is the same layout as a pandas table.

If your input is an audio signal, a sentence of text, an image, or some other not-obviously-vector-like thing, there may be more work to do.

Given a dataframe like the iris dataset, it's pretty easy to get to a ML-style training dataset X, y:

In [4]:
X = iris[["sepal_length", "sepal_width", "petal_length", "petal_width"]].to_numpy()
X.shape
Out[4]:
(150, 4)
In [5]:
X[:10,:]
Out[5]:
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2],
       [5.4, 3.9, 1.7, 0.4],
       [4.6, 3.4, 1.4, 0.3],
       [5. , 3.4, 1.5, 0.2],
       [4.4, 2.9, 1.4, 0.2],
       [4.9, 3.1, 1.5, 0.1]])
In [8]:
iris["species"].value_counts()
Out[8]:
species
setosa        50
versicolor    50
virginica     50
Name: count, dtype: int64
In [10]:
y = iris["species"].map({"setosa": 1, "versicolor": 2, "virginica": 3}).to_numpy()
y.shape
Out[10]:
(150,)
In [11]:
y
Out[11]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3])

Split the data into two sets: a training set and a test set.

In [12]:
shuffled = np.random.permutation(range(iris.shape[0]))
train_inds = shuffled[:100]
test_inds  = shuffled[100:]

Xtrain = X[train_inds,:]
Xtest  = X[test_inds, :]
ytrain = y[train_inds]
ytest  = y[test_inds]

Train a model¶

(on the training data)

In [13]:
import sklearn

# Train a model on the training features (Xtrain) and labels (ytrain)
knn = sklearn.neighbors.KNeighborsClassifier(n_neighbors=3).fit(Xtrain, ytrain)

Use the Model to Make Predictions¶

(on the test data)

In [14]:
# Predict the test set label (ypred) given the features of the test set
ypred = knn.predict(Xtest)

Evaluate the Quality of the Predictions¶

In [15]:
# Compare the predictions (ypred) to the true labels for the test set (ytest)
(ytest == ypred).sum() / len(ytest)
Out[15]:
np.float64(0.96)

Let's visualize this:

In [16]:
results = pd.DataFrame(np.column_stack([Xtest, ytest, ypred, ytest == ypred]), 
             columns=list(iris.columns)+["Prediction", "Correct?"])
results
Out[16]:
sepal_length sepal_width petal_length petal_width species Prediction Correct?
0 6.3 2.9 5.6 1.8 3.0 3.0 1.0
1 4.8 3.4 1.6 0.2 1.0 1.0 1.0
2 6.9 3.1 4.9 1.5 2.0 2.0 1.0
3 6.3 2.7 4.9 1.8 3.0 3.0 1.0
4 7.3 2.9 6.3 1.8 3.0 3.0 1.0
5 4.5 2.3 1.3 0.3 1.0 1.0 1.0
6 5.8 2.8 5.1 2.4 3.0 3.0 1.0
7 5.2 2.7 3.9 1.4 2.0 2.0 1.0
8 4.4 3.2 1.3 0.2 1.0 1.0 1.0
9 5.9 3.0 5.1 1.8 3.0 3.0 1.0
10 6.3 3.4 5.6 2.4 3.0 3.0 1.0
11 4.6 3.6 1.0 0.2 1.0 1.0 1.0
12 6.1 3.0 4.6 1.4 2.0 2.0 1.0
13 5.6 2.8 4.9 2.0 3.0 3.0 1.0
14 6.2 2.2 4.5 1.5 2.0 2.0 1.0
15 4.9 3.1 1.5 0.1 1.0 1.0 1.0
16 6.7 3.3 5.7 2.5 3.0 3.0 1.0
17 6.9 3.1 5.4 2.1 3.0 3.0 1.0
18 5.2 3.4 1.4 0.2 1.0 1.0 1.0
19 4.9 3.6 1.4 0.1 1.0 1.0 1.0
20 5.4 3.9 1.3 0.4 1.0 1.0 1.0
21 6.7 3.3 5.7 2.1 3.0 3.0 1.0
22 5.0 3.6 1.4 0.2 1.0 1.0 1.0
23 6.6 2.9 4.6 1.3 2.0 2.0 1.0
24 5.1 3.5 1.4 0.2 1.0 1.0 1.0
25 6.2 3.4 5.4 2.3 3.0 3.0 1.0
26 6.1 2.9 4.7 1.4 2.0 2.0 1.0
27 5.5 2.5 4.0 1.3 2.0 2.0 1.0
28 4.9 3.1 1.5 0.2 1.0 1.0 1.0
29 6.5 3.2 5.1 2.0 3.0 3.0 1.0
30 7.7 2.8 6.7 2.0 3.0 3.0 1.0
31 5.2 4.1 1.5 0.1 1.0 1.0 1.0
32 5.6 2.9 3.6 1.3 2.0 2.0 1.0
33 6.3 3.3 4.7 1.6 2.0 2.0 1.0
34 7.7 3.8 6.7 2.2 3.0 3.0 1.0
35 4.9 2.5 4.5 1.7 3.0 2.0 0.0
36 6.1 2.6 5.6 1.4 3.0 3.0 1.0
37 6.7 3.0 5.0 1.7 2.0 2.0 1.0
38 5.9 3.0 4.2 1.5 2.0 2.0 1.0
39 6.7 3.1 4.7 1.5 2.0 2.0 1.0
40 5.1 3.8 1.9 0.4 1.0 1.0 1.0
41 5.0 3.4 1.5 0.2 1.0 1.0 1.0
42 6.3 2.8 5.1 1.5 3.0 2.0 0.0
43 5.4 3.4 1.5 0.4 1.0 1.0 1.0
44 5.4 3.4 1.7 0.2 1.0 1.0 1.0
45 6.0 3.0 4.8 1.8 3.0 3.0 1.0
46 7.4 2.8 6.1 1.9 3.0 3.0 1.0
47 6.9 3.1 5.1 2.3 3.0 3.0 1.0
48 5.7 2.6 3.5 1.0 2.0 2.0 1.0
49 5.0 3.5 1.6 0.6 1.0 1.0 1.0
In [17]:
palette = {
    1.0: "#4C72B0",  # setosa    → blue
    2.0: "#DD8452",  # versicolor → orange
    3.0: "#55A868",  # virginica  → green
}

sns.scatterplot(results, x="sepal_length", y="petal_length", hue="species", palette=palette, style="Correct?")
Out[17]:
<Axes: xlabel='sepal_length', ylabel='petal_length'>
No description has been provided for this image

Distance Metrics¶

We used a k nearest neighbors classifier, which assigns the label of the $k$ nearest datapoints in the training set.

An important question here is: what do we mean by nearest?

This is a fundamental primitive that many if not all ML methods rely on: The abbility to ability to compare datapoints to determine similar or different are they?

$L^p$ Distances¶

A common family of distance metrics is the $L^p$ distance:

$$d_p(a, b) = \sqrt[p]{\sum_{i=1}^d |a_i - b_i|^p}$$

When $p = 2$, this is the Euclidean distance we're all used to, based on the Pythagorean theorem; in 2D, it reduces to: $$\sqrt{(b_x - a_x)^2 + (b_y - a_y^2)}$$

Different values of $p$ give different behavior:

  • For smaller $p$, we care less about how different the per-dimension differences are from each other.
  • For larger $p$, we care more about how different the per-dimension differences are from each other.

Exercise 1: Write a function to compute $L^p$ distance between two vectors:

In [ ]:
def L(p, a, b):
    """ Compute the L^p distance between vectors a and b
    Precondition: p > 0 and a, b are d-dimensional 1d arrays """
    

The following visualizes the $L^p$ distance from the origin of points in a square with side length 2 centered at the origin. To help with visualization, the colormap is set up so that:

  • Points a distance of 1 from the origin (i.e., on the unit circle) appear white.
  • Points a distance less than 1 (i.e., inside the unit circle) appear blue, and
  • Points a distance greater than 1 (i.e., outside the unit circle) appear red
In [28]:
def plot_lp_distance(p):
    """ Visualize the L^p distance from the origin. """
    N = 401
    x = np.linspace(-1, 1, N)
    y = np.linspace(-1, 1, N)
    X, Y = np.meshgrid(x, y)
    Z = (np.abs(X)**p + np.abs(Y)**p) ** (1.0 / p)
    plt.imshow(Z-1, extent=[-1,1,-1,1], origin='lower', cmap='seismic', vmin=-1, vmax=1)
    plt.colorbar()

plot_lp_distance(0.5)
No description has been provided for this image

Exercise 2: Run the above cell using the following values of $p$. For each one, describe the shape of the "unit circle".

  • $p = 2$
  • $p = 1$
  • $p = 4$
  • What is the shape of the unit circle in the limit as $p \rightarrow \infty$? How could you intuitively interpret this distance in terms of the values of $x$ and $y$?
  • $p = 0.5$
  • What is the shape of the unit circle in the limit as $p \rightarrow 0$? How could you intuitively interpret this distance in terms of the values of $x$ and $y$?

More Distance Metrics¶

Hamming Distance¶

For vectors of categorical values, Hamming distance is the number of dimensions in which two vectors differ: $$d(a, b) = \sum_i \mathbb{1}(a_i \ne b_i)$$ where $\mathbb{1}(\cdot)$ is an indicator function that has value 1 if its argument is true and 0 otherwise.

Cosine Similarity¶

A similarity (not distance) metric that considers only vector direction, not magnitude:

$$ sim(a, b) = \cos \theta = \frac{a^Tb}{\sqrt{(a^Ta)(b^Tb)}}$$

In [ ]: