Lecture 15 - Machine Learning: What, Why, and an Example¶

In [ ]:

Announcements:¶

  • Lab 5: see new subsection on ascraping the-numbers
  • Data ethics 2 due wednesday!

Goals:¶

  • See a few different perspectives on "what is machine learning?"
    • CS/programmer's perspective
    • DS/data wrangler's perspective
    • Stats/mathematician's perspective
  • Know the basic mathematical setup of most machine learning methods.
  • Understand the meaning of and distinction between classification and regression.
  • Be able to describe the overall machine learning framework and steps to train a model h(x).
  • Be able to give some example use cases where machine learning is the best, if not only practical, way to solve a computing problem.
  • Be able to define model class, loss, and empirical risk, and explain how they relate to each other.

What the heck is ~all this hype~ machine learning about?¶

A programmer's perspective:¶

It's a tool that we can use when we don't know how to just write the code to compute the answer.

A data scientist's perspective:¶

  • A way to predict values (columns, rows, subsets thereof) that we don't have based on the data we do.
    • Fill in missing data
    • Forecast future time series data
    • Discover hidden structure in data

A mathematical / probabilistic perspective:¶

Given a set of data $X$ and (possibly) labels $y$ (i.e., quantities to predict), model either $P(y | x)$ or $P(x, y)$.

Recall: If you know $P(x, y)$, you can compute $P(y | x)$:

$$P(y | x) = \frac{P(x, y)}{P(x)}$$

and $$P(x) = \sum_i P(x,y_i)$$

Machine Learning: When and Why?¶

  • We don't know the underlying process that generated the data, but we have examples of its outputs
  • Use cases: (so far) data-rich domains with problems we don't know how to solve any other way.
  • High-profile examples are from complex domains:
    • Speech recognition
    • Image recognition / object detection
    • Weather forecasting
    • Self-driving cars (robotics)
    • Drug design

Side note: Machine Learning vs. Artificial Intelligence¶

  • AI is the study of making computer systems behave with "intelligence" (what does that mean?)
  • ML is generally considered a subfield of AI
    • ML has come the closest so far to exhibiting intelligent behavior
    • Actively debated: is ML the way to "general" AI?

A Guided Example¶

  • Start with a dataset to learn from (the training set).
    • Inputs / features / independent variable: per capita income
    • Labels / targets / dependent variables: crime rate
In [1]:
import pandas as pd
import seaborn as sns
In [2]:
df = pd.DataFrame({
  "Income": [0.49, 0.18, 0.31, 0.40, 0.24],
  "CrimeRate": [0.09, 0.45, 0.23, 0.19, 0.48]
})
In [3]:
sns.scatterplot(data=df,x="Income",y="CrimeRate")
Out[3]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa5664044f0>

Classification vs. Regression¶

Outputs are real-valued (continuous): this is a regression task.

Predicting categorical outputs is called a classification task.

Ok, how do we do this?¶

Overview: three steps:

  1. Decide on a model class
  2. Decide on a formal definition of how well a model fits the training data.
  3. Optimize (search for the best model in the model class) - aka train, or learn

Step 1: Decide on a model class¶

  • There are a ton of options, since this looks vaguely linear, let's keep it simple:
    • Model class: linear functions; i.e., $h(x) = mx + b$
      • $x$ is the input
      • $m$ (slope) and $b$ (y-intercept) are the parameters of the model.
      • Equivalent terminology: $m$ and $b$ are weights, or $m$ is a weight and $b$ is a bias.

Step 2: Decide on a formal definition of how well a model fits the training data.¶

  • Break this into two substeps:
    1. Define a loss function that indicates how poorly the model does on a single datapoint. Here, we choose to use squared error loss (lots of alternative options):
    $$ L(h(x),y) = (h(x)-y)^2$$
    • Loss functions are typically 0 when the prediction is correct
    • Loss functions grow as the prediction is more wrong
    • Squared error loss is popular because it is (relatively) easy to optimize
    1. Compute the *empirical risk* (i.e., average loss over the training set)
    $$ \hat{R}(h; {\cal X}) = \frac{1}{N} \sum_{i=1}^N L(h(x_{(i)}),y_{(i)})$$
    • $h$ is the model
    • ${\cal X}$ is the training set
    • $\sum$ is adding up losses for datapoints $1,2,\dots,N$
    • Note that ${\cal X}$ is fixed -- this is a function of $h$, which in our case, means it is a function of $m$ and $b$.

Step 3: Optimize¶

Training means solving the following optimization problem: $$ h^* = \arg\min_{h} \hat{R}(h; {\cal X})$$ Let's break this down:

  • $h^*$ the optimal model (i.e., the line with the best $m$ and $b$)
  • $\arg\min_h$ means to find and return the $h$ that gives the smallest value of the function being minimized
  • $\hat{R}(h; {\cal X})$ the function being minimized

Solving this problem requires math we can't count on in this class, so let's jump to the solution: $$ h^*(x) \approx -1.29x + 0.71$$

Plot of solution

Model Selection - Revisited¶

In [4]:
sns.scatterplot(data=df,x="Income",y="CrimeRate")
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fa5662f2e80>

We limited our model class to linear functions. What other choices might we make?

  • Quadratic?
  • Piecewise linear?
  • Higher-order polynomial?
In [ ]:
 

What effect does this choice have on the optimal value of $\hat{R}(h; \mathcal{X})$?

What would you choose if you needed to deploy a model to make predictions on unseen datapoints?

Definition Recap / Exit Ticket:¶

  • Regression
  • Classification
  • Model Class
  • Loss function
  • Empirical Risk

Taxonomic Considerations¶

Classification vs Regression¶

You know this one!

Supervised vs Unsupervised¶

Supervised Learning, by example:

  • $X$ is a collection of $n$ houses with $d$ numbers about them (square feet, # bedrooms, ...); $\mathbf{y}$ is a vector of their market values.

Key property: a dataset $X$ with corresponding "ground truth" labels $\mathbf{y}$ is available, and we want be able to to predict a $y$ given a new $\mathbf{x}$.

Unsupervised Learning, by example:

  • Same setup, but instead of predicting $y$ for a new $\mathbf{x}$, you want to know which of the $d$ variables are most predictive.

Key property: you aren't given "the right answer" - you're looking to discover structure in the data.

Discriminative vs Generative¶

For the moment, let's focus on supervised classification for the moment. In this case, $\mathbf{x}$ is a vector and $y$ is a discrete label indicating one of a set of categories or classes.

Discriminative models try to estimate $P(y | \mathbf{x})$. If you know this, then pick the most likely label, i.e., the $y$ that maximizes $P(y | \mathbf{x})$, and that's your predicted label for $\mathbf{x}$.

  • Example: image classification

Generative models try to estimate $P(\mathbf{x})$ (if there are no labels $y$), or $P(x, y)$ if labels exist.

  • Example: image generation

Note: it's easy to conclude that classification and discriminative are the same thing; they're not! Classifiers are often discriminative, but not always. How does that work?

Consider two schemes for classifying penguin species given their bill length and depth.

Exercise: Which of these is generative and which is discriminative?

  1. Draw lines through the space and classify based on which area the penguin is in: penguins2.png
  2. Fit 2D Gaussian distributions to the three species, and classify based on which Gaussian says a point is most probable: penguins2.png

Exercises:¶

  1. Automatically detecting outlier images - like lab 4, except fully automatic.

    • Classification, regression, or neither?
    • Supervised or unsupervised?
  2. Using linear regression (i.e., a best-fit line) to predict home value based on square feet.

    • Classification, regression, or neither?
    • Supervised or unsupervised?
  3. Finding "communities" of people who frequently interact with each other on a social network.

    • Classification, regression, or neither?
    • Supervised or unsupervised?

Assumptions¶

Data distribution¶

Generally: unseen data is drawn from the same distrubtion as your dataset.

Consequence: We don't assume correlation is causation, but we do assume that observed correlations will hold in unseen data.

Specific models: many more! Example:

Linear regression assumes:

  • Columns are linearly independent (i.e., one column can't be directly computed from another
  • Data is homoscedastic, i.e., the following won't happen: