L15

Announcements:¶

Lab 5: see new subsection on ascraping the-numbers
Data ethics 2 due wednesday!

Goals:¶

See a few different perspectives on "what is machine learning?"
- CS/programmer's perspective
- DS/data wrangler's perspective
- Stats/mathematician's perspective
Know the basic mathematical setup of most machine learning methods.
Understand the meaning of and distinction between classification and regression.
Be able to describe the overall machine learning framework and steps to train a model h(x).
Be able to give some example use cases where machine learning is the best, if not only practical, way to solve a computing problem.
Be able to define model class, loss, and empirical risk, and explain how they relate to each other.

What the heck is ~all this hype~ machine learning about?¶

A programmer's perspective:¶

It's a tool that we can use when we don't know how to just write the code to compute the answer.

A data scientist's perspective:¶

A way to predict values (columns, rows, subsets thereof) that we don't have based on the data we do.
- Fill in missing data
- Forecast future time series data
- Discover hidden structure in data

A mathematical / probabilistic perspective:¶

Given a set of data $X$ and (possibly) labels $y$ (i.e., quantities to predict), model either $P(y | x)$ or $P(x, y)$.

Recall: If you know $P(x, y)$, you can compute $P(y | x)$:

$$P(y | x) = \frac{P(x, y)}{P(x)}$$

and $$P(x) = \sum_i P(x,y_i)$$

Machine Learning: When and Why?¶

We don't know the underlying process that generated the data, but we have examples of its outputs
Use cases: (so far) data-rich domains with problems we don't know how to solve any other way.
High-profile examples are from complex domains:
- Speech recognition
- Image recognition / object detection
- Weather forecasting
- Self-driving cars (robotics)
- Drug design

Side note: Machine Learning vs. Artificial Intelligence¶

AI is the study of making computer systems behave with "intelligence" (what does that mean?)
ML is generally considered a subfield of AI
- ML has come the closest so far to exhibiting intelligent behavior
- Actively debated: is ML the way to "general" AI?

A Guided Example¶

Start with a dataset to learn from (the training set).
- Inputs / features / independent variable: per capita income
- Labels / targets / dependent variables: crime rate

In [1]:

import pandas as pd
import seaborn as sns

In [2]:

df = pd.DataFrame({
  "Income": [0.49, 0.18, 0.31, 0.40, 0.24],
  "CrimeRate": [0.09, 0.45, 0.23, 0.19, 0.48]
})

In [3]:

sns.scatterplot(data=df,x="Income",y="CrimeRate")

Out[3]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fa5664044f0>

Classification vs. Regression¶

Outputs are real-valued (continuous): this is a regression task.

Predicting categorical outputs is called a classification task.

Ok, how do we do this?¶

Overview: three steps:

Decide on a model class
Decide on a formal definition of how well a model fits the training data.
Optimize (search for the best model in the model class) - aka train, or learn

Step 1: Decide on a model class¶

There are a ton of options, since this looks vaguely linear, let's keep it simple:
- Model class: linear functions; i.e., $h(x) = mx + b$
  - $x$ is the input
  - $m$ (slope) and $b$ (y-intercept) are the parameters of the model.
  - Equivalent terminology: $m$ and $b$ are weights, or $m$ is a weight and $b$ is a bias.

Step 2: Decide on a formal definition of how well a model fits the training data.¶

Break this into two substeps:
1. Define a loss function that indicates how poorly the model does on a single datapoint. Here, we choose to use squared error loss (lots of alternative options):
$$ L(h(x),y) = (h(x)-y)^2$$
- Loss functions are typically 0 when the prediction is correct
- Loss functions grow as the prediction is more wrong
- Squared error loss is popular because it is (relatively) easy to optimize
1. Compute the *empirical risk* (i.e., average loss over the training set)
$$ \hat{R}(h; {\cal X}) = \frac{1}{N} \sum_{i=1}^N L(h(x_{(i)}),y_{(i)})$$
- $h$ is the model
- ${\cal X}$ is the training set
- $\sum$ is adding up losses for datapoints $1,2,\dots,N$
- Note that ${\cal X}$ is fixed -- this is a function of $h$, which in our case, means it is a function of $m$ and $b$.

Step 3: Optimize¶

Training means solving the following optimization problem: $$ h^* = \arg\min_{h} \hat{R}(h; {\cal X})$$ Let's break this down:

$h^*$ the optimal model (i.e., the line with the best $m$ and $b$)
$\arg\min_h$ means to find and return the $h$ that gives the smallest value of the function being minimized
$\hat{R}(h; {\cal X})$ the function being minimized

Solving this problem requires math we can't count on in this class, so let's jump to the solution: $$ h^*(x) \approx -1.29x + 0.71$$

Plot of solution

Model Selection - Revisited¶

In [4]:

sns.scatterplot(data=df,x="Income",y="CrimeRate")

Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x7fa5662f2e80>

We limited our model class to linear functions. What other choices might we make?

Quadratic?
Piecewise linear?
Higher-order polynomial?

What effect does this choice have on the optimal value of $\hat{R}(h; \mathcal{X})$?

What would you choose if you needed to deploy a model to make predictions on unseen datapoints?

Definition Recap / Exit Ticket:¶

Regression
Classification
Model Class
Loss function
Empirical Risk

Taxonomic Considerations¶

Classification vs Regression¶

You know this one!

Supervised vs Unsupervised¶

Supervised Learning, by example:

$X$ is a collection of $n$ houses with $d$ numbers about them (square feet, # bedrooms, ...); $\mathbf{y}$ is a vector of their market values.

Key property: a dataset $X$ with corresponding "ground truth" labels $\mathbf{y}$ is available, and we want be able to to predict a $y$ given a new $\mathbf{x}$.

Unsupervised Learning, by example:

Same setup, but instead of predicting $y$ for a new $\mathbf{x}$, you want to know which of the $d$ variables are most predictive.

Key property: you aren't given "the right answer" - you're looking to discover structure in the data.

Discriminative vs Generative¶

For the moment, let's focus on supervised classification for the moment. In this case, $\mathbf{x}$ is a vector and $y$ is a discrete label indicating one of a set of categories or classes.

Discriminative models try to estimate $P(y | \mathbf{x})$. If you know this, then pick the most likely label, i.e., the $y$ that maximizes $P(y | \mathbf{x})$, and that's your predicted label for $\mathbf{x}$.

Example: image classification

Generative models try to estimate $P(\mathbf{x})$ (if there are no labels $y$), or $P(x, y)$ if labels exist.

Example: image generation

Note: it's easy to conclude that classification and discriminative are the same thing; they're not! Classifiers are often discriminative, but not always. How does that work?

Consider two schemes for classifying penguin species given their bill length and depth.

Exercise: Which of these is generative and which is discriminative?

Draw lines through the space and classify based on which area the penguin is in:
Fit 2D Gaussian distributions to the three species, and classify based on which Gaussian says a point is most probable:

Exercises:¶

Automatically detecting outlier images - like lab 4, except fully automatic.
- Classification, regression, or neither?
- Supervised or unsupervised?
Using linear regression (i.e., a best-fit line) to predict home value based on square feet.
- Classification, regression, or neither?
- Supervised or unsupervised?
Finding "communities" of people who frequently interact with each other on a social network.
- Classification, regression, or neither?
- Supervised or unsupervised?

Assumptions¶

Data distribution¶

Generally: unseen data is drawn from the same distrubtion as your dataset.

Consequence: We don't assume correlation is causation, but we do assume that observed correlations will hold in unseen data.

Specific models: many more! Example:

Linear regression assumes:

Columns are linearly independent (i.e., one column can't be directly computed from another
Data is homoscedastic, i.e., the following won't happen:

Lecture 15 - Machine Learning: What, Why, and an Example¶