L23

Announcements:¶

The scariest pumpkin I've ever carved - you're welcome for the nightmares:

Goals:¶

See a few different perspectives on "what is machine learning?"
- CS/programmer's perspective
- DS/data wrangler's perspective
- Stats/mathematician's perspective
Know the basic mathematical setup of most machine learning methods.
- Data points (rows) as vectors in high-dimensional space
Understand the meaning of and distinction between unsupervised learning and supervised learning.
Understand the meaning of and distinction between discriminative and generative learning.
Understand the basic and near-universal assumption of machine learning models: unseen data is drawn from the same distribution as the dataset.

What the heck is ~all this hype~ machine learning about?¶

A programmer's perspective:¶

It's a tool that we can use when we don't know how to just write the code to compute the answer.

A data scientist's perspective:¶

A way to predict values (columns, rows, subsets thereof) that we don't have based on the data we do.
- Fill in missing data
- Forecast future time series data
- Discover hidden structure in data

A mathematical / probabilistic perspective:¶

Given a set of data $X$ and (possibly) labels $y$ (i.e., quantities to predict), model either $P(y | x)$ or $P(x, y)$.

Recall: If you know $P(x, y)$, you can compute $P(y | x)$:

$$P(y | x) = \frac{P(x, y)}{P(x)}$$

and $$P(x) = \sum_i P(x,y_i)$$

Terminology and Problem Setup¶

Feature Vectors¶

Machine learning only works on vectors. If you have something else (text, images, penguins, bananas), you'll need to convert them to vectors first.

These vectors are called feature vectors; each dimension is an individual feature; in this course we've been calling these columns.

The process of going from things to feature vectors is called feature extraction.

Observation: vectors can only contain numbers, so categorical columns usually need to be either dropped or converted to numbers somehow.

Dataset and Labels¶

We'll assume that we have a collection of a bunch of feature vectors (we've been calling it a table), one for each datapoint. This is our dataset.

A collection of vectors can be packaged into a matrix. The traditional way to do this is to call it $X$ and arrange the dimensions it like we're used to with our tables, so $X_{n \times d}$ contains $n$ rows, each of which is a $d$-dimensional feature vector for a datapoint.

Sometimes, we also have values for some quantity we'd like to predict, one value per datapoint in our dataset. These are called labels, and traditionally written $\mathbf{y}_{n \times 1}$, so $y_i$ is the label for the $i$th row of $X$.

Note: The word "label" may connote that it is categorical, but this is not necessarily true.

Problem Statement¶

Given $X$:

discover some hidden structure in or properties of $X$

Given $X$ and $\mathbf{y}$:

determine the most likely $y^*$ for some feature vector $\mathbf{x}^*$ that is not part of $X$.

Examples:

$X$ is a table of $n$ houses with $d$ numbers about them (square feet, # bedrooms, ...); $\mathbf{y}$ is a vector of their market values.
Same setup, but instead of predicting $y$ for a new $\mathbf{x}$, you want to know which of the $d$ variables are most predictive, or you want to discover the structure of the neighborhoods in a city.

We generally go about this by training (i.e., fitting) a model to our dataset $X$.

Exercise: Use ALL the words!¶

Exercise: You are given a collection of images. Some of them contain cats and some do not, and you have a column in a dataframe that tells you which are which. Without going into specifics of how the steps are accomplished, tell me a story about how you can use machine learning to automatically predict whether an image has a cat or not. Your story should use the following words:

feature extraction, feature vector, dataset, labels, model, training.

Taxonomic Considerations¶

Classification vs Regression¶

You know this one!

Supervised vs Unsupervised¶

Supervised Learning, by example:

$X$ is a collection of $n$ houses with $d$ numbers about them (square feet, # bedrooms, ...); $\mathbf{y}$ is a vector of their market values.

Key property: a dataset $X$ with corresponding "ground truth" labels $\mathbf{y}$ is available, and we want be able to to predict a $y$ given a new $\mathbf{x}$.

Unsupervised Learning, by example:

Same setup, but instead of predicting $y$ for a new $\mathbf{x}$, you want to know which of the $d$ variables are most predictive.

Key property: you aren't given "the right answer" - you're looking to discover structure in the data.

Discriminative vs Generative¶

For the moment, let's focus on supervised classification for the moment. In this case, $\mathbf{x}$ is a vector and $y$ is a discrete label indicating one of a set of categories or classes.

Discriminative models try to estimate $P(y | \mathbf{x})$. If you know this, then pick the most likely label, i.e., the $y$ that maximizes $P(y | \mathbf{x})$, and that's your predicted label for $\mathbf{x}$.

Example: image classification

Generative models try to estimate $P(\mathbf{x})$ (if there are no labels $y$), or $P(x, y)$ if labels exist.

Example: image generation

Note: it's easy to conclude that classification and discriminative are the same thing; they're not! Classifiers are often discriminative, but not always. How does that work?

Consider two schemes for classifying penguin species given their bill length and depth.

Exercise: Which of these is generative and which is discriminative?

Draw lines through the space and classify based on which area the penguin is in:
Fit 2D Gaussian distributions to the three species, and classify based on which Gaussian says a point is most probable:

Exercises:¶

Automatically detecting outlier images - like lab 4, except fully automatic.
- Classification, regression, or neither?
- Supervised or unsupervised?
Using linear regression (i.e., a best-fit line) to predict home value based on square feet.
- Classification, regression, or neither?
- Supervised or unsupervised?
Finding "communities" of people who frequently interact with each other on a social network.
- Classification, regression, or neither?
- Supervised or unsupervised?

Assumptions¶

Data distribution¶

Generally: unseen data is drawn from the same distrubtion as your dataset.

Consequence: We don't assume correlation is causation, but we do assume that observed correlations will hold in unseen data.

Specific models: many more! Example:

Linear regression assumes:

Columns are linearly independent (i.e., one column can't be directly computed from another
Data is homoscedastic, i.e., the following won't happen:

Lecture 23 - Machine Learning Overview and Taxonomy¶