Lecture 21 - Classification and Regression; Linear Algebra

Announcements:

Goals

Prediction Tasks: Classification and Regression

So far we've mainly talked about exploratory analysis and descriptive techniques: looking for what is apparent from the data. Often, it's useful to use the insights available in data to make predictions about other data.

Generally, it can be useful to predict some quantity or property that cannot be measured (or cannot be measured ahead of time):

  1. What is the current market value of a house?
  2. Is a given email spam?
  3. What percent of the vote will a given candidate get in an election?
  4. How will a given person vote in an election?
  5. Will someone who clicks on an ad buy a product?
  6. What will be the market value of a stock one hour from now?
  7. How many customers will eat at a restaurant next week?
  8. Is a given sequence of requests to a website coming from a real humanm, or a DATA 311 student's Lab 5 (or other bot)?.
  9. How could you most accurately fill in the NaN length measurements for each penguin?

Most prediction tasks fall into one of two categories: classification or regression. The basic distinction is whether you're trying to predict a discrete categorical property, or a continuous numerical property.

Exercise: Classify each of the above 9 prediction problems by determining whether it is a classification problem or a regression problem.

Exercise: Consider flipper length and body mass. Suppose it's easier to put a penguin on a scale than to pin it down and measure its flipper with a measuring tape. Come up with a scheme to predict flipper length given only body mass. Note: no fancy "linear regression" allowed - tell me a scheme in terms an 8th grader could understand!

Exercise: Suppose we want to predict a penguin's species based on its body mass and flipper length. Describe a scheme for doing this.

Exercise: Suppose we want to predict a penguin's species based on its bill length and depth. Describe a scheme for doing this.

Linear Algebra

Any classification and regression problem can be cast in terms very similar to those above. The number of variables you're basing your prediction on may change, but that doesn't fundamentally change the problem.

While there are many schemes for classification and regression, many of the most commonly used ones are all built on top of linear models. This means a few things:

It may not seem so at first, but the natural language to talk about these models (and most of the fancier ones built on top of them) is linear algebra, because linear functions are very naturally represented using matrices.

Linear Algebra - The Basics

Exercise: Compute the dot product: $$\begin{bmatrix}1 \\ 2 \\ 3\end{bmatrix} \cdot \begin{bmatrix}6 \\ 3 \\ 0\end{bmatrix}$$

Exercise: Given a vector $v$, write an expression that represents the sum of the squares of its elements.

Matrices

Exercise: In non-mathematical terms, what does the following matrix do when multiplied by a given 3-vector? $$\begin{bmatrix}0 & 0 & 1\\ 0 & 1 & 0 \\ 1 & 0 & 0\end{bmatrix}$$

Exercise: For each of the following, say whether there's a dimension mismatch; if not, give the dimensions of the result.

Exercise: Write the dot product of two column vectors $v_1$ and $v_2$ as a matrix multiplication.

Exercise: In non-mathematical terms, what does the following matrix do when multiplied by a given 3-vector? $$\begin{bmatrix}1 & 0 & 0\\ 0 & 1 & 0 \\ 0 & 0 & 1\end{bmatrix}$$

Square matrices only:

Exercise: Find a way to rewrite the following regression model using matrix notation: $$ y = c_0 + c_1 x_1 + \ldots + c_d x_d$$