Lecture 23 - Evaluating Classifiers¶

Announcements:¶

FP milestone feedback:

  • 7 groups have feedback
  • Remaining 4 will get feedback by the end of today

Final exam logistics:

  • 8am-10am Monday 3/13, here, on paper
  • You are allowed one double-sided sheet of handwritten notes

FP presentations tomorrow and Wednesday

  • There will be an exit ticket for credit, so please be there.
  • If we get through all presentations by the end of Wednesday, no class/lab Friday.

Goals¶

  • Know how to understand the performance of binary classifiers:
    • The definition of True/False Positives/Negatives
    • The calculation of different performance metrics, including:
      • accuracy
      • precision
      • recall
      • F-score
  • Know how to understand the performance of multiclass classifiers:
    • Top-$k$ accuracy
    • Confusion matrix

Classifiers¶

Last time: linear classfiers: line seprates classes instead of hitting datapoints.

Many different types of classifiers exist; several flavors of linear as well as others.

See the classifier zoo for examples.

Binary Classification¶

Evaluating binary classification is trickier than regression, and the reason is that most intuitive metrics can be gamed using a well-chosen baseline.

Simplest metric - accuracy: on what % of the examples were you correct?

There are different kinds of right and wrong:

  • TP - True positives (correctly labeled positive)
  • TN - True negatives (correctly labeled negative)
  • FP - False positives (incorrectly labeled positive; was actually negative)
  • FN - False negatives (incorrectly labeled negative; was actually positive)

Exercise: let TP be the number of true positives, and so on for the other three. Define accuracy in terms of these quantities.

In [ ]:

Accuracy = $\frac{TP + TN}{(TP + TN + FP + FN)}$

Exercise: Game this metric. Hint: suppose the classes are unbalanced (95% no-tumor, 5% tumor).

In [ ]:

Problem: if you just say no cancer all the time, you get 95% accuracy.

Okay, what's really important is how often you're right when you say it's positive:

Precision = $\frac{TP}{(TP + FP)}$

Anything wrong with this?

In [ ]:

Problem: incentivizes only saying "yes" when very sure (or never).

Okay, what's really important is the fraction of all real cancer cases that you correctly identify.

Recall = $\frac{TP}{(TP + FN)}$

Exercise: Game this metric.

In [ ]:

Problem: you get perfect recall if you say everyone has cancer.

Can't we just have one number? Sort of. Here's one that's hard to game:

F-score $= 2 *\frac{\textrm{precision } * \textrm{ recall}}{\textrm{precision } + \textrm{ recall}}$

Here's a visual summary (source: Wikipedia):

Tuning a Binary Classifier¶

Sometimes your classifier will have a built-in threshold that you can tune. The simplest example is a simple threshold classifier that says "positive" if a single input feature exceeds some value, and negative otherwise.

Consider trying to predict sex (Male or Female) given height:

If you move the line left or right, you can trade off between error types (FP and FN).

The possibilities in this space of trade-offs can be summarized by plotting FP vs TP:

Multi-Class Classification¶

Usually, a multiclass classifier will output a score or probability for each class; the prediction will then be the one with the highest score or probability.

Metrics:¶

  • Accuracy - still possible, but random guess baseline gets worse fast, and good accuracy is very hard to get with many classes.
  • Top-k accuracy: does the correct class lie in the $k$ most likely classes? Easier, and gives "partial credit"
  • Precision and recall can be defined for each class:
    • Precision for class $c$: $\frac{\textrm{# correctly labeld } c}{\textrm{# labeled class } c}$
    • Recall for class $c$: $\frac{\textrm{# correctly labeled } c}{\textrm{# with true label } c}$

The full performance details can be represented using a confusion matrix:

Exercises: Given a confusion matrix, how would you calculate:

  • the precision for a certain class?
  • the recall for a certain class?