DATA 311 - Lecture 18 - Feature Preprocessing and Encoding

Write the names of the students at your table:

  1. We computed the L2 distance of every penguin to Penguin 0 and ran pairplot to see how the L2 distance relates to each numerical column. Here’s the bottom row of that pairplot:

pairplot

What’s happening here? Is this what we wanted? If not, why did this happen and what could we do about it?

  1. What is the Hamming distance between each pair of the following three penguins, based only on the categorical features listed? Don’t coun the index column as a feature.

    island sex
    1 Torgersen Female
    338 Biscoe Female
    33 Dream Male
  2. Suppose you’re extracting features that will be used for distance comparisons among datapoints using some \(L^p\) distance. When would you want to choose ordinal encoding over one-hot encoding?