Lecture 35 - Preprocessing and Cross-Validation

Announcements:

Goals

Act I - A Tale of Scale

This dataset is from Spotify; it has some of their (presumably machine-learning-derived) attributes ("energy", "liveness", etc), as well as a genre label.

I wanted to do PCA on the attributes, visualize it in 2D, and see if genres were well separated.

Grab the numerical attribute columns only:

I'm not sure what "mode" is, but it only has two values. Let's use only variables that seem truly numerical.

Ok, let's some PCAing!

Wow, this is great! Practically all the variance is explained by 2 components. This means we won't lose anything if we plot the first 2 components.

Huh. That's a... weird picture.

The fact that there exactly 12 lines is suspicious.

Less convenient, but also less surprising: the intrisnic dimensionality is about 10.

Act II: Cross-Validation, Hyperparameters, and how to tune them

Back to the zoo!

The above plot shows the classification results for each of the 3 different "folds" of training. Notice that the accuracy is similar but the decision boundaries are slightly different because of the particulars of which points were in the training vs validation set.