Creating a Genre Classifier using Spotify's API

Madeline Carter, Ethan Crow, Alex Isbill


Overview

Like most young adults our age, we are avid music listeners. Madeline knows most, if not all of The Hot 100; Ethan boogies down to rock and J-pop; Alex taps his cowboy boots to the sweet melodies of good ol' country. Clearly, our tastes in music are wildly varied and we began to wonder how genre of music could be differentiated analytically. We grew fascinated by the idea of a machine learning model that could predict the genre of a song, and discovered a way to accomplish this using on everyone's favorite music streaming platform: Spotify.

Gathering the Data

Spotify provides an API for developers that allows users to pull data about artists, albums, songs, playlists, and more. With the API, you can look at various "audio features" that Spotify assigns to each song. Here are some of the features Spotify makes available and their descriptions from the Spotify API Documentation:

spotify

We thought that it would be interesting to try and predict a song's genre based off of these audio features. Unfortunately, Spotify's API does not allow you to pull the genre of a song directly. Rather, they provide a "genre seed:" an array of genres associated with the song used for the recommendation function of the API. To work around this, we used the API to search for the top 1000 songs in a given genre, pull the audio features for each song, and add on the genre label.

After putting the data into a dataframe, there were 6,000 rows representing 1,000 songs from the genres of pop, rock, country, EDM, rap, and classical. After removing duplicates, 5,381 songs remained. Notably, most duplicates were from overlap between pop and other genres, such as EDM or rap. This indicates that it may be harder to classify this genre.


Exploratory Data Analysis

Before training our genre classifier, we felt it would be helpful to look at the correlation between each of our features to distinguish which would be useful in making our predictions.

corr

From the correlation heatmap, we find that acousticness and instrumentalness are highly correlated; popularity, danceability, energy, and loudness are also correlated. These four features are also distinctly negatively correlated with the two mentioned before.

To continue our analysis, we converted each of these features into z-scores and grouped the data by genre, finding the mean z-score for each one. Then, to interpret the genres relative to one another, we plotted the difference in z-score between each genre and one baseline genre. Here, the baseline selected was rock.

zdiff

Our principal observations:

As we saw when removing duplicate songs from our classifier, many pop songs overlapped with other genres. We decided to create the same plot for differences between pop to gain insight into this.

zdiff2

Let's see which features may be distinguishable:


Creating the Classifier

Our goal was to create a classifier that could identify a song as one of five genres (rock, country, EDM, rap, classical) based off of the song's audio characteristics. (Pop was left out because of its significant overlap with other genres.) Exploratory analysis suggested that songs in different genres have distinguishing audio characteristics that would allow a classifier to correctly identify a song's most probable genre. This is a multi-class classification problem with the following setup:

Setup

Task: Accurately classify a song into one of five broad genres

Labels: The ground-truth genre of a given song

Features: Danceability, Energy, Loudness, Speechiness, Acousticness, Instrumentalness, Liveness, Valence, Tempo, Duration

Initial Testing

After splitting the data into its features and labels, scaling the data, and creating training, validation, and test sets, we were able to begin testing various classification models on our dataset. The classifier has parameters set to their default values to see if one classifier performed significantly better than the others from the start. This would allow us to choose the best default classifier and tune its hyperparameters to create the best possible predictions.

Here were our initial results:

Model Validation Set Accuracy Validation Set F1 Score
Random Forest 0.761 0.760
Neural Network 0.753 0.752
Linear SVM 0.741 0.741
Logistic Regression 0.722 0.720
K-Nearest-Neighbors 0.667 0.666
Decision Tree 0.654 0.653
Naive Bayes 0.610 0.589
BASELINE: Guess the Mode 0.2312 0.087
BASELINE: Random Guess 0.208 0.209

From the table, we see that the Random Forest and Neural Network classifiers perform best in the accuracy and F1 score metrics. This means that Random Forest could classify 76.1% of songs into their correct genres. To look at how the classifier performs on each genre, we can look at the confusion matrix.

cm1

We see that the classifier performs best on the classical and rap genres. However it confususes country and rock fairly frequently. This is likely because these genres have very similar audio feature scores as seen in the exploratory analysis.

Now, we will take the Random Forest Classifier and see if we can tune its hyperparameters to achieve better results.

Hyperparameter Tuning

The Random Forest Classifier has 18 different hyperparameters that control various aspects of how the model makes its predictions. We can take a subset of these hyperparameters and test how the model performs on various combinations of different hyperparameter values. The functions GridSearchCV and RandomizedSearchCV were used to test 68 combinations in total to arrive at the model that had the highest average mean accuracy score over three cross validation runs.

The following are hyperparameters that were tested, as well as the combination that produced the best results:

Final Performance

After hyperparameter tuning, we achieved a score of 77.1% accuracy on the validation set. While there was only a 1.1% increase over the generic model, improvement is evident and celebrated. The model scored slighly lower on the test set indicating some overfitting on the training and validation set. The confusion matrix for the final model also shows that it slightly improved its predictions for the rock and country categories.

cm2

Further Analysis: For details of our full analysis, please see the linked jupyter notebook here.