Our project involves books - lots of books. We looked at the synopses of many books along with their primary genre with the goal of being able to classify a book’s genre based solely on its description. We got our dataset from kaggle, where someone had obtained the data of 100k books including vitally their descriptions and genres from Goodreads. We wanted to answer the question - are there words that are significantly more common in some types of books than others? Can these words be used to determine information about a book from its synopsis?
(Marvin) I had been initially attracted to this idea thinking about light novel titles. Light novels are a type of book aimed at young adults and teenagers, and they’re notorious for being long and extremely descriptive of the book’s contents. However, there are only so many of them and they’re not as easy to find easily usable data about, so when we found the Goodreads dataset of 100,000 books we decided to apply our idea to books as a whole.
The genres of books are voted on by the Goodreads community, and there were many genres we found to be too niche or redundant for our usage, so we decided to take each book’s list of genres and simplify it down to one genre, from a list of thirty genres we had hand-selected to be individually distinct and useful. For example, ‘Christian’ and ‘Christianity’ were both genres, so we opted to only keep the former. We did some cleaning and preprocessing as well during this time to get rid of non ascii characters, choosing to drop books with empty descriptions or no genre. After cleaning we were left with about 77000 book descriptions.
Our first step was to figure out what words would be the most important for use in machine learning, by figuring out what words are the most commonly used in each genre relative to one another. But to accomplish this, first we needed to preprocess the data so that words could be easily identified as the same even if they were shown differently, for example “book” and “books” are about the same thing, but would appear to be entirely different words if we simply looked at the amount of times both occurred. In order to remove the effect of non important words we made use of a list of stop words or words that appear in most sentences but don’t provide any meaning to the sentence. These are words like to, the and a all of which aid the structure and syntax of a sentence but don't have any meaning of their own. To reduce the amount of data we would have to process and to make our data more resilient, we found the Natural Language Toolkit which could stem words and reduce words to word roots.
After reducing words to their stems, we ran the TF-IDF vectorizer from sklearn which would apply the TF-IDF formula to our corpus of genre blurbs. This formula tells us about the term frequency by looking at the amount of times a term shows up in the corpus as well as the amount of individual entries in the corpus there are in which the term shows up. This would result in a list of words as well as TF-IDF values for every single entry in the corpus. For example, if there was a corpus of 100 books and one of them was about bats, the word ‘bat’ would have a high TF-IDF value in the vector for that book, while it would have a significantly lower value for other rows with books that are not about bats.
For our exploratory analysis we decided to explore the words that would have the highest TF-IDF values in each genre’s corpus compared to the overall corpus of all book descriptions. We were able to group all of the books’ descriptions into their own corpi which would be evaluated by the TF-IDF vectorizer. After applying the vectorizer, we were able to look at the words with the highest TF-IDF scores in each genre. These results were in the range of expectations - three of the top five words in the Animals genre were animal names, and the most important word in the biography genre was ‘life’.
Now that we understood how the TF-IDF vectorizer worked, we were to begin applying it on our corpus of book synopses so each book could be represented as a vector for use in machine learning. However, we immediately ran into a big problem - running the TF-IDF vectorizer on 77,000 book descriptions with around 430,000 unique words led to requiring 190GB of memory to vectorize all the descriptions, which was not feasible for us. We decided to choose the most important words of the bunch, creating a list of all the words and their highest TF-IDF score in any genre category from our earlier step so we could isolate the words that were the most distinct overall to any genre - things like animal names from the previous step, or ‘chess’ which was a clear indicator of a book belonging to the ‘Games’ genre. The below chart displays the top thirty words ranked by highest TF-IDF score, but don’t be fooled - the vast majority of the 430,000 words have extremely small scores well below 0.001. Note that the top two words are foreign language words of some sort, ranking highly in the ‘European Literature’ category as a clear example of how words that only show up in one category will have a very high TF-IDF score in that category.
Now with a list of all the words and their highest TF-IDF score, we could whittle it down to only the most important of these words. Initially we chose a threshold of words with a TF-IDF score above 0.03, which narrowed it down to 1,500 words and achieved us a score of about 0.50 using our best classifier. Decreasing the threshold to 0.02 netted us around 2,500 words and a gain of around 4% in accuracy on our validation set, and we did not decrease this further because of time and file size constraints. Our saved .csv file was a whopping 800MB with 2,500 words’ TF-IDF scores, which we felt was a reasonable stopping point.
Our goal was to see if we would be able to find a book's genre based on just the words in its description. Our predictions had quite interesting results as we were only expecting a 3% success rate based on random chance of the classifier guessing the right genre. In our initial analysis, by summing the TF-IDF scores of the words in the description for each genre and predicting the highest scoring genre, we were able to achieve an accuracy of 43%. This gave us a very ambitious baseline to beat.
When we began utilizing machine learning techniques, we initially started with a Complement Naive Bayes classifier which had around a 47% accuracy rate. After much trial and error with different machine learning classifiers, we achieved around a 54% accuracy rate at best, by using a stochastic gradient descent classifier. The reason we believe the stochastic gradient descent classifier was able to give us the best results is because by tweaking the hyperparameters, naming the alpha value, it was able to act like a more regularized linear model. It was able to do this while still being able to handle our large data set, something linear classifiers like support vector classification struggled with. This classifier could produce results that gave us the correct genre over half of the time. Considering that we had thirty genres which would have netted a 3% accuracy from randomly guessing, we’re quite happy with 54%, especially considering our limitations which we’ll discuss in a moment.
We had a very noisy dataset - it was, after all, a hundred thousand books scraped from Goodreads by some data scientist. When looking through random descriptions we found many that simply listed some facts about the book like publication date and author. In retrospect, if we spent more care on cleaning our dataset of these garbage descriptions, we likely could have improved our accuracy because these descriptions tell us nothing about the book itself. In addition, the system resources required to vectorize all of our words were too immense for us, and we were forced to choose 2,500 words - less than a percent of all the words we had found! Granted, many of the 430,000 are not strictly English words as they contain author names, fantasy names, food names, typos, foreign words, and all sorts of other artifacts. We already saw the effects of this limitation when we increased our set of words and got a 4% gain in prediction accuracy, so it’s not unthinkable that we could achieve a much better score if we were able to use the full extent of the words we found in our dataset.
In some articles we read online, data scientists were able to achieve above 80% accuracy on things like basic email classification problems so it seems logical that we could achieve at least 60-70% accuracy on book descriptions without these limitations.
In our analysis, we were hoping to find a clear link behind the kinds of words in a description of a book and the genre of that book, finding the most ‘important’ words that would be able to tell us the most about a book’s genre. We were largely successful and proved how with only 2,500 words’ TF-IDF scores, we could predict a book’s genre out of thirty options more than half of the time, even with a very noisy dataset. There are places we could further explore in improving our genre accuracy. We could include the title, author, publication year, page count, and of course the text of the book itself in a machine learning model. We could even experiment with including features based on a book’s cover, for example what the primary color is, how monochrome or diverse it is, and such. These are all things we could use as humans to make snap judgments about a book, but it takes a lot of work to implement these things into a machine learning model and shows that we have a long way to go in learning about machine learning.