Final Project DATA 311 - Game Sales

Jason Li, Caleb Ponce, Jeremy Tran

Abstract:

We look at how some features of a videogame affect the sales figure of the games. The data was obtained from Kaggle which sourced its data from a VGchartz webscrape. Additional data was added from Google Trends using the gtab library to solve issues involving using the Google Trends website on its own.

The Data:

With our standard imports of regression models and all that other fun stuff, we import the Google Trends AnchorBank Library found on github. GTAB allows us to grab trend data on a "universal scale" rather than a limited scope. It also enables us to do more than 5 searches at a given time. In addition, Google Trends scales results to both timeframe AND search query. GTAB prevents this issue.

Our data is locally loaded, so we placed it into a pandas DataFrame.
The dataset is around 16,000 entries in length, but this is composed of many duplicate titles spread across every console. We take around 400 entries from the start of the dataset as a significant portion of the data is massively skewed to one side (an example to follow soon). To decrease the size of our data, we compress all duplicate (same titles) and add up all the sales from all platforms.
The dataset had some issues with column alignments. This was likely caused by the webscraping method used by the scraper. If a game included a ; in the title, it would break and offset the data by one column. This is mostly ignored as such titles are not within our scope of used data (sales were extremely limited). Our response variable is Global_Sales which is interpreted as the number of physical game copies sold.

However, for our dataset, we only have the following features:

Not only do we have limited features, they are all categorical. As such, we need some other qualitative data to help improve our model potentially. We decided to quantify the "hype" of a game by how often it was searched for on Google. Google Trends provides this data as a relative search popularity for a timeframe. GTAB expedites this process by allowing us to perform multiple searches on a more accurate timescale. We get the maximum ratio of searches for each title and use that as our "hype" for the game.
This takes some time and may even get a rate limit from too many requests, as such we take our queries and save them into a DataFrame which we can then output to a CSV so that we can continuously load it frequently.

We tried cube root, square root, logarithm and inverse transformations.
Given that our data is heavily skewed, we will tranform the response variable (Global_Sales) by inversing the value.
This will give us a more normally distributed set of data in hopes for a decent model score (previous iterations have shown poor results).

Now we use an OrdinalEncoder to transform our categorical features of Platform, Genre and Publisher to numerical values the model can understand.

Now, we drop all variables except Publisher and the Hype values.

Here, we set up multiple regression models.
We start out with two DummyRegressors to define some baselines through the median and mean.
As we can see, the scores are... very poor.
This sets our evalution environment, but we aren't expecting anything massive based on these scores.

We next attempt to set up a regression model using the Decision Tree.
This results in significantly better but still massively poor scores. We also hunt for the best hyperparameters afterwards to see if our score can improve further.

As we can see above, all models provide very poor scores. This is not entirely unexpected given the limited scope of our data and features. Successful games are hard to define. The dataset has a focus on "popular" games that are easily found in the mainstream and on very specific platforms. There is an additional layer of issues caused by the increasing digital sales market and the sheer number of games available. More and more numbers are obscured as time passes.
With the original dataset, it is likely counting only physical units sold. Sometimes this data is not made available or is easily obtainable.

So... What's with the Scores?

It would be best to take a look at how each of our supposed features work with each other. So, let's make a pair plot to visualize bivariate relationships between our predictor variables and our response variable.

As we can see from the pair plot above, there is a very poor linear relationship between all features and Global Sales. Hype is grossly affected from how the data is collected and GTAB works in mysterious ways. The sales spread for most features is roughly even. This may also be a result of how small our sample size is as we only took around 400 entries from the top of the list.

Game Features Intuitively

When we first saw the dataset, we made some assumptions about what features would affect the model score the most based on our own experiences with videogame marketing.

Publisher

We thought that Publisher would be one of the stronger features to affect sales. The reason being that a given publisher has a certain reputation to consumers. However, our dataset featured many publishers we did not even know so this threw our perception off as we may sometimes confuse developers for publishers.

Genre

The genre of a game is very important to how it appeals to an audience. Every genre has its own audience so it will be successful in its own way. In addition, not every genre is populated equally. We figured that action would be the most "successful" as it is an oversaturated category (i.e. "Call of Duty", "Grand Theft Auto","Fortnite"). However, just like movies, there are very good action games and very crummy ones.

Hype

We had the least faith in the Hype values. Primarily because hype is a very nebulous value to obtain. Hype could be expressed in different ways, but we chose search amount. Games are often discussed through news articles or reports, but there is almost an infinite amount of news platforms on the internet. This may include positive and negatives about a given game. Google Trends was difficult to work with how the system is built and the library we needed to use. The values actually betrayed our expectations.

Score Improvements?

There are some things could have done that may or may not improve our scores. This includes finding other datasets that relate to videogames. We mainly used only one dataset which contributed to how good the score was due to the structure. We could have found more numerical data rather than having only categorical variables. Some other features we considered are below:

Conclusion

So... What we can pull from all this?

Videogame data is difficult to work with. There are very few good publicly available datasets pertaining to videogames itself. When they are available, it is in a piecemeal form with the full version needing payments. Often, much of the data is related to the sales and the aim is to maximize them. Since it is also a popular medium of entertainment, it isn't a surprise that companies keep data to themselves.
They want to be the ones on top.