We look at how some features of a videogame affect the sales figure of the games. The data was obtained from Kaggle which sourced its data from a VGchartz webscrape. Additional data was added from Google Trends using the gtab
library to solve issues involving using the Google Trends website on its own.
With our standard imports of regression models and all that other fun stuff, we import the Google Trends AnchorBank Library found on github. GTAB allows us to grab trend data on a "universal scale" rather than a limited scope. It also enables us to do more than 5 searches at a given time. In addition, Google Trends scales results to both timeframe AND search query. GTAB prevents this issue.
import pandas as pd
import seaborn as sns
import numpy as np
import sklearn
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.dummy import DummyRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
# GoogleTrends AnchorBank Library from github
import gtab
Our data is locally loaded, so we placed it into a pandas DataFrame.
The dataset is around 16,000 entries in length, but this is composed of many duplicate titles spread across every console.
We take around 400 entries from the start of the dataset as a significant portion of the data is massively skewed to one side (an example to follow soon). To decrease the size of our data, we compress all duplicate (same titles) and add up all the sales from all platforms.
The dataset had some issues with column alignments. This was likely caused by the webscraping method used by the scraper. If a game included a ;
in the title, it would break and offset the data by one column. This is mostly ignored as such titles are not within our scope of used data (sales were extremely limited).
Our response variable is Global_Sales
which is interpreted as the number of physical game copies sold.
data = pd.read_csv('vgsales_12_4_2021.csv')
data = data.iloc[:400]
data['NA_Sales'] = data["NA_Sales"].astype('float64')
aggregation_functions = {'Rank':'first', 'Name': 'first', 'Platform':'first', 'Year':'first', 'Genre':'first', 'Publisher':'first','NA_Sales': 'sum', 'EU_Sales':'sum', 'JP_Sales':'sum', 'Other_Sales':'sum','Global_Sales': 'sum'}
df_new = data.groupby(data['Name']).aggregate(aggregation_functions)
df_new = df_new.sort_values(by="Rank")
df_new
Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | |
---|---|---|---|---|---|---|---|---|---|---|---|
Name | |||||||||||
Wii Sports | 1 | Wii Sports | Wii | 2006 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 |
Super Mario Bros. | 2 | Super Mario Bros. | NES | 1985 | Platform | Nintendo | 32.48 | 3.581.3 | 6.96 | 0.99 | 45.31 |
Mario Kart Wii | 3 | Mario Kart Wii | Wii | 2008 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 |
Wii Sports Resort | 4 | Wii Sports Resort | Wii | 2009 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 |
Pokemon Red and Blue | 5 | Pokemon Red and Blue | GB | 1996 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
The Legend of Zelda: Spirit Tracks | 396 | The Legend of Zelda: Spirit Tracks | PS4 | 2015 | Action | Konami Digital Entertainment | 1.08 | 1.35 | 0.48 | 0.47 | 3.38 |
WWF War Zone | 397 | WWF War Zone | DS | 2009 | Action | Nintendo | 1.43 | 0.94 | 0.74 | 0.27 | 3.38 |
The Legend of Zelda: Majora's Mask | 398 | The Legend of Zelda: Majora's Mask | PS | 1998 | Fighting | Acclaim Entertainment | 2.47 | 0.76 | 0.00 | 0.13 | 3.36 |
Professor Layton and the Unwound Future | 399 | Professor Layton and the Unwound Future | N64 | 2000 | Action | Nintendo | 1.90 | 0.67 | 0.73 | 0.06 | 3.36 |
Rugrats: Search For Reptar | 400 | Rugrats: Search For Reptar | DS | 2008 | Puzzle | Nintendo | 0.65 | 1.61 | 0.82 | 0.28 | 3.36 |
343 rows × 11 columns
sns.displot(data, x = "Global_Sales")
plt.title("Heavily Skewed Data Distribution")
Text(0.5, 1.0, 'Heavily Skewed Data Distribution')
However, for our dataset, we only have the following features:
Not only do we have limited features, they are all categorical. As such, we need some other qualitative data to help improve our model potentially. We decided to quantify the "hype" of a game by how often it was searched for on Google. Google Trends provides this data as a relative search popularity for a timeframe. GTAB expedites this process by allowing us to perform multiple searches on a more accurate timescale. We get the maximum ratio of searches for each title and use that as our "hype" for the game.
This takes some time and may even get a rate limit from too many requests, as such we take our queries and save them into a DataFrame which we can then output to a CSV so that we can continuously load it frequently.
t = gtab.GTAB()
t.set_options(pytrends_config= {"timeframe": "2004-01-01 2021-01-01"})
## Grab hype values from queries here: Create list of names
# Google has a cap on how many queries per day we can make, so if we
# need to pull a lot frequently for testing we should save it in a csv
names_lst = df_new['Name']
hype_lst = []
# For loop to query for hype for sample of games
for i in names_lst:
hype = t.new_query(names_lst[i]);
hype = hype['max_ratio'].max()
hype_lst.append(hype)
#Save data to CSV
df_hype = pd.DataFrame(hype_lst)
df_hype.to_csv('Hype_vals.csv',index=False)
# Import hype data that was pulled using GTAB
hype_lst = pd.read_csv('Hype_vals.csv', index_col=False)
hype_lst = hype_lst.set_index(df_new.index)
# Add hype data to DataFrame
df_new['hype'] = hype_lst
df_new
Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | hype | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | ||||||||||||
Wii Sports | 1 | Wii Sports | Wii | 2006 | Sports | Nintendo | 41.49 | 29.02 | 3.77 | 8.46 | 82.74 | 5.000000 |
Super Mario Bros. | 2 | Super Mario Bros. | NES | 1985 | Platform | Nintendo | 32.48 | 3.581.3 | 6.96 | 0.99 | 45.31 | 1.250000 |
Mario Kart Wii | 3 | Mario Kart Wii | Wii | 2008 | Racing | Nintendo | 15.85 | 12.88 | 3.79 | 3.31 | 35.82 | 14.434786 |
Wii Sports Resort | 4 | Wii Sports Resort | Wii | 2009 | Sports | Nintendo | 15.75 | 11.01 | 3.28 | 2.96 | 33.00 | 2.702703 |
Pokemon Red and Blue | 5 | Pokemon Red and Blue | GB | 1996 | Role-Playing | Nintendo | 11.27 | 8.89 | 10.22 | 1.00 | 31.37 | 0.290000 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
The Legend of Zelda: Spirit Tracks | 396 | The Legend of Zelda: Spirit Tracks | PS4 | 2015 | Action | Konami Digital Entertainment | 1.08 | 1.35 | 0.48 | 0.47 | 3.38 | 1.428571 |
WWF War Zone | 397 | WWF War Zone | DS | 2009 | Action | Nintendo | 1.43 | 0.94 | 0.74 | 0.27 | 3.38 | 0.062400 |
The Legend of Zelda: Majora's Mask | 398 | The Legend of Zelda: Majora's Mask | PS | 1998 | Fighting | Acclaim Entertainment | 2.47 | 0.76 | 0.00 | 0.13 | 3.36 | 0.210000 |
Professor Layton and the Unwound Future | 399 | Professor Layton and the Unwound Future | N64 | 2000 | Action | Nintendo | 1.90 | 0.67 | 0.73 | 0.06 | 3.36 | 0.560000 |
Rugrats: Search For Reptar | 400 | Rugrats: Search For Reptar | DS | 2008 | Puzzle | Nintendo | 0.65 | 1.61 | 0.82 | 0.28 | 3.36 | 0.005678 |
343 rows × 12 columns
We tried cube root, square root, logarithm and inverse transformations.
Given that our data is heavily skewed, we will tranform the response variable (Global_Sales) by inversing the value.
This will give us a more normally distributed set of data in hopes for a decent model score (previous iterations have shown poor results).
#Transformations of response variable
df_new["Global_Sales"] = df_new["Global_Sales"].transform(lambda x: 1/x).astype(float)
sns.displot(df_new['Global_Sales'])
<seaborn.axisgrid.FacetGrid at 0x7f75e38a70f0>
Now we use an OrdinalEncoder to transform our categorical features of Platform, Genre and Publisher to numerical values the model can understand.
# Ordinal encoding for our categorical variables platform, genre, and publisher
enc = OrdinalEncoder()
df_new[['Platform']]= enc.fit_transform(df_new[['Platform']])
df_new[['Genre']]= enc.fit_transform(df_new[['Genre']])
df_new[['Publisher']]= enc.fit_transform(df_new[['Publisher']])
df_new.head(10)
Rank | Name | Platform | Year | Genre | Publisher | NA_Sales | EU_Sales | JP_Sales | Other_Sales | Global_Sales | hype | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | ||||||||||||
Wii Sports | 1 | Wii Sports | 16.0 | 2006 | 10.0 | 15.0 | 41.49 | 29.02 | 3.77 | 8.46 | 0.012086 | 5.000000 |
Super Mario Bros. | 2 | Super Mario Bros. | 8.0 | 1985 | 4.0 | 15.0 | 32.48 | 3.581.3 | 6.96 | 0.99 | 0.022070 | 1.250000 |
Mario Kart Wii | 3 | Mario Kart Wii | 16.0 | 2008 | 6.0 | 15.0 | 15.85 | 12.88 | 3.79 | 3.31 | 0.027917 | 14.434786 |
Wii Sports Resort | 4 | Wii Sports Resort | 16.0 | 2009 | 10.0 | 15.0 | 15.75 | 11.01 | 3.28 | 2.96 | 0.030303 | 2.702703 |
Pokemon Red and Blue | 5 | Pokemon Red and Blue | 3.0 | 1996 | 7.0 | 15.0 | 11.27 | 8.89 | 10.22 | 1.00 | 0.031878 | 0.290000 |
Tetris | 6 | Tetris | 3.0 | 1989 | 5.0 | 15.0 | 26.17 | 2.260.69 | 6.03 | 0.69 | 0.027902 | 23.620559 |
New Super Mario Bros. | 7 | New Super Mario Bros. | 2.0 | 2006 | 4.0 | 15.0 | 11.38 | 9.23 | 6.50 | 2.90 | 0.033322 | 0.500000 |
Wii Play | 8 | Wii Play | 16.0 | 2006 | 3.0 | 15.0 | 14.03 | 9.2 | 2.93 | 2.85 | 0.034459 | 5.263158 |
New Super Mario Bros. Wii | 9 | New Super Mario Bros. Wii | 16.0 | 2009 | 4.0 | 15.0 | 14.59 | 7.06 | 4.70 | 2.26 | 0.034941 | 0.460000 |
Duck Hunt | 10 | Duck Hunt | 8.0 | 1984 | 8.0 | 15.0 | 26.93 | 0.63 | 0.28 | 0.47 | 0.035323 | 0.470000 |
Now, we drop all variables except Publisher and the Hype values.
scalar = StandardScaler()
x = df_new.drop(columns = ['Name','NA_Sales', 'EU_Sales', 'JP_Sales', 'Other_Sales', 'Rank','Genre','Platform','Year','Global_Sales'])
x = scalar.fit_transform(x)
y = np.asarray(df_new['Global_Sales'])
y = scalar.fit_transform(y.reshape(-1,1))
#Split training and test data into 80/20 split, then split validation off of training data.
XtrainZ, XtestZ, YtrainZ, YtestZ = train_test_split(x, y,train_size=0.8,test_size=0.2)
XtrainZ, XvalZ, YtrainZ, YvalZ = train_test_split(XtrainZ, YtrainZ,test_size=0.2)
Here, we set up multiple regression models.
We start out with two DummyRegressors to define some baselines through the median and mean.
As we can see, the scores are... very poor.
This sets our evalution environment, but we aren't expecting anything massive based on these scores.
# Model baseline that uses median to predict values
baseline_reg = DummyRegressor(strategy="median")
baseline_reg.fit(XtrainZ,YtrainZ)
baseline_reg.score(XvalZ, YvalZ)
-0.06581935060585375
# Model baseline that uses mean to predict values, not as good due to the skewness of the distribution of our response variable
baseline_reg = DummyRegressor(strategy="mean")
baseline_reg.fit(XtrainZ,YtrainZ)
baseline_reg.score(XvalZ, YvalZ)
-0.001925256166505518
poly = PolynomialFeatures(4)
transTrainx = poly.fit_transform(XtrainZ)
transValx = poly.fit_transform(XvalZ)
transTestx = poly.fit_transform(XtestZ)
lr = LinearRegression().fit(XtrainZ, YtrainZ)
print(lr.score(XtrainZ, YtrainZ))
print(lr.score(XvalZ, YvalZ))
pred = lr.predict(XvalZ)
sklearn.metrics.mean_squared_error(YvalZ, pred, squared = False)
0.007907021627495125 0.07877507674375972
0.8857204689307384
We next attempt to set up a regression model using the Decision Tree.
This results in significantly better but still massively poor scores. We also hunt for the best hyperparameters afterwards to see if our score can improve further.
# Base Decision Tree regressor implementation
DT = DecisionTreeRegressor(max_depth=2)
regr = DT.fit(XtrainZ, YtrainZ)
print(regr.score(XtrainZ,YtrainZ))
print(regr.score(XvalZ, YvalZ))
0.13459630557278024 0.0004551252969434705
# Dictionary of parameters of DecisionTreeRegressor for hyperparameter tuning
parameters={
"splitter":["best","random"],
"max_depth":[1,3,5,7,9,11,12],
"min_samples_leaf":[1,2,3,4,5,6,7,8,9,10],
"min_weight_fraction_leaf":[0.1,0.2,0.3,0.4,0.5],
"max_features":["auto","log2","sqrt",None],
"max_leaf_nodes":[None,10,20,30,40,50,60,70,80,90]}
# Used GridSearchCV for hyperparameter tuning
hypertune = GridSearchCV(DT,param_grid=parameters,scoring='neg_mean_squared_error',cv=5,verbose=3);
hypertune.fit(x,y)
# Implement new DT model with parameters provided by hyperparameter tuning
tuned_DT = DecisionTreeRegressor()
# Automatically set params of DT regressor provided by hypertuning
tuned_DT.set_params(**hypertune.best_params_)
# Refit model and check the score/MSE
tuned_regr = tuned_DT.fit(XtrainZ, YtrainZ)
print(tuned_regr.score(XtrainZ,YtrainZ))
print(tuned_regr.score(XvalZ, YvalZ))
pred = tuned_regr.predict(XvalZ)
sklearn.metrics.mean_squared_error(YvalZ, pred, squared = False)
0.19043501161699483 0.09038778887738508
0.8801201877564524
As we can see above, all models provide very poor scores. This is not entirely unexpected given the limited scope of our data and features. Successful games are hard to define. The dataset has a focus on "popular" games that are easily found in the mainstream and on very specific platforms. There is an additional layer of issues caused by the increasing digital sales market and the sheer number of games available. More and more numbers are obscured as time passes.
With the original dataset, it is likely counting only physical units sold. Sometimes this data is not made available or is easily obtainable.
# >>> Load df_new into frame and see how each feature relates to each other on a visual level.
# >>> Then discuss about how to talk about game features intuitively
# Other things to consider:
# >>> things we could have done to improve the score
# >>> problems we faced with this data
# - what could this kind of research be used for practically?
It would be best to take a look at how each of our supposed features work with each other. So, let's make a pair plot to visualize bivariate relationships between our predictor variables and our response variable.
sns.pairplot(data = df_new, vars = ["Publisher", "Genre", "hype", "Global_Sales"])
<seaborn.axisgrid.PairGrid at 0x7f75d8b25d30>
As we can see from the pair plot above, there is a very poor linear relationship between all features and Global Sales. Hype is grossly affected from how the data is collected and GTAB works in mysterious ways. The sales spread for most features is roughly even. This may also be a result of how small our sample size is as we only took around 400 entries from the top of the list.
When we first saw the dataset, we made some assumptions about what features would affect the model score the most based on our own experiences with videogame marketing.
We thought that Publisher would be one of the stronger features to affect sales. The reason being that a given publisher has a certain reputation to consumers. However, our dataset featured many publishers we did not even know so this threw our perception off as we may sometimes confuse developers for publishers.
The genre of a game is very important to how it appeals to an audience. Every genre has its own audience so it will be successful in its own way. In addition, not every genre is populated equally. We figured that action would be the most "successful" as it is an oversaturated category (i.e. "Call of Duty", "Grand Theft Auto","Fortnite"). However, just like movies, there are very good action games and very crummy ones.
We had the least faith in the Hype values. Primarily because hype is a very nebulous value to obtain. Hype could be expressed in different ways, but we chose search amount. Games are often discussed through news articles or reports, but there is almost an infinite amount of news platforms on the internet. This may include positive and negatives about a given game. Google Trends was difficult to work with how the system is built and the library we needed to use. The values actually betrayed our expectations.
There are some things could have done that may or may not improve our scores. This includes finding other datasets that relate to videogames. We mainly used only one dataset which contributed to how good the score was due to the structure. We could have found more numerical data rather than having only categorical variables. Some other features we considered are below:
So... What we can pull from all this?
Videogame data is difficult to work with. There are very few good publicly available datasets pertaining to videogames itself. When they are available, it is in a piecemeal form with the full version needing payments. Often, much of the data is related to the sales and the aim is to maximize them. Since it is also a popular medium of entertainment, it isn't a surprise that companies keep data to themselves.
They want to be the ones on top.