index

Introduction¶

If you’ve clicked on this blog post you’re likely interested in the tech field, furthermore, joining the tech industry, another related occupation or maybe you’ve already made a career out of tech. Regardless, as with all proper career plans, you’re going to need to know your potential earnings. But predicting your potential earnings can be a really hard thing to do, especially if you’re planning for the long term. It can just get really complicated, different companies offer different wages and benefits, your years at the company and overall experience will change your income, even race and gender can have an unjustified affect. This is what we explored and eventually tried to predict.

As we all know in order to do that we needed data, and preferably lots of it. Now, there is a company named “levels.fyi” that have (and continue to) explore the same question, but they have the data to do so. Thankfully for us they’ve uploaded it in a way where it can be scraped, so we were able to able to get a hold of it on Kaggle from someone of the username “Jack Ogozaly” who did exactly that.

One of the most important things to mention here is how the source data was collected. As mentioned this data was retrieved from levels.fyi, who claim that they get their data from submitted salaries. From then, levels.fyi then verifies the submissions, which requires proof documents (i.e., Offer Letters, W2 Statements, Annual Comp Statements, Promotion Summaries, etc.). From there they then slightly modify them for security and anonymity of the original person, though the exact definition of slightly modified is never described. Considering levels.fyi uses those modified numbers and relies on their accuracies, we can assume it’s probably negligible.

We mentioned it before but it’s best if we state our exact goal explicitly. We want to explore this dataset through the filter of what makes total yearly compensation. What correlates to an increase? What correlates to a decrease? What doesn’t even matter? And are there any reasonable features that we can safely presume to be a causation? After all that, how accurately can we predict someone’s total yearly compensation given the same features available from the dataset?

First, we’ll launch into a discussion of some exploratory analysis we conducted on the data. Second, we’ll discuss the model we created from the data to predict total yearly compensation for a given STEM employee. Our main findings are that total yearly compensation varies widely across different people in different parts of the world, different levels of education, different levels of experience, and in different jobs, and that predicting earnings from these different features of a person’s employment is very with only this amount of data.

Exploratory Analysis: What are the general trends with STEM employee earnings?¶

Looking at the raw data we saw various relationships among different features within the data set. Since the main goal of our analysis was to examine and predict total yearly compensation for the employees in our data set, we simply just compared most of the features in our data to total yearly compensation to gain a strong understanding of the data.

Some of the obvious questions we had were how does education, race, and years of experience relate to total yearly compensation. Within education there is a trend where higher education correlates with higher compensation when looking at a masters or PhD. When comparing just the BA, Highschool, and Some College columns we see that higher education does not exactly mean higher total yearly compensation, as some college seems to make the most out of the three columns with BA degrees having the lowest total yearly compensation.

We found very little discrepancy when comparing races and compensation but noted that African Americans did have a slightly lower yearly compensation when compared among the other races. When looking at gender and only when an employee stated male or female, we find that only 16.39% are females out of all the binary gender data. There are 400 data points that said “Other” for gender but were not used in this comparison. This discrepancy could be due to males reporting more to the dataset but unlikely.

Although useful in determining various correlations, these features, were not exactly the strongest predictors that we would be using within our model to predict total yearly compensation, and still have minimal effect when compared with years of experience which we closely examine now. When looking at the heat map we created for seeing which features correlate the most towards total yearly compensation, we saw that years of experience has a tremendous effect on salary aside from stock grant value and base salary which are not exactly good metrics for predicting total yearly compensation for someone who is looking for a job within the field.

It was found that as years of experience increases so does compensation but only to a point. After about 10 years of experience the compensation begins to taper off having about the same compensation as 5-10 years of experience.

And when looking at the distribution for years of experience, most data points were within 3-6 years.

Overall, it was useful to see exactly how the data was dispersed and to notice what features are going to play a major role in predicting total yearly compensation. Utilizing a heatmap allowed more insight as to how each specific feature was closely related to another. The features talked about above were not the only ones analyzed within our exploratory analysis but felt that these were the big obvious questions when tackling a large dataset. We also looked at job title vs total yearly compensation and comparing how much just a few of the well-known fortune 500 companies are paying their employees. Now we have a much better idea as to what our model may look like, by using the trends we found from exploring the data. We plan to continue doing more exploratory analysis to pinpoint which features we might specifically want within our model, but for now it was good to get an overall understanding of the data set.

How did we predict total yearly compensation?¶

After completing our exploratory analysis, we went on to creating a baseline test for our analysis to be compared to. If the model(s) we created could outperform the baseline test by a noteworthy amount, then we can consider it a more instructive model than some simple analysis. To do so, we first split up our data into three different sets of training data that we would build the model on, validation where we would run our preliminary tests on, and testing data to test our prediction model. We reserved 20% of the data for testing, 20% for validation, and the remaining 60% for training. This is the generic recommended data split for data sets of the medium-small size such as ours (100-10,000,000). The baseline we chose to train and measure our model against was just a simple linear regression of the years of experience that a given employee has on their total yearly compensation (i.e., just using math to estimate a linear relationship between years of experience and earnings). Years of experience was our chosen as the single variable for our baseline because it was found to have the highest correlation with employee earnings. The following are two graphs that show the results of our baseline tests:

This graph provides insight into how wrong the baseline was test was for every number of years of experience that a given employee had. The green lines indicate the baseline model is very good at predicting employee earnings up until about 18 years of experience, where at that point the model starts to (mostly) overestimate the earnings of employees. However, you can clearly see that at a couple of points the residuals spike up to and above $200,000 dollars. This is because the model has no way to predict the earnings of a couple of outliers (those who make way more money than the rest of the sample).

This graphic shows the distribution of the residuals for the model. The reason the X axis is so large when most of the data is around 0 is because there are a couple of observations where the model is vastly underestimating the earnings of a couple people (see the sharp spikes in yearly earnings in the prior graphic). Other than comparing our final results to these two graphs, we took note of a couple of key metrics of performance for our baseline to examine how it performed.

The coefficient of determination (R2) of our baseline on our evaluation was 0.154, meaning that 15.4% of total yearly compensation was predictable from our variable of years of experience.
The mean absolute error (MAE) of our baseline was $80,032.57, meaning that on average, our model missed the true earnings of an employee by that much.
The root mean squared error of our model (RMSE) was $124,982.98. This metric provides insight into how spread out the errors are of the model by putting more weight on outliers.
The mean relative error (MRE) of our model was 70.68%. This means that relative to the true values, the baseline was on average roughly 71% off.

TLDR on our baseline: predicting total yearly compensation based off just years of experience is not very accurate at all.

Next, we moved towards creating an improved linear regression model. First, since we saw from our baseline graphs that the positive relationship between earnings and years of experience looked like it decreased as years of experience increased, we performed a feature transformation to make the relationship quadratic. Next, we added the years an employee had been at a current position to the regression as well. From this point on, since the rest of our variables in our data were categorical, we slowly added more and more dummy variables (which basically just adds to the prediction an amount if a person is apart of that category). As we added these dummy variables, after adding each one we checked if the estimate from that variable was statistically significant. If it was (meaning it helped the model) we kept it, and if it wasn’t, we dropped the variable (some variables that were marginally insignificant we kept however). The variables we tried and dropped were employee race, and a number of different categories for other variables that did not have enough observations to be accurate (for example, we did not include variables for if people worked at companies with less than a few hundred employees).

It is also worth noting that we tried various other regression techniques (support vector regression, decision tree regression, K-nearest neighbor regression, and neural network regression), and scaling our dependent variables in different ways, but the best model consistently was the linear regression, therefore we stuck with that technique that for our final model.

In the end, our linear model included variables for years of experience, years at the company, education, job title, state they are employed in (or in some cases country if they live outside the US), and the name of the company they work for. We also performed a log-transformation on the training y-values because the distribution of wages was found to resemble a power law distribution. Given all this, here are the results of our model when compared to our baselines.

Compared to our baseline, you can see the errors (green lines) hover around 0 much more, and the distribution of our residuals hug very closely to zero. It is still worth noting that while there are a lot of errors are in the 100,000s, but the majority of the errors are between -100,000 and -200,000. That being said, the metrics by which we evaluated our baseline test improved significantly when compared to that baseline, but they were still not the best:

The coefficient of determination (R2) of our final model for predicting our test data was 0.504, compared to the baseline of 0.154, meaning that in our test data the model could accurately predict 50.4% of total yearly compensation.
The mean absolute error (MAE) for our predictions on our test data was $55,817.47 compared to the baseline MAE of $80,032.57. In other words, on average our model was inaccurate by $55,817.47.
Our final model’s prediction’s root mean squared error (RMSE) was $93399.99 compared to the baseline of $124,982.98.
Finally, the mean relative error of our final model (MRE) improved from 70.6% in on our baseline predictions to 30.9% on our final predictions, meaning that relative to the true values the model was trying to predict, the model was on average about 31% off.

All in all, while our model was not incredibly accurate, it was significantly better at the prediction task than a simple model that just estimates employee earnings based on years of experience. More analysis on the errors of our model will be given in a moment, but before diving deeper into that let’s first discuss the coefficients on our model. A nice property about performing a log-transformation on our y-values is that it allows us to explain the predictions in the resulting model in percent terms. Given this, here are a few interesting things to be drawn from our model’s coefficients.

Compared to those who only have an undergraduate degree or less education, the model predicted that those who have master’s degrees have 6% higher earnings, and those who have PhD’s have about 24% higher earnings.
The model found that some job titles were associated with much higher or lower earnings than others. Compared to the software engineers, people who work in sales, product designers, data scientists and management consultants, our model estimated that business analysts, recruiters, marketing workers, mechanical engineers, and human resources employees earn between 20-40% less, regardless of the other variables. On the other hand, product managers were estimated to earn 5% more, and software engineering managers were estimated to earn 25% more.
Significant differences were estimated in the earnings for employees based on where they live. Compared to workers who live in other countries outside the U.S. and a few states, people who work in India and Russia were estimated to earn 95% less. Furthermore, people who work in Spain and Taiwan were expected to earn around 50% less. Conversely, those who work in California, New York and Washington state were estimated to earn between 40-55% more, while those in Colorado, Washington DC, Oregon, New Jersey, Illinois, Texas and Massachusetts were estimated to earn between 20-35% more.
Finally, different companies were associated with different levels of earnings while holding the other variables constant. Compared to workers who are not employed by any of the top 30 companies with the most workers in our dataset, those who work at Netflix were estimated to earn 68% more, workers at Google, Uber, Twitter and Facebook were found to earn about 35-50% more, those who work at Adobe, Nvidia, Amazon were found to earn between 20-30% more, and those who work at Microsoft, Wayfair and Shopify were found to earn about 10-20% more.
If you would like to see the exact coefficients on all the variables in our model, please check out our notebooks showing our work.

While these findings may be interesting, it needs to be said that they are taken from a model that is both not very accurate at predicting wages in general (remember how the R2 of the model was only around 50%) and that the model only considers a handful of companies, states, and countries in these observations. In other words, it comes from a not entirely accurate model that restricts the data in weird ways. For example, our model saying that those who live in Washington earn 44% more is kind of an odd statement to make when the group they are being compared to is a conglomerate of those who live in either the states of California, Colorado, Illinois, Massachusetts, New Jersey, or 6 other chosen states, and those who don’t live in India, Israel, Spain, Sweden, Taiwan, Canada, Switzerland, or Poland. However, the findings are still interesting.

Where Did Our Model Fail¶

As previously mentioned, though our model did beat our baselines, its predictions were still somewhat disappointing. Naturally that lead us to question, what went wrong? From what we found the most apparent cause of large error came from people making a large total yearly compensation. Plotting the relation of our model’s magnitude of residuals over total yearly compensation results in a clear positive correlation…

As total yearly compensation gets larger, typically so does the magnitude of our residuals. The exact cause for this is currently unknown, however a likely cause comes from our lack of data with high total yearly compensation and few outliers heavily effecting our model.

Among our post-model-exploratory-analysis we found yet another correlation between higher total yearly compensation and larger residuals. Comparing the Residuals Over Job Title plot with the Total Yearly Compensation Over Job Title plot we see a striking resemblance between the two of them. Basically, this means that that job titles which earn a higher total yearly compensation are also harder for our model to predict accurately. Looking at highest earners It turns out that our model is significantly worse at predicting the total yearly compensation of Software Engineering Managers than any other job title. This plays right into our theory of the disproportionate effects of outliers with very large total yearly compensation. Because while looking at the “Software Engineering Manager” column in the Total Yearly Compensation Over Job Title plot, we can see that by far Software engineering managers have the highest median values in the dataset, and the highest outliers. Since there’s only a small number of datapoints with the title “Software engineering manager” and yet a large variance among them, it comes to reason that our model just can’t predict accurately in that situation.

Predicting STEM Salary's: Total Yearly Confusion¶

Links To The Notebooks¶

Introduction¶

Exploratory Analysis: What are the general trends with STEM employee earnings?¶

How did we predict total yearly compensation?¶

Where Did Our Model Fail¶

Conclusion: Earnings are all over the place¶