Our project is about NBA players first year salary when coming out of college from 1984 to 2019 based on their college stats.
We got our data from different csvs and combined them into one. The datasets that we used are based on NBA salaries, players and the year they joined the league, and players college stats.
To get all of the datasets into one we merged them on the name collumn that appeared in all of our datasets. Because most of the players appeared multiple times due to having multiple salaries we had to drop the years of players that were not their first year in the league. After that we looked through our columns in order to drop ones that we deemed unnecesarry and got rid of some outliers since thier salary was so low it was unusable.
The beginning size of our dataset was 14,430 rows by 62 columns, and when we had finished dropping all of the columns and outliers we didn't need we had 1,857 rows and 18 columns.
Link to preprocessing:notebook
We're planning to predict an NBA player's rookie year salary based off of their college statistics, We'll be using features such as, the year they played, what college they went to, average points per game, etc. We don't expect to hae super high accuracy since, there are MANY other factors that determine a player's salary. A player's statistics doesn't represent how much value they bring to the team, other factors such as budget cap, promise and potential aren't shown purely in stat sheets. Another major concern is how much the game of basketball has changed, 20 points in 1980 is worth plenty more today. We're hoping to achieve a decent score of accuracy, but All-Stars are definitely going to be our biggest issues, the salary these huge players have, and there is no real linear relationship to find between their stats and their salaries. This also applies to bench players, who only get a couple minutes a game, but still make a million a year. Overall we're really expecting to be able to accurately predict the average NBA player, but we expect to have some issues with the outliers.
When embarking on this journey of predicting NBA salaries of their first year in the league, we found this to be more challenging than expected. This is because When looking at our combined dataset we noticed that some people had NaN in their college stats even though they had salaries for their first year in the NBA. The cause of this was that these people were from overseas so they didn't have college stats. Another trick in this was that there were players before 2006 who were not from overseas because they had gone from highscool straight to the NBA. Luckily this was stopped in 2006 when the commissioner called for a higher age limit. For the college stats we also came across people who had a 0 in some of their columns and it's because of their positions. For example a center is not likely to shoot any threes at all because it's not something that they would even think about when they are that tall. This lead to most centers having 0's in 3 pointers attempted and 3 pointers made.
We also had to factor in inflation over the years otherwise we would have no explanation as to why people got paid more and more money as years go by. Using Steph Curry as an example and due to the fact that Steph Curry changing the game of basketball and making millions and the NBPA (National Basketball Players' Association) representing their workers well it gives them a chance to push for higher salaries, which leads to increased productivity and that continues the cycle. Another way that inflation was being implemented into these salaries was when genreal managers of teams saw someone with great potential they poured money into that player in order to keep them in the organization. And to keep that player happy or attract other players that first year salary had to be enticing enough to keep them there. This caused others to do the same thing and only increase the salary more and more.
While working on this project we realized that the NBA is always changing and different things become important to teams as time goes on. 20 years ago being big and a dominant post might have been the most important trait and 8 years later the most important thing was being able to shoot the three and this playstyle has been part of the NBA till today. This caused certain stats to be valued more and as people started to put up more stats they had to be given more money causing salaries to go up.
We've tried multiple methods to approach our training model, For features such as college, position, and seasons, we decided on using a LabelEncoder to give those features numbers. One of the biggest issues we had was dealing with the fact that the distribution of salary is exponential. We've applied a log base to the salary but the distribtution was still far from normal. This fact itself made a linear regression model, non optimal. Our dataset also had nothing about team's budget caps, which play a huge role into a player's salary. From there we've tried various other models such as LASSO, K-fold CrossValidation. We eventually ended up using a Neural network.
Another issue we've been trying to figure out, is that scoring hasn't been consistant over the last 40 years, Obviously some positions will score more then others, and also people who drop 40 points a game shouldn't have their scores converted to a normal distrubtion, which would lead them to a disadvantage, We've tried multiple scalers to try to fit out x vectors Such as StandardScaler, which was super sensitive to outliers, and Normalization, but these both overall lowered our accuracy.
Our findings proved that you are able to predict roughly the amount a player is going to get their first year in the NBA. For example our model predicted that Jimmy Butler would get 2.35 million and he ended up getting 2.6 million in his first year. Another example is Isaiah Thomas qould get 2 million and he ended up getting 1.6 million. This shows that we can somewhat accurately predict the amount a player is going to make within their first year. Our greatest overestimate was with James Collins, and our greatest underestimate was with Brandon Armstrong
One of our biggest issues was just the fact that our dataset was over the course of 40 years, The game of basketball is forever changing. Another was that our dataset didn't include Crucial statistics such as blocks, assist, steals, and rebounds. Some player's are more team players, and help in more ways then just scoring. If we confined our dataset to just one year, and used NBA statistics, our accuracy would of been a lot better, but since there's a huge gap from college stats to NBA salaries, there are thousands of variables that play into how much a player get's paid. We've tried our best to account for how the game has changed but, with our limited dataset, it was pretty difficult to get a linear relation ship between these two statistics. What gave us a lot of problems were outliers and we could have figured out a way to predict them.
Our project has given us interesting results, since we are able to somewhat predict what players rookie year salary would be. Although there were many factors that we had not thought about until they came up, we ended up being able to work past them. The accuracy that we acquired for each of the sets was: Train=47.7%, Validate=64.3%, Test=65.0%. Our scores that we have now are very big improvements on what they were at the beginning. Our scores increased by about 25% and we had gotten our Mean Squarred Error down by about over 1 million to 0.23, which blew our mind when we saw we could lower it by that much. We have definitely exceeded our expectations with this project and learned a lot from working together.