In our project we set out to find meaningful correaltions between basketball statistics and team win rates. We hoped to be able to predict a teams win rate based on a variety of stats. We stumbled across basketball-reference.com which had all NBA team statistics for over 50 years
Here is an example of the data that was available to us (This example shows the 2020 season) These columns in order are: Wins, Losses, Win/Loss%, Games Behind, Points Per Game, Opponent Points Per Game and Simple Rating System
Another example showing the per game stats The columns in this table show: Games Played, Minutes Played, Field Goals, Field Goal Attempts, Field Goal %, 3 Points, 3 Point Attemps, 3 Point %, 2 Points, 2 Point Attempts, 2 Point %, Free Thows, Free Throw Attempts, Free Throw %, Offensive Rebounds, Defensive Rebounds, Total Rebounds, Assists, Steals, Blocks, Turnovers, Personal Fouls and Points
The per game stats were available to be exported via CSV file, but unfortunately we would need more than one years worth of stats for our project so we had to manually scrape the website to collect our data.
We wrote a python notebook file to scrape 20 years worth of conference standings and per game stats which we built into a single csv file which ended up being around 600 rows of data from 20 seasons of NBA games. We also had around 30 columns of different statistics to analyze
We now had a csv file filled with all the information above that we could perform anaylsis on
With all this data available to us we decided a good place to start was to see which statistics correlated the most with Win/Loss Percentage
As you can see some obvious stats correlate strongly with Win/Loss % such as Wins, Losses, Strength of Schedule and more
We then trained a linear regression model on all our data using all the statistics but, we found when using all available statistics our linear regression model had near perfect accuracy, and most likely there were too many features directly correlated with Win/Loss % that significantly affected our model
We tested many subsets of columns to find a model that would make for good analysis and we settled on these columns: W/L%, FG, FGA, 3P, 3PA, 2P, 2PA, FT, FTA, PTS, ORB, DRB, TRB, STL, TOV
Using these stats we were able to train a linear regression model with around 81% accuracy and predict Win/Loss percentage based on the above statistics
We wanted to make sure our model was not following any consistent trends with our predictions and our residuals(The difference between our predicted values and the actual values) so we plotted those
We also plotted our residuals with respect to Win/Loss % and found that they also did not follow a trend
Pictured below is a graph of our predicted values vs the actual values in our training data and you can clearly see they follow a linear trend
We had a training model that passed our tests so we could now see which factors (Coeffcients) in our training model had the most signicant effect on Win/Loss %
From the above picture it's cleary shown that PTS(Points) had the largest influence on a teams win rate. The 2nd largest factor was 3P(3 Pointers Scored) and the 3rd most influential factor was FG(Field Goals Scored). It's easy to see that points scored in a variety of ways influenced a teams win rate the most.
It's clear that PTS has the largest influence on Win Rate, and that it dwarfs the other variables in comparison but, what would happen if we dropped PTS from our training model?
Surprisingly our model does not lose much accuracy at all, with our training score staying at around 81 %
Our residuals still follow no trend and our predicted values are still linear with our actual values, but the biggest change is to our coefficients
Now that PTS are no longer part of the training data most the other variables have increased in their influence. Surprisingly 3P has been reduced from the 2nd most influential and has been lowered to almost the least influential statistic. Meanwhile 3PA(3 Point Attempts), FG (Field Goals) and FGA(Field Goal Attempts) have increased significantly in our graph
To double check the accuracy of our new training set with PTS removed we again plotted our residuals, predictions and W/L % on our validation set
With PTS missing from our training data, we still had a accuracy of 83% on our validation set and our residuals did not follow a trend with respect to our predictions or our actual values picture below
Our predictions still followed a linear trend with respect to our actual values which is a very good sign for our model
Now let's plot the same variables above on our testing set which had a slightly lower accuracy of 78%
These results are very similar to our previous graphs in our validation set and still cleary show our models accuracy
We found a variety of different ways to model our data using all different subsets of columns, but we found when using a more limited amount of columns of data we could do some more intersting analysis, as most of the columns already listed had a direct influence to win rate, which would skew our model into seeming more accurate then it actually was. After reducing the number of features in our model, we can say that our model became more accurate even though it produced a smaller linear regression score. This allowed us to perform more advanced analysis with residuals to show the relationship between different factors of basketball that affected win rate.
The most surprising thing to us was that even after getting an accurate training model we could drop columns that were seemingly very important (ex: PTS), but still be able to produce a accurate model