Blog Post

In our project we set out to find meaningful correaltions between basketball statistics and team win rates. We hoped to be able to predict a teams win rate based on a variety of stats. We stumbled across basketball-reference.com which had all NBA team statistics for over 50 years

Here is an example of the data that was available to us (This example shows the 2020 season) These columns in order are: Wins, Losses, Win/Loss%, Games Behind, Points Per Game, Opponent Points Per Game and Simple Rating System

Another example showing the per game stats The columns in this table show: Games Played, Minutes Played, Field Goals, Field Goal Attempts, Field Goal %, 3 Points, 3 Point Attemps, 3 Point %, 2 Points, 2 Point Attempts, 2 Point %, Free Thows, Free Throw Attempts, Free Throw %, Offensive Rebounds, Defensive Rebounds, Total Rebounds, Assists, Steals, Blocks, Turnovers, Personal Fouls and Points

The per game stats were available to be exported via CSV file, but unfortunately we would need more than one years worth of stats for our project so we had to manually scrape the website to collect our data.

We wrote a python notebook file to scrape 20 years worth of conference standings and per game stats which we built into a single csv file which ended up being around 600 rows of data from 20 seasons of NBA games. We also had around 30 columns of different statistics to analyze

We now had a csv file filled with all the information above that we could perform anaylsis on

With all this data available to us we decided a good place to start was to see which statistics correlated the most with Win/Loss Percentage

Exploratory Findings

As you can see some obvious stats correlate strongly with Win/Loss % such as Wins, Losses, Strength of Schedule and more

We then trained a linear regression model on all our data using all the statistics but, we found when using all available statistics our linear regression model had near perfect accuracy, and most likely there were too many features directly correlated with Win/Loss % that significantly affected our model

We tested many subsets of columns to find a model that would make for good analysis and we settled on these columns: W/L%, FG, FGA, 3P, 3PA, 2P, 2PA, FT, FTA, PTS, ORB, DRB, TRB, STL, TOV

Using these stats we were able to train a linear regression model with around 81% accuracy and predict Win/Loss percentage based on the above statistics

We wanted to make sure our model was not following any consistent trends with our predictions and our residuals(The difference between our predicted values and the actual values) so we plotted those

We also plotted our residuals with respect to Win/Loss % and found that they also did not follow a trend

Pictured below is a graph of our predicted values vs the actual values in our training data and you can clearly see they follow a linear trend

We had a training model that passed our tests so we could now see which factors (Coeffcients) in our training model had the most signicant effect on Win/Loss %

From the above picture it's cleary shown that PTS(Points) had the largest influence on a teams win rate. The 2nd largest factor was 3P(3 Pointers Scored) and the 3rd most influential factor was FG(Field Goals Scored). It's easy to see that points scored in a variety of ways influenced a teams win rate the most.

Removing PTS from our training model¶

It's clear that PTS has the largest influence on Win Rate, and that it dwarfs the other variables in comparison but, what would happen if we dropped PTS from our training model?

Surprisingly our model does not lose much accuracy at all, with our training score staying at around 81 %

Our residuals still follow no trend and our predicted values are still linear with our actual values, but the biggest change is to our coefficients

Now that PTS are no longer part of the training data most the other variables have increased in their influence. Surprisingly 3P has been reduced from the 2nd most influential and has been lowered to almost the least influential statistic. Meanwhile 3PA(3 Point Attempts), FG (Field Goals) and FGA(Field Goal Attempts) have increased significantly in our graph

To double check the accuracy of our new training set with PTS removed we again plotted our residuals, predictions and W/L % on our validation set

With PTS missing from our training data, we still had a accuracy of 83% on our validation set and our residuals did not follow a trend with respect to our predictions or our actual values picture below

Our predictions still followed a linear trend with respect to our actual values which is a very good sign for our model

Plotting results on test set¶

Now let's plot the same variables above on our testing set which had a slightly lower accuracy of 78%

These results are very similar to our previous graphs in our validation set and still cleary show our models accuracy

We found a variety of different ways to model our data using all different subsets of columns, but we found when using a more limited amount of columns of data we could do some more intersting analysis, as most of the columns already listed had a direct influence to win rate, which would skew our model into seeming more accurate then it actually was. After reducing the number of features in our model, we can say that our model became more accurate even though it produced a smaller linear regression score. This allowed us to perform more advanced analysis with residuals to show the relationship between different factors of basketball that affected win rate.

The most surprising thing to us was that even after getting an accurate training model we could drop columns that were seemingly very important (ex: PTS), but still be able to produce a accurate model

Andrew Conrad¶

Van Mason¶

Removing PTS from our training model¶

Plotting results on test set¶

Final Thoughts¶