The goal of our project was to predict the amount of daily precipitation in Seattle using a regression machine learning model. When brainstorming ideas for the project, we wanted to delve deeper into some of our analyses from Lab 2, as we all had fun exploring the data from that lab in different ways. Since we all are from the Seattle area, we wanted to make a model that could predict precipitation so that we could apply machine learning towards our shared experiences.
While weather predictions and forecasting already exist, we wanted to try making our own that was trained specifically to Seattle. Because precipitation varies location to location and is affected by local geography, it's one of the more difficult aspects of weather to predict, compared to temperature and other aspects. Our model has a smaller scope than modern-day weather models, which are able to predict forecasts 5 days in advance with 90% accuracy. Most weather forecasts predict the chance of rain, rather than the amount of rain the area might experience in a day. According to the National Weather Service, the "chance of rain" used by weather forecasting means the chance of precipitation occurring at any point in a given area; our model is trained to predict the water equivalent amount of precipitation per day, in inches to hundredths (including frozen precipitation).
Our model is trained on NOAA's Local Climatological Dataset, gathered from Seatac Airport from the years 2010 to 2020. Originally, the dataset was 144,259 rows by 126 columns, but after extracting the data we needed, (daily instead of hourly), the dataset was 3,873 rows by 19 columns. We also collected data from January 2021 through November 2021 to use as a secondary test set for our model.
Our baseline prediction was the mean amount of daily precipitation. After running creating a dummy baseline model and running it through our exploratory evaluation environment, on our validation set this resulted in an accuracy of -0.11%, mean absolute error of 0.1636, and a root mean squared error of 0.56. By calculating the relative error we found which model had the smallest difference between error in training and validation sets.
For our exploratory analysis, first we graphed a pairplot of all the variables against daily precipitation, and saw trends in most of the columns. Below are scatterplots of the daily average dry bulb temperature, relative humidity, sea level pressure, station pressure, and wet bulb temperature, all plotted against daily precipitation. These graphs resemble vaguely Gaussian distributions with skewing, and indicated to us that we were on the right track.
After we created our model, we graphed the absolute values of the linear regression coefficients for each of our input features, in order to compare the weights of the relationships between our response variable (precipitation) and our predictor variabels. According to our bar chart, station pressure and sea level pressure have the highest impacts on daily precipitation. Wet bulb temperature, dew point temperature, and daily snowfall had around the same weights, followed by snow depth and relative humidity. Finally, wind speed, departure from average temperature, and maximum dry bulb temperature had very slight impacts. Other than these 10 features, the rest appear to have negligble coefficients from our model.
Next, we used PCA analysis to graph the number of features needed to explain variance in our model. According to this, at around 10 features, variance reaches a maximum point where it appears to stabilize. Based on these two graphs, we opted to train our model on the ten columns from our dataset listed above, with the highest absolute coefficients.
We made a regression classifier zoo to compare 12 different models, using strategies such as dropping columns as determined above, scaling, hyperparameter tuning, and polynomial feature expansion to reach the best model.
After all of these steps, we determined that the best model was LinearRegression()
, using the top 10 columns and polynomial feature expansion.
Our Linear Regression model got an R2 score of 0.50 on our first testing set, which was an unseen subset of the data our model was trained on. We also tested our model on data from 2021, which was a year that we hadn't trained our model on, to see if it would be able to use past data to predict "future" results. On 2021 data, our model's R2 score was 0.43, which was unfortunately much lower.
While station pressure and sea level pressure had the strongest relationships with precipitation, we needed the other input features for our model to achieve a high accuracy. Our findings show that while Seattle rains on a lot of days, the average amount of precipitation is fairly low.
Below are scatterplots of our linear regression model's performance on the training and testing sets. The graphs show that our model performed best on the training set, and worst on the 2021 testing set.
Ultimately, we were satisfied with the results of our model on the first testing set, but slightly disappointed that it wasn't as accurate for 2021 data. This implies that our model isn't as useful for data outside of 2010-2020. A warming climate is expected to increase precipitation (according to the EPA), and as our climate is heating at an increasing rate due to global warming, we can expect to see more and more drastic changes in weather features as time goes on.
Exploratory Analyis notebook: Exploratory
Model training notebook: Model Training