Fall 2021
The last time I taught CSCI 141, my students did a data-science-flavored final project. Their task was to develop and refine a data question that can be answered using weather records provided by NOAA. For example, my question was:
Is it cloudy more often in the winter in Bellingham, WA than in Ithaca, NY (where I moved here from)?
The refinement of the question is one that can be unambiguously answered by the available data. In this case, I framed it like this:
In the months of December and January through March in the year 2020, did Bellingham or Ithaca have more hourly observations where the sky was 7/8ths or more covered by clouds?
My 141 students were tasked with writing a Python program to answer their question - without using any of the fancy libraries we’re using in this class! In this lab, we’ll do a similar thing - answer a bunch of weather questions based on data, but we’ll do it using pandas
.
For this lab, you may (are encouraged) to spend the lab period working together with a partner. After the lab period ends, you will work independently and submit your own solution, though you may continue to collaborate in accordance with the individual assignment collaboration policy listed on the syllabus.
For this project, we’ll be drawing from a dataset called Local Climatological Data (LCD), which is maintained by the National Oceanographic and Atmospheric Administration (NOAA). You can find information about the datasets availble from NOAA here, but we’ll focus for now on data recorded from land-based weather stations. The data we’ll work with consists primarily of recordings that are taken once per hour around the clock, every day of the year.
The meanings of the columns available in the data are spelled out in detail in the LCD Documentation. Spend a few minutes reading through to see which columns might be relevant. Here are a couple specific bits of domain knowledge that may be helpful:
HourlyDryBulbTemperature
, DailyMaximumDryBulbTemperature
, or some other dry bulb measurement.HourlyStationPressure
) and wind speed (HourlyWindSpeed
) are measured at the time the observation is taken.HourlyPrecipitation
) give the liquid amount of precipitation that fell during the observation period (e.g., during the hour preceeding the observation). This means snow is melted before being measured.HourlySkyConditions
have more complicated contents, but you can use them too - you’ll just need to read up in on what they mean in the documentation.This data mixes hourly, daily, and monthly information, so different rows of the table will have values in different columns. For example, for each hourly observation, all of the columns beginning with Daily
will simply be empty.
A useful way to filter out the rows that will have the data you’re intersted in is to look at the REPORT_TYPE
column. For example, rows with REPORT_TYPE
value FM-15
correspond to the regular hourly observations. If you consider only these rows, all the Hourly
columns should have values.
To make things easier for you, I’ve downloaded LCD data for about 18 locations across the U.S over the entire year 2020.
You can find the pre-downloaded data files here. The URLs of the files in that directory can be given directly to pd.read_csv
to load the data directly into your notebook.
If you’re interested in examining data from other locations or timeframes, I’ve included instructions below for how to do that.
If you want to answer a question that requires other weather stations and/or other timeframes, you’ll need to retrieve your own data. Here’s how I downloaded the files that I provided to you:
If your submission relies on data files not included in the pre-downloaded data, upload them to Canvas alongside your notebook. When reading the files in your notebook, your code should assume the CSV files are in the same directory as the notebook; i.e., to load a file called NV_LasVegas.csv
, you’d call pd.read_csv("NV_LasVegas.csv")
.
Please refer to Lab 1 for instructions on how to get the notebook server running, in case you’ve forgotten how; instructions for running on the labs remotely are here.
Use the pre-downloaded 2020 weather data to perform your choice of two of the following pre-defined analyses. Create a separate notebook for each question you address.
Perform one additional analysis that is substantively different from the above (this doesn’t mean you can’t use the same columns, just that you are getting at different questions and, ideally, using different techniques). Feel free to collect additional data beyond what I’ve pre-downloaded; this would, for example, enable investigation of trends over years (e.g., which of the past 10 summers in Bellingham was hottest?). Submit your analysis in a third notebook.
Perform an additional in-depth analysis of your choice; ideally this will involve multiple data columns, multiple cities, and/or analysis by time or date. More credit will be awarded for analyses that result in insights that are surprising or non-obvious.
Each of your analyses should tell a story, and should do so clearly and convincingly.
If your analyses use any data other than the pre-downloaded files, please submit a zip file containing all of the data files needed by all of your notebooks (please zip the data even if it’s only one file - this will preserve the data filename so your notebook can load it without modification), except those that are among the pre-downloaded. Be sure that the notebook reads pre-downloaded files from their URLs, and other data files from the same directory as the notebook.
For each notebook, make sure that all your cells have been run and have the output you intended, then download a copy in .ipynb format and submit the resulting file to the Lab 2 assignment on Canvas.
Finally, fill out the Week 2 Survey on Canvas. Your submission will not be considered complete until the survey is submitted.
Each analysis in Part 1 is worth 20 points, while your Part 2 analysis is worth 10 points. For full credit, your analysis will be correct, convincing, and clear.
Extra Credit
Up to 5 points of extra credit are available for the extra credit analysis, judged on the same criteria as above, but more stringently. Each extra credit point is exponentially more difficult to get; you will need to amaze me in order to get 5 points.