DATA 311 - Lab 2: Answering Weather Questions

Scott Wehrwein

Fall 2021

Introduction

The last time I taught CSCI 141, my students did a data-science-flavored final project. Their task was to develop and refine a data question that can be answered using weather records provided by NOAA. For example, my question was:

Is it cloudy more often in the winter in Bellingham, WA than in Ithaca, NY (where I moved here from)?

The refinement of the question is one that can be unambiguously answered by the available data. In this case, I framed it like this:

In the months of December and January through March in the year 2020, did Bellingham or Ithaca have more hourly observations where the sky was 7/8ths or more covered by clouds?

My 141 students were tasked with writing a Python program to answer their question - without using any of the fancy libraries we’re using in this class! In this lab, we’ll do a similar thing - answer a bunch of weather questions based on data, but we’ll do it using pandas.

Collaboration Policy

For this lab, you may (are encouraged) to spend the lab period working together with a partner. After the lab period ends, you will work independently and submit your own solution, though you may continue to collaborate in accordance with the individual assignment collaboration policy listed on the syllabus.

The Data

For this project, we’ll be drawing from a dataset called Local Climatological Data (LCD), which is maintained by the National Oceanographic and Atmospheric Administration (NOAA). You can find information about the datasets availble from NOAA here, but we’ll focus for now on data recorded from land-based weather stations. The data we’ll work with consists primarily of recordings that are taken once per hour around the clock, every day of the year.

The meanings of the columns available in the data are spelled out in detail in the LCD Documentation. Spend a few minutes reading through to see which columns might be relevant. Here are a couple specific bits of domain knowledge that may be helpful:

Temperature is measured in two ways: with a “dry bulb” and a “wet bulb”. These two measurements together allow for the calculation of dewpoint and humidity. The regular air temperature, as reported by your favorite weather app or website, is the dry bulb temperature. So if you want to answer a question about air, you’ll probably want to look at HourlyDryBulbTemperature, DailyMaximumDryBulbTemperature, or some other dry bulb measurement.
Quantities like barometric pressure (HourlyStationPressure) and wind speed (HourlyWindSpeed) are measured at the time the observation is taken.
Precipitation amounts (e.g.,HourlyPrecipitation) give the liquid amount of precipitation that fell during the observation period (e.g., during the hour preceeding the observation). This means snow is melted before being measured.
Some fields, such as HourlySkyConditions have more complicated contents, but you can use them too - you’ll just need to read up in on what they mean in the documentation.

CSV Specifics

This data mixes hourly, daily, and monthly information, so different rows of the table will have values in different columns. For example, for each hourly observation, all of the columns beginning with Daily will simply be empty.

A useful way to filter out the rows that will have the data you’re intersted in is to look at the REPORT_TYPE column. For example, rows with REPORT_TYPE value FM-15 correspond to the regular hourly observations. If you consider only these rows, all the Hourly columns should have values.

Pre-Downloaded Data

To make things easier for you, I’ve downloaded LCD data for about 18 locations across the U.S over the entire year 2020.

You can find the pre-downloaded data files here. The URLs of the files in that directory can be given directly to pd.read_csv to load the data directly into your notebook.

If you’re interested in examining data from other locations or timeframes, I’ve included instructions below for how to do that.

Getting More Data

If you want to answer a question that requires other weather stations and/or other timeframes, you’ll need to retrieve your own data. Here’s how I downloaded the files that I provided to you:

Go to https://www.ncdc.noaa.gov/cdo-web/datatools/lcd
Select one of the available Location Types to search for a weather station in your location of interest.
Find the station of interest in the Station Details list and click “Add To Cart”. Don’t worry - the data is free!
Search for additional locations and add them to your cart as needed. Note that the data ordering system puts a 10 station-year limit on the amount of data you can order at once. For example, you could download 10 years from 1 station or 1 year from 10 stations. You can make as many orders as you want, but if you need more than 10 station-years you’ll need to break it into multiple orders.
Mouse over the orange “Cart (Free Data)” button in the top right and click “View All Items”.
Under Select the Output Format, choose the “LCD CSV” format radio button.
Enter the date range you want to get data for and click Continue at the bottom of the page.
On the next page, enter your email address and click click Submit Order. The system will email you an order confirmation, and then shortly thereafter you’ll get another email with a link to download your data.
If you downloaded data from multiple stations, they’ll be batched into a single CSV file. I wrote a short program breakout_stations.py to process such a file into one CSV file per station.

If your submission relies on data files not included in the pre-downloaded data, upload them to Canvas alongside your notebook. When reading the files in your notebook, your code should assume the CSV files are in the same directory as the notebook; i.e., to load a file called NV_LasVegas.csv, you’d call pd.read_csv("NV_LasVegas.csv").

Getting Started

Please refer to Lab 1 for instructions on how to get the notebook server running, in case you’ve forgotten how; instructions for running on the labs remotely are here.

Your Tasks

Part 1

Use the pre-downloaded 2020 weather data to perform your choice of two of the following pre-defined analyses. Create a separate notebook for each question you address.

Determine which of the provided cities was the rainiest in 2020.
Find out how often Bellingham was overcast in each month of 2020.
Compare Fall weather in Bellingham and Portland based on temperature, humidity, rainfall, and wind.

Part 2

Perform one additional analysis that is substantively different from the above (this doesn’t mean you can’t use the same columns, just that you are getting at different questions and, ideally, using different techniques). Feel free to collect additional data beyond what I’ve pre-downloaded; this would, for example, enable investigation of trends over years (e.g., which of the past 10 summers in Bellingham was hottest?). Submit your analysis in a third notebook.

Extra Credit

Perform an additional in-depth analysis of your choice; ideally this will involve multiple data columns, multiple cities, and/or analysis by time or date. More credit will be awarded for analyses that result in insights that are surprising or non-obvious.

Guidelines

Each of your analyses should tell a story, and should do so clearly and convincingly.

Notice that I’ve left terms here imprecise - it’s up to you to decide how to define things in terms of the data available. For example, for #1 you would need to decide what “rainy” means to you and explain that decision.
A convincing analysis might compare multiple interpretations of a concept. For example, it may be interesting to see both number of days with rain and also total amount of rain for each city to get a more complete picture of “raininess”.
Any preprocessing, cleaning, or assumptions you make in your analysis should be explained. If you find something strange in the data and can’t find an explanation for it in the LCD Documentation, you may need to ignore it; be explicit about doing this.
When answering these kinds of questions, I often start with a fairly specific question (“is it cloudier in Bellingham or Ithaca?”), but once I do the work to answer it, it’s quite easy to answer other similar questions or perform more in-depth analysis (“what are the relative frequencies of each possible sky condition in each city?”). You’re not required to, but you should feel free to follow your curiosity into further analysis after answering the original question.

Submitting your work

If your analyses use any data other than the pre-downloaded files, please submit a zip file containing all of the data files needed by all of your notebooks (please zip the data even if it’s only one file - this will preserve the data filename so your notebook can load it without modification), except those that are among the pre-downloaded. Be sure that the notebook reads pre-downloaded files from their URLs, and other data files from the same directory as the notebook.

For each notebook, make sure that all your cells have been run and have the output you intended, then download a copy in .ipynb format and submit the resulting file to the Lab 2 assignment on Canvas.

Finally, fill out the Week 2 Survey on Canvas. Your submission will not be considered complete until the survey is submitted.

Rubric

Each analysis in Part 1 is worth 20 points, while your Part 2 analysis is worth 10 points. For full credit, your analysis will be correct, convincing, and clear.

Parts 1 and 2

50% - Correct
- 5/5 analysis is completely technically sound
- 4/5 small technical issues, but no apparent effect on the outcome of the analysis
- 3/5 some techniques are applied or interpreted incorrectly in a way that affects the analysis
- 2/5 an honest attempt was made, but it’s seriously incomplete or flawed
- 1/5 some effort was made
- 0/5 no effort was made / no submission
30% - Convincing
- 3/3 - Sensible preprocessing decisions are made; assumptions are reasonable
- 2/3 - Some questionable assumptions are made or important information ignored, weakening the conclusion of the analysis
- 1/3 - The analysis attempts to, but does not support the conclusion / provide insight into the issue in question.
- 0/3 no effort was made / no submission
20% - Clear
- 2/2 - code and analysis are clearly and concisely explained and justified
- 1/2 - code and analysis are partly explained/justified, or explanations are difficult to understand
- 0/2 no effort was made / no submission

Extra Credit

Up to 5 points of extra credit are available for the extra credit analysis, judged on the same criteria as above, but more stringently. Each extra credit point is exponentially more difficult to get; you will need to amaze me in order to get 5 points.