DATA 311 - Lab 2: Answering Weather Questions

Scott Wehrwein

Fall 2021

Introduction

The last time I taught CSCI 141, my students did a data-science-flavored final project. Their task was to develop and refine a data question that can be answered using weather records provided by NOAA. For example, my question was:

Is it cloudy more often in the winter in Bellingham, WA than in Ithaca, NY (where I moved here from)?

The refinement of the question is one that can be unambiguously answered by the available data. In this case, I framed it like this:

In the months of December and January through March in the year 2020, did Bellingham or Ithaca have more hourly observations where the sky was 7/8ths or more covered by clouds?

My 141 students were tasked with writing a Python program to answer their question - without using any of the fancy libraries we’re using in this class! In this lab, we’ll do a similar thing - answer a bunch of weather questions based on data, but we’ll do it using pandas.

Collaboration Policy

For this lab, you may (are encouraged) to spend the lab period working together with a partner. After the lab period ends, you will work independently and submit your own solution, though you may continue to collaborate in accordance with the individual assignment collaboration policy listed on the syllabus.

The Data

For this project, we’ll be drawing from a dataset called Local Climatological Data (LCD), which is maintained by the National Oceanographic and Atmospheric Administration (NOAA). You can find information about the datasets availble from NOAA here, but we’ll focus for now on data recorded from land-based weather stations. The data we’ll work with consists primarily of recordings that are taken once per hour around the clock, every day of the year.

The meanings of the columns available in the data are spelled out in detail in the LCD Documentation. Spend a few minutes reading through to see which columns might be relevant. Here are a couple specific bits of domain knowledge that may be helpful:

CSV Specifics

This data mixes hourly, daily, and monthly information, so different rows of the table will have values in different columns. For example, for each hourly observation, all of the columns beginning with Daily will simply be empty.

A useful way to filter out the rows that will have the data you’re intersted in is to look at the REPORT_TYPE column. For example, rows with REPORT_TYPE value FM-15 correspond to the regular hourly observations. If you consider only these rows, all the Hourly columns should have values.

Pre-Downloaded Data

To make things easier for you, I’ve downloaded LCD data for about 18 locations across the U.S over the entire year 2020.

You can find the pre-downloaded data files here. The URLs of the files in that directory can be given directly to pd.read_csv to load the data directly into your notebook.

If you’re interested in examining data from other locations or timeframes, I’ve included instructions below for how to do that.

Getting More Data

If you want to answer a question that requires other weather stations and/or other timeframes, you’ll need to retrieve your own data. Here’s how I downloaded the files that I provided to you:

  1. Go to https://www.ncdc.noaa.gov/cdo-web/datatools/lcd
  2. Select one of the available Location Types to search for a weather station in your location of interest.
  3. Find the station of interest in the Station Details list and click “Add To Cart”. Don’t worry - the data is free!
  4. Search for additional locations and add them to your cart as needed. Note that the data ordering system puts a 10 station-year limit on the amount of data you can order at once. For example, you could download 10 years from 1 station or 1 year from 10 stations. You can make as many orders as you want, but if you need more than 10 station-years you’ll need to break it into multiple orders.
  5. Mouse over the orange “Cart (Free Data)” button in the top right and click “View All Items”.
  6. Under Select the Output Format, choose the “LCD CSV” format radio button.
  7. Enter the date range you want to get data for and click Continue at the bottom of the page.
  8. On the next page, enter your email address and click click Submit Order. The system will email you an order confirmation, and then shortly thereafter you’ll get another email with a link to download your data.
  9. If you downloaded data from multiple stations, they’ll be batched into a single CSV file. I wrote a short program breakout_stations.py to process such a file into one CSV file per station.

If your submission relies on data files not included in the pre-downloaded data, upload them to Canvas alongside your notebook. When reading the files in your notebook, your code should assume the CSV files are in the same directory as the notebook; i.e., to load a file called NV_LasVegas.csv, you’d call pd.read_csv("NV_LasVegas.csv").

Getting Started

Please refer to Lab 1 for instructions on how to get the notebook server running, in case you’ve forgotten how; instructions for running on the labs remotely are here.

Your Tasks

Part 1

Use the pre-downloaded 2020 weather data to perform your choice of two of the following pre-defined analyses. Create a separate notebook for each question you address.

  1. Determine which of the provided cities was the rainiest in 2020.
  2. Find out how often Bellingham was overcast in each month of 2020.
  3. Compare Fall weather in Bellingham and Portland based on temperature, humidity, rainfall, and wind.

Part 2

Perform one additional analysis that is substantively different from the above (this doesn’t mean you can’t use the same columns, just that you are getting at different questions and, ideally, using different techniques). Feel free to collect additional data beyond what I’ve pre-downloaded; this would, for example, enable investigation of trends over years (e.g., which of the past 10 summers in Bellingham was hottest?). Submit your analysis in a third notebook.

Extra Credit

Perform an additional in-depth analysis of your choice; ideally this will involve multiple data columns, multiple cities, and/or analysis by time or date. More credit will be awarded for analyses that result in insights that are surprising or non-obvious.

Guidelines

Each of your analyses should tell a story, and should do so clearly and convincingly.

Submitting your work

If your analyses use any data other than the pre-downloaded files, please submit a zip file containing all of the data files needed by all of your notebooks (please zip the data even if it’s only one file - this will preserve the data filename so your notebook can load it without modification), except those that are among the pre-downloaded. Be sure that the notebook reads pre-downloaded files from their URLs, and other data files from the same directory as the notebook.

For each notebook, make sure that all your cells have been run and have the output you intended, then download a copy in .ipynb format and submit the resulting file to the Lab 2 assignment on Canvas.

Finally, fill out the Week 2 Survey on Canvas. Your submission will not be considered complete until the survey is submitted.

Rubric

Each analysis in Part 1 is worth 20 points, while your Part 2 analysis is worth 10 points. For full credit, your analysis will be correct, convincing, and clear.

Parts 1 and 2

Extra Credit

Up to 5 points of extra credit are available for the extra credit analysis, judged on the same criteria as above, but more stringently. Each extra credit point is exponentially more difficult to get; you will need to amaze me in order to get 5 points.