DATA 311 - Lab 2: Answering Weather Questions

Scott Wehrwein

Winter 2023

Introduction

The last time I taught CSCI 141, my students did a data-science-flavored final project. Their task was to develop and refine a data question that can be answered using weather records provided by NOAA. For example, my question was:

Is it cloudy more often in the winter in Bellingham, WA than in Ithaca, NY (where I moved here from)?

The refinement of the question is one that can be unambiguously answered by the available data. In this case, I framed it like this:

In the months of December and January through March in the year 2020, did Bellingham or Ithaca have more hourly observations where the sky was 7/8ths or more covered by clouds?

My 141 students were tasked with writing a Python program to answer their question - without using any of the fancy libraries we’re using in this class! In this lab, we’ll do a similar thing - answer a bunch of weather questions based on data, but we’ll do it using pandas.

Collaboration Policy

For this lab, you may (are encouraged) to spend the lab period working together with a partner. Together means synchronously and collaboratively: no divide an conquer. After the lab period ends, you will work independently and submit your own solution, though you may continue to collaborate with classmates in accordance with the individual assignment collaboration policy (i.e., you can discuss ideas and strategies but not share or view code). Your submission must acknowledge anyone you collaborated with, and for which parts of the lab (this should be included as a statement at the top of your notebook(s)).

The Data

For this project, we’ll be drawing from a dataset called Local Climatological Data (LCD), which is maintained by the National Oceanographic and Atmospheric Administration (NOAA). You can find information about the datasets availble from NOAA here, but we’ll focus for now on data recorded from land-based weather stations. The data we’ll work with consists primarily of recordings that are taken once per hour around the clock, every day of the year.

The meanings of the columns available in the data are spelled out in detail in the LCD Documentation. Spend a few minutes reading through to see which columns might be relevant. Here are a couple specific bits of domain knowledge that may be helpful:

CSV Specifics

This data mixes hourly, daily, and monthly information, so different rows of the table will have values in different columns. For example, for each hourly observation, all of the columns beginning with Daily will simply be empty.

A useful way to filter out the rows that will have the data you’re intersted in is to look at the REPORT_TYPE column. For example, rows with REPORT_TYPE value FM-15 correspond to the regular hourly observations. If you consider only these rows, all the Hourly columns should have values.

Pre-Downloaded Data

To make things easier for you, I’ve downloaded LCD data for about 18 locations across the U.S over the entire year 2020.

You can find the pre-downloaded data files here. The URLs of the files in that directory can be given to pd.read_csv to load the data directly into your notebook. For the sake of bandwidth, load the data you need in a single cell at the top of your notebook so you only need to download the data once each time you’re working on the lab.

If you’re interested in examining data from other locations or timeframes, I’ve included instructions below for how to do that.

Getting More Data

If you want to answer a question that requires other weather stations and/or other timeframes, you’ll need to retrieve your own data. Here’s how I downloaded the files that I provided to you:

  1. Go to https://www.ncdc.noaa.gov/cdo-web/datatools/lcd
  2. Select one of the available Location Types to search for a weather station in your location of interest.
  3. Find the station of interest in the Station Details list and click “Add To Cart”. Don’t worry - the data is free!
  4. Search for additional locations and add them to your cart as needed. Note that the data ordering system puts a 10 station-year limit on the amount of data you can order at once. For example, you could download 10 years from 1 station or 1 year from 10 stations. You can make as many orders as you want, but if you need more than 10 station-years you’ll need to break it into multiple orders.
  5. Mouse over the orange “Cart (Free Data)” button in the top right and click “View All Items”.
  6. Under Select the Output Format, choose the “LCD CSV” format radio button.
  7. Enter the date range you want to get data for and click Continue at the bottom of the page.
  8. On the next page, enter your email address and click click Submit Order. The system will email you an order confirmation, and then shortly thereafter you’ll get another email with a link to download your data.
  9. If you downloaded data from multiple stations, they’ll be batched into a single CSV file. I wrote a short program breakout_stations.py to process such a file into one CSV file per station.

If your submission relies on data files not included in the pre-downloaded data, upload them to Canvas alongside your notebook. When reading the files in your notebook, your code should assume the CSV files are in the same directory as the notebook; i.e., to load a file called NV_LasVegas.csv, you’d call pd.read_csv("NV_LasVegas.csv").

Your Tasks

Part 1

Use the pre-downloaded 2020 weather data to perform the following three pre-defined analyses. Create a separate notebook for each question you address.

  1. In a notebook titled lab2_q1.ipynb, determine which of the provided cities was the rainiest in 2020.
  2. In a notebook titled lab2_q2.ipynb, find out how often Bellingham was overcast in each month of 2020.
  3. In a notebook titled lab2_q3.ipynb, compare Fall weather in Bellingham and Portland based on temperature, humidity, rainfall, and wind. Use the “climatological” definition of Fall, which runs from the beginning of September through the end of November.

Part 2

Perform one additional analysis that is substantively different from the above (this doesn’t mean you can’t use the same columns, just that you are getting at different questions and, ideally, using different techniques). Feel free to collect additional data beyond what I’ve pre-downloaded; this would, for example, enable investigation of trends over years (e.g., which of the past 10 summers in Bellingham was hottest?). Submit your analysis in a fourth notebook. Submit your analysis in a notebook titled lab2_q4.ipynb.

Extra Credit

Perform an additional in-depth analysis of your choice; ideally this will involve multiple data columns, multiple cities, and/or analysis by time or date. More credit will be awarded for analyses that result in insights that are surprising or non-obvious.

Guidelines

Each of your analyses should clearly and convincingly tell a story.

Submitting your work

For each notebook, make sure that all your cells have been run and have the output you intended, then download a copy in .ipynb format. Create a single zip file containing all of your notebooks, following the naming conventions given above, where spelling, spacing, and capitalization matter. Do not include any of the provided (pre-downloaded) data files; be sure that the notebook reads pre-downloaded files from their URLs. If your analysis uses data other than the pre-downloaded files, include only those extra data files in your zip file. Submit your zip file to the Lab 2 assignment on Canvas.

Finally, fill out the Lab 2 Survey on Canvas. Your submission will not be considered complete until the survey is submitted.

Rubric

Each analysis in Part 1 is worth 15 points, while your Part 2 analysis is worth 10 points. For full credit, your analysis will be correct, convincing, and clear.

Parts 1 and 2

Extra Credit

Up to 5 points of extra credit are available for the extra credit analysis, judged on the same criteria as above, but more stringently. Each extra credit point is exponentially more difficult to get; you will need to amaze me in order to get 5 points.