CSCI 141 Final Project: Weather Data Science

Scott Wehrwein

Spring 2021

Update: You can find the collected Abstracts summarizing everyone’s results here!

Important Dates:

Proposal: Wednesday, May 19th
Draft program: Wednesday, May 26th
Finalized code and report: Monday, May 31st.

Note: Slip days can be used on the final project, but a slip day used on one of the interim deadlines does not carry to later ones.

Introduction

Does it rain more often in Olympia than in Bellingham? Is Ithaca, NY (where I went to graduate school) actually any less cloudy than Bellingham is in the winter? It also seems like it’s more windy more often here than anywhere I’ve lived before; is that true?

These are all questions that I find it easy to wonder about - and they’re also questions that could be carefully refined into a form that can be definitively answered with the right data. Fortunately, the United States is home to a large network of weather observation stations that record standardized information about the current weather in locations across the country on an hourly basis.

The goal of this project is to formulate a question that can be answered using weather data from one or more locations in the US, and develop a Python program to find a quantitative answer your question.

The Data

For this project, we’ll be drawing from a dataset called Local Climatological Data (LCD), which is maintained by the National Oceanographic and Atmospheric Administration (NOAA). You can find information about the datasets availble from NOAA here, but we’ll focus for now on data recorded from land-based weather stations. The data we’ll work with consists primarily of recordings that are taken once per hour around the clock, every day of the year.

Pre-Downloaded Data

To make things easier for you, I’ve downloaded LCD data for about 18 locations across the U.S over the entire year 2020.

You can find the pre-downloaded data files here.

If you’re interested in examining data from other locations or timeframes, I’ve included instructions below for how to do that.

The CSV files are large and a little messy, but there’s a lot of useful information in there. Here’s a list of just some of the columns that might be of interest:

STATION
DATE
REPORT_TYPE
DailyAverageDewPointTemperature
DailyAverageDryBulbTemperature
DailyAverageRelativeHumidity
DailyAverageSeaLevelPressure
DailyAverageStationPressure
DailyAverageWetBulbTemperature
DailyAverageWindSpeed
DailyCoolingDegreeDays
DailyDepartureFromNormalAverageTemperature
DailyHeatingDegreeDays
DailyMaximumDryBulbTemperature
DailyMinimumDryBulbTemperature
DailyPeakWindDirection
DailyPeakWindSpeed
DailyPrecipitation
DailySnowDepth
DailySnowfall
DailySustainedWindDirection
DailySustainedWindSpeed
DailyWeather
HourlyAltimeterSetting
HourlyDewPointTemperature
HourlyDryBulbTemperature
HourlyPrecipitation
HourlyPresentWeatherType
HourlyPressureChange
HourlyPressureTendency
HourlyRelativeHumidity
HourlySeaLevelPressure
HourlySkyConditions
HourlyStationPressure
HourlyVisibility
HourlyWetBulbTemperature
HourlyWindDirection
HourlyWindGustSpeed
HourlyWindSpeed
MonthlyAverageRH
Sunrise
Sunset
TStorms

The meanings of these columns are spelled out in detail in the LCD Documentation, but here are a few points to get you started:

Temperature is measured in two ways: with a “dry bulb” and a “wet bulb”. These two measurements together allow for the calculation of dewpoint and humidity. The regular air temperature, as reported by your favorite weather app or website, is the dry bulb temperature. So if you want to answer a question about air, you’ll probably want to look at HourlyDryBulbTemperature, DailyMaximumDryBulbTemperature, or some other dry bulb measurement.
Quantities like barometric pressure (HourlyStationPressure) and wind speed (HourlyWindSpeed) are measured at the time the observation is taken.
Precipitation amounts (e.g.,HourlyPrecipitation) give the liquid amount of precipitation that fell during the observation period (e.g., during the hour preceeding the observation). This means snow is melted before being measured.
Some fields, such as HourlySkyConditions have more complicated contents, but you can use them too - you’ll just need to read up in on what they mean in the documentation.

CSV Specifics

Becuase this data mixes hourly, daily, and monthly information, different rows of the table will have values in different columns. For example, for each hourly observation, all of the columns beginning with Daily will simply be empty.

A useful way to filter out the rows that will have the data you’re intersted in is to look at the REPORT_TYPE column. For example, rows with REPORT_TYPE value FM-15 correspond to the regular hourly observations. If you consider only these rows, all the Hourly columns should have values.

You may find it helpful to look at the CSV files in a spreadsheet program, to help you understand what’s there and how they’re laid out. I recommend opening them either in Microsoft Excel, LibreOffice Calc, or Google Sheets (you can upload a CSV file to Google Drive then right click it and select Open In > Google Sheets).

Commas in CSV Fields

Some data files have CSV fields that contain commas; this is tricky because the commas are part of the data, but they look just like the commas used to separate the values. The typical solution to this is to enclose the values in quotes (in this case, double quotes), so "SNOW, FOG" is considered a single value, rather than the two values "SNOW and FOG". This makes splitting on commas insufficient to get the right values back. To help with this, I’ve written a function that handles this properly. You can download the code for that function here: splitty.py. You are free to use this function in your solution.

Optional: Getting More Data

If you want to answer a question that requires other weather stations and/or other timeframes, you’re encouraged to retrieve your own data. Here’s how I downloaded the files that I provided to you:

Go to https://www.ncdc.noaa.gov/cdo-web/datatools/lcd
Select one of the available Location Types to search for a weather station in your location of interest.
Find the station of interest in the Station Details list and click “Add To Cart”. Don’t worry - the data is free!
Search for additional locations and add them to your cart as needed. Note that the data ordering system puts a 10 station-year limit on the amount of data you can order at once. For example, you could download 10 years from 1 station or 1 year from 10 stations. You can make as many orders as you want, but if you need more than 10 station-years you’ll need to break it into multiple orders.
Mouse over the orange “Cart (Free Data)” button in the top right and click “View All Items”.
Under Select the Output Format, choose the “LCD CSV” format radio button.
Enter the date range you want to get data for and click Continue at the bottom of the page.
On the next page, enter your email address and click click Submit Order. The system will email you an order confirmation, and then shortly thereafter you’ll get another email with a link to download your data.
If you downloaded data from multiple stations, they’ll be batched into a single CSV file. I wrote a short program breakout_stations.py to process such a file into one CSV file per station.

Part 1: Formulating a Question and Planning Your Program

Submission: PDF upload to Canvas by the start of class on Wednesday, 5/19.

Start with curiosity: what do you want to know? Ground your curiosity in the available data: try to come up with a question that pertains to one or more of the columns in the data files described in the prior section.

Example 1: is it cloudy more often in the winter in Bellingham, WA than in Ithaca, NY? I read up on the HourlySkyCondition field; this will tell me about the amount of cloud cover on an hourly basis, so I think I should be able to investigate this question using the available data.

Example 2 How do the daily temperature variations change across seasons of the year in Bellingham? I should be able to use the DailyMaximumDryBulbTemperature and DailyMinimumDryBulbTemperature fields to investigate this.

Refining Your Question

At this point, we need to get more specific: we need to re-phrase our question precisely enough that it has a concrete answer that can be found in, or calculated from, the data.

Example 1: The HourlySkyConditions column includes a number from 0-10; if it’s between 0 and 8, it indicates the number of “oktas” (eighths) of the sky that are covered with clouds. If it’s 9 or 10, that means the sky is obscured somehow. Given this, I’ll decide on the following definitions:

define “cloudy” as 7 or more oktas; I’ll include obscuration (which I think will be rare) as cloudy
define “winter” as the months between December and March; I could cover more months, but those are the months that are unambiguously “winter” in both places.
I’ll use the pre-downloaded data and focus on the year 2020

I can now phrase my question precisely as follows:

In the months from December through March in the year 2020, which did Bellingham or Ithaca have more hourly observations that were cloudy (i.e., 7 or more oktas covered with clouds)?

Example 2: Let’s define the “temperature swing” for a day as the difference between the daily maximum temperature and the daily minimum temperature (both dry bulb). One way to understand how the temperature swing varies across seasons is to calculate the average temperature swing for each month of the year. We can then look at a table (or a plot) of this data and perhaps we’ll see a trend. So my goal is then:

Calculate the average daily temperature swing (i.e., difference in daily dry bulb max and min) for each month of the year 2020.

Submitting your Question Proposal

Once you’ve formulated your question as described in this section, you will submit a Question Proposal to Canvas so I can review your question and make sure I agree that it’s sufficiently precise and answerable.

Write a short description and justification of your question; this probably won’t require more than a few paragraphs. Be sure to include the following information:

A description of your original, unrefined, motivating question.
Your refined question.
An explanation of how your question can be answered using the available data; include the names of the columns from the dataset that you will need to use in answering your question.

Submit your proposal in PDF format to the FP Proposal assignment on Canvas.

Part 2: Answering the Question

Planning your Program

At this point, your question should be precise enough that you could hand it to a competent programmer and they could answer it by writing a program. There’s no ambiguity about what you’re looking for; just the challenge of extracting the desired information from the data.

The next step you should take is to write pseudocode for your program. Start with a focus on the high-level here: what steps need to be taken to solve the problem?

The following is some pseudocode for Example 2; keep in mind that there are many ways this could be done, so even if your question is similar to this one, your pseudocode likely won’t look the same:

create a list of 12 temperature swing totals, one per
month, initialized to zero

for each row of the CSV file:
  if it's a daily observation:
    calculate the temperature swing
    add it to the appropriate monthly total list element

for each monthly total:
   divide it by the number of days in the given month to compute the
   average temperature swing

print a table of the calculated monthly averages

Writing your Program

At this point, you’re ready to start coding up your solution. This section provides guidelines for how your program should behave, suggestions for how to go about developing it, and style guidelines that you should follow. Read the this section carefully at least twice: once before you start programming and once after you finish to make sure you’ve met all of the stated requirements.

Program Input and Output

Your program should read data from LCD CSV files like the ones I’ve provided. If you use the files I provided, make sure the file names match mine. If you use other data files, name them by a similar convention (ST_City.csv) and include them with your submission to Canvas. Your program should read all files under the assumption that they are available in the same folder as your code.

If your program can take user input (e.g., to decide which cities to compare), this input must be read using command line arguments, not the input function. If your program takes command line arguments, the comment at the top of your program should include an additional component called Usage that describes how a user can run your program, what the command line arguments should be, and provides at least one examples of how the program might be run.

The output of your program should be nicely formatted and present its results with context, ideally in tabular form; the specifics will depend on your question, but as an example, here’s the output from my Example 1 program:

This table shows the percentage of hourly observations in which the
sky was at least a given number of eighths covered in clouds in each city.

City \  Eighths sky cover      >= 4      >= 5      >= 6      >= 7      >= 8
WA_Bellingham.csv            72.12%    67.17%    67.17%    67.17%    54.89%
NY_Ithaca.csv                75.60%    73.06%    73.06%    73.06%    62.72%
WA_Seattle.csv               87.36%    78.95%    78.95%    78.95%    54.62%

Test Early and Often

You are likely to learn (possibly unexpected) things about the data once you start processing it, so I strongly recommend that you test as early as possible. Don’t wait until you have the full program written out! Here’s an example demonstrating why that’s a good idea.

When implementing code for Example 1, I started with a loop like the one in the pseudocode above. I quickly realized that I’d need to be looking at many fields of each row even just to decide whether to include it or not. To make this easier, I loaded each row into a dictionary with the column headers as keys and the row’s values as values.

To make sure that was working, I looped over these and printed out the HourlySkyConditions field for all observations with REPORT_TYPE equal to FM-15. I quickly realized that HourlySkyConditions was still missing in a lot of rows; looking at the raw data I noticed that the SOURCE column was 4 for all the observations that were missing HourlySkyConditions, and 7 when it wasn’t missing. So I just added another check to ignore observations with SOURCE not equal to 7.

Notice that this would have been harder to track down and discover if I’d finished my whole program first - the key thing that made this easy was that I decided to print out all the sky conditions for each record I wasn’t skipping.

Write Helper Functions and Test Them Independently

The complexity of your task is sufficiently high that you should not write your code as a single monolithic program. You should write a short (no more than about 30 lines) main program inside a main guard (i.e., if __name__ == "__main__"). Furthermore, your code should not be overly repetitious - if you’re performing some action more than once or twice, create a function to perform that action and call it as many times as you need it. In my Example 1 solution, no single function has more than 15 lines of code, not counting their docstrings; you should strive for many short, self-contained helper functions that make the main program as clear and readable as possible.

Your program should have at least two helper functions of your own devising. The helper functions should have precise docstrings describing their behavior and the meanings of their inputs and outputs. To help you find bugs early and easily, test your functions separately from the rest of the program. Here’s another anecdotal example of why this is a good idea, based on my own experience coding up Example 1:

I made several more interesting discoveries along the way towards a working solution. For example, the documentation is simply incorrect about the format of the sky conditions column: the docs say it’s ccc:ll-xxx but it’s actually ccc:ll xx (a space instead of a dash). Fortunately, I was already writing a standalone function to parse the sky conditions column and I tested it independently of the rest of my program, so it was easy to identify where things were going wrong and fix it.

Additional Style Guidelines:

Your program should not import any modules except sys, math, and random (you might not need all, or even any of these). If have a use for other modules, ask me and can grant you permission to use them on a case-by-case basis.
Code inside a function should not refer to variables defined outside that function; any information a function needs should be passed into that function as an argument.
As usual, your program should have a comment at the top listing the author, date, and a short description of the program’s purpose.
Any code that is not self-explanatory should be explained using concise inline comments.

Minimize the use of mysterious hard-coded constants. For example:

# not great:
val1 = csv_row[19] # I made up 19, no idea what column it actually represents

# better:
dry_bulb_index = headers.index("HourlyDryBulbTemperature") # done early in the program
# ...
dry_bulb = csv_row[dry_bulb_index] # done anytime you need to access the hourly temp

# also better:
# read each CSV row into a dict with headers as keys
dry_bulb = observation["HourlyDryBulbTemperature"]

If Circumstances Change

It’s possible that you will discover that your original question cannot be answered as you expected due to limitations in the data or other hiccups that come with interacting with real-world data.

If you aren’t sure how to proceed, come talk to me or send me an email. If you know how to modify your question to a new one that follows the original guidelines for the Question Proposal, you may do so and explain it in your interim report without consulting me. Just keep in mind that your final grade will depend on whether you answered the question in your original proposal unless your interim report provides a new question and solid justification for the change. If in doubt, ask me to confirm that your revised question is good to go.

Submitting Your Draft Program and Interim Report

Before the final deadline, you will submit a draft program and interim report. At this point, your program need not be fully polished. You may also focus on getting the code working for a limited set of data (for example, if you’re measuring something over five years, it’s fine to have confirmed that the code measures the quantity of interest for one year). The program’s output need not be polished yet.

Submit the following items to the FP Draft assignment on Canvas:

A working program (a .py file) that reads the relevant data and computes the substance of the answer to your question.
A brief interim report (in PDF format) containing the following information:
- Describe the progress you’ve made.
- Enumerate the work remaining to be done on your program before it is ready for the final submission.
- If your question (either the motivating question or the specifics of how you refined it) needed to be adjusted in any way, include the old question, the new question, and a justification explaining why you needed to change it.
If your program uses any data files not provided by me in the data directory, upload those to Canvas as well. Do not upload files that are provided in the data directory.

Part 3: Reporting your Results

The final report is roughly modeled after a scientific paper or white paper reporting the results of your investigation. You should think of this document as simultaneously serving a few different audiences: some readers (e.g., a layperson deciding where to live) just want to know your conclusions; others (e.g., a fellow data scientist investigating similar trends) will want to know your methods and might go as far as reading or using your code to reproduce your results.

The report should contain the following sections:

Abstract: A one-sentence summary of the answer to your question.
Introduction: Introduce your question and explain how you formalized / refined it into something answerable with the data. If your question has changed since your proposal, include an explanation of that here as well.
Methods: A short description of the high-level architecture of your code. Describe the main program’s algorithm at a very high level; give enough detail to explain how it answers the question, but do not get into details that are unimportant to a reader who isn’t planning to read your code.
Results: Include the output of your program and interpret the output in terms of the original question. This should include a clear statement of the answer to your original question. If you created any plots, charts, or graphics, this is the place to include them and talk about them.
Discussion: What limitations does your approach have? Did you encounter anything odd or interesting along the way? Do you have any ideas for future analysis to address limitations or to investigate as-yet-unanswered questions?
Code guide: this should contain two subsections; the first is for someone who wants to run your code, while the second is for someone who wants to read your code.
- Usage: Provide instructions for running your code. Be sure to include a list of any data files that need to be available and a description of any command line arguments or other inputs your program expects.
- Architecture: Describe architecture of your code; this should be a “tour” of the code for a reader who has the code and this description open side-by-side and is interested in understanding how your code works. This should include a description of each function you’ve written and how they fit together into the overall program.

For best results, please make the report as clear and concise (i.e., short) as possible while completely covering all the requested content.

Submitting Your Final Program and Report

Submit your program (as a .py file) and your report (as a .pdf) file to the FP Final Code and Report assignment on Canvas. As with the draft, also include any data files your program reads from, unless they are already included in the data directory.

Finally, please submit the FP Survey assignment on Canvas to let me know how many hours you spent on the project and (optionally) let me know how it went and provide feedback.

Rubric

Program - Functionality (40 points)

Regardless of your question, your program will need to:

Read lines from a CSV file (10)
Extract relevant fields from those lines (10)
Perform necessary computation or aggregation of values from the data (10)
Output nicely formatted results with sufficient labels and context that it can be interpreted without reading the code (10)

Program - Code Quality (25 points)

Your code should be polished with a focus on clarity as well as correctness. Clarity not only includes good variable names and appropriate commenting, but also writing code that is as concise as possible. Deductions will also be made for violating explicit guidelines including but not limited to the following:

program does not use modules other than sys, math, and random unless permission was given
program uses a main guard
main program is < 30 lines
program uses least two helper functions
functions have proper docstrings
functions do not refer to global variables
commenting is appropriate

Report (20 points)

Report contains all the appropriate information (10)

Report is clear and readable (10)