Spring 2021
Update: You can find the collected Abstracts summarizing everyone’s results here!
Important Dates:
Note: Slip days can be used on the final project, but a slip day used on one of the interim deadlines does not carry to later ones.
Does it rain more often in Olympia than in Bellingham? Is Ithaca, NY (where I went to graduate school) actually any less cloudy than Bellingham is in the winter? It also seems like it’s more windy more often here than anywhere I’ve lived before; is that true?
These are all questions that I find it easy to wonder about - and they’re also questions that could be carefully refined into a form that can be definitively answered with the right data. Fortunately, the United States is home to a large network of weather observation stations that record standardized information about the current weather in locations across the country on an hourly basis.
The goal of this project is to formulate a question that can be answered using weather data from one or more locations in the US, and develop a Python program to find a quantitative answer your question.
For this project, we’ll be drawing from a dataset called Local Climatological Data (LCD), which is maintained by the National Oceanographic and Atmospheric Administration (NOAA). You can find information about the datasets availble from NOAA here, but we’ll focus for now on data recorded from land-based weather stations. The data we’ll work with consists primarily of recordings that are taken once per hour around the clock, every day of the year.
To make things easier for you, I’ve downloaded LCD data for about 18 locations across the U.S over the entire year 2020.
You can find the pre-downloaded data files here.
If you’re interested in examining data from other locations or timeframes, I’ve included instructions below for how to do that.
The CSV files are large and a little messy, but there’s a lot of useful information in there. Here’s a list of just some of the columns that might be of interest:
STATION
DATE
REPORT_TYPE
DailyAverageDewPointTemperature
DailyAverageDryBulbTemperature
DailyAverageRelativeHumidity
DailyAverageSeaLevelPressure
DailyAverageStationPressure
DailyAverageWetBulbTemperature
DailyAverageWindSpeed
DailyCoolingDegreeDays
DailyDepartureFromNormalAverageTemperature
DailyHeatingDegreeDays
DailyMaximumDryBulbTemperature
DailyMinimumDryBulbTemperature
DailyPeakWindDirection
DailyPeakWindSpeed
DailyPrecipitation
DailySnowDepth
DailySnowfall
DailySustainedWindDirection
DailySustainedWindSpeed
DailyWeather
HourlyAltimeterSetting
HourlyDewPointTemperature
HourlyDryBulbTemperature
HourlyPrecipitation
HourlyPresentWeatherType
HourlyPressureChange
HourlyPressureTendency
HourlyRelativeHumidity
HourlySeaLevelPressure
HourlySkyConditions
HourlyStationPressure
HourlyVisibility
HourlyWetBulbTemperature
HourlyWindDirection
HourlyWindGustSpeed
HourlyWindSpeed
MonthlyAverageRH
Sunrise
Sunset
TStorms
The meanings of these columns are spelled out in detail in the LCD Documentation, but here are a few points to get you started:
HourlyDryBulbTemperature
, DailyMaximumDryBulbTemperature
, or some other dry bulb measurement.HourlyStationPressure
) and wind speed (HourlyWindSpeed
) are measured at the time the observation is taken.HourlyPrecipitation
) give the liquid amount of precipitation that fell during the observation period (e.g., during the hour preceeding the observation). This means snow is melted before being measured.HourlySkyConditions
have more complicated contents, but you can use them too - you’ll just need to read up in on what they mean in the documentation.Becuase this data mixes hourly, daily, and monthly information, different rows of the table will have values in different columns. For example, for each hourly observation, all of the columns beginning with Daily
will simply be empty.
A useful way to filter out the rows that will have the data you’re intersted in is to look at the REPORT_TYPE
column. For example, rows with REPORT_TYPE
value FM-15
correspond to the regular hourly observations. If you consider only these rows, all the Hourly
columns should have values.
You may find it helpful to look at the CSV files in a spreadsheet program, to help you understand what’s there and how they’re laid out. I recommend opening them either in Microsoft Excel, LibreOffice Calc, or Google Sheets (you can upload a CSV file to Google Drive then right click it and select Open In > Google Sheets).
Commas in CSV Fields
Some data files have CSV fields that contain commas; this is tricky because the commas are part of the data, but they look just like the commas used to separate the values. The typical solution to this is to enclose the values in quotes (in this case, double quotes), so "SNOW, FOG"
is considered a single value, rather than the two values "SNOW
and FOG"
. This makes splitting on commas insufficient to get the right values back. To help with this, I’ve written a function that handles this properly. You can download the code for that function here: splitty.py. You are free to use this function in your solution.
If you want to answer a question that requires other weather stations and/or other timeframes, you’re encouraged to retrieve your own data. Here’s how I downloaded the files that I provided to you:
Submission: PDF upload to Canvas by the start of class on Wednesday, 5/19.
Start with curiosity: what do you want to know? Ground your curiosity in the available data: try to come up with a question that pertains to one or more of the columns in the data files described in the prior section.
Example 1: is it cloudy more often in the winter in Bellingham, WA than in Ithaca, NY? I read up on the HourlySkyCondition
field; this will tell me about the amount of cloud cover on an hourly basis, so I think I should be able to investigate this question using the available data.
Example 2 How do the daily temperature variations change across seasons of the year in Bellingham? I should be able to use the DailyMaximumDryBulbTemperature
and DailyMinimumDryBulbTemperature
fields to investigate this.
At this point, we need to get more specific: we need to re-phrase our question precisely enough that it has a concrete answer that can be found in, or calculated from, the data.
Example 1: The HourlySkyConditions
column includes a number from 0-10; if it’s between 0 and 8, it indicates the number of “oktas” (eighths) of the sky that are covered with clouds. If it’s 9 or 10, that means the sky is obscured somehow. Given this, I’ll decide on the following definitions:
I can now phrase my question precisely as follows:
In the months from December through March in the year 2020, which did Bellingham or Ithaca have more hourly observations that were cloudy (i.e., 7 or more oktas covered with clouds)?
Example 2: Let’s define the “temperature swing” for a day as the difference between the daily maximum temperature and the daily minimum temperature (both dry bulb). One way to understand how the temperature swing varies across seasons is to calculate the average temperature swing for each month of the year. We can then look at a table (or a plot) of this data and perhaps we’ll see a trend. So my goal is then:
Calculate the average daily temperature swing (i.e., difference in daily dry bulb max and min) for each month of the year 2020.
Once you’ve formulated your question as described in this section, you will submit a Question Proposal to Canvas so I can review your question and make sure I agree that it’s sufficiently precise and answerable.
Write a short description and justification of your question; this probably won’t require more than a few paragraphs. Be sure to include the following information:
Submit your proposal in PDF format to the FP Proposal assignment on Canvas.
At this point, your question should be precise enough that you could hand it to a competent programmer and they could answer it by writing a program. There’s no ambiguity about what you’re looking for; just the challenge of extracting the desired information from the data.
The next step you should take is to write pseudocode for your program. Start with a focus on the high-level here: what steps need to be taken to solve the problem?
The following is some pseudocode for Example 2; keep in mind that there are many ways this could be done, so even if your question is similar to this one, your pseudocode likely won’t look the same:
create a list of 12 temperature swing totals, one per
month, initialized to zero
for each row of the CSV file:
if it's a daily observation:
calculate the temperature swing
add it to the appropriate monthly total list element
for each monthly total:
divide it by the number of days in the given month to compute the
average temperature swing
print a table of the calculated monthly averages
At this point, you’re ready to start coding up your solution. This section provides guidelines for how your program should behave, suggestions for how to go about developing it, and style guidelines that you should follow. Read the this section carefully at least twice: once before you start programming and once after you finish to make sure you’ve met all of the stated requirements.
Your program should read data from LCD CSV files like the ones I’ve provided. If you use the files I provided, make sure the file names match mine. If you use other data files, name them by a similar convention (ST_City.csv
) and include them with your submission to Canvas. Your program should read all files under the assumption that they are available in the same folder as your code.
If your program can take user input (e.g., to decide which cities to compare), this input must be read using command line arguments, not the input
function. If your program takes command line arguments, the comment at the top of your program should include an additional component called Usage
that describes how a user can run your program, what the command line arguments should be, and provides at least one examples of how the program might be run.
The output of your program should be nicely formatted and present its results with context, ideally in tabular form; the specifics will depend on your question, but as an example, here’s the output from my Example 1 program:
This table shows the percentage of hourly observations in which the
sky was at least a given number of eighths covered in clouds in each city.
City \ Eighths sky cover >= 4 >= 5 >= 6 >= 7 >= 8
WA_Bellingham.csv 72.12% 67.17% 67.17% 67.17% 54.89%
NY_Ithaca.csv 75.60% 73.06% 73.06% 73.06% 62.72%
WA_Seattle.csv 87.36% 78.95% 78.95% 78.95% 54.62%
You are likely to learn (possibly unexpected) things about the data once you start processing it, so I strongly recommend that you test as early as possible. Don’t wait until you have the full program written out! Here’s an example demonstrating why that’s a good idea.
When implementing code for Example 1, I started with a loop like the one in the pseudocode above. I quickly realized that I’d need to be looking at many fields of each row even just to decide whether to include it or not. To make this easier, I loaded each row into a dictionary with the column headers as keys and the row’s values as values.
To make sure that was working, I looped over these and printed out the
HourlySkyConditions
field for all observations withREPORT_TYPE
equal toFM-15
. I quickly realized thatHourlySkyConditions
was still missing in a lot of rows; looking at the raw data I noticed that theSOURCE
column was4
for all the observations that were missingHourlySkyConditions
, and7
when it wasn’t missing. So I just added another check to ignore observations withSOURCE
not equal to7
.
Notice that this would have been harder to track down and discover if I’d finished my whole program first - the key thing that made this easy was that I decided to print out all the sky conditions for each record I wasn’t skipping.
The complexity of your task is sufficiently high that you should not write your code as a single monolithic program. You should write a short (no more than about 30 lines) main program inside a main guard (i.e., if __name__ == "__main__"
). Furthermore, your code should not be overly repetitious - if you’re performing some action more than once or twice, create a function to perform that action and call it as many times as you need it. In my Example 1 solution, no single function has more than 15 lines of code, not counting their docstrings; you should strive for many short, self-contained helper functions that make the main program as clear and readable as possible.
Your program should have at least two helper functions of your own devising. The helper functions should have precise docstrings describing their behavior and the meanings of their inputs and outputs. To help you find bugs early and easily, test your functions separately from the rest of the program. Here’s another anecdotal example of why this is a good idea, based on my own experience coding up Example 1:
I made several more interesting discoveries along the way towards a working solution. For example, the documentation is simply incorrect about the format of the sky conditions column: the docs say it’s
ccc:ll-xxx
but it’s actuallyccc:ll xx
(a space instead of a dash). Fortunately, I was already writing a standalone function to parse the sky conditions column and I tested it independently of the rest of my program, so it was easy to identify where things were going wrong and fix it.
Your program should not import any modules except sys
, math
, and random
(you might not need all, or even any of these). If have a use for other modules, ask me and can grant you permission to use them on a case-by-case basis.
Code inside a function should not refer to variables defined outside that function; any information a function needs should be passed into that function as an argument.
As usual, your program should have a comment at the top listing the author, date, and a short description of the program’s purpose.
Any code that is not self-explanatory should be explained using concise inline comments.
Minimize the use of mysterious hard-coded constants. For example:
# not great:
val1 = csv_row[19] # I made up 19, no idea what column it actually represents
# better:
dry_bulb_index = headers.index("HourlyDryBulbTemperature") # done early in the program
# ...
dry_bulb = csv_row[dry_bulb_index] # done anytime you need to access the hourly temp
# also better:
# read each CSV row into a dict with headers as keys
dry_bulb = observation["HourlyDryBulbTemperature"]
It’s possible that you will discover that your original question cannot be answered as you expected due to limitations in the data or other hiccups that come with interacting with real-world data.
If you aren’t sure how to proceed, come talk to me or send me an email. If you know how to modify your question to a new one that follows the original guidelines for the Question Proposal, you may do so and explain it in your interim report without consulting me. Just keep in mind that your final grade will depend on whether you answered the question in your original proposal unless your interim report provides a new question and solid justification for the change. If in doubt, ask me to confirm that your revised question is good to go.
Before the final deadline, you will submit a draft program and interim report. At this point, your program need not be fully polished. You may also focus on getting the code working for a limited set of data (for example, if you’re measuring something over five years, it’s fine to have confirmed that the code measures the quantity of interest for one year). The program’s output need not be polished yet.
Submit the following items to the FP Draft assignment on Canvas:
.py
file) that reads the relevant data and computes the substance of the answer to your question.The final report is roughly modeled after a scientific paper or white paper reporting the results of your investigation. You should think of this document as simultaneously serving a few different audiences: some readers (e.g., a layperson deciding where to live) just want to know your conclusions; others (e.g., a fellow data scientist investigating similar trends) will want to know your methods and might go as far as reading or using your code to reproduce your results.
The report should contain the following sections:
Abstract: A one-sentence summary of the answer to your question.
Introduction: Introduce your question and explain how you formalized / refined it into something answerable with the data. If your question has changed since your proposal, include an explanation of that here as well.
For best results, please make the report as clear and concise (i.e., short) as possible while completely covering all the requested content.
Submit your program (as a .py
file) and your report (as a .pdf
) file to the FP Final Code and Report assignment on Canvas. As with the draft, also include any data files your program reads from, unless they are already included in the data directory.
Finally, please submit the FP Survey assignment on Canvas to let me know how many hours you spent on the project and (optionally) let me know how it went and provide feedback.
Regardless of your question, your program will need to:
Your code should be polished with a focus on clarity as well as correctness. Clarity not only includes good variable names and appropriate commenting, but also writing code that is as concise as possible. Deductions will also be made for violating explicit guidelines including but not limited to the following:
sys
, math
, and random
unless permission was givenReport contains all the appropriate information (10)
Report is clear and readable (10)