Winter 2023
The last time I taught CSCI 141, my students did a data-science-flavored final project. Their task was to develop and refine a data question that can be answered using weather records provided by NOAA. For example, my question was:
Is it cloudy more often in the winter in Bellingham, WA than in Ithaca, NY (where I moved here from)?
The refinement of the question is one that can be unambiguously answered by the available data. In this case, I framed it like this:
In the months of December and January through March in the year 2020, did Bellingham or Ithaca have more hourly observations where the sky was 7/8ths or more covered by clouds?
My 141 students were tasked with writing a Python program to answer
their question - without using any of the fancy libraries we’re using in
this class! In this lab, we’ll do a similar thing - answer a bunch of
weather questions based on data, but we’ll do it using
pandas
.
For this lab, you may (are encouraged) to spend the lab period working together with a partner. Together means synchronously and collaboratively: no divide an conquer. After the lab period ends, you will work independently and submit your own solution, though you may continue to collaborate with classmates in accordance with the individual assignment collaboration policy (i.e., you can discuss ideas and strategies but not share or view code). Your submission must acknowledge anyone you collaborated with, and for which parts of the lab (this should be included as a statement at the top of your notebook(s)).
For this project, we’ll be drawing from a dataset called Local Climatological Data (LCD), which is maintained by the National Oceanographic and Atmospheric Administration (NOAA). You can find information about the datasets availble from NOAA here, but we’ll focus for now on data recorded from land-based weather stations. The data we’ll work with consists primarily of recordings that are taken once per hour around the clock, every day of the year.
The meanings of the columns available in the data are spelled out in detail in the LCD Documentation. Spend a few minutes reading through to see which columns might be relevant. Here are a couple specific bits of domain knowledge that may be helpful:
HourlyDryBulbTemperature
,
DailyMaximumDryBulbTemperature
, or some other dry bulb
measurement.HourlyStationPressure
) and wind speed
(HourlyWindSpeed
) are measured at the time the observation
is taken.HourlyPrecipitation
) give
the liquid amount of precipitation that fell during the observation
period (e.g., during the hour preceeding the observation). This means
snow is melted before being measured.HourlySkyConditions
have more
complicated contents, but you can use them too - you’ll just need to
read up in on what they mean in the documentation.This data mixes hourly, daily, and monthly information, so different
rows of the table will have values in different columns. For example,
for each hourly observation, all of the columns beginning with
Daily
will simply be empty.
A useful way to filter out the rows that will have the data you’re
intersted in is to look at the REPORT_TYPE
column. For
example, rows with REPORT_TYPE
value FM-15
correspond to the regular hourly observations. If you consider only
these rows, all the Hourly
columns should have values.
To make things easier for you, I’ve downloaded LCD data for about 18 locations across the U.S over the entire year 2020.
You can find the pre-downloaded data files here.
The URLs of the files in that directory can be given to
pd.read_csv
to load the data directly into your notebook.
For the sake of bandwidth, load the data you need in a single cell at
the top of your notebook so you only need to download the data once each
time you’re working on the lab.
If you’re interested in examining data from other locations or timeframes, I’ve included instructions below for how to do that.
If you want to answer a question that requires other weather stations and/or other timeframes, you’ll need to retrieve your own data. Here’s how I downloaded the files that I provided to you:
If your submission relies on data files not included in the
pre-downloaded data, upload them to Canvas alongside your notebook. When
reading the files in your notebook, your code should assume the CSV
files are in the same directory as the notebook; i.e., to load a file
called NV_LasVegas.csv
, you’d call
pd.read_csv("NV_LasVegas.csv")
.
Use the pre-downloaded 2020 weather data to perform the following three pre-defined analyses. Create a separate notebook for each question you address.
lab2_q1.ipynb
, determine which of
the provided cities was the rainiest in 2020.lab2_q2.ipynb
, find out how often
Bellingham was overcast in each month of 2020.lab2_q3.ipynb
, compare Fall
weather in Bellingham and Portland based on temperature, humidity,
rainfall, and wind. Use the “climatological” definition of Fall, which
runs from the beginning of September through the end of November.Perform one additional analysis that is substantively different from
the above (this doesn’t mean you can’t use the same columns, just that
you are getting at different questions and, ideally, using different
techniques). Feel free to collect additional data beyond what I’ve
pre-downloaded; this would, for example, enable investigation of trends
over years (e.g., which of the past 10 summers in Bellingham was
hottest?). Submit your analysis in a fourth notebook. Submit your
analysis in a notebook titled lab2_q4.ipynb
.
Perform an additional in-depth analysis of your choice; ideally this will involve multiple data columns, multiple cities, and/or analysis by time or date. More credit will be awarded for analyses that result in insights that are surprising or non-obvious.
Each of your analyses should clearly and convincingly tell a story.
For each notebook, make sure that all your cells have been run and have the output you intended, then download a copy in .ipynb format. Create a single zip file containing all of your notebooks, following the naming conventions given above, where spelling, spacing, and capitalization matter. Do not include any of the provided (pre-downloaded) data files; be sure that the notebook reads pre-downloaded files from their URLs. If your analysis uses data other than the pre-downloaded files, include only those extra data files in your zip file. Submit your zip file to the Lab 2 assignment on Canvas.
Finally, fill out the Lab 2 Survey on Canvas. Your submission will not be considered complete until the survey is submitted.
Each analysis in Part 1 is worth 15 points, while your Part 2 analysis is worth 10 points. For full credit, your analysis will be correct, convincing, and clear.
Extra Credit
Up to 5 points of extra credit are available for the extra credit analysis, judged on the same criteria as above, but more stringently. Each extra credit point is exponentially more difficult to get; you will need to amaze me in order to get 5 points.