Winter 2023
In this lab, you’ll get comfortable working in Colab and begin
learning the basics of how to manipulate tabular data in Python using
the pandas
library.
For this lab, you may (are encouraged) to spend the lab period working together with a partner. After the lab period ends, you will work independently and submit your own solution, though you may continue to collaborate in accordance with the individual assignment collaboration policy listed on the syllabus.
In this lab, you’ll start with a provided notebook and fill in the code cells to complete some analysis of a couple datasets. Start by downloading the Lab 1 starter notebook - you may need to right-click the link and choose “Save As…” or similar.
Next, open Colab in a web browser. Select the “Upload” tab at the top of the overlay window, and use the Browse… button to select the lab 1 notebook file you downloaded and load it into Colab.
Add a new Text cell above the current top cell (“Part 1…”). In
that cell, create a top-level heading (preceded by a single
#
) with the lab number and your name (e.g., I’d write
“DATA 311 Lab 1 - Scott Wehrwein
”).
Run the first code cell, either using the triangular “Play”
button or the keyboard shortcut Ctrl+Enter. Notice that no output is
produced. Add a line in the same cell containing just the variable name
data_base_url
. Re-run the cell and notice that the value of
the last line is displayed below the code cell.
Take a look at the Tools>Keyboard Shortcuts page. There are a lot of shortcuts, and a lot more things you can assign key combinations for! It’s probably worth your time to learn a few of these, since you’ll be working in Colab a lot this quarter. I encourage you to revisit these regularly and pick up a few at a time to continue improving your efficiency.
The purpose of the remainder of this lab is to get you familiar with
the basics of using pandas
, a Python library for working
with tabular data.
The notebook starts with some setup. The first code cell contains the following:
import pandas as pd
# Make the graphs a bit prettier, and bigger
import matplotlib.pyplot as plt
'ggplot')
plt.style.use('figure.figsize'] = (15, 5)
plt.rcParams[
= "https://fw.cs.wwu.edu/~wehrwes/courses/data311_23w/lab1/data/" data_base_url
The first line imports the pandas
module under the
shorthand alias pd
; this is a near-universal convention, so
I recommend following it every time you import pandas.
The next lines just tweak some of the plotting settings to make the graphs prettier; don’t worry about this for now.
Finally, we set a variable data_base_url
pointing to the
directory on my webpage where the datasets for this lab live. The next
code cell calls the read_csv
function to read the data from
the URL we created above. A couple things to notice here:
encoding='latin-1'
to specify the text encoding format of
the CSV file. In my experience, most other CSV files can be read without
this argument, because they are encoded using the default encoding
(UTF-8).df
. This
stands for DataFrame
, which is the primary data structure
that Pandas uses to represent a table of data.Run both the first and second cell now.
Pro tip: the Runtime menu has some useful options for running multiple cells; of particular note are “Run All” and “Run All Above”.
Although I did some live demos of many of the pandas features you’ll need, you’re unlikely to have retained every detail and some details you’ll need for this lab were intentionally left out. Now and going forward, you will need to become proficient at figuring out how to do things that I haven’t told you how to do. This section provides some tips on how to approach this process.
In a word: Google (or another search engine of your choice). However, using search engines effectively is a skill, so here are a few tips.
Generally, I recommend using whatever resources you can find - learning how to do something by searching the Pandas documentation, Stack Overflow, or the internet at large is simply part of everyday life for a data scientist. Personally, I tend to search for whatever I’m looking to do; when skimming search results, I tend to prefer results in the following order:
1. Official Pandas tutorials
2. Official Pandas documentation
3. Stack Overflow
4. The rest of the internet
Lots of the content on the internet at large is great, but lots of it is also not so great, which is why I prefer the more authoritative sources. The official tutorials are pretty high-quality and easier to read than the documentation, at the expense of completeness. For understanding error messages and debugging weird behaviors, Stack Overflow is great because someone else has probably had your problem before and asked about it on Stack Overflow.
When you do click on a search result, you may find yourself on an overwhelmingly long webpage. If you’re looking for a specific function and the search result didn’t jump you straight to it, it’s often helpful to use your browser’s text search feature to find what you’re looking for on the page. I do this a lot. Ctrl+F will open up a search box in most browsers, where you can type in the name you’re looking for.
To get you started, I recommend checking out a couple specific Pandas tutorials; these have a high likelihood of containing many of the most basic stuff you’ll need for this lab.
I especially recommend skimming through the Getting Started tutorial
What
kind of data does pandas handle?, and having How
do I select a subset of a DataFrame
? close at hand.
Take a look at the other Getting Started tutorial titles so you know
what’s there. Finally, 10
minutes to pandas is a little more detailed (and, for me anyway,
much longer than 10 minutes) but covers a lot of ground and includes
useful links to the more thorough documentation in the User Guide. This
is a good one to have open for recent Ctrl-F searches.
Your job for the remainder of this lab is to fill in the code cells to accomplish the goals outlined in each text cell. Each todo item is numbered (1.1 through 1.13 and 2.1 through 2.13). Many code cells are one-liners, while a few might require as many as 3 or 4. If you have more lines than that, you should probably look for a simpler approach.
The analysis for the main lab is closely guided, step-by-step. For a few points of extra credit, do some other interesting analysis: find something interesting in the data and tell me something I don’t know! Include your analysis under a new 2nd-level header cell titled Extra Credit below the existing cells. Explain your analysis clearly in interleaved text cells, and ideally include some basic visualization of your results.
Before downloading your notebook, make sure that all your cells have been run and have the output you intended. I recommend using the Run All option from the Runtime menu to ensure all your outputs are up-to-date. It’s easy to forget that you’ve made code changes and have stale outputs, so look over the freshly computed results and make sure they match your expectations.
Once your notebook is ready, download a copy. From the File menu, go to Download and select Download ipynb to get the notebook in its native format.
Submit the resulting .ipynb
file to the Lab 1
assignment on Canvas.
Fill out the Lab 1 Survey on Canvas. Your submission will not be considered complete until the survey is submitted.
There are 26 tasks in this lab; successfully completing each one is worth 2 points. The lab is scored out of a total of 52 points.
Extra Credit
Up to 3 points of extra credit are available for extra credit analysis. Each extra credit point is exponentially more difficult to get; in order to get 3 points your analysis needs to be involved, interesting, and well-presented.
This lab is based on parts of the Pandas Cookbook by Julia Evans. Thanks to her for developing it and sharing it under a CC-BY-SA license. This lab is released under the same CC-BY-SA license.