DATA 311 - Lab 1: Jupyter and Pandas - The Basics

Scott Wehrwein

Winter 2023

Introduction

In this lab, you’ll get comfortable working in Colab and begin learning the basics of how to manipulate tabular data in Python using the pandas library.

Collaboration Policy

For this lab, you may (are encouraged) to spend the lab period working together with a partner. After the lab period ends, you will work independently and submit your own solution, though you may continue to collaborate in accordance with the individual assignment collaboration policy listed on the syllabus.

Colab Setup

In this lab, you’ll start with a provided notebook and fill in the code cells to complete some analysis of a couple datasets. Start by downloading the Lab 1 starter notebook - you may need to right-click the link and choose “Save As…” or similar.

Next, open Colab in a web browser. Select the “Upload” tab at the top of the overlay window, and use the Browse… button to select the lab 1 notebook file you downloaded and load it into Colab.

Preliminaries: Getting Familiar with Colab

Add a new Text cell above the current top cell (“Part 1…”). In that cell, create a top-level heading (preceded by a single #) with the lab number and your name (e.g., I’d write “DATA 311 Lab 1 - Scott Wehrwein”).
Run the first code cell, either using the triangular “Play” button or the keyboard shortcut Ctrl+Enter. Notice that no output is produced. Add a line in the same cell containing just the variable name data_base_url. Re-run the cell and notice that the value of the last line is displayed below the code cell.
Take a look at the Tools>Keyboard Shortcuts page. There are a lot of shortcuts, and a lot more things you can assign key combinations for! It’s probably worth your time to learn a few of these, since you’ll be working in Colab a lot this quarter. I encourage you to revisit these regularly and pick up a few at a time to continue improving your efficiency.

Parts 1 and 2: Basic Pandas

The purpose of the remainder of this lab is to get you familiar with the basics of using pandas, a Python library for working with tabular data.

Setup

The notebook starts with some setup. The first code cell contains the following:

import pandas as pd

# Make the graphs a bit prettier, and bigger
import matplotlib.pyplot as plt
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 5)

data_base_url = "https://fw.cs.wwu.edu/~wehrwes/courses/data311_23w/lab1/data/"

The first line imports the pandas module under the shorthand alias pd; this is a near-universal convention, so I recommend following it every time you import pandas.

The next lines just tweak some of the plotting settings to make the graphs prettier; don’t worry about this for now.

Finally, we set a variable data_base_url pointing to the directory on my webpage where the datasets for this lab live. The next code cell calls the read_csv function to read the data from the URL we created above. A couple things to notice here:

We’ve given it a URL, and the details of fetching the document from the course webpage have been taken care of for us. We could also have given it a path to a CSV file on our local computer if we were running there, or a file on the Colab instance if we’d uploaded one there.
For this particular CSV file, I’ve passed encoding='latin-1' to specify the text encoding format of the CSV file. In my experience, most other CSV files can be read without this argument, because they are encoded using the default encoding (UTF-8).
We’ve assigned the result to a variable called df. This stands for DataFrame, which is the primary data structure that Pandas uses to represent a table of data.

Run both the first and second cell now.

Pro tip: the Runtime menu has some useful options for running multiple cells; of particular note are “Run All” and “Run All Above”.

Tips: Teaching Yourself to Teach Yourself

Although I did some live demos of many of the pandas features you’ll need, you’re unlikely to have retained every detail and some details you’ll need for this lab were intentionally left out. Now and going forward, you will need to become proficient at figuring out how to do things that I haven’t told you how to do. This section provides some tips on how to approach this process.

In a word: Google (or another search engine of your choice). However, using search engines effectively is a skill, so here are a few tips.

Which Search Results?

Generally, I recommend using whatever resources you can find - learning how to do something by searching the Pandas documentation, Stack Overflow, or the internet at large is simply part of everyday life for a data scientist. Personally, I tend to search for whatever I’m looking to do; when skimming search results, I tend to prefer results in the following order:

1. Official Pandas tutorials
2. Official Pandas documentation
3. Stack Overflow
4. The rest of the internet

Lots of the content on the internet at large is great, but lots of it is also not so great, which is why I prefer the more authoritative sources. The official tutorials are pretty high-quality and easier to read than the documentation, at the expense of completeness. For understanding error messages and debugging weird behaviors, Stack Overflow is great because someone else has probably had your problem before and asked about it on Stack Overflow.

Life is short, and webpages are long. Ctrl+F is your friend.

When you do click on a search result, you may find yourself on an overwhelmingly long webpage. If you’re looking for a specific function and the search result didn’t jump you straight to it, it’s often helpful to use your browser’s text search feature to find what you’re looking for on the page. I do this a lot. Ctrl+F will open up a search box in most browsers, where you can type in the name you’re looking for.

Recommended Resources for Getting Started

To get you started, I recommend checking out a couple specific Pandas tutorials; these have a high likelihood of containing many of the most basic stuff you’ll need for this lab.

I especially recommend skimming through the Getting Started tutorial What kind of data does pandas handle?, and having How do I select a subset of a DataFrame? close at hand. Take a look at the other Getting Started tutorial titles so you know what’s there. Finally, 10 minutes to pandas is a little more detailed (and, for me anyway, much longer than 10 minutes) but covers a lot of ground and includes useful links to the more thorough documentation in the User Guide. This is a good one to have open for recent Ctrl-F searches.

Your Tasks: A Pandas Scavenger Hunt

Your job for the remainder of this lab is to fill in the code cells to accomplish the goals outlined in each text cell. Each todo item is numbered (1.1 through 1.13 and 2.1 through 2.13). Many code cells are one-liners, while a few might require as many as 3 or 4. If you have more lines than that, you should probably look for a simpler approach.

Extra Credit

The analysis for the main lab is closely guided, step-by-step. For a few points of extra credit, do some other interesting analysis: find something interesting in the data and tell me something I don’t know! Include your analysis under a new 2nd-level header cell titled Extra Credit below the existing cells. Explain your analysis clearly in interleaved text cells, and ideally include some basic visualization of your results.

Submitting your work

Before downloading your notebook, make sure that all your cells have been run and have the output you intended. I recommend using the Run All option from the Runtime menu to ensure all your outputs are up-to-date. It’s easy to forget that you’ve made code changes and have stale outputs, so look over the freshly computed results and make sure they match your expectations.
Once your notebook is ready, download a copy. From the File menu, go to Download and select Download ipynb to get the notebook in its native format.
Submit the resulting .ipynb file to the Lab 1 assignment on Canvas.
Fill out the Lab 1 Survey on Canvas. Your submission will not be considered complete until the survey is submitted.

Rubric

There are 26 tasks in this lab; successfully completing each one is worth 2 points. The lab is scored out of a total of 52 points.

Extra Credit

Up to 3 points of extra credit are available for extra credit analysis. Each extra credit point is exponentially more difficult to get; in order to get 3 points your analysis needs to be involved, interesting, and well-presented.

Acknowledgements

This lab is based on parts of the Pandas Cookbook by Julia Evans. Thanks to her for developing it and sharing it under a CC-BY-SA license. This lab is released under the same CC-BY-SA license.