Fall 2021
In this lab, you’ll get comfortable working in the Jupyter Notebook environment and begin learning the very basics of how to manipulate tabular data in Python using the pandas
library.
For this lab, you may (are encouraged) to spend the lab period working together with a partner. After the lab period ends, you will work independently and submit your own solution, though you may continue to collaborate in accordance with the individual assignment collaboration policy listed on the syllabus.
Today, you’ll need to complete the following steps to get things set up to work with Jupyter notebooks in the CS lab environment.
Open up a terminal (you can press the “windows key” on the keyboard and type “Terminal” to do this) and enter the following commands, one at a time. I’ll demo this, but the commands are here for your reference.
cd # change to your home directory, in case you're not al ready there
mkdir 311 # create a folder for all your 311 materials
cd 311 # move into that folder
python3 -m venv data311_env # create a virtual environment
source data311_env/bin/activate # activate the virtual environment
pip install jupyter pandas numexpr matplotlib # install some packages
jupyter notebook # launch the jupyter notebook server; a browser will open
Once you’ve completed the above steps, you just need to do the following three steps to get the notebook server running again so you can resume work on an existing notebook, or begin working on a new one:
cd ~/311 # move into your 311 directory
source data311_env/bin/activate # activate the virtual environment
jupyter notebook # launch the notebook server; a browser will open
First, create a new notebook by selecting the New drop-down menu at the top right and choosing Python 3. This opens a new browser tab with a blank notebook. Before doing anything else, click the notebook title (which has defaulted to something like Untitled) and rename it to lab1
.
Allow you to intersperse formatted text with code.
Type your markdown syntax, then “run” the cell to see it rendered with formatting.
Basic markdown formatting:
# Heading 1
This markdown block has:
* A bulleted list
* Some *italicized* text
* Some **bold** text
* A line with some `code` (those are backticks - the key just below Esc)
## Heading 2
Also
1.A numbered list
2. A second list element
renders to:
This markdown block has:
code
(those are backticks - the key just below Esc)Also:
Go to the Help menu and take the User Interface Tour to get familiar with the basics of how to interact with a Jupyter Notebook.
Change the empty cell at the top of the notebook to Markdown using the “Code” dropdown type in a title (e.g., DATA 311 Lab 1) preceded by #
to make it a heading. On the next line, include your name as a second-level heading by preceding it with ##
. Press the Run button to render the formatting of the markdown cell.
Create a code cell. In it, enter the following assignment statement:
= 'https://fw.cs.wwu.edu/~wehrwes/courses/data311_21f/data/avengers/avengers.csv' data_url
Run the cell - notice that no output is produced. Next, add a line in the same cell containing just the variable name data_url
. Re-run the cell and notice that the value of the last line is displayed below the code cell.
Now that you’ve created and run a markdown cell and a code cell, press h
and spend a few minutes looking over the keyboard shortcuts. Learning to efficiently navigate, edit, and run notebook cells will be a very good investment for this course, since we’ll be doing pretty much all of our coding and writing in Jupyter notebooks.
Create a new markdown cell above the code cell but below the title cell. In it, make two bulleted lists, each listing two keyboard shortcuts:
I recommend regularly revisiting the keyboard shortcuts and continuing to learn more as you get more comfortable working in notebooks.
The purpose of the remainder of this lab is to get you familiar with the basics of using pandas
, a Python library for working with tabular data. Rather than tell you exactly what to type, this portion of the lab is formatted as a scavenger hunt. I’ll get you started by showing you how to load a table of data, demo the process of finding out how to solve one or two tasks live, then send you off to figure out how to do a bunch of simple manipulations. Don’t hesitate to follow your curiosity and explore beyond what I’ve asked for - we’ll be learning more about Pandas over the next couple weeks, and likely continue discovering new things throughout the quarter, so time spent playing around now will almost certainly pay off later!
Let’s get set up to use Pandas to interact with some data. Below your keyboard shortcuts markdown cell, create a new code cell and paste the following code into it:
import pandas as pd
= pd.read_csv(data_url, encoding='latin-1')
df df
The first line imports the pandas
module under the shorthand alias pd
; this is a near-universal convention, so I recommend following it every time you import pandas.
The next line calls the read_csv
function to read the data from the URL we created above. A couple things to notice here:
encoding='latin-1'
to specify the text encoding format of the CSV file. In my experience, most other CSV files can be read without this argument, because they are encoded using the default encoding (UTF-8).df
. This stands for DataFrame
, which is the primary data structure that Pandas uses to represent a table of data.Now we’ll run the cell, but first make sure that the cell above containing the definition of data_url
has already been run. You can tell that a code cell has been run because the label next to it will have a number, such as In [1]
.
Pro tip: the Cell menu has some useful options for running multiple cells; of particular note are “Run All” and “Run All Above”.
Go ahead and run the cell now.
The last line of the cell just has the df
variable by itself so that the notebook will show us its value when the cell is run. You should see a nicely formatted representation of some of the data table. This particular dataset was downloaded from FiveThirtyEight, which compiled it for a 2015 article entitled Joining The Avengers Is As Deadly As Jumping Off A Four-Story Building. It catalogs information about all of the characters from the Marvel comic books that were ever members of the Avengers. You can find some meta-information about the dataset including a description of what each column means in the accompanying readme file (it’s in Markdown format; one easy way to display it nicely would be to paste its contents into a Markdown cell in a notebook).
Here’s an example of a “scavenger hunt” item I might ask you to figure out how to do, but for this one I’ll show it to you live.
"URL"
, "Probationary Introl"
, "Full/Reserve Avengers Intro"
, "Honorary"
. Store the trimmed table to a variable called avengers
. Useful function(s): df.drop
Here is, roughly, the code block I expect to end up with in the live demo. The explanatory text above it might look like:
The DataFrame
’s drop
method allows you to drop a specified list of columns or rows. The following example creates a new DataFrame
without the columns listed in drop_cols
. The columns can be given as the columns
keyword argument, or as the first argument with the axis
keyword argument specifying that the list contains 'columns'
to be dropped.
= ["URL", "Probationary Introl", "Full/Reserve Avengers Intro", "Honorary"]
drop_cols = df.drop(columns=drop_cols)
avengers # should also work: df.drop(drop_cols, axis='columns')
avengers
Add this to your code and take a look at the output.
In a word: Google (or another search engine of your choice). However, using search engines effectively is a skill, so here are a few tips.
Generally, I recommend using whatever resources you can find - learning how to do something by searching the Pandas documentation, Stack Overflow, or the internet at large is simply part of everyday life for a data scientist. Personally, I tend to search for whatever I’m looking to do; when skimming search results, I tend to prefer results in the following order:
1. Official Pandas tutorials
2. Official Pandas documentation
3. Stack Overflow
4. The rest of the internet
Lots of the content on the internet at large is great, but lots of it is also not so great, which is why I prefer the more authoritative sources. The official tutorials are pretty high-quality and easier to read than the documentation, at the expense of completeness. For understanding error messages and debugging weird behaviors, Stack Overflow is great because someone else has probably had your problem before and asked about it on Stack Overflow.
When you do click on a search result, you may find yourself on an overwhelmingly long webpage. If you’re looking for a specific function and the search result didn’t jump you straight to it, it’s often helpful to use your browser’s text search feature to find what you’re looking for on the page. I do this a lot. Ctrl+F will open up a search box in most browsers, where you can type in the name you’re looking for.
To get you started, I recommend checking out a couple specific Pandas tutorials; these have a high likelihood of containing many of the most basic stuff you’ll need for this lab.
I especially recommend skimming through the Getting Started tutorial What kind of data does pandas handle?, and having How do I select a subset of a DataFrame
? close at hand. Take a look at the other Getting Started tutorial titles so you know what’s there. Finally, 10 minutes to pandas is a little more detailed (and, for me anyway, much longer than 10 minutes) but covers a lot of ground and includes useful links to the more thorough documentation in the User Guide. This is a good one to have open for recent Ctrl-F searches.
Add cells to your notebook to perform each of the following tasks on the avengers
DataFrame in sequence. For each one, include a Markdown cell above with a brief explanation of what the code does such that the remainder of your notebook reads like a tutorial. Unless explicitly stated, these operations should not modify the avengers
variable. I’ve provided suggestions for useful functions, operators, or syntax that might help guide your searches for the functionality you need.
First few rows: Display only the first 10 rows of the table. Useful function(s): df.head
Last few rows: Display only the last 5 rows of the table. Useful function(s): df.tail
Find the dimensions of the table: Display the number of rows and the number of columns in avengers
. Useful property: df.shape
Extract a single column: Extract and display just the “Name/Alias” column. Useful operator: indexing using square brackets []
Sort the table: Sort the table by number of appearances so that the Avenger with the most appearances appears at the top. Store the sorted table in a new variable called avengers_sorted
. Useful function: df.sort_values
Find the not-quite-superstars: Display a table of the avengers with the 5th- through 10th-largest values in the Appearances column. Useful operator: slicing using square brackets []
Calculate the gender balance: Count the number of MALE and FEMALE avengers based on the Gender column. The result should display as follows:
MALE 115
FEMALE 58
Name: Gender, dtype: int64
Useful function: df.value_counts
Calculate the gender ratio: Based on the above, calculate the percentage of FEMALE avengers. Useful operators: indexing using square brackets []
; basic Python arithmetic operations
Get a subset of the columns: Extract a DataFrame containing just the “Name/Alias”, “Appearances”, “Gender”, and “Years since joining” columns. Store it in a variable called year_apps
. Useful operator: using a list as the index in square brackets [[]]
.
Plot # appearances vs years since joining: Create a scatter plot with “Years since joining” on the x-axis and “Appearances” on the y-axis. Useful function: df.plot.scatter
(Extra Credit) Plot the gender balance over time. Let’s simplify things a tad and assume that all avengers are current members of the avengers and all of them are alive. Given this, how has the gender balance of the avengers changed over time? Answer this with a scatter plot showing the number of MALE characters who have joined the avengers through each year found in the table and a corresponding scatter plot for FEMALE characters.
(Extra Credit) Do some other interesting analysis. Find something interesting in the data and tell me something I don’t know! Explain it clearly, ideally with some basic visualization.
Before downloading your notebook, make sure that all your cells have been run and have the output you intended. I recommend using the Run All option from the Cell menu to ensure all your outputs are up-to-date. It’s easy to forget that you’ve made code changes and have stale outputs, so look over the freshly computed results and make sure they match your expectations.
Once your notebook is ready, download a copy. From the File menu, go to Download As and select Notebook (ipynb) to get the notebook in its native format.
Submit the resulting .ipynb
file to the Lab 1 assignment on Canvas.
Fill out the Lab 1 Survey on Canvas. Your submission will not be considered complete until the survey is submitted.
As noted on the syllabus, in data science, presentation matters. Make sure that the writing in your Markdown cells is as carefully and clearly written as your code.
Jupyter
Pandas
For each task (#1-10):
Extra Credit
Up to 3 points of extra credit are available for each of the extra credit items (#11 and 12). Each extra credit point is exponentially more difficult to get; you will need to amaze me in order to get 6 points.