Fall 2025
This pre-lab and lab serve to give hands-on experience with another important Python library: Pandas. As shown in class, it is a powerful, popular library for processing and analyzing tabular data in Python.
Details for submission and development environment are the same as for Lab 1; please see the Lab 1 handout if you need a refresher. As a reminder, if you choose to use Google Colab or any other alternative notebook hosting service, you must disable any built-in generative AI features.
You must complete the pre-lab individual, but complete the lab in pairs. Your TA will help facilitating pairing. If an odd number of students are in attendance at lab, your TA will arrange one group of three. The lab must be done synchronously and collaboratively with your partner, with no free-loaders and no divide-and-conquer; every answer should reflect the understanding and work of both members. If your group does not finish during lab time, please arrange to meet as a pair outside of class time to finish up any work.
You must work with a different partner for Lab 2 than you did for Lab 1.
Although I did some live demos of many of the pandas features you’ll need, you’re unlikely to have retained every detail and some details you’ll need for this lab were intentionally left out. Now and going forward, you will need to become proficient at figuring out how to do things that I haven’t told you how to do. This section provides some tips on how to approach this process.
In a word: Google (or another search engine of your choice). However, using search engines effectively is a skill, so here are a few tips.
Generally, I recommend using whatever resources you can find - learning how to do something by searching the Pandas documentation, Stack Overflow, or the internet at large is simply part of everyday life for a data scientist. Personally, I tend to search for whatever I’m looking to do; when skimming search results, I tend to prefer results in the following order:
1. Official Pandas tutorials
2. Official Pandas documentation
3. Stack Overflow
4. The rest of the internet
Lots of the content on the internet at large is great, but lots of it is also not so great, which is why I prefer the more authoritative sources. The official tutorials are pretty high-quality and easier to read than the documentation, at the expense of completeness. For understanding error messages and debugging weird behaviors, Stack Overflow is great because someone else has probably had your problem before and asked about it on Stack Overflow.
When you do click on a search result, you may find yourself on an overwhelmingly long webpage. If you’re looking for a specific function and the search result didn’t jump you straight to it, it’s often helpful to use your browser’s text search feature to find what you’re looking for on the page. I do this a lot. Ctrl+F will open up a search box in most browsers, where you can type in the name you’re looking for.
Please carefully read 10 minutes to pandas and Intro to data structures.
In a text editor of your choice that is capable of exporting to pdf, answer these questions:
What is a DataFrame? What about a Series?
If df
is the name of a DataFrame, what does calling
df.head(3)
do?
How do you select a column in a DataFrame?
What are two different ways you can create a DataFrame?
What does describe()
do? Briefly describe what each
of the parameters in the function mean.
How do you perform Boolean indexing on a DataFrame?
What does groupby
do?
List at least three example operations you can perform on the
object that results from calling groupby
.
What is NaN
and when might you encounter it in a
DataFrame?
List three different possible dtypes
for columns of
a DataFrame.
The Getting
Started tutorials are great; especially useful to start with are What
kind of data does pandas handle? and How
do I select a subset of a DataFrame
?. You may want to
take a look at the other Getting Started tutorial titles so you know
what’s there.
With your lab partner, download lab2.ipynb, upload it to JupyterHub, and complete the lab following the instructions in the notebook. Any work not completed during lab time must be completed outside of lab hours, and should only be done with both partners present.
The analysis for the main lab is closely guided, step-by-step. For a few points of extra credit, do some other interesting analysis on the state salary data: find something interesting in the data and tell me something I don’t know! Include your analysis under a new 2nd-level header cell titled Extra Credit below the existing cells. Explain your analysis clearly in interleaved text cells, and ideally include some basic visualization of your results.
There are 20 subtasks in this lab; successfully completing each one is worth 2 points. The lab is scored out of a total of 40 points.
Extra Credit
Up to 3 points of extra credit are available for doing some additional interesting analysis on the state employee salary data. Each extra credit point is exponentially more difficult to get; in order to get 3 points your analysis needs to be involved, interesting, and well-presented.
Part 1 of this lab is based on parts of the Pandas Cookbook by Julia Evans. Thanks to her for developing it and sharing it under a CC-BY-SA license. Part 1 is released under the same CC-BY-SA license. Part 2 is based on a lab developed by Aaron Tuor.