DATA 311 - Lab 2: Basic Pandas

Scott Wehrwein

Fall 2025

Overview

This pre-lab and lab serve to give hands-on experience with another important Python library: Pandas. As shown in class, it is a powerful, popular library for processing and analyzing tabular data in Python.

Submission and Development Environment

Details for submission and development environment are the same as for Lab 1; please see the Lab 1 handout if you need a refresher. As a reminder, if you choose to use Google Colab or any other alternative notebook hosting service, you must disable any built-in generative AI features.

Collaboration Policy

You must complete the pre-lab individual, but complete the lab in pairs. Your TA will help facilitating pairing. If an odd number of students are in attendance at lab, your TA will arrange one group of three. The lab must be done synchronously and collaboratively with your partner, with no free-loaders and no divide-and-conquer; every answer should reflect the understanding and work of both members. If your group does not finish during lab time, please arrange to meet as a pair outside of class time to finish up any work.

You must work with a different partner for Lab 2 than you did for Lab 1.

Tips: Teaching Yourself to Teach Yourself

Although I did some live demos of many of the pandas features you’ll need, you’re unlikely to have retained every detail and some details you’ll need for this lab were intentionally left out. Now and going forward, you will need to become proficient at figuring out how to do things that I haven’t told you how to do. This section provides some tips on how to approach this process.

In a word: Google (or another search engine of your choice). However, using search engines effectively is a skill, so here are a few tips.

Which Search Results?

Generally, I recommend using whatever resources you can find - learning how to do something by searching the Pandas documentation, Stack Overflow, or the internet at large is simply part of everyday life for a data scientist. Personally, I tend to search for whatever I’m looking to do; when skimming search results, I tend to prefer results in the following order:

1. Official Pandas tutorials
2. Official Pandas documentation
3. Stack Overflow
4. The rest of the internet

Lots of the content on the internet at large is great, but lots of it is also not so great, which is why I prefer the more authoritative sources. The official tutorials are pretty high-quality and easier to read than the documentation, at the expense of completeness. For understanding error messages and debugging weird behaviors, Stack Overflow is great because someone else has probably had your problem before and asked about it on Stack Overflow.

Life is short, and webpages are long. Ctrl+F is your friend.

When you do click on a search result, you may find yourself on an overwhelmingly long webpage. If you’re looking for a specific function and the search result didn’t jump you straight to it, it’s often helpful to use your browser’s text search feature to find what you’re looking for on the page. I do this a lot. Ctrl+F will open up a search box in most browsers, where you can type in the name you’re looking for.

Pre-Lab

Pandas Primer

Please carefully read 10 minutes to pandas and Intro to data structures.
In a text editor of your choice that is capable of exporting to pdf, answer these questions:
1. What is a DataFrame? What about a Series?
2. If df is the name of a DataFrame, what does calling df.head(3) do?
3. How do you select a column in a DataFrame?
4. What are two different ways you can create a DataFrame?
5. What does describe() do? Briefly describe what each of the parameters in the function mean.
6. How do you perform Boolean indexing on a DataFrame?
7. What does groupby do?
8. List at least three example operations you can perform on the object that results from calling groupby.
9. What is NaN and when might you encounter it in a DataFrame?
10. List three different possible dtypes for columns of a DataFrame.

Other Resources

The Getting Started tutorials are great; especially useful to start with are What kind of data does pandas handle? and How do I select a subset of a DataFrame?. You may want to take a look at the other Getting Started tutorial titles so you know what’s there.

Lab

With your lab partner, download lab2.ipynb, upload it to JupyterHub, and complete the lab following the instructions in the notebook. Any work not completed during lab time must be completed outside of lab hours, and should only be done with both partners present.

Extra Credit

The analysis for the main lab is closely guided, step-by-step. For a few points of extra credit, do some other interesting analysis on the state salary data: find something interesting in the data and tell me something I don’t know! Include your analysis under a new 2nd-level header cell titled Extra Credit below the existing cells. Explain your analysis clearly in interleaved text cells, and ideally include some basic visualization of your results.

Rubric

Pre-Lab (10 points)

1 point per question

Lab (40 points)

2 points per subtask

Extra Credit

Up to 3 points of extra credit are available for doing some additional interesting analysis on the state employee salary data. Each extra credit point is exponentially more difficult to get; in order to get 3 points your analysis needs to be involved, interesting, and well-presented.

Acknowledgements

Part 1 of this lab is based on parts of the Pandas Cookbook by Julia Evans. Thanks to her for developing it and sharing it under a CC-BY-SA license. Part 1 is released under the same CC-BY-SA license. Part 2 is based on a lab developed by Aaron Tuor.