DATA 311 - Lecture 3¶

Announcements¶

  • Our first Data Ethics assignment is out; due Wednesday before class. You will:
    • Download all the data that some tech company has on you
    • Poke around
    • Write a short reflection
    • Discuss in class on Wednesday 4/15
  • Quiz 1 is Friday, covering material through today.
  • Today's lecture: you may want to follow along in our own copy of the notebook. You can grab the ipynb I'm working with from the course webpage.

Goals¶

  • Know how to use integer and boolean indexing in numpy.

  • Understand the fundamental data structures and concepts of the pandas library, and how they relate to each other:

    • Series, DataFrame, index
  • Know enough about pandas to be able to do, or look up how to do, the following basic data manipulation tasks:

    • Show the first or last k rows
    • Drop columns from a DataFrame
    • Get the dimensions of the table
    • Extract a single column
    • Extract multiple columns
    • Extract a single column as a DataFrame
    • Sort the table on a column
    • Get a custom slice of rows
    • Count number of rows with each value in a categorical column
    • Plot a column as a line graph
    • Scatterplot 2 columns
    • Group by a categorical variable and apply reductions to each group
    • Get only rows that meet some condition
    • Show summary statistics of a DataFrame

We may not get to all of these today, but by the time you complete Lab 2, you should be able to accomplish the above.

Advantage of Jupyter:¶

  • Reproducibility. Important rule of data science: Always start with the raw data, and document all processing and analysis so anyone can reproduce your findings (and understand what assumptions / decisions you made. Notebooks help you do this seamlessly.

Numpy, Continued¶

In [1]:
import numpy as np
In [2]:
a = np.array(range(12)).reshape((4, 3))
a
Out[2]:
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])
In [3]:
a[0:2,:]
Out[3]:
array([[0, 1, 2],
       [3, 4, 5]])

Fancy indexing¶

  • Integer indexing: a[ list or ndarray of integer indices ]
  • Boolean indexing: a[ list or ndarray of booleans ] where the list/ndarray's shape matches a's

See https://numpy.org/doc/stable/user/basics.indexing.html for much more.

Integer indexing¶

In [4]:
a = np.array(range(10, 20))
a
Out[4]:
array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

Indexing with a list or array of integers pulls out only the elements at those indices:

In [5]:
# get the first, third, and fifth elements:
a[np.array([0, 2, 4])]
Out[5]:
array([10, 12, 14])
In [6]:
# get the fourth, second, and second elements (!):
a[np.array([3, 1, 1])]
Out[6]:
array([13, 11, 11])

Boolean Indexing¶

In [7]:
b = np.ones((2, 2))
b[0,0] = 2
b[1,1] = 0

Quick quiz: what does b look like now?

In [8]:
b
Out[8]:
array([[2., 1.],
       [1., 0.]])

Make a "mask" of booleans that's the same shape as b:

In [9]:
mask = np.array([
    [True, False],
    [False, True]
])
mask
Out[9]:
array([[ True, False],
       [False,  True]])

Index b with a boolean mask:

In [10]:
b[mask]
Out[10]:
array([2., 0.])

A common pattern - boolean operators to generate a mask:

In [11]:
# pull out all elements of b that are greater than zero:
b[b > 0]
Out[11]:
array([2., 1., 1.])

Tips for multidimensional arrays¶

  • I never display anything that's more than 2D.
  • I never try to visualize anything that's more than 3D.
In [12]:
c = np.array(range(24)).reshape(2, 4, 3)
c
Out[12]:
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8],
        [ 9, 10, 11]],

       [[12, 13, 14],
        [15, 16, 17],
        [18, 19, 20],
        [21, 22, 23]]])
In [13]:
# take one 2D slice
c[:,:,0]
Out[13]:
array([[ 0,  3,  6,  9],
       [12, 15, 18, 21]])
In [14]:
# take another 2D slice along a different axis
c[:, 1, :]
Out[14]:
array([[ 3,  4,  5],
       [15, 16, 17]])

Exercise 3 Play with my cat¶

In pairs (seriously - one computer open per pair please): In this exercise, we'll manipulate an image as a 2D array.

We'll start by loading a picture of my cat, Beans:

In [15]:
import imageio.v3 as imageio
import matplotlib.pyplot as plt
In [16]:
beans = imageio.imread("/cluster/academic/DATA311/202620/beans_gray.jpeg")

We'll use plt.imshow to visualize the image:

In [17]:
plt.imshow(beans, cmap='gray')
Out[17]:
<matplotlib.image.AxesImage at 0x14ee7d243260>
No description has been provided for this image
In [18]:
beans.shape
Out[18]:
(200, 200)
  1. What is the dtype of the resulting array? What are the minimum and maximum values?
In [19]:
beans.dtype
Out[19]:
dtype('uint8')
  1. Display a binary image showing which pixels are greater than half the maximum pixel intensity (127).
In [20]:
plt.imshow(beans>127, cmap='gray')
Out[20]:
<matplotlib.image.AxesImage at 0x14ee7cf047a0>
No description has been provided for this image
  1. What is the average value of pixels that have intensity value above 127?
In [21]:
pixels_of_interest = beans[beans > 127]
pixels_of_interest.mean()
Out[21]:
np.float64(160.46054636482367)
In [22]:
np.sum(beans)
Out[22]:
np.uint64(4149006)
  1. Which column of the image has the highest average pixel value?
In [23]:
beans.mean(axis=0).argmax()
Out[23]:
np.int64(193)

Pandas: a library for working with tabular data¶

How to Learn Pandas (and other tools we'll use in this class):¶

No description has been provided for this image No description has been provided for this image No description has been provided for this image

But seriously¶

I won't teach you every little thing you need to use. I will expect you to be able to find and use functionality that gets the job done. The Lab 2 handout has some suggestions for how to go about searching and learning process. I also won't quiz/test you on syntactic minutia, though the basics are fair game.

Pandas: basic data structures/concepts¶

  • Series
  • Index
  • DataFrame
In [24]:
from pandas import Series, DataFrame
import pandas as pd

Series - a 1D list-like thing (think of it as a column with labels)¶

In [25]:
s = Series([9,6,8,4])
s
Out[25]:
0    9
1    6
2    8
3    4
dtype: int64

The values are the items in the Series themselves:

In [26]:
# get the values:
s.values
Out[26]:
array([9, 6, 8, 4])

The indices are the "labels" - the left column in the display above. We didn't provide labels, so they defaulted to sequential integers:

In [27]:
# get the index:
s.index
Out[27]:
RangeIndex(start=0, stop=4, step=1)

Square bracket indexing pulls out the value at an index:

In [28]:
# get the third value:
s[2]
Out[28]:
np.int64(8)

We can customize the index:

In [29]:
s2 = Series([9,6,8,4],index=['win','spr','sum','fal'])
s2
Out[29]:
win    9
spr    6
sum    8
fal    4
dtype: int64
In [30]:
# get the values:
s2.values
Out[30]:
array([9, 6, 8, 4])
In [31]:
# get the indices:
s2.index
Out[31]:
Index(['win', 'spr', 'sum', 'fal'], dtype='str')
In [32]:
# get the value at index "win":
s2["win"]
Out[32]:
np.int64(9)

What if I want the second thing? Don't do this:

In [33]:
s2[1]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/miniforge/lib/python3.12/site-packages/pandas/core/indexes/base.py:3641, in Index.get_loc(self, key)
   3640 try:
-> 3641     return self._engine.get_loc(casted_key)
   3642 except KeyError as err:

File pandas/_libs/index.pyx:168, in pandas._libs.index.IndexEngine.get_loc()
--> 168 'Could not get source, probably due dynamically evaluated source code.'

File pandas/_libs/index.pyx:176, in pandas._libs.index.IndexEngine.get_loc()
--> 176 'Could not get source, probably due dynamically evaluated source code.'

File pandas/_libs/index.pyx:583, in pandas._libs.index.StringObjectEngine._check_type()
--> 583 'Could not get source, probably due dynamically evaluated source code.'

KeyError: 1

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[33], line 1
----> 1 s2[1]

File /opt/miniforge/lib/python3.12/site-packages/pandas/core/series.py:959, in Series.__getitem__(self, key)
    954     key = unpack_1tuple(key)
    956 elif key_is_scalar:
    957     # Note: GH#50617 in 3.0 we changed int key to always be treated as
    958     #  a label, matching DataFrame behavior.
--> 959     return self._get_value(key)
    961 # Convert generator to list before going through hashable part
    962 # (We will iterate through the generator there to check for slices)
    963 if is_iterator(key):

File /opt/miniforge/lib/python3.12/site-packages/pandas/core/series.py:1046, in Series._get_value(self, label, takeable)
   1043     return self._values[label]
   1045 # Similar to Index.get_value, but we do not fall back to positional
-> 1046 loc = self.index.get_loc(label)
   1048 if is_integer(loc):
   1049     return self._values[loc]

File /opt/miniforge/lib/python3.12/site-packages/pandas/core/indexes/base.py:3648, in Index.get_loc(self, key)
   3643     if isinstance(casted_key, slice) or (
   3644         isinstance(casted_key, abc.Iterable)
   3645         and any(isinstance(x, slice) for x in casted_key)
   3646     ):
   3647         raise InvalidIndexError(key) from err
-> 3648     raise KeyError(key) from err
   3649 except TypeError:
   3650     # If we have a listlike key, _check_indexing_error will raise
   3651     #  InvalidIndexError. Otherwise we fall through and re-raise
   3652     #  the TypeError.
   3653     self._check_indexing_error(key)

KeyError: 1

instead do this:

In [34]:
# get the second thing using iloc
s2.iloc[1]
Out[34]:
np.int64(6)

iloc allows you to use numerical (numpy-like) indexing into a Series or dataframe even when its index has different labels.

Notice that iloc is, weirdly, not a function - it's .iloc[ind], not .iloc(ind).

Slicing works too:

In [35]:
# get the second and third things:
s2.iloc[1:3]
Out[35]:
spr    6
sum    8
dtype: int64

We can create a Series from a dictionary:

In [36]:
d = {}
d['win'] = 9
d['spr'] = 6
d['sum'] = 8
d['fal'] = 4
s3 = Series(d)
s3
Out[36]:
win    9
spr    6
sum    8
fal    4
dtype: int64

Many things that work on dictionaries and lists work on Series:

In [37]:
# is 'fal' a key in s3? using the in keyword
'fal' in s3
Out[37]:
True
In [38]:
# is 'jan' a key in s3?
'jan' in s3
Out[38]:
False
In [39]:
# is 6 a value in s3?
6 in s3.values
Out[39]:
True

DataFrames¶

DataFrames represent 2D tables.

Create a DataFrame from scratch:

In [40]:
data = {'city': ['Seattle','Spokane','Tacoma','Vancouver'],
        'pop': [787,230,222,189], # units are in thousands
        'tax': [10.25,9.0,10.3,8.5]}
df = DataFrame(data)
df
Out[40]:
city pop tax
0 Seattle 787 10.25
1 Spokane 230 9.00
2 Tacoma 222 10.30
3 Vancouver 189 8.50

Each column is a Series. Indexing the DataFrame by the column name extracts the Series:

In [41]:
# get the city column using square brackets:
df["city"]
Out[41]:
0      Seattle
1      Spokane
2       Tacoma
3    Vancouver
Name: city, dtype: str

Another way to access a column; generally prefer the square brackets, since column names can have spaces and other weirdness.

In [42]:
# get the city column using property accessor:
df.city
Out[42]:
0      Seattle
1      Spokane
2       Tacoma
3    Vancouver
Name: city, dtype: str

This might not always be preferred, since column names are not always valid python identifiers (e.g. they can have spaces, dashes, etc.)

Elementwise arithmetic works on Series (they are based on numpy arrays):

In [43]:
# divide the tax column by 100:
df["tax"] / 100
Out[43]:
0    0.1025
1    0.0900
2    0.1030
3    0.0850
Name: tax, dtype: float64

Add a column to an existing DataFrame:

In [44]:
df['visits'] = [20,2,5,4]
df
Out[44]:
city pop tax visits
0 Seattle 787 10.25 20
1 Spokane 230 9.00 2
2 Tacoma 222 10.30 5
3 Vancouver 189 8.50 4

More pandas, now with Avengers¶

For demo purposes, we'll use a dataset downloaded from FiveThirtyEight, which compiled it for a 2015 article entitled Joining The Avengers Is As Deadly As Jumping Off A Four-Story Building. It catalogs information about all of the characters from the Marvel comic books that were ever members of the Avengers. You can find some meta-information about the dataset including a description of what each column means in the accompanying readme file (it's in Markdown format; one easy way to display it nicely would be to paste its contents into a Markdown cell in a notebook).

In [45]:
data_url = '/cluster/academic/DATA311/202620/avengers/avengers.csv'
In [46]:
avengers = pd.read_csv(data_url, encoding='latin-1')
avengers
Out[46]:
URL Name/Alias Appearances Current? Gender Probationary Introl Full/Reserve Avengers Intro Year Years since joining Honorary ... Return1 Death2 Return2 Death3 Return3 Death4 Return4 Death5 Return5 Notes
0 http://marvel.wikia.com/Henry_Pym_(Earth-616) Henry Jonathan "Hank" Pym 1269 YES MALE NaN Sep-63 1963 52 Full ... NO NaN NaN NaN NaN NaN NaN NaN NaN Merged with Ultron in Rage of Ultron Vol. 1. A...
1 http://marvel.wikia.com/Janet_van_Dyne_(Earth-... Janet van Dyne 1165 YES FEMALE NaN Sep-63 1963 52 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Dies in Secret Invasion V1:I8. Actually was se...
2 http://marvel.wikia.com/Anthony_Stark_(Earth-616) Anthony Edward "Tony" Stark 3068 YES MALE NaN Sep-63 1963 52 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Death: "Later while under the influence of Imm...
3 http://marvel.wikia.com/Robert_Bruce_Banner_(E... Robert Bruce Banner 2089 YES MALE NaN Sep-63 1963 52 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Dies in Ghosts of the Future arc. However "he ...
4 http://marvel.wikia.com/Thor_Odinson_(Earth-616) Thor Odinson 2402 YES MALE NaN Sep-63 1963 52 Full ... YES YES NO NaN NaN NaN NaN NaN NaN Dies in Fear Itself brought back because that'...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
168 http://marvel.wikia.com/Eric_Brooks_(Earth-616)# Eric Brooks 198 YES MALE NaN 13-Nov 2013 2 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
169 http://marvel.wikia.com/Adam_Brashear_(Earth-6... Adam Brashear 29 YES MALE NaN 14-Jan 2014 1 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
170 http://marvel.wikia.com/Victor_Alvarez_(Earth-... Victor Alvarez 45 YES MALE NaN 14-Jan 2014 1 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
171 http://marvel.wikia.com/Ava_Ayala_(Earth-616)# Ava Ayala 49 YES FEMALE NaN 14-Jan 2014 1 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
172 http://marvel.wikia.com/Kaluu_(Earth-616)# Kaluu 35 YES MALE NaN 15-Jan 2015 0 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

173 rows × 21 columns

Display only the first 2 rows of the table:

In [47]:
# use head:
avengers.head(2)
Out[47]:
URL Name/Alias Appearances Current? Gender Probationary Introl Full/Reserve Avengers Intro Year Years since joining Honorary ... Return1 Death2 Return2 Death3 Return3 Death4 Return4 Death5 Return5 Notes
0 http://marvel.wikia.com/Henry_Pym_(Earth-616) Henry Jonathan "Hank" Pym 1269 YES MALE NaN Sep-63 1963 52 Full ... NO NaN NaN NaN NaN NaN NaN NaN NaN Merged with Ultron in Rage of Ultron Vol. 1. A...
1 http://marvel.wikia.com/Janet_van_Dyne_(Earth-... Janet van Dyne 1165 YES FEMALE NaN Sep-63 1963 52 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Dies in Secret Invasion V1:I8. Actually was se...

2 rows × 21 columns

Only the last 3:

In [48]:
# use tail:
avengers.tail(3)
Out[48]:
URL Name/Alias Appearances Current? Gender Probationary Introl Full/Reserve Avengers Intro Year Years since joining Honorary ... Return1 Death2 Return2 Death3 Return3 Death4 Return4 Death5 Return5 Notes
170 http://marvel.wikia.com/Victor_Alvarez_(Earth-... Victor Alvarez 45 YES MALE NaN 14-Jan 2014 1 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
171 http://marvel.wikia.com/Ava_Ayala_(Earth-616)# Ava Ayala 49 YES FEMALE NaN 14-Jan 2014 1 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
172 http://marvel.wikia.com/Kaluu_(Earth-616)# Kaluu 35 YES MALE NaN 15-Jan 2015 0 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 21 columns

What are all the columns?

In [49]:
# list the columns
avengers.columns
Out[49]:
Index(['URL', 'Name/Alias', 'Appearances', 'Current?', 'Gender',
       'Probationary Introl', 'Full/Reserve Avengers Intro', 'Year',
       'Years since joining', 'Honorary', 'Death1', 'Return1', 'Death2',
       'Return2', 'Death3', 'Return3', 'Death4', 'Return4', 'Death5',
       'Return5', 'Notes'],
      dtype='str')

New view of the table, with the URL column dropped:

In [52]:
# use drop:
avengers.drop(columns=["URL"])
Out[52]:
Name/Alias Appearances Current? Gender Probationary Introl Full/Reserve Avengers Intro Year Years since joining Honorary Death1 Return1 Death2 Return2 Death3 Return3 Death4 Return4 Death5 Return5 Notes
0 Henry Jonathan "Hank" Pym 1269 YES MALE NaN Sep-63 1963 52 Full YES NO NaN NaN NaN NaN NaN NaN NaN NaN Merged with Ultron in Rage of Ultron Vol. 1. A...
1 Janet van Dyne 1165 YES FEMALE NaN Sep-63 1963 52 Full YES YES NaN NaN NaN NaN NaN NaN NaN NaN Dies in Secret Invasion V1:I8. Actually was se...
2 Anthony Edward "Tony" Stark 3068 YES MALE NaN Sep-63 1963 52 Full YES YES NaN NaN NaN NaN NaN NaN NaN NaN Death: "Later while under the influence of Imm...
3 Robert Bruce Banner 2089 YES MALE NaN Sep-63 1963 52 Full YES YES NaN NaN NaN NaN NaN NaN NaN NaN Dies in Ghosts of the Future arc. However "he ...
4 Thor Odinson 2402 YES MALE NaN Sep-63 1963 52 Full YES YES YES NO NaN NaN NaN NaN NaN NaN Dies in Fear Itself brought back because that'...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
168 Eric Brooks 198 YES MALE NaN 13-Nov 2013 2 Full NO NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
169 Adam Brashear 29 YES MALE NaN 14-Jan 2014 1 Full NO NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
170 Victor Alvarez 45 YES MALE NaN 14-Jan 2014 1 Full NO NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
171 Ava Ayala 49 YES FEMALE NaN 14-Jan 2014 1 Full NO NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
172 Kaluu 35 YES MALE NaN 15-Jan 2015 0 Full NO NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

173 rows × 20 columns

Look at the dimensions of the table:

In [53]:
# use shape
avengers.shape
Out[53]:
(173, 21)

Extract the Series representing one column:

In [54]:
# extract the Name/Alias columnn:
avengers["Name/Alias"]
Out[54]:
0        Henry Jonathan "Hank" Pym
1                   Janet van Dyne
2      Anthony Edward "Tony" Stark
3              Robert Bruce Banner
4                     Thor Odinson
                  ...             
168                    Eric Brooks
169                  Adam Brashear
170                 Victor Alvarez
171                      Ava Ayala
172                          Kaluu
Name: Name/Alias, Length: 173, dtype: str

Extract a view of the DataFrame containing only a subset of the columns:

In [60]:
# get a dataframe with only the Name/Alias and Appearances columns
na = avengers[["Name/Alias", "Appearances"]]
na
Out[60]:
Name/Alias Appearances
0 Henry Jonathan "Hank" Pym 1269
1 Janet van Dyne 1165
2 Anthony Edward "Tony" Stark 3068
3 Robert Bruce Banner 2089
4 Thor Odinson 2402
... ... ...
168 Eric Brooks 198
169 Adam Brashear 29
170 Victor Alvarez 45
171 Ava Ayala 49
172 Kaluu 35

173 rows × 2 columns

If the size of the subset is one, it gives you a one-column table:

In [56]:
# extract a DataFrame with just Name/Alias:
avengers[["Name/Alias"]]
Out[56]:
Name/Alias
0 Henry Jonathan "Hank" Pym
1 Janet van Dyne
2 Anthony Edward "Tony" Stark
3 Robert Bruce Banner
4 Thor Odinson
... ...
168 Eric Brooks
169 Adam Brashear
170 Victor Alvarez
171 Ava Ayala
172 Kaluu

173 rows × 1 columns

Notice that this is different from avengers["Name/Alias"] above because this is a DataFrame, whereas the above gives you a Series.

Sort the Avengers by number of appearances:

In [61]:
# use sort_values:
na = na.sort_values("Appearances", ascending=False)
na
Out[61]:
Name/Alias Appearances
73 Peter Benjamin Parker 4333
6 Steven Rogers 3458
92 James "Logan" Howlett 3130
2 Anthony Edward "Tony" Stark 3068
4 Thor Odinson 2402
... ... ...
117 Dennis Sykes 6
65 Gene Lorrene 4
68 Doug Taggert 3
39 Moira Brandon 2
125 Fiona 2

173 rows × 2 columns

Exercise: Show the 10th through 20th most-appearing avengers?

In [ ]:
# get the 10th through 20th most-appearing avengers

Note on iloc - it can also index across columns:

In [63]:
# Extract the first four avengers, and the second through fourth columns:
na.iloc[:4, 1:5]
Out[63]:
Appearances
73 4333
6 3458
92 3130
2 3068

Basic plotting is built into pandas!

In [64]:
# line plot of appearances
na.plot(y="Appearances", use_index=False)
Out[64]:
<Axes: >
No description has been provided for this image

For a categorical column like Gender, we can count the frequency of each value:

In [65]:
# show the Gender column:
avengers["Gender"]
Out[65]:
0        MALE
1      FEMALE
2        MALE
3        MALE
4        MALE
        ...  
168      MALE
169      MALE
170      MALE
171    FEMALE
172      MALE
Name: Gender, Length: 173, dtype: str
In [66]:
# use value_counts to see frequency of each categorical label:
avengers["Gender"].value_counts()
Out[66]:
Gender
MALE      115
FEMALE     58
Name: count, dtype: int64
In [68]:
avengers.columns
Out[68]:
Index(['URL', 'Name/Alias', 'Appearances', 'Current?', 'Gender',
       'Probationary Introl', 'Full/Reserve Avengers Intro', 'Year',
       'Years since joining', 'Honorary', 'Death1', 'Return1', 'Death2',
       'Return2', 'Death3', 'Return3', 'Death4', 'Return4', 'Death5',
       'Return5', 'Notes'],
      dtype='str')
In [70]:
# scatter plot Years since joining vs Appearances
avengers.plot.scatter(x="Years since joining", y="Appearances")
Out[70]:
<Axes: xlabel='Years since joining', ylabel='Appearances'>
No description has been provided for this image

Getting fancier¶

Group the avengers by gender, then average the Appearances column of each group:

In [ ]:
avengers.groupby("Gender")["Appearances"].mean()

Look at the three most frequently-appearing female avengers:

In [ ]:
avengers[avengers["Gender"] == "FEMALE"].head(3)

Get a table of only the avengers that appear more than 2000 times:

In [ ]:
avengers[avengers["Appearances"] > 2000]
In [ ]:
# column info
avengers.info()
In [ ]:
avengers.describe()

pandas: a whirlwind tour¶

Things to demo:

  • Show the first 5 rows, last 10 rows
  • Drop some columns
  • Get the dimensions of the table
  • Get a single column
  • Get multiple columns
  • Get a single column as a DataFrame
  • Sort the table
  • Get a custom slice of rows (not-quite-superstars)
  • Count number of rows with each value in a categorical column (e.g., gender)
  • Plot a column
  • Scatterplot 2 columns
  • Compute by groups
  • Get only rows that meet some condition
  • Show summary statistics of a DataFrame
In [ ]: