DATA 311 - Lecture 2¶

Announcements¶

A couple logistics reminders:

  • Labs (50 minutes) are on Fridays, held in CF 420 with our fantastic TA, Rory Little
  • Quizzes are timed async on Canvas:
    • Released Friday after lab
    • 10 minute time limit
    • Finish by the start of class on Monday

Extra credit opportunity - Faculty candidate talks:¶

The CS department is interviewing faculty candidates. Each candidate gives 2 public talks - one research talk and one teaching demo. Students are welcome and encouraged to attend these! To incentivize attendance, I offer a bit of extra credit. See the "Talk Attendance Extra Credit" assignment on Canvas for details, but the deal is this:

  • If you attend a talk, you can grab an index card from me at the start. During the talk, fill out the card with four things:

    1. Your name and class (DATA 311)
    2. Something you learned from the talk
    3. Something that could have been improved in the talk
    4. Your thoughts on whether we should offer a faculty position to the speaker
  • Hand the card to me at the end of the talk for 1 point of extra credit in the Quizzes category.

  • I will remind you of scheduled talks; the first two are this coming Thursday and Friday at 4pm:

    • Thu 1/12 CF 105: William Hoza - Research (computational complexity and pseudorandomness)
    • Fri 1/13 CF 316: William Hoza - Teaching Demo
  • Some of these folks will be your future professors!

Goals¶

  • Be able to navigate and work in Jupyter with

    • Basic markdown syntax in text cells
    • Python code in code cells
  • Understand the fundamental data structures and concepts of the pandas library, and how they relate to each other:

    • Series, DataFrame, index
  • Know enough about pandas to be able to do, or look up how to do, the following basic data manipulation tasks:

    • Show the first or last k rows
    • Drop columns from a DataFrame
    • Get the dimensions of the table
    • Extract a single column
    • Extract multiple columns
    • Extract a single column as a DataFrame
    • Sort the table on a column
    • Get a custom slice of rows
    • Count number of rows with each value in a categorical column
    • Plot a column as a line graph
    • Scatterplot 2 columns
    • Group by a categorical variable and apply reductions to each group
    • Get only rows that meet some condition
    • Show summary statistics of a DataFrame

Jupyter / Colab Basics¶

What is Jupyter? What is Colab?

http://colab.research.google.com

Notebook features:

  • Python cells
    • Contain Python code
    • You can run and re-run cells
    • State is maintained after running a cell
    • The value of the last line, if any, is displayed (not printed)
In [1]:
a = 2+2

b = 8
a+b
a
Out[1]:
4
In [2]:
a
Out[2]:
4
  • Markdown cells:
    • Allow you to intersperse formatted text with code.
    • Type your markdown syntax, see the formatted text on the right
    • Basic markdown formatting
      • headings
      • lists (bulleted, numbered)
      • italics, bold, monospace
      • code block
      • link
      • image https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_21f/lab4/diagonal_example.png

Markdown:¶

  • bulleted lists
  • other lists
  1. numbered lists

  2. other lists

  3. another

  4. list

a = 4
b = 7

link text

alt text

  • Why jupyter?
    • Interleaved display
    • Quick, interactive development cycle
    • Reproducibility. Cardinal rule of data science: Always start with the raw data.

Pandas Basics - How to Learn Pandas (and other tools we'll use in this class):¶

But seriously: I won't teach you every little thing you need to use. I will expect you to be able to find and use functionality that gets the job done. I also won't quiz/test you on syntactic minutia.

Pandas: a library for working with tabular data¶

For demo purposes, we'll use a dataset downloaded from FiveThirtyEight, which compiled it for a 2015 article entitled Joining The Avengers Is As Deadly As Jumping Off A Four-Story Building. It catalogs information about all of the characters from the Marvel comic books that were ever members of the Avengers. You can find some meta-information about the dataset including a description of what each column means in the accompanying readme file (it's in Markdown format; one easy way to display it nicely would be to paste its contents into a Markdown cell in a notebook).

In [3]:
data_url = 'https://fw.cs.wwu.edu/~wehrwes/courses/data311_21f/data/avengers/avengers.csv'
In [4]:
import pandas as pd
df = pd.read_csv(data_url, encoding='latin-1')
df
Out[4]:
URL Name/Alias Appearances Current? Gender Probationary Introl Full/Reserve Avengers Intro Year Years since joining Honorary ... Return1 Death2 Return2 Death3 Return3 Death4 Return4 Death5 Return5 Notes
0 http://marvel.wikia.com/Henry_Pym_(Earth-616) Henry Jonathan "Hank" Pym 1269 YES MALE NaN Sep-63 1963 52 Full ... NO NaN NaN NaN NaN NaN NaN NaN NaN Merged with Ultron in Rage of Ultron Vol. 1. A...
1 http://marvel.wikia.com/Janet_van_Dyne_(Earth-... Janet van Dyne 1165 YES FEMALE NaN Sep-63 1963 52 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Dies in Secret Invasion V1:I8. Actually was se...
2 http://marvel.wikia.com/Anthony_Stark_(Earth-616) Anthony Edward "Tony" Stark 3068 YES MALE NaN Sep-63 1963 52 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Death: "Later while under the influence of Imm...
3 http://marvel.wikia.com/Robert_Bruce_Banner_(E... Robert Bruce Banner 2089 YES MALE NaN Sep-63 1963 52 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Dies in Ghosts of the Future arc. However "he ...
4 http://marvel.wikia.com/Thor_Odinson_(Earth-616) Thor Odinson 2402 YES MALE NaN Sep-63 1963 52 Full ... YES YES NO NaN NaN NaN NaN NaN NaN Dies in Fear Itself brought back because that'...
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
168 http://marvel.wikia.com/Eric_Brooks_(Earth-616)# Eric Brooks 198 YES MALE NaN 13-Nov 2013 2 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
169 http://marvel.wikia.com/Adam_Brashear_(Earth-6... Adam Brashear 29 YES MALE NaN 14-Jan 2014 1 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
170 http://marvel.wikia.com/Victor_Alvarez_(Earth-... Victor Alvarez 45 YES MALE NaN 14-Jan 2014 1 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
171 http://marvel.wikia.com/Ava_Ayala_(Earth-616)# Ava Ayala 49 YES FEMALE NaN 14-Jan 2014 1 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
172 http://marvel.wikia.com/Kaluu_(Earth-616)# Kaluu 35 YES MALE NaN 15-Jan 2015 0 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

173 rows × 21 columns

Jupyter: basic data structures/concepts¶

  • Series
  • DataFrame
  • Index
In [5]:
from pandas import Series, DataFrame
import pandas as pd

Series - a 1D list-like thing (think of it as a column with labels)

In [6]:
s = Series([9,6,8,4])
s
s.values
s.index
s[2]
Out[6]:
8

We can customize the labels:

In [7]:
s2 = Series([9,6,8,4],index=['win','spr','sum','fal'])
s2
s2.values
s2.index
s2[2]
s2['sum']
Out[7]:
8

We can create a Series from a dictionary:

In [8]:
d = {}
d['win'] = 9
d['spr'] = 6
d['sum'] = 8
d['fal'] = 4
s3 = Series(d)
s3
Out[8]:
win    9
spr    6
sum    8
fal    4
dtype: int64

Many things that work on dictionaries and lists work on Series:

In [9]:
'fal' in s2
'jan' in s2
Out[9]:
False

DataFrames represent 2D tables; each column is a Series.

Create a DataFrame from scratch:

In [10]:
data = {'city': ['Seattle','Spokane','Tacoma','Vancouver'],
        'pop': [787,230,222,189], # units are thousands
        'tax': [10.25,9.0,10.3,8.5]}
df = DataFrame(data)
df
Out[10]:
city pop tax
0 Seattle 787 10.25
1 Spokane 230 9.00
2 Tacoma 222 10.30
3 Vancouver 189 8.50
In [11]:
df['city']
df.city
#df[0]
Out[11]:
0      Seattle
1      Spokane
2       Tacoma
3    Vancouver
Name: city, dtype: object

Elementwise arithmetic works on Series:

In [12]:
df['tax'] / 100
Out[12]:
0    0.1025
1    0.0900
2    0.1030
3    0.0850
Name: tax, dtype: float64

Add a column to an existing DataFrame:

In [13]:
df['visits'] = [20,2,5,4]
df
Out[13]:
city pop tax visits
0 Seattle 787 10.25 20
1 Spokane 230 9.00 2
2 Tacoma 222 10.30 5
3 Vancouver 189 8.50 4
In [14]:
avengers = pd.read_csv(data_url, encoding='latin-1')
avengers.head(2)
Out[14]:
URL Name/Alias Appearances Current? Gender Probationary Introl Full/Reserve Avengers Intro Year Years since joining Honorary ... Return1 Death2 Return2 Death3 Return3 Death4 Return4 Death5 Return5 Notes
0 http://marvel.wikia.com/Henry_Pym_(Earth-616) Henry Jonathan "Hank" Pym 1269 YES MALE NaN Sep-63 1963 52 Full ... NO NaN NaN NaN NaN NaN NaN NaN NaN Merged with Ultron in Rage of Ultron Vol. 1. A...
1 http://marvel.wikia.com/Janet_van_Dyne_(Earth-... Janet van Dyne 1165 YES FEMALE NaN Sep-63 1963 52 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Dies in Secret Invasion V1:I8. Actually was se...

2 rows × 21 columns

In [15]:
avengers.tail(3)
Out[15]:
URL Name/Alias Appearances Current? Gender Probationary Introl Full/Reserve Avengers Intro Year Years since joining Honorary ... Return1 Death2 Return2 Death3 Return3 Death4 Return4 Death5 Return5 Notes
170 http://marvel.wikia.com/Victor_Alvarez_(Earth-... Victor Alvarez 45 YES MALE NaN 14-Jan 2014 1 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
171 http://marvel.wikia.com/Ava_Ayala_(Earth-616)# Ava Ayala 49 YES FEMALE NaN 14-Jan 2014 1 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
172 http://marvel.wikia.com/Kaluu_(Earth-616)# Kaluu 35 YES MALE NaN 15-Jan 2015 0 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

3 rows × 21 columns

In [16]:
avengers.drop(columns=["URL"]).head()
Out[16]:
Name/Alias Appearances Current? Gender Probationary Introl Full/Reserve Avengers Intro Year Years since joining Honorary Death1 Return1 Death2 Return2 Death3 Return3 Death4 Return4 Death5 Return5 Notes
0 Henry Jonathan "Hank" Pym 1269 YES MALE NaN Sep-63 1963 52 Full YES NO NaN NaN NaN NaN NaN NaN NaN NaN Merged with Ultron in Rage of Ultron Vol. 1. A...
1 Janet van Dyne 1165 YES FEMALE NaN Sep-63 1963 52 Full YES YES NaN NaN NaN NaN NaN NaN NaN NaN Dies in Secret Invasion V1:I8. Actually was se...
2 Anthony Edward "Tony" Stark 3068 YES MALE NaN Sep-63 1963 52 Full YES YES NaN NaN NaN NaN NaN NaN NaN NaN Death: "Later while under the influence of Imm...
3 Robert Bruce Banner 2089 YES MALE NaN Sep-63 1963 52 Full YES YES NaN NaN NaN NaN NaN NaN NaN NaN Dies in Ghosts of the Future arc. However "he ...
4 Thor Odinson 2402 YES MALE NaN Sep-63 1963 52 Full YES YES YES NO NaN NaN NaN NaN NaN NaN Dies in Fear Itself brought back because that'...
In [17]:
avengers.shape
Out[17]:
(173, 21)
In [18]:
avengers["Name/Alias"]
Out[18]:
0        Henry Jonathan "Hank" Pym
1                   Janet van Dyne
2      Anthony Edward "Tony" Stark
3              Robert Bruce Banner
4                     Thor Odinson
                  ...             
168                    Eric Brooks
169                  Adam Brashear
170                 Victor Alvarez
171                      Ava Ayala
172                          Kaluu
Name: Name/Alias, Length: 173, dtype: object
In [19]:
na = avengers[["Name/Alias", "Appearances"]]
In [20]:
avengers[["Name/Alias"]]
Out[20]:
Name/Alias
0 Henry Jonathan "Hank" Pym
1 Janet van Dyne
2 Anthony Edward "Tony" Stark
3 Robert Bruce Banner
4 Thor Odinson
... ...
168 Eric Brooks
169 Adam Brashear
170 Victor Alvarez
171 Ava Ayala
172 Kaluu

173 rows × 1 columns

In [21]:
na = na.sort_values("Appearances", ascending=False)
In [22]:
na[10:20]
Out[22]:
Name/Alias Appearances
140 Ororo Munroe 1598
49 Namor McKenzie 1561
7 Clinton Francis Barton 1456
141 Matt Murdock 1375
104 Doctor Stephen Vincent Strange 1324
0 Henry Jonathan "Hank" Pym 1269
9 Wanda Maximoff 1214
1 Janet van Dyne 1165
15 Natalia Alianovna Romanova 1112
13 Victor Shade (alias) 1036
In [24]:
na.plot(y="Appearances", use_index=False)
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6e91f63e80>
In [ ]:
avengers["Gender"]
avengers.value_counts("Gender")
Out[ ]:
Gender
MALE      115
FEMALE     58
dtype: int64
In [26]:
avengers.plot.scatter("Years since joining", "Appearances")
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f6e90d840d0>
In [30]:
avengers.groupby("Gender")["Appearances"].mean()
Out[30]:
Gender
FEMALE    263.327586
MALE      490.069565
Name: Appearances, dtype: float64
In [42]:
avengers[avengers["Gender"] =="FEMALE"].head(3)
Out[42]:
URL Name/Alias Appearances Current? Gender Probationary Introl Full/Reserve Avengers Intro Year Years since joining Honorary ... Return1 Death2 Return2 Death3 Return3 Death4 Return4 Death5 Return5 Notes
1 http://marvel.wikia.com/Janet_van_Dyne_(Earth-... Janet van Dyne 1165 YES FEMALE NaN Sep-63 1963 52 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Dies in Secret Invasion V1:I8. Actually was se...
9 http://marvel.wikia.com/Wanda_Maximoff_(Earth-... Wanda Maximoff 1214 YES FEMALE NaN May-65 1965 50 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Dies in Uncanny_Avengers_Vol_1_14. Later comes...
15 http://marvel.wikia.com/Natalia_Romanova_(Eart... Natalia Alianovna Romanova 1112 YES FEMALE NaN May-73 1973 42 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Killed by The Hand. Later revived with The Sto...

3 rows × 21 columns

In [45]:
avengers[avengers["Appearances"] > 2000]
Out[45]:
URL Name/Alias Appearances Current? Gender Probationary Introl Full/Reserve Avengers Intro Year Years since joining Honorary ... Return1 Death2 Return2 Death3 Return3 Death4 Return4 Death5 Return5 Notes
2 http://marvel.wikia.com/Anthony_Stark_(Earth-616) Anthony Edward "Tony" Stark 3068 YES MALE NaN Sep-63 1963 52 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Death: "Later while under the influence of Imm...
3 http://marvel.wikia.com/Robert_Bruce_Banner_(E... Robert Bruce Banner 2089 YES MALE NaN Sep-63 1963 52 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Dies in Ghosts of the Future arc. However "he ...
4 http://marvel.wikia.com/Thor_Odinson_(Earth-616) Thor Odinson 2402 YES MALE NaN Sep-63 1963 52 Full ... YES YES NO NaN NaN NaN NaN NaN NaN Dies in Fear Itself brought back because that'...
6 http://marvel.wikia.com/Steven_Rogers_(Earth-616) Steven Rogers 3458 YES MALE NaN Mar-64 1964 51 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Dies at the end of Civil War. Later comes back.
40 http://marvel.wikia.com/Benjamin_Grimm_(Earth-... Benjamin Jacob Grimm 2305 NO MALE NaN Jun-86 1986 29 Full ... YES NaN NaN NaN NaN NaN NaN NaN NaN Once killed during a battle with Doctor Doom.'...
57 http://marvel.wikia.com/Reed_Richards_(Earth-6... Reed Richards 2125 YES MALE NaN Feb-89 1989 26 Full ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
73 http://marvel.wikia.com/Peter_Parker_(Earth-616)# Peter Benjamin Parker 4333 YES MALE NaN Apr-90 1990 25 Full ... YES YES YES NaN NaN NaN NaN NaN NaN Since joining the New Avengers: First death Ki...
92 http://marvel.wikia.com/James_Howlett_(Earth-6... James "Logan" Howlett 3130 YES MALE NaN 5-Jun 2005 10 Full ... NO NaN NaN NaN NaN NaN NaN NaN NaN Died in Death_of_Wolverine_Vol_1_4. Has not ye...

8 rows × 21 columns

In [39]:
# column info
avengers.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 173 entries, 0 to 172
Data columns (total 21 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   URL                          173 non-null    object
 1   Name/Alias                   163 non-null    object
 2   Appearances                  173 non-null    int64 
 3   Current?                     173 non-null    object
 4   Gender                       173 non-null    object
 5   Probationary Introl          15 non-null     object
 6   Full/Reserve Avengers Intro  159 non-null    object
 7   Year                         173 non-null    int64 
 8   Years since joining          173 non-null    int64 
 9   Honorary                     173 non-null    object
 10  Death1                       173 non-null    object
 11  Return1                      69 non-null     object
 12  Death2                       17 non-null     object
 13  Return2                      16 non-null     object
 14  Death3                       2 non-null      object
 15  Return3                      2 non-null      object
 16  Death4                       1 non-null      object
 17  Return4                      1 non-null      object
 18  Death5                       1 non-null      object
 19  Return5                      1 non-null      object
 20  Notes                        75 non-null     object
dtypes: int64(3), object(18)
memory usage: 28.5+ KB
In [38]:
avengers.describe()
Out[38]:
Appearances Year Years since joining
count 173.000000 173.000000 173.000000
mean 414.052023 1988.445087 26.554913
std 677.991950 30.374669 30.374669
min 2.000000 1900.000000 0.000000
25% 58.000000 1979.000000 5.000000
50% 132.000000 1996.000000 19.000000
75% 491.000000 2010.000000 36.000000
max 4333.000000 2015.000000 115.000000

Jupyter: a whirlwind tour¶

Things to demo:

  • Show the first 5 rows, last 10 rows
  • Drop some columns
  • Get the dimensions of the table
  • Get a single column
  • Get multiple columns
  • Get a single column as a DataFrame
  • Sort the table
  • Get a custom slice of rows (not-quite-superstars)
  • Count number of rows with each value in a categorical column (e.g., gender)

End of L02; done at the beginning of L03:

  • Plot a column
  • Scatterplot 2 columns
  • Compute by groups
  • Get only rows that meet some condition
  • Show summary statistics of a DataFrame