DATA 311 Lecture 2 - numpy demo¶
import numpy as np
import random
import matplotlib.pyplot as plt
import imageio.v3 as imageio
Creating Arrays¶
array,zeros,ones,*_likedtypeargument
# create a python list with 0..9
a = list(range(10))
a
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# create a numpy array with the list's contents
a = np.array(a)
# show the array's data type
a.dtype
dtype('int64')
# get the element at index 1
a[1]
np.int64(1)
# get the shape of the array
a.shape
(10,)
Basic list-like slicing¶
# slice with beginning and end
a[3:6]
array([3, 4, 5])
# slice with implicit start (0)
a[:4]
array([0, 1, 2, 3])
# slice with implicit end (len)
a[5:]
array([5, 6, 7, 8, 9])
# slice with implicit start and end
a[:]
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# slice with a step size
a[1:7:2]
array([1, 3, 5])
Elementwise Operations¶
a
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
# array + scalar
a + 4
array([ 4, 5, 6, 7, 8, 9, 10, 11, 12, 13])
# array + array
a + a
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
# scalar * array
2 * a
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
# array + array dimension matching
a + a[:4]
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[16], line 2 1 # array + array dimension matching ----> 2 a + a[:4] ValueError: operands could not be broadcast together with shapes (10,) (4,)
Exercise 1: Speed Check¶
In pairs: I've claimed numpy is faster than native Python. Let's find out how much faster.
- In the cell below, create a Python list (not a numpy array!) of 10,000 random floating-point numbers between 0.0 and 1.0. Useful tools:
import random,random.random().
import random
b = []
for i in range(1_000_000):
b.append(random.random())
b[:5]
[0.8437826572010122, 0.6696759345208393, 0.14343354017923993, 0.8904407099190859, 0.2182667489356157]
- In the next cell, create a new Python list containing the same numbers as the original, but with 0.5 subtracted from each. Don't modify the original list. I've added the ipython magic command
%%timeto the top of the cell to measure and report the time it takes to execute that cell.
%%time
c = []
for v in b:
c.append(v - 0.5)
CPU times: user 83.7 ms, sys: 16.1 ms, total: 99.8 ms Wall time: 99.4 ms
- In the next cell, create a numpy array
np_numscontaining the same numbers as your original (0.0 to 0.1) list.
np_nums = np.array(b)
- In the cell below, create a new numpy array
np_resultby subtracting 0.5 fromnp_nums(i.e., using elementwise operations). Time this cell's execution. How much faster is the numpy version than the native python version?
%%time
np_result = np_nums - 0.5
# your code here
CPU times: user 1.84 ms, sys: 2.96 ms, total: 4.8 ms Wall time: 4.14 ms
Multidimensional Arrays¶
- 2D arrays, slicing across dimensions
- elementwise operations
- comparisons / boolean dtype, masking
- visualizing as an image
More ways of making arrays:
# create an array from [1, 2, 3]
np.array([1, 2, 3])
array([1, 2, 3])
# create an array of 6 zeros
np.zeros((6,))
array([0., 0., 0., 0., 0., 0.])
# create an array of 6 ones with 64-bit integer datatype
np.zeros((6,), dtype=np.int64)
array([0, 0, 0, 0, 0, 0])
# create a 2-by-3 array of zeros
np.zeros((2, 3))
array([[0., 0., 0.],
[0., 0., 0.]])
Reshaping¶
- more than 2 dimensions
# set b to an array of 0..5, reshaped to 2-by-3
b = np.array(range(6)).reshape((2, 3))
b
array([[0, 1, 2],
[3, 4, 5]])
# demo indexing into b
b[1,2]
np.int64(5)
Aggregation / Projection¶
# find the sum of all elements in b
b.sum()
np.int64(15)
# find the minimum value in b
b.min()
np.int64(0)
# display b, just for reference
b
array([[0, 1, 2],
[3, 4, 5]])
# sum the elements of b along axis 0 (the row dimension)
b.sum(axis=0)
array([3, 5, 7])
# sum the elements of b along axis 1 (the column dimension)
b.sum(axis=1)
array([ 3, 12])
Exercise 2 - Broadcasting¶
In pairs: We've seen that, to perform elementwise operations, the dimensions of the arrays must match. There's one convenient exception to this. Let's see it in action below:
b = np.array(range(6)).reshape((2, 3))
b.shape
(2, 3)
c = np.array([2, 4])
c.shape
(2,)
# dimension mismatch
b * c
--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[34], line 2 1 # dimension mismatch ----> 2 b * c ValueError: operands could not be broadcast together with shapes (2,3) (2,)
# reshape c to be 2D
c_2x1 = c.reshape((2, 1))
c_2x1.shape
(2, 1)
b
array([[0, 1, 2],
[3, 4, 5]])
c_2x1
array([[2],
[4]])
# example of broadcasting:
b * c_2x1
array([[ 0, 2, 4],
[12, 16, 20]])
- This is called broadcasting. Explain what happened here.
The values in c_2x1 got repeated across the column dimension to match the column dimension of b.
Now, run the following cells to see another example.
d = np.array([1, 0, 1]).reshape(1, 3)
d.shape
(1, 3)
b
array([[0, 1, 2],
[3, 4, 5]])
d
array([[1, 0, 1]])
b * d
array([[0, 0, 2],
[3, 0, 5]])
Now, explain the general rule for:
- What kind of dimension mismatches are allowed?
Dimensions must match exactly, unless one array has a 1 in a given dimension.
- How do elementwise operations behave when such a mismatch is present?
The elements will be repeated across the singleton dimension.
Numpy, Continued¶
Fancy indexing¶
- Integer indexing:
a[ list or ndarray of integer indices ] - Boolean indexing:
a[ list or ndarray of booleans ]where the list/ndarray's shape matches a's
See https://numpy.org/doc/stable/user/basics.indexing.html for much more.
Integer indexing¶
a = np.array(range(10, 20))
a
Indexing with a list or array of integers pulls out only the elements at those indices:
# get the first, third, and fifth elements:
# get the fourth, second, and second elements (!):
Boolean Indexing¶
b = np.ones((2, 2))
b[0,0] = 2
b[1,1] = 0
Quick quiz: what does b look like now?
b
Make a "mask" of booleans that's the same shape as b:
mask = np.array([
[True, False],
[False, True]
])
mask
# index b with the boolean mask:
A common pattern - comparison operators to generate a mask:
# get an array of only the elements of b that are greater than zero:
Tips for multidimensional arrays¶
- I never display anything that's more than 2D.
- I never try to visualize anything that's more than 3D.
c = np.array(range(24)).reshape(2, 4, 3)
c
# take one 2D slice
# take another 2D slice along a different axis
Exercise 3 Play with my cat¶
In pairs: In this exercise, we'll manipulate an image as a 2D array.
We'll start by loading a picture of my cat, Beans:
beans = imageio.imread("/cluster/academic/DATA311/202620/beans_gray.jpeg")
We'll use plt.imshow to visualize the image:
plt.imshow(beans, cmap='gray')
- What is the dtype of the resulting array? What are the minimum and maximum values?
- Display a binary image showing which pixels are greater than half the maximum pixel intensity (127).
- What is the average value of pixels that have intensity value above 127?
- Which column of the image has the highest average pixel value?
