Winter 2023
This lab gives you some practice working with multidimensional numpy arrays (in particular, images) and data preprocessing (in particular, outlier detection and text normalization).
For this lab, you may (are encouraged) to spend the lab period working together with a partner. After the lab period ends, you will work independently and submit your own solution, though you may continue to collaborate in accordance with the individual assignment collaboration policy listed on the syllabus.
The three parts of this lab will each be done in a separate notebook. For the first two parts, I’ve provided you with a little starter code. Download the starter notebooks here: Part 1, Part 2 and upload them to Colab to get started.
As we saw in class, color images can be represented by 3D
ndarray
s with shape (rows
,
columns
, channels
), where
channels
is typically size 3, representing red, green, and
blue values of each pixel. A video can be thought of as
a stack of image: a 4D array, sometimes represented as an array with
shape (frames
, rows
, columns
,
channels
).
You can imagine a video as being a cube-like object, with rows and
columns as two dimensions and time (frames) as the third. The channel
dimension just comes along for the ride - when displayed on your screen,
those three values get stuck together into a single pixel. If you
imagine taking a two-dimensional slice of the “video cube” in the plane
of the rows
and columns
axis at some fixed
value of the frames
dimension, you’ll just end up with a
single frame of the video. But we can slice the cube in other directions
as well, and this can result in some pretty interesting images! In this
task, you’ll experiment with “slicing” a video cube in a couple of
non-traditional ways.
Load the frames into a single ndarray
. Different
approaches can work - here are a couple suggestions:
(height, width, 3)
images into a (frames, height, width, 3)
array.(frames, height, width, 3)
array, then
read each image in assigning it to a slice of the array, as in
cube[i,:,:,:] = frame_i
.You’ll probably want to test and debug with a subset (e.g., the first 5) of the frames.
Before moving onto the next part, I recommend reviewing the notes on
numpy indexing and slicing. You should be comfortable with the synatx
for, e.g., extracting the i
th frame from the video cube, or
taking a 100-by-100 crop from the top left of the video cube.
To get a frame, you can fix the frame dimension and take all rows and all columns. What if instead, we fix the column dimension and take all rows and all frames? If we do this, we get an image that’s similar to a “slit scan” photograph (wikipedia, examples). Your task is to play with creating “time slice photographs” that fix the image column (or row) and produce an image where one of the dimensions represents time. Here’s an example output that I made from the given image set:
Write code to compute a time slice image like the above. Play around and see what you can create! The only requirement is that one dimension must represent time and another must represent a fixed row or column of the image’s spatial dimensions.
Note: loading all the images is not a super quick operation. Do not load the images in more than one place in your notebook, and do not load the images more than once during a single session of work on this lab. Also, when you’re testing and developing, you may want to try working on a small subset of the images (e.g., the first 20 frames) so you can try stuff out more quickly.
At the end of the “Part 1.1” section of your notebook, call
plt.imshow
to display your result.
Above, we took a slice straight along the time dimension, orthogonal
to the (rows
, columns
) plane. But there’s no
reason we have to stick to that! If we slice the space-time cube
diagonally, we can mix change over time and change across the image’s
spatial dimensions into a single image. Here’s an example I created
using a timelapse of the New York City skyline (you can see the original
video here):
Write code to compute a “diagonal” time slice image like this one. Again, you can play around with the specifics and get creative, but for this result please make sure that your slice is not parallel to any of the video cube’s axis-aligned planes.
Note: as in Part 1.1, loading all the images is not a super quick operation. Do not load the images in more than one place in your notebook, and do not load the images more than once during a single session of work on this lab. Also, when you’re testing and developing, you may want to try working on a small subset of the images (e.g., the first 20 frames) so you can try stuff out more quickly.
At the end of the “Part 1.2” section of your notebook, call
plt.imshow
to display your result.
In this task, we’ll also work with a large pile of images, but this time they don’t come from a video. These are small versions of aerial imagery tiles from Bing Maps (because they have a nice API). My research students downloaded these for a computer vision project analyzing international political borders - we’re working with a small subset of the data that comes from the border between Ecuador and Peru.
After we downloaded the images, we realized there was a problem: though the Bing API served us up files for all the locations we asked for, some of the images weren’t actually available. Instead of conveniently returning an error code or something, Bing simply gave us a bogus file that doesn’t look like a normal image. Since there are 3300+ images (and in reality, more like 660,000 covering the international borders for the whole globe), we need a way to detect these bogus images so we can ignore them when applying our computer vision models.
As in Part 1, I’ve provided you with an index DataFrame with the
filename and URL of each of the images; your task is to augment this
dataframe with a new column that tell us whether the image is real or
not. This column should be called Inlier
and should be
True
if the image is real (i.e., not
bogus) and False
if the image doesn’t
actually contain image data (i.e., bogus, an outlier). The starter
notebook includes some cells at the bottom, including one that writes
your completed dataframe to a CSV file. Submit this CSV file to Canvas
along with your notebook.
Also as in part 1, you should not download and re-download the images unnecessarily. Set up your data downloading code in a single cell and run it only once per session. A grade deduction will be applied to notebooks that show evidence of a workflow involving repeated downloads of imagery.
Your notebook should not include mysterious hard-coded knowledge. The process of figuring out how to detect the outliers may be interactive - you’ll want to actually look at images and test out hypotheses. Hard-coding the knowledge you gained is not sufficient: your notebook should include these tests and convincingly demonstrate the source of any knowledge you gained when exploring.
Here’s a contrived example to illustrate my point: suppose that after
poking around, you found that the bogus images have different dimensions
(e.g., 32x32) than the real images (64x64) (real life is not so
convenient, but let’s roll with the hypothetical). In this case, your
notebook should at least include some code that pull out a sample of the
32x32 images and visualizes them to confirm that they’re all bogus,
before using that knowledge to create the Inlier
column of
the DataFrame. Even more convincingly, you might identify another
hallmark of an outlier and show that all the 32x32 images have such a
hallmark and none of the rest do.
Loading and processing all 3358 images at once is going to be a little slow. I suggest working with a smaller number of images when testing. Since for all we know the outliers may be bunched together somewhere, I recommend looking at a random sample of the data, rather than, say, the first k images. Do be sure to sample enough images that your sample is likely to have enough outliers to stand out, though. This is hard to do before you know what fraction of the images are outliers, but hey, real life is hard.
In a notebook titled lab4_p3.ipynb
, load the contents of
lab4_p3.txt1 into a string. You will complete two
subtasks: look at the distribution of sentence lengths in the story, and
plot the frequency of the most commonly-appearing words.
Use spacy
’s sentence tokenization to break the story
into sentences and:
Use whatever preprocessing or normalization strategies you need to ensure your analysis includes only things that you would consider “real” sentences. Explain your steps; describe any issues you encountered and how you handled them.
Produce a plot showing the frequency of the 30 most commonly-appearing words in the story. Your analysis should focus on only substantive words; for example, it’s not very interesting to know that ‘the’ appears quite often in the text, or that spaces happen between (almost?) every pair of words, so your plot should avoid things like that.
Use whatever preprocessing or normalization strategies you need. Explain these steps, and describe any issues you encountered along the way and how you handled them.
Each flavor of extra credit can earn you up to 3 points. As usual, each point is exponentially more difficult to earn.
Part 1 Experiment with (or generalize!) the techniques from Part 1 on other videos of your choosing and make something cool! For example, you could try out non-planar slices. If you want to load video files directly, you can - there’s some info on a few approaches here; I just decided to keep things simple and minimize extra dependencies by breaking the videos out into images.
For inspiration, here are a couple research papers (with neat video results!) that play with the space-time cube in interesting ways:
Part 3 Expand your analysis in some interesting way, like color-coding your most-frequent words by part of speech or plotting the sentiment over the duration of the story.
Although these notebooks differ from the ones we’ve produced before, you should still think of them as presentable artifacts. Your approach should be explained (to a knowledgeable reader who knows the tools you’re using) in Markdown cells interleaved with the code such that your notebook tells the story of how you got to your results.
Make sure that all your cells have been run and have the output you
intended, then download a copy of each notebook in .ipynb
format. Run the last cell in the Part 2 notebook to generate
image_index.csv
. Combine all three notebooks and your
image_index.csv
into a single zip file and submit it to the
Lab 4 assignment on Canvas.
Finally fill out the Week 4 Survey on Canvas. Your submission will not be considered complete until you have submitted the survey.
Bandwidth etiquette Parts 1 and 2 of this lab involve downloading collections of images. If your notebook(s) show evidence of a workflow involving excessive and unnecessary downloads of imagery, up to 10 points will be deducted.
Parts 1.1 and 1.2 are each worth 10 points, and are graded on the correctness, clarity, and creativity/style criteria.
Part 2 is worth 20 points and is graded on the correct, convincing, and clear criteria:
Inliers
column is completely correctInliers
column has a few mistakesInliers
column
has some mistakesInliers
column
is missingParts 3.1 and 3.2 are each worth 10 points and are graded on convincing, and clear criteria:
Extra Credit
Up to 3 points as detailed above.
Source: The New York Times Magazine↩︎