DATA 311 - Lab 4: Numpy, Data Preprocessing

Scott Wehrwein

Winter 2023

Introduction

This lab gives you some practice working with multidimensional numpy arrays (in particular, images) and data preprocessing (in particular, outlier detection and text normalization).

Collaboration Policy

For this lab, you may (are encouraged) to spend the lab period working together with a partner. After the lab period ends, you will work independently and submit your own solution, though you may continue to collaborate in accordance with the individual assignment collaboration policy listed on the syllabus.

Getting Started

The three parts of this lab will each be done in a separate notebook. For the first two parts, I’ve provided you with a little starter code. Download the starter notebooks here: Part 1, Part 2 and upload them to Colab to get started.

Part 1: Simulating Slit Scan and Time Slice Photography

As we saw in class, color images can be represented by 3D ndarrays with shape (rows, columns, channels), where channels is typically size 3, representing red, green, and blue values of each pixel. A video can be thought of as a stack of image: a 4D array, sometimes represented as an array with shape (frames, rows, columns, channels).

You can imagine a video as being a cube-like object, with rows and columns as two dimensions and time (frames) as the third. The channel dimension just comes along for the ride - when displayed on your screen, those three values get stuck together into a single pixel. If you imagine taking a two-dimensional slice of the “video cube” in the plane of the rows and columns axis at some fixed value of the frames dimension, you’ll just end up with a single frame of the video. But we can slice the cube in other directions as well, and this can result in some pretty interesting images! In this task, you’ll experiment with “slicing” a video cube in a couple of non-traditional ways.

1.0: Create a Video Cube

Load the frames into a single ndarray. Different approaches can work - here are a couple suggestions:

Load all the images into a python list, then call np.array on that list; this will concatenate the list of (height, width, 3) images into a (frames, height, width, 3) array.
Preallocate a (frames, height, width, 3) array, then read each image in assigning it to a slice of the array, as in cube[i,:,:,:] = frame_i.

You’ll probably want to test and debug with a subset (e.g., the first 5) of the frames.

Before moving onto the next part, I recommend reviewing the notes on numpy indexing and slicing. You should be comfortable with the synatx for, e.g., extracting the ith frame from the video cube, or taking a 100-by-100 crop from the top left of the video cube.

1.1: Slicing Directly Across Time

To get a frame, you can fix the frame dimension and take all rows and all columns. What if instead, we fix the column dimension and take all rows and all frames? If we do this, we get an image that’s similar to a “slit scan” photograph (wikipedia, examples). Your task is to play with creating “time slice photographs” that fix the image column (or row) and produce an image where one of the dimensions represents time. Here’s an example output that I made from the given image set:

Write code to compute a time slice image like the above. Play around and see what you can create! The only requirement is that one dimension must represent time and another must represent a fixed row or column of the image’s spatial dimensions.

Note: loading all the images is not a super quick operation. Do not load the images in more than one place in your notebook, and do not load the images more than once during a single session of work on this lab. Also, when you’re testing and developing, you may want to try working on a small subset of the images (e.g., the first 20 frames) so you can try stuff out more quickly.

At the end of the “Part 1.1” section of your notebook, call plt.imshow to display your result.

1.2: Slicing Diagonally Across Time and Space

Above, we took a slice straight along the time dimension, orthogonal to the (rows, columns) plane. But there’s no reason we have to stick to that! If we slice the space-time cube diagonally, we can mix change over time and change across the image’s spatial dimensions into a single image. Here’s an example I created using a timelapse of the New York City skyline (you can see the original video here):

Write code to compute a “diagonal” time slice image like this one. Again, you can play around with the specifics and get creative, but for this result please make sure that your slice is not parallel to any of the video cube’s axis-aligned planes.

Note: as in Part 1.1, loading all the images is not a super quick operation. Do not load the images in more than one place in your notebook, and do not load the images more than once during a single session of work on this lab. Also, when you’re testing and developing, you may want to try working on a small subset of the images (e.g., the first 20 frames) so you can try stuff out more quickly.

At the end of the “Part 1.2” section of your notebook, call plt.imshow to display your result.

Part 2 - Outlier Detection in Aerial Imagery

In this task, we’ll also work with a large pile of images, but this time they don’t come from a video. These are small versions of aerial imagery tiles from Bing Maps (because they have a nice API). My research students downloaded these for a computer vision project analyzing international political borders - we’re working with a small subset of the data that comes from the border between Ecuador and Peru.

After we downloaded the images, we realized there was a problem: though the Bing API served us up files for all the locations we asked for, some of the images weren’t actually available. Instead of conveniently returning an error code or something, Bing simply gave us a bogus file that doesn’t look like a normal image. Since there are 3300+ images (and in reality, more like 660,000 covering the international borders for the whole globe), we need a way to detect these bogus images so we can ignore them when applying our computer vision models.

As in Part 1, I’ve provided you with an index DataFrame with the filename and URL of each of the images; your task is to augment this dataframe with a new column that tell us whether the image is real or not. This column should be called Inlier and should be True if the image is real (i.e., not bogus) and False if the image doesn’t actually contain image data (i.e., bogus, an outlier). The starter notebook includes some cells at the bottom, including one that writes your completed dataframe to a CSV file. Submit this CSV file to Canvas along with your notebook.

Also as in part 1, you should not download and re-download the images unnecessarily. Set up your data downloading code in a single cell and run it only once per session. A grade deduction will be applied to notebooks that show evidence of a workflow involving repeated downloads of imagery.

Guidelines and Suggestions

Your notebook should not include mysterious hard-coded knowledge. The process of figuring out how to detect the outliers may be interactive - you’ll want to actually look at images and test out hypotheses. Hard-coding the knowledge you gained is not sufficient: your notebook should include these tests and convincingly demonstrate the source of any knowledge you gained when exploring.

Here’s a contrived example to illustrate my point: suppose that after poking around, you found that the bogus images have different dimensions (e.g., 32x32) than the real images (64x64) (real life is not so convenient, but let’s roll with the hypothetical). In this case, your notebook should at least include some code that pull out a sample of the 32x32 images and visualizes them to confirm that they’re all bogus, before using that knowledge to create the Inlier column of the DataFrame. Even more convincingly, you might identify another hallmark of an outlier and show that all the 32x32 images have such a hallmark and none of the rest do.
Loading and processing all 3358 images at once is going to be a little slow. I suggest working with a smaller number of images when testing. Since for all we know the outliers may be bunched together somewhere, I recommend looking at a random sample of the data, rather than, say, the first k images. Do be sure to sample enough images that your sample is likely to have enough outliers to stand out, though. This is hard to do before you know what fraction of the images are outliers, but hey, real life is hard.

Part 3 - Text Normalization

In a notebook titled lab4_p3.ipynb, load the contents of lab4_p3.txt ¹ into a string. You will complete two subtasks: look at the distribution of sentence lengths in the story, and plot the frequency of the most commonly-appearing words.

3.1: Sentence Length Distribution

Use spacy’s sentence tokenization to break the story into sentences and:

Plot the observed distribution of sentence lengths (in words)
Find and print the 3 shortest sentences
Find and print the 3 longest sentences

Use whatever preprocessing or normalization strategies you need to ensure your analysis includes only things that you would consider “real” sentences. Explain your steps; describe any issues you encountered and how you handled them.

3.2: Word Frequency Distribution

Produce a plot showing the frequency of the 30 most commonly-appearing words in the story. Your analysis should focus on only substantive words; for example, it’s not very interesting to know that ‘the’ appears quite often in the text, or that spaces happen between (almost?) every pair of words, so your plot should avoid things like that.

Use whatever preprocessing or normalization strategies you need. Explain these steps, and describe any issues you encountered along the way and how you handled them.

Extra Credit

Each flavor of extra credit can earn you up to 3 points. As usual, each point is exponentially more difficult to earn.

Part 1 Experiment with (or generalize!) the techniques from Part 1 on other videos of your choosing and make something cool! For example, you could try out non-planar slices. If you want to load video files directly, you can - there’s some info on a few approaches here; I just decided to keep things simple and minimize extra dependencies by breaking the videos out into images.

For inspiration, here are a couple research papers (with neat video results!) that play with the space-time cube in interesting ways:

Part 3 Expand your analysis in some interesting way, like color-coding your most-frequent words by part of speech or plotting the sentiment over the duration of the story.

Guidelines

Although these notebooks differ from the ones we’ve produced before, you should still think of them as presentable artifacts. Your approach should be explained (to a knowledgeable reader who knows the tools you’re using) in Markdown cells interleaved with the code such that your notebook tells the story of how you got to your results.

Submitting your work

Notebooks and CSV

Make sure that all your cells have been run and have the output you intended, then download a copy of each notebook in .ipynb format. Run the last cell in the Part 2 notebook to generate image_index.csv. Combine all three notebooks and your image_index.csv into a single zip file and submit it to the Lab 4 assignment on Canvas.

Survey

Finally fill out the Week 4 Survey on Canvas. Your submission will not be considered complete until you have submitted the survey.

Rubric

Bandwidth etiquette Parts 1 and 2 of this lab involve downloading collections of images. If your notebook(s) show evidence of a workflow involving excessive and unnecessary downloads of imagery, up to 10 points will be deducted.

Parts 1.1 and 1.2 are each worth 10 points, and are graded on the correctness, clarity, and creativity/style criteria.

5 points - a correct time slice image is created and displayed
4 points - code and approach are clearly explained
1 point - the result is creative/interesting to look at (e.g., not just solid horizontal stripes of the same colors)

Part 2 is worth 20 points and is graded on the correct, convincing, and clear criteria:

10 points - Correct
- 5/5 the Inliers column is completely correct
- 4/5 the Inliers column has a few mistakes
- 3/5 some outliers are detected, but the Inliers column has some mistakes
- 2/5 an honest attempt was made, but the outlier detection works rarely or never
- 1/5 minimal effort was made
- 0/5 minimal effort / no submission / the Inliers column is missing
6 points - Convincing
- 3/3 - Exploration is clearly documented and the outlier detection criteria are thoroughly validated.
- 2/3 - Exploration is partially documented, outlier criteria are believable but not thoroughly demonstrated.
- 1/3 - Some exploration is documented, but the outlier criteria are hard-coded without justification.
- 0/3 - No exploration is documented.
4 points - Clear
- 2/2 - code and analysis are clearly and concisely explained and justified
- 1/2 - code and analysis are partly explained, or explanations are difficult to understand
- 0/2 - minimal effort was made / no submission

Parts 3.1 and 3.2 are each worth 10 points and are graded on convincing, and clear criteria:

6 points - Convincing
- 3/3 - Analysis is solid and preprocessing ensures that only substance is included
- 2/3 - Analysis is solid and preprocessing removes most non-substantive content
- 1/3 - Analysis is flawed or preprocessing is minimally effective
- 0/1 - Minimal effort / no submission
4 points - Clear
- 2/2 - code and analysis are clearly and concisely explained and justified
- 1/2 - code and analysis are partly explained, or explanations are difficult to understand
- 0/2 - no effort was made / no submission

Extra Credit

Up to 3 points as detailed above.

Source: The New York Times Magazine ↩︎