DATA 311 - Lab 4: Numpy, and Data Cleaning

Scott Wehrwein

Fall 2021

Introduction

This lab gives you some practice working with multidimensional numpy arrays (in particular, images) and data cleaning (in particular, outlier detection).

Collaboration Policy

For this lab, you may (are encouraged) to spend the lab period working together with a partner. After the lab period ends, you will work independently and submit your own solution, though you may continue to collaborate in accordance with the individual assignment collaboration policy listed on the syllabus.

Getting Started

To work on this lab, you’ll need to install the imageio package to your virtual environment. Assuming you followed the instructions from Lab 1, this would look like the following:

Open a new terminal
cd 311
source data311_env/bin/activate
pip install imageio

The two parts of this lab will each be done in a separate notebook. Contrary to prior labs, I’ve provided you with a little starter code for each one. You can find the starter notebooks here: Part 1, Part 2. Download each notebook and use the code therein to access the relevant data for each part.

Your Tasks

Part 1: Simulating Slit Scan and Time Slice Photography

As we saw in class, color images can be represented by 3D arrays with shape (rows, columns, channels), where channels is typically size 3, representing red, green, and blue values of each pixel. A video can be thought of as a stack of image: a 4D array, sometimes represented as an array with shape (frames, rows, columns, channels).

You can imagine a video as being a cube-like object, with rows and columns as two dimensions and time (frames) as the third. The channel dimension just comes along for the ride - when displayed on your screen, those three values get stuck together into a single pixel. If you imagine taking a two-dimensional slice of the “video cube” in the plane of the rows and columns axis at some fixed value of the frames dimension, you’ll just end up with a single frame of the video. But we can slice the cube in other directions as well, and this can result in some pretty interesting images! In this task, you’ll experiment with “slicing” a video cube in a couple of non-traditional ways.

1.1: Slicing Directly Across Time

To get a frame, you can fix the frame dimension and take all rows and all columns. What if instead, we fix the column dimension and take all rows and all frames? If we do this, we get an image that’s similar to a “slit scan” photograph (wikipedia, examples). Your task is to load the images provided (see the starter notebook for details) and play with creating “time slice photographs” that fix the image column (or row) and produce an image where one of the dimensions represents time. Here’s an example output that I made from the given image set:

Write code to compute a time slice image like the above. Play around and see what you can create! The only requirement is that one dimension must represent time and another must represent a fixed row or column of the image’s spatial dimensions.

Note: loading all the images is not a super quick operation. Try not to do it more than once in your notebook. Also, when you’re testing and developing, you may want to try working on a small subset of the images (e.g., the first 20 frames) so you can try stuff out more quickly.

At the end of the “Part 1.1” section of your notebook, call plt.imshow to display your result.

1.2: Slicing Diagonally Across Time and Space

Above, we took a slice straight along the time dimension, orthogonal to the (rows, columns) plane. But there’s no reason we have to stick to that! If we slice the space-time cube diagonally, we can mix change over time and change across the image’s spatial dimensions into a single image. Here’s an example I created using a timelapse of the New York City skyline (you can see the original video here):

Write code to compute a “diagonal” time slice image like this one. Again, you can play around with the specifics and get creative, but for this result please make sure that your slice is not parallel to any of the video cube’s axis-aligned planes.

Note: as in Part 1.1, loading all the images is not a super quick operation. Try not to do it more than once in your notebook. Also, when you’re testing and developing, you may want to try working on a small subset of the images (e.g., the first 20 frames) so you can try stuff out more quickly.

At the end of the “Part 1.2” section of your notebook, call plt.imshow to display your result.

Part 2 - Outlier Detection in Aerial Imagery

In this task, we’ll also work with a large pile of images, but this time they don’t come from a video. These are small versions of aerial imagery tiles from Bing Maps (because they have a nice API). My research students downloaded these for a computer vision project analyzing international political borders - we’re working with a small subset of the data that comes from the border between Ecuador and Peru.

After we downloaded the images, we realized there was a problem: though the Bing API served us up files for all the locations we asked for, some of the images weren’t actually available. Instead of conveniently returning an error code or something, Bing simply gave us a bogus file that doesn’t look like a normal image. Since there are 3300+ images (and in reality, more like 660,000 covering the international borders for the whole globe), we need a way to detect these bogus images so we can ignore them when applying our computer vision models.

As in Part 1, I’ve provided you with an index DataFrame with the filename and URL of each of the images; your task is to augment this dataframe with a new column that tell us whether the image is real or not. This column should be called Inlier and should be True if the image is real (i.e., not bogus) and False if the image doesn’t actually contain image data (i.e., bogus, an outlier). The starter notebook includes some cells at the bottom, including one that writes your completed dataframe to a CSV file. Submit this CSV file to Canvas along with your notebook.

Guidelines and Suggestions

Your notebook should not include mysterious hard-coded knowledge. The process of figuring out how to detect the outliers may be interactive - you’ll want to actually look at images and test out hypotheses. Hard-coding the knowledge you gained is not sufficient: your notebook should include these tests and convincingly demonstrate the source of any knowledge you gained when exploring.

Here’s a contrived example to illustrate my point: suppose that after poking around, you found that the bogus images have different dimensions (e.g., 32x32) than the real images (64x64) (real life is not so convenient, but let’s roll with the hypothetical). In this case, your notebook should at least include some code that pull out a sample of the 32x32 images and visualizes them to confirm that they’re all bogus, before using that knowledge to create the Inlier column of the DataFrame. Even more convincingly, you might identify another hallmark of an outlier and show that all the 32x32 images have such a hallmark and none of the rest do.
Loading and processing all 3358 images at once is going to be a little slow. I suggest working with a smaller number of images when testing. Since for all we know the outliers may be bunched together somewhere, I recommend looking at a random sample of the data, rather than, say, the first k images. Do be sure to sample enough images that your sample is likely to have enough outliers to stand out, though. This is hard to do before you know what fraction of the images are outliers, but hey, real life is hard.

Extra Credit

Experiment with (or generalize!) the techniques from Part 1 on other videos of your choosing and make something cool! For example, you could try out non-planar slices. If you want to load video files directly, you can - there’s some info on a few approaches here; I just decided to keep things simple and minimize extra dependencies by breaking the videos out into images.

If you make some creative and interesting results, I’ll award up to 3 points of extra credit - as usual, each extra credit point is exponentially more difficult to earn.

For inspiration, here are a couple research papers (with neat video results!) that play with the space-time cube in interesting ways:

Guidelines

Although these notebooks differ from the ones we’ve produced before, you should stil think of them as presentable artifacts. Your approach should be explained (to a knowledgeable reader who knows pandas and numpy) in Markdown cells interleaved with the code such that your notebook tells the story of how you got to your results.

Submitting your work

Notebooks and CSV

Make sure that all your cells have been run and have the output you intended, then download a copy of your notebook in .ipynb format and submit the resulting files to the Lab 4 assignment on Canvas. Run the last cell in the Part 2 notebook to generate image_index.csv and submit that to Canvas along with both notebooks.

Survey

Finally fill out the Week 4 Survey on Canvas. Your submission will not be considered complete until you have submitted the survey.

Rubric

Parts 1.1 and 1.2 are each worth 15 points, and are graded on the correctness, clarity, and creativity/style criteria.

8 points - a correct time slice image is created and displayed
5 points - code and approach are clearly explained
2 points - the result is creative/interesting to look at (e.g., not just solid horizontal stripes of the same colors)

Part 2 is worth 20 points and is graded on the correct, convincing, and clear criteria:

10 points - Correct
- 5/5 the Inliers column is completely correct
- 4/5 the Inliers column has a few mistakes
- 3/5 some outliers are detected, but the Inliers column has a few mistakes
- 2/5 an honest attempt was made, but the outlier detection works rarely or never
- 1/5 minimal effort was made
- 0/5 no effort / no submission / the Inliers column is missing
6 points - Convincing
- 3/3 - Exploration is clearly documented and the outlier detection criteria are thoroughly validated.
- 2/3 - Exploration is partially documented, outlier criteria are believable but not thoroughly demonstrated.
- 1/3 - Some exploration is documented, but the outlier criteria are hard-coded without justification.
- 0/3 - No exploration is documented.
4 points - Clear
- 2/2 - code and analysis are clearly and concisely explained and justified
- 1/2 - code and analysis are partly explained, or explanations are difficult to understand
- 0/2 no effort was made / no submission

Extra Credit

Up to 3 points as detailed above.