DATA 311 - Lab 5: Scraping Movie Data

Scott Wehrwein

Spring 2026

Overview

In this lab, you will analyze trends in the movie industry. The investigation is motivated by an interest in questions pertaining to the relationship between money-related properties (budget, box office returns) and quality/popularity-related properties (average popular ratings; average critic ratings). The good news is that all of this data is available; the bad news is that some of it is only available via webpages; the worse news is that we need to go to two different websites to get it.

Collaboration Policy

You must complete the pre-lab individual, but complete the lab in pairs. Your TA will help facilitating pairing. If an odd number of students are in attendance at lab, your TA will arrange one group of three. The lab must be done synchronously and collaboratively with your partner, with no free-loaders and no divide-and-conquer; every answer should reflect the understanding and work of both members. If your group does not finish during lab time, please arrange to meet as a pair outside of class time to finish up any work.

You must work with a different partner for Lab 5 than you did for Labs 1, 2, and 4.

Getting Started

There is no starter notebook for this lab; create a new notebook in Colab, give it a title, and include both partners’ names in a subheading at the top.

The Data

IMDB - The Internet Movie Database

The Internet Movie Database (https://www.imdb.com/) contains a wealth of information about movies. Most things you might want to know about a movie - title, release date, runtime, cast and crew, user ratings, critic ratings, etc. etc. IMDB actually does provide much of their data in a friendly, downloadable form (see https://developer.imdb.com/non-commercial-datasets/). The problem is that this data doesn’t contain everything we want, so we are going to scrape the webpages instead.

Specifically, we are going to scrape the IMDB pages for the top 1000 movies sorted by box office returns. This is available here:

https://www.imdb.com/list/ls098063263/

When you visit this page, it shows the top 250 movies. You can click a button at the bottom, or append &page=2 to go to the next page (there are a total of four, covering the top 1000 movies). Unfortunately, these pages don’t seem to take kindly to automated requests like the ones we’ll make using requests. Although there are techniques for getting around the blocking of automated requests like this, we are going to be polite and not try to workaround this restriction.

Instead, I’ve cached a version of this list, rendered as four separate pages, each with 250 of the top 1000 movies. You can find the first page here:

https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_26s/lab5/imdb/Top%201000%20Highest-Grossing%20Movies%20of%20All%20Time.html

and subsequent pages are at similar URLs with filenames such as ...Time_page2.html.

The Numbers

IMDB is good for the basics, but for juicy financial information, we have to look to The Numbers. This website displays data on budgets, US (“Domestic”) box office gross, and worldwide gross. It no longer seems to have an all-time list of highest-budget films, but I’ve made a local mirror of the list they previously had.

Unfortunately, this list is sorted by budget, not by domestic gross - if we had that, we could get the same list of movies from The Numbers and from IMDB. We’ll have to settle for the heuristic that the movies with the highest budgets (from The Numbers) are likely to overlap with the movies with the highest domestic gross.

A paginated table of movies sorted by budget showing release date, title, budget, domestic, and worldwide gross can be accessed via URLs with the following pattern:

https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_26s/lab5/the-numbers.com/movie/budgets/all/
https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_26s/lab5/the-numbers.com/movie/budgets/all/101

Each URL has 100 films out of a total list of 1000.

Scraping Etiquette

Most websites are built for humans to visit and view their content using a web browser. As soon as we start programmatically visiting websites, we risk the potential for abusing these resources. Above all, it’s important make sure that your scraping activities are not putting undue strain on those hosting the resources.

When scraping websites, internet etiquette dictates that you should avoid hammering somebody’s servers by making many requests at once in rapid succession. Be sure that you follow these rules, several of which will make your life easier as well as minimize unneccessary traffic to the pages you’re scraping:

Pre-Lab

Part I: HTML

Part II: Requests

Part III: BeautifulSoup

Part IV: Robots.txt

Lab

For both of the scrapting tasks below, use the locally-hosted mirrored versions of the websites described above, rather than scraping from imdb.com and the-numbers.com.

Part 1: Scrape IMDB

Using the URL pattern provided above (which sorts movies in descending order by US Box Office, a.k.a. Domestic Gross), extract the following data from each of the top 1000 movies:

Form this data into the columns of a DataFrame, and save it out to a CSV file so you can load it from disk instead of re-scraping.

Part 2 - Scrape The Numbers

Use the URL pattern provided above to extract the following data for the top 1000 movies, ordered by budget:

Form this data into a separate DataFrame and save it out to a CSV file so you can load it from disk instead of re-scraping.

Part 3: Clean and Merge

Merge the data tables together and clean the columns. It’s up to you what order you do this in - you may find it makes sense to clean then merge, or merge then clean, or perhaps clean, merge, then clean some more.

Merge

Since our two DataFrames have a matching column, we can merge them together, matching up rows using that column; for a bit more explanation, see this section of the Pandas Getting Started tutorials. Merge the IMDB and The Numbers data on the Title column. Note that any movie titles that don’t appear in both tables should simply not be included in the merged table. Discard any entries with matching titles but different years - these are likely not referring to the same movie (e.g., reboots or unrelated films with the same title).

Clean

Among other things, we’re interested in looking at correlations, which means we need to be able to scatterplot our columns. This will require us to unify some of the numerical representations in the two tables. In particular, you should at least handle the following:

  1. For all the columns that represent numerical values, remove all instances of ‘$’ and ‘,’ and convert them to numerical types using df.astype().
  2. Clean the release year column using pd.to_datetime(). Tip: you can now access the year a movie was released with df[‘Release Date’].dt.year

Part 4: Analyze

You are now in possession of a dataset that, to my knowledge, did not exist (or could not be obtained for free, anyway) until you made it - that’s pretty neat! Explore the data for interesting trends. For each of these analyses, create plots with title and axis labels (with units!) and comment on any trends you find. For correlation plots, use df.corr to compute and display the correlation coefficients. Your analysis should:

  1. Investigate the empirical distribution of the domestic gross returns by creating a histogram.

  2. Investigate user rating vs domestic gross by creating a scatterplot.

  3. Create a plot to investigate budget by release year.

  4. Create one more visualization and analysis of your own choosing. Your insights should include at least one thing that could not be discovered using only one dataset or the other (i.e., something involving relationships between IMDB columns and The Numbers columns).

Rubric

Pre-Lab (10 points)

Lab (50 points)

Parts 1-2 (10 points each)

The scraping process collects all of the requested data, respects proper scraping etiquette, and is sufficiently well-documented.

Part 3 (10 points)
Part 4 (20 points)