DATA 311 - Lab 5: Scraping Movie Data

Scott Wehrwein

Spring 2026

Overview

In this lab, you will analyze trends in the movie industry. The investigation is motivated by an interest in questions pertaining to the relationship between money-related properties (budget, box office returns) and quality/popularity-related properties (average popular ratings; average critic ratings). The good news is that all of this data is available; the bad news is that some of it is only available via webpages; the worse news is that we need to go to two different websites to get it.

Collaboration Policy

You must complete the pre-lab individual, but complete the lab in pairs. Your TA will help facilitating pairing. If an odd number of students are in attendance at lab, your TA will arrange one group of three. The lab must be done synchronously and collaboratively with your partner, with no free-loaders and no divide-and-conquer; every answer should reflect the understanding and work of both members. If your group does not finish during lab time, please arrange to meet as a pair outside of class time to finish up any work.

You must work with a different partner for Lab 5 than you did for Labs 1, 2, and 4.

Getting Started

There is no starter notebook for this lab; create a new notebook in Colab, give it a title, and include both partners’ names in a subheading at the top.

The Data

IMDB - The Internet Movie Database

The Internet Movie Database (https://www.imdb.com/) contains a wealth of information about movies. Most things you might want to know about a movie - title, release date, runtime, cast and crew, user ratings, critic ratings, etc. etc. IMDB actually does provide much of their data in a friendly, downloadable form (see https://developer.imdb.com/non-commercial-datasets/). The problem is that this data doesn’t contain everything we want, so we are going to scrape the webpages instead.

Specifically, we are going to scrape the IMDB pages for the top 1000 movies sorted by box office returns. This is available here:

https://www.imdb.com/list/ls098063263/

When you visit this page, it shows the top 250 movies. You can click a button at the bottom, or append &page=2 to go to the next page (there are a total of four, covering the top 1000 movies). Unfortunately, these pages don’t seem to take kindly to automated requests like the ones we’ll make using requests. Although there are techniques for getting around the blocking of automated requests like this, we are going to be polite and not try to workaround this restriction.

Instead, I’ve cached a version of this list, rendered as four separate pages, each with 250 of the top 1000 movies. You can find the first page here:

https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_26s/lab5/imdb/Top%201000%20Highest-Grossing%20Movies%20of%20All%20Time.html

and subsequent pages are at similar URLs with filenames such as ...Time_page2.html.

The Numbers

IMDB is good for the basics, but for juicy financial information, we have to look to The Numbers. This website displays data on budgets, US (“Domestic”) box office gross, and worldwide gross. It no longer seems to have an all-time list of highest-budget films, but I’ve made a local mirror of the list they previously had.

Unfortunately, this list is sorted by budget, not by domestic gross - if we had that, we could get the same list of movies from The Numbers and from IMDB. We’ll have to settle for the heuristic that the movies with the highest budgets (from The Numbers) are likely to overlap with the movies with the highest domestic gross.

A paginated table of movies sorted by budget showing release date, title, budget, domestic, and worldwide gross can be accessed via URLs with the following pattern:

https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_26s/lab5/the-numbers.com/movie/budgets/all/
https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_26s/lab5/the-numbers.com/movie/budgets/all/101

Each URL has 100 films out of a total list of 1000.

Scraping Etiquette

Most websites are built for humans to visit and view their content using a web browser. As soon as we start programmatically visiting websites, we risk the potential for abusing these resources. Above all, it’s important make sure that your scraping activities are not putting undue strain on those hosting the resources.

When scraping websites, internet etiquette dictates that you should avoid hammering somebody’s servers by making many requests at once in rapid succession. Be sure that you follow these rules, several of which will make your life easier as well as minimize unneccessary traffic to the pages you’re scraping:

Import the time module, then after each time you make a webpage request, call time.sleep(10) (or more) to pause for ten seconds and make sure you’re not making requests at an unreasonable rate.
Avoid unnecessarily repeating requests; develop your scraping approach using data from just one page; make the get request in an early cell that you don’t need to re-run every time you add a step in your soup parsing. When your scraping seems rock solid, you can then apply it to each of the desired pages in a loop.
Once you’ve scraped all the movies and collected your raw data, the first thing you should do is construct a DataFrame and save it out as a CSV file using the to_csv method. Anytime you come back to continue work, load from the CSV file instead of re-doing the whole scrape.

Pre-Lab

Part I: HTML

Please navigate to the “Structuring content with HTML” online learning module and go through the following tutorials as an introduction to HTML structures: Basic HTML syntax and Structuring documents.
For additional resources, please see “HTML Tutorial.” The following modules are recommended: HTML Introduction and HTML Basic Examples.
In a text editor of your choice that is capable of exporting to pdf, answer these questions:
1. What is the character reference for double quote? What are two other symbols with character references, and what is the reference for them?
2. Where does title of a page typically show up?
3. What’s the difference between a span and a div?
4. What roles do the head, title, and body elements play in an HTML document?

Part II: Requests

Please read these two paragraphs in the “Requests Quickstart” for the python requests package: Make a Request and Response Content.
In the same editor, answer these questions:
1. What does the requests.get() command do?
2. How do you read the content of the server’s response for an html page?

Part III: BeautifulSoup

Please read “Guide to Parsing HTML with BeautifulSoup in Python” and answer these questions in the same editor:
1. What does the find_all() method do?
2. How do you search for all <a> tags that have the “element” class?

Part IV: Robots.txt

Please read “What is robots.txt” and answer these questions in the same editor:
1. Visit https://www.whatismybrowser.com/detect/what-is-my-user-agent/ and find out what your web browser’s user agent is.
2. In a Python notebook, load the same page using the requests library and find out what your user agent is when using requests. You may find it handy to parse the resulting page with BeautifulSoup, but combing through the raw HTML is fine too.
3. Look at the robots.txt file for IMDB: https://www.imdb.com/robots.txt. As I alluded to above, I found that my get requests returned empty pages when I tried to access the Top 1000 Grossing list (at url https://www.imdb.com/list/ls098063263/) using requests. Does IMDB’s robots.txt explicitly forbid us (using the default requests user-agent) from accessing this page programmatically? If so, are there any restrictions? If not, give the relevant parts of the robots.txt file that forbids this access.

Lab

For both of the scrapting tasks below, use the locally-hosted mirrored versions of the websites described above, rather than scraping from imdb.com and the-numbers.com.

Part 1: Scrape IMDB

Using the URL pattern provided above (which sorts movies in descending order by US Box Office, a.k.a. Domestic Gross), extract the following data from each of the top 1000 movies:

Title
Year
Runtime
User Rating (this is the number next to the yellow star)
Metascore
Votes

Form this data into the columns of a DataFrame, and save it out to a CSV file so you can load it from disk instead of re-scraping.

Part 2 - Scrape The Numbers

Use the URL pattern provided above to extract the following data for the top 1000 movies, ordered by budget:

Release Date
Title
Production Budget
Domestic Gross
Worldwide Gross

Form this data into a separate DataFrame and save it out to a CSV file so you can load it from disk instead of re-scraping.

Part 3: Clean and Merge

Merge the data tables together and clean the columns. It’s up to you what order you do this in - you may find it makes sense to clean then merge, or merge then clean, or perhaps clean, merge, then clean some more.

Merge

Since our two DataFrames have a matching column, we can merge them together, matching up rows using that column; for a bit more explanation, see this section of the Pandas Getting Started tutorials. Merge the IMDB and The Numbers data on the Title column. Note that any movie titles that don’t appear in both tables should simply not be included in the merged table. Discard any entries with matching titles but different years - these are likely not referring to the same movie (e.g., reboots or unrelated films with the same title).

Clean

Among other things, we’re interested in looking at correlations, which means we need to be able to scatterplot our columns. This will require us to unify some of the numerical representations in the two tables. In particular, you should at least handle the following:

For all the columns that represent numerical values, remove all instances of ‘$’ and ‘,’ and convert them to numerical types using df.astype().
Clean the release year column using pd.to_datetime(). Tip: you can now access the year a movie was released with df[‘Release Date’].dt.year

Part 4: Analyze

You are now in possession of a dataset that, to my knowledge, did not exist (or could not be obtained for free, anyway) until you made it - that’s pretty neat! Explore the data for interesting trends. For each of these analyses, create plots with title and axis labels (with units!) and comment on any trends you find. For correlation plots, use df.corr to compute and display the correlation coefficients. Your analysis should:

Investigate the empirical distribution of the domestic gross returns by creating a histogram.
Investigate user rating vs domestic gross by creating a scatterplot.
Create a plot to investigate budget by release year.
Create one more visualization and analysis of your own choosing. Your insights should include at least one thing that could not be discovered using only one dataset or the other (i.e., something involving relationships between IMDB columns and The Numbers columns).

Rubric

Pre-Lab (10 points)

1 point per question

Lab (50 points)

Parts 1-2 (10 points each)

The scraping process collects all of the requested data, respects proper scraping etiquette, and is sufficiently well-documented.

Part 3 (10 points)

(6 points) The two data sources are merged correctly
(2 points) Numerical columns are cleaned and converted
(2 points) Release year is represented as a DateTime.

Part 4 (20 points)

5 points per analysis/plot