DATA 311 - Lab 5: Scraping Movie Data

Scott Wehrwein

Fall 2025

Overview

In this lab, you will analyze trends in the movie industry. The investigation is motivated by an interest in questions pertaining to the relationship between money-related properties (budget, box office returns) and quality/popularity-related properties (average popular ratings; average critic ratings). The good news is that all of this data is available; the bad news is that some of it is only available via webpages; the worse news is that we need to go to two different websites to get it.

Collaboration Policy

You must complete the pre-lab individual, but complete the lab in pairs. Your TA will help facilitating pairing. If an odd number of students are in attendance at lab, your TA will arrange one group of three. The lab must be done synchronously and collaboratively with your partner, with no free-loaders and no divide-and-conquer; every answer should reflect the understanding and work of both members. If your group does not finish during lab time, please arrange to meet as a pair outside of class time to finish up any work.

You must work with a different partner for Lab 5 than you did for Labs 1, 2, and 4.

Getting Started

There is no starter notebook for this lab; create a new notebook in Colab, give it a title, and include both partners’ names in a subheading at the top.

The Data

IMDB - The Internet Movie Database

The Internet Movie Database (https://www.imdb.com/) contains a wealth of information about movies. Most things you might want to know about a movie - title, release date, runtime, cast and crew, user ratings, critic ratings, etc. etc. IMDB actually does provide much of their data in a friendly, downloadable form (see https://developer.imdb.com/non-commercial-datasets/). The problem is that this data doesn’t contain everything we want, so we are going to scrape the webpages instead.

Specifically, we are going to scrape the IMDB pages for the top 1000 movies sorted by box office returns. This is available here:

https://www.imdb.com/search/title/?release_date=2000-01-01,2020-12-31&sort=boxoffice_gross_us,desc

When you visit this page, it shows the top 50 movies. In the past, this was more scrape-friendly: you were able to append (e.g.) &start=50 to the URL and get a page with the next 50 the movies, starting at #50. Unfortunatley, now the additional movies are revealed via Javascript interaction using the “50 more” button at the bottom of the page. Although more sophisticated crawlers/scrapers exist that can be configured to perform actions like this, we aren’t going to go that far.

Instead, I’ve cached a version of this list, rendered as a single page page with the top 1000 movies. You can find that here: https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_25f/lab5/imdb/imdb_top1000.html

Note that when you visit that page in a browser (at least when I did), it only shows the first 50 movies. However, when you load it in Python using the requests library, you’ll get the HTML containing all 1000 movies.

The Numbers

IMDB is good for the basics, but for juicy financial information, we have to look to The Numbers. This website displays data on budgets, US (“Domestic”) box office gross, and worldwide gross. A paginated table of movies sorted by budget showing release date, title, budget, domestic, and worldwide gross can be found here:

https://www.the-numbers.com/movie/budgets/all

Unfortunately, they don’t seem to have the same data sorted by domestic gross - that way we could get, theoretically, the same list of movies from The Numbers and from IMDB. We’ll have to settle for the heuristic that the movies with the highest budget (from The Numbers) are likely to overlap with the movies with the highest domestic gross.

On this page, you can click the 101-200 link at the bottom of the page to see how the URL changes to see the next 100 movies. Before you start scraping The Numbers, please make sure you’ve read the Scraping Etiquette section below.

If you are unable to scrape The Numbers’ live webpage, either because their robots.txt disallows it or because they are blocking your requests (even though they respect robots.txt), you can access a cached version of the pages you can access a cached version that I’ve hosted using URLs with the following pattern:

https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_25f/lab5/the-numbers.com/movie/budgets/all/
https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_25f/lab5/the-numbers.com/movie/budgets/all/101

Notice that these URLs mirror the true ones, but with the course webpage URL for Lab 5 prepended.

Scraping Etiquette

Most websites are built for humans to visit and view their content using a web browser. As soon as we start programmatically visiting websites, we risk the potential for abusing these resources. Above all, it’s important make sure that your scraping activities are not putting undue strain on those hosting the resources.

When scraping websites, internet etiquette dictates that you should avoid hammering somebody’s servers by making many requests at once in rapid succession. Be sure that you follow these rules, several of which will make your life easier as well as minimize unneccessary traffic to the pages you’re scraping:

Import the time module, then after each time you make a webpage request, call time.sleep(10) (or more) to pause for ten seconds and make sure you’re not making requests at an unreasonable rate.
Avoid unnecessarily repeating requests; develop your scraping approach using data from just one page; make the get request in an early cell that you don’t need to re-run every time you add a step in your soup parsing. When your scraping seems rock solid, you can then apply it to each of the desired pages in a loop.
Once you’ve scraped all the movies and collected your raw data, the first thing you should do is construct a DataFrame and save it out as a CSV file using the to_csv method. Anytime you come back to continue work, load from the CSV file instead of re-doing the whole scrape.

Pre-Lab

Part I: HTML

Please navigate to the “Structuring content with HTML” online learning module and go through the following tutorials as an introduction to HTML structures: Basic HTML syntax and Structuring documents.
For additional resources, please see “HTML Tutorial.” The following modules are recommended: HTML Introduction and HTML Basic Examples.
In a text editor of your choice that is capable of exporting to pdf, answer these questions:
1. What is the character reference for double quote? What are two other symbols with character references, and what is the reference for them?
2. Where does title of a page typically show up?
3. What’s the difference between a span and a div?
4. What roles do the head, title, and body elements play in an HTML document?

Part II: Requests

Please read these two paragraphs in the “Requests Quickstart” for the python requests package: Make a Request and Response Content.
In the same editor, answer these questions:
1. What does the requests.get() command do?
2. How do you read the content of the server’s response for an html page?

Part III: BeautifulSoup

Please read “Guide to Parsing HTML with BeautifulSoup in Python” and answer these questions in the same editor:
1. What does the find_all() method do?
2. How do you search for all <a> tags that have the “element” class?

Part IV: Robots.txt

Please read “What is robots.txt” and answer these questions in the same editor:
1. Visit https://www.whatismybrowser.com/detect/what-is-my-user-agent/ and find out what your web browser’s user agent is.
2. In a Python notebook, load the same page using the requests library and find out what your user agent is when using requests. You may find it handy to parse the resulting page with BeautifulSoup, but combing through the raw HTML is fine too.
3. Look at the robots.txt file for The Numbers: https://www.the-numbers.com/robots.txt. As you saw above, we’re interested in scraping pages of the form https://www.the-numbers.com/movie/budgets/all. Are you (via the requests user-agent you found in the prior question) allowed to do this? If so, are there any restrictions?

Lab

Part 1: Scrape IMDB

Using the URL pattern provided above (which sorts movies in descending order by US Box Office, a.k.a. Domestic Gross), extract the following data from each of the top 1000 movies:

Title
Year
Runtime
User Rating (this is the number next to the yellow star)
Metascore
Votes
Genre (use the first of the possibly multiple genres listed)

Form this data into the columns of a DataFrame, and save it out to a CSV file so you can load it from disk instead of re-scraping.

Part 2 - Scrape The Numbers

Use the URL pattern provided above to extract the following data for the top 1000 movies, ordered by budget:

Release Date
Title
Production Budget
Domestic Gross
Worldwide Gross

Form this data into a separate DataFrame and save it out to a CSV file so you can load it from disk instead of re-scraping.

As mentioned above, if you encounter any issues scraping the live site, please use my mirrored version instead. Do not continue hammering their servers if you find your requests are being blocked, even if your usage is not explicitly going against their robots.txt.

Part 3: Clean and Merge

Merge the data tables together and clean the columns. It’s up to you what order you do this in - you may find it makes sense to clean then merge, or merge then clean, or perhaps clean, merge, then clean some more.

Merge

Since our two DataFrames have a matching column, we can merge them together, matching up rows using that column; for a bit more explanation, see this section of the Pandas Getting Started tutorials. Merge the IMDB and The Numbers data on the Title column. Note that any movie titles that don’t appear in both tables should simply not be included in the merged table. Discard any entries with matching titles but different years - these are likely not referring to the same movie (e.g., reboots or unrelated films with the same title).

Clean

Among other things, we’re interested in looking at correlations, which means we need to be able to scatterplot our columns. This will require us to unify some of the numerical representations in the two tables. In particular, you should at least handle the following:

For all the columns that represent numerical values, remove all instances of ‘$’ and ‘,’ and convert them to numerical types using df.astype().
Clean the release year column using pd.to_datetime(). Tip: you can now access the year a movie was released with df[‘Release Date’].dt.year

Part 4: Analyze

You are now in possession of a dataset that, to my knowledge, did not exist (or could not be obtained for free, anyway) until you made it - that’s pretty neat! Explore the data for interesting trends. For each of these analyses, create plots with title and axis labels (with units!) and comment on any trends you find. For correlation plots, use df.corr to compute and display the correlation coefficients. Your analysis should:

Investigate the empirical distribution of the domestic gross returns by creating a histogram.
Investigate user rating vs domestic gross by creating a scatterplot.
Create a plot to investigate budget by release year.
Create one more visualization and analysis of your own choosing. Your insights should include at least one thing that could not be discovered using only one dataset or the other (i.e., something involving relationships between IMDB columns and The Numbers columns).

Rubric

Pre-Lab (10 points)

1 point per question

Lab (50 points)

Parts 1-2 (10 points each)

The scraping process collects all of the requested data, respects proper scraping etiquette, and is sufficiently well-documented.

Part 3 (10 points)

(6 points) The two data sources are merged correctly
(2 points) Numerical columns are cleaned and converted
(2 points) Release year is represented as a DateTime.

Part 4 (20 points)

5 points per analysis/plot