Spring 2026
In this lab, you will analyze trends in the movie industry. The investigation is motivated by an interest in questions pertaining to the relationship between money-related properties (budget, box office returns) and quality/popularity-related properties (average popular ratings; average critic ratings). The good news is that all of this data is available; the bad news is that some of it is only available via webpages; the worse news is that we need to go to two different websites to get it.
You must complete the pre-lab individual, but complete the lab in pairs. Your TA will help facilitating pairing. If an odd number of students are in attendance at lab, your TA will arrange one group of three. The lab must be done synchronously and collaboratively with your partner, with no free-loaders and no divide-and-conquer; every answer should reflect the understanding and work of both members. If your group does not finish during lab time, please arrange to meet as a pair outside of class time to finish up any work.
You must work with a different partner for Lab 5 than you did for Labs 1, 2, and 4.
There is no starter notebook for this lab; create a new notebook in Colab, give it a title, and include both partners’ names in a subheading at the top.
The Internet Movie Database (https://www.imdb.com/) contains a wealth of information about movies. Most things you might want to know about a movie - title, release date, runtime, cast and crew, user ratings, critic ratings, etc. etc. IMDB actually does provide much of their data in a friendly, downloadable form (see https://developer.imdb.com/non-commercial-datasets/). The problem is that this data doesn’t contain everything we want, so we are going to scrape the webpages instead.
Specifically, we are going to scrape the IMDB pages for the top 1000 movies sorted by box office returns. This is available here:
https://www.imdb.com/list/ls098063263/
When you visit this page, it shows the top 250 movies. You can click
a button at the bottom, or append &page=2 to go to the
next page (there are a total of four, covering the top 1000 movies).
Unfortunately, these pages don’t seem to take kindly to automated
requests like the ones we’ll make using requests. Although
there are techniques for getting around the blocking of automated
requests like this, we are going to be polite and not try to workaround
this restriction.
Instead, I’ve cached a version of this list, rendered as four separate pages, each with 250 of the top 1000 movies. You can find the first page here:
https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_26s/lab5/imdb/Top%201000%20Highest-Grossing%20Movies%20of%20All%20Time.html
and subsequent pages are at similar URLs with filenames such as
...Time_page2.html.
IMDB is good for the basics, but for juicy financial information, we have to look to The Numbers. This website displays data on budgets, US (“Domestic”) box office gross, and worldwide gross. It no longer seems to have an all-time list of highest-budget films, but I’ve made a local mirror of the list they previously had.
Unfortunately, this list is sorted by budget, not by domestic gross - if we had that, we could get the same list of movies from The Numbers and from IMDB. We’ll have to settle for the heuristic that the movies with the highest budgets (from The Numbers) are likely to overlap with the movies with the highest domestic gross.
A paginated table of movies sorted by budget showing release date, title, budget, domestic, and worldwide gross can be accessed via URLs with the following pattern:
https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_26s/lab5/the-numbers.com/movie/budgets/all/
https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_26s/lab5/the-numbers.com/movie/budgets/all/101
Each URL has 100 films out of a total list of 1000.
Most websites are built for humans to visit and view their content using a web browser. As soon as we start programmatically visiting websites, we risk the potential for abusing these resources. Above all, it’s important make sure that your scraping activities are not putting undue strain on those hosting the resources.
When scraping websites, internet etiquette dictates that you should avoid hammering somebody’s servers by making many requests at once in rapid succession. Be sure that you follow these rules, several of which will make your life easier as well as minimize unneccessary traffic to the pages you’re scraping:
time module, then after each time you make a
webpage request, call time.sleep(10) (or more) to pause for
ten seconds and make sure you’re not making requests at an unreasonable
rate.get
request in an early cell that you don’t need to re-run every time you
add a step in your soup parsing. When your scraping seems rock solid,
you can then apply it to each of the desired pages in a loop.to_csv method. Anytime
you come back to continue work, load from the CSV file instead of
re-doing the whole scrape.Part I: HTML
Please navigate to the “Structuring content with HTML” online learning module and go through the following tutorials as an introduction to HTML structures: Basic HTML syntax and Structuring documents.
For additional resources, please see “HTML Tutorial.” The following modules are recommended: HTML Introduction and HTML Basic Examples.
In a text editor of your choice that is capable of exporting to pdf, answer these questions:
What is the character reference for double quote? What are two other symbols with character references, and what is the reference for them?
Where does title of a page typically show
up?
What’s the difference between a span and a
div?
What roles do the head, title, and
body elements play in an HTML document?
Part II: Requests
Please read these two paragraphs in the “Requests Quickstart” for the python requests package: Make a Request and Response Content.
In the same editor, answer these questions:
What does the requests.get() command do?
How do you read the content of the server’s response for an html page?
Part III: BeautifulSoup
Please read “Guide to Parsing HTML with BeautifulSoup in Python” and answer these questions in the same editor:
What does the find_all() method do?
How do you search for all <a> tags that have
the “element” class?
Part IV: Robots.txt
requests library and find out what your user agent is when
using requests. You may find it handy to parse the
resulting page with BeautifulSoup, but combing through the raw HTML is
fine too.get requests returned empty pages when I
tried to access the Top 1000 Grossing list (at url
https://www.imdb.com/list/ls098063263/) using
requests. Does IMDB’s robots.txt explicitly
forbid us (using the default requests user-agent) from
accessing this page programmatically? If so, are there any restrictions?
If not, give the relevant parts of the robots.txt file that forbids this
access.For both of the scrapting tasks below, use the locally-hosted mirrored versions of the websites described above, rather than scraping from imdb.com and the-numbers.com.
Using the URL pattern provided above (which sorts movies in descending order by US Box Office, a.k.a. Domestic Gross), extract the following data from each of the top 1000 movies:
Form this data into the columns of a DataFrame, and save it out to a CSV file so you can load it from disk instead of re-scraping.
Use the URL pattern provided above to extract the following data for the top 1000 movies, ordered by budget:
Form this data into a separate DataFrame and save it out to a CSV file so you can load it from disk instead of re-scraping.
Merge the data tables together and clean the columns. It’s up to you what order you do this in - you may find it makes sense to clean then merge, or merge then clean, or perhaps clean, merge, then clean some more.
Since our two DataFrames have a matching column, we can merge them together, matching up rows using that column; for a bit more explanation, see this section of the Pandas Getting Started tutorials. Merge the IMDB and The Numbers data on the Title column. Note that any movie titles that don’t appear in both tables should simply not be included in the merged table. Discard any entries with matching titles but different years - these are likely not referring to the same movie (e.g., reboots or unrelated films with the same title).
Among other things, we’re interested in looking at correlations, which means we need to be able to scatterplot our columns. This will require us to unify some of the numerical representations in the two tables. In particular, you should at least handle the following:
df.astype().pd.to_datetime().
Tip: you can now access the year a movie was released with
df[‘Release Date’].dt.yearYou are now in possession of a dataset that, to my knowledge, did not
exist (or could not be obtained for free, anyway) until you made it -
that’s pretty neat! Explore the data for interesting trends. For each of
these analyses, create plots with title and axis labels (with units!)
and comment on any trends you find. For correlation plots, use
df.corr to compute and display the correlation
coefficients. Your analysis should:
Investigate the empirical distribution of the domestic gross returns by creating a histogram.
Investigate user rating vs domestic gross by creating a scatterplot.
Create a plot to investigate budget by release year.
Create one more visualization and analysis of your own choosing. Your insights should include at least one thing that could not be discovered using only one dataset or the other (i.e., something involving relationships between IMDB columns and The Numbers columns).
The scraping process collects all of the requested data, respects proper scraping etiquette, and is sufficiently well-documented.