Fall 2025
In this lab, you will analyze trends in the movie industry. The investigation is motivated by an interest in questions pertaining to the relationship between money-related properties (budget, box office returns) and quality/popularity-related properties (average popular ratings; average critic ratings). The good news is that all of this data is available; the bad news is that some of it is only available via webpages; the worse news is that we need to go to two different websites to get it.
You must complete the pre-lab individual, but complete the lab in pairs. Your TA will help facilitating pairing. If an odd number of students are in attendance at lab, your TA will arrange one group of three. The lab must be done synchronously and collaboratively with your partner, with no free-loaders and no divide-and-conquer; every answer should reflect the understanding and work of both members. If your group does not finish during lab time, please arrange to meet as a pair outside of class time to finish up any work.
You must work with a different partner for Lab 5 than you did for Labs 1, 2, and 4.
There is no starter notebook for this lab; create a new notebook in Colab, give it a title, and include both partners’ names in a subheading at the top.
The Internet Movie Database (https://www.imdb.com/) contains a wealth of information about movies. Most things you might want to know about a movie - title, release date, runtime, cast and crew, user ratings, critic ratings, etc. etc. IMDB actually does provide much of their data in a friendly, downloadable form (see https://developer.imdb.com/non-commercial-datasets/). The problem is that this data doesn’t contain everything we want, so we are going to scrape the webpages instead.
Specifically, we are going to scrape the IMDB pages for the top 1000 movies sorted by box office returns. This is available here:
https://www.imdb.com/search/title/?release_date=2000-01-01,2020-12-31&sort=boxoffice_gross_us,desc
When you visit this page, it shows the top 50 movies. In the past,
this was more scrape-friendly: you were able to append (e.g.)
&start=50 to the URL and get a page with the next 50
the movies, starting at #50. Unfortunatley, now the additional movies
are revealed via Javascript interaction using the “50 more” button at
the bottom of the page. Although more sophisticated crawlers/scrapers
exist that can be configured to perform actions like this, we aren’t
going to go that far.
Instead, I’ve cached a version of this list, rendered as a single page page with the top 1000 movies. You can find that here: https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_25f/lab5/imdb/imdb_top1000.html
Note that when you visit that page in a browser (at least when I
did), it only shows the first 50 movies. However, when you load it in
Python using the requests library, you’ll get the HTML
containing all 1000 movies.
IMDB is good for the basics, but for juicy financial information, we have to look to The Numbers. This website displays data on budgets, US (“Domestic”) box office gross, and worldwide gross. A paginated table of movies sorted by budget showing release date, title, budget, domestic, and worldwide gross can be found here:
https://www.the-numbers.com/movie/budgets/all
Unfortunately, they don’t seem to have the same data sorted by domestic gross - that way we could get, theoretically, the same list of movies from The Numbers and from IMDB. We’ll have to settle for the heuristic that the movies with the highest budget (from The Numbers) are likely to overlap with the movies with the highest domestic gross.
On this page, you can click the 101-200 link at the
bottom of the page to see how the URL changes to see the next 100
movies. Before you start scraping The Numbers, please
make sure you’ve read the Scraping Etiquette section below.
If you are unable to scrape The Numbers’ live webpage, either because
their robots.txt disallows it or because they are blocking
your requests (even though they respect robots.txt), you
can access a cached version of the pages you can access a cached version
that I’ve hosted using URLs with the following pattern:
https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_25f/lab5/the-numbers.com/movie/budgets/all/
https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_25f/lab5/the-numbers.com/movie/budgets/all/101
Notice that these URLs mirror the true ones, but with the course webpage URL for Lab 5 prepended.
Most websites are built for humans to visit and view their content using a web browser. As soon as we start programmatically visiting websites, we risk the potential for abusing these resources. Above all, it’s important make sure that your scraping activities are not putting undue strain on those hosting the resources.
When scraping websites, internet etiquette dictates that you should avoid hammering somebody’s servers by making many requests at once in rapid succession. Be sure that you follow these rules, several of which will make your life easier as well as minimize unneccessary traffic to the pages you’re scraping:
time module, then after each time you make a
webpage request, call time.sleep(10) (or more) to pause for
ten seconds and make sure you’re not making requests at an unreasonable
rate.get
request in an early cell that you don’t need to re-run every time you
add a step in your soup parsing. When your scraping seems rock solid,
you can then apply it to each of the desired pages in a loop.to_csv method. Anytime
you come back to continue work, load from the CSV file instead of
re-doing the whole scrape.Part I: HTML
Please navigate to the “Structuring content with HTML” online learning module and go through the following tutorials as an introduction to HTML structures: Basic HTML syntax and Structuring documents.
For additional resources, please see “HTML Tutorial.” The following modules are recommended: HTML Introduction and HTML Basic Examples.
In a text editor of your choice that is capable of exporting to pdf, answer these questions:
What is the character reference for double quote? What are two other symbols with character references, and what is the reference for them?
Where does title of a page typically show
up?
What’s the difference between a span and a
div?
What roles do the head, title, and
body elements play in an HTML document?
Part II: Requests
Please read these two paragraphs in the “Requests Quickstart” for the python requests package: Make a Request and Response Content.
In the same editor, answer these questions:
What does the requests.get() command do?
How do you read the content of the server’s response for an html page?
Part III: BeautifulSoup
Please read “Guide to Parsing HTML with BeautifulSoup in Python” and answer these questions in the same editor:
What does the find_all() method do?
How do you search for all <a> tags that have
the “element” class?
Part IV: Robots.txt
requests library and find out what your user agent is when
using requests. You may find it handy to parse the
resulting page with BeautifulSoup, but combing through the raw HTML is
fine too.https://www.the-numbers.com/movie/budgets/all. Are you (via
the requests user-agent you found in the prior question)
allowed to do this? If so, are there any restrictions?Using the URL pattern provided above (which sorts movies in descending order by US Box Office, a.k.a. Domestic Gross), extract the following data from each of the top 1000 movies:
Form this data into the columns of a DataFrame, and save it out to a CSV file so you can load it from disk instead of re-scraping.
Use the URL pattern provided above to extract the following data for the top 1000 movies, ordered by budget:
Form this data into a separate DataFrame and save it out to a CSV file so you can load it from disk instead of re-scraping.
As mentioned above, if you encounter any issues scraping the live site, please use my mirrored version instead. Do not continue hammering their servers if you find your requests are being blocked, even if your usage is not explicitly going against their robots.txt.
Merge the data tables together and clean the columns. It’s up to you what order you do this in - you may find it makes sense to clean then merge, or merge then clean, or perhaps clean, merge, then clean some more.
Since our two DataFrames have a matching column, we can merge them together, matching up rows using that column; for a bit more explanation, see this section of the Pandas Getting Started tutorials. Merge the IMDB and The Numbers data on the Title column. Note that any movie titles that don’t appear in both tables should simply not be included in the merged table. Discard any entries with matching titles but different years - these are likely not referring to the same movie (e.g., reboots or unrelated films with the same title).
Among other things, we’re interested in looking at correlations, which means we need to be able to scatterplot our columns. This will require us to unify some of the numerical representations in the two tables. In particular, you should at least handle the following:
df.astype().pd.to_datetime().
Tip: you can now access the year a movie was released with
df[‘Release Date’].dt.yearYou are now in possession of a dataset that, to my knowledge, did not
exist (or could not be obtained for free, anyway) until you made it -
that’s pretty neat! Explore the data for interesting trends. For each of
these analyses, create plots with title and axis labels (with units!)
and comment on any trends you find. For correlation plots, use
df.corr to compute and display the correlation
coefficients. Your analysis should:
Investigate the empirical distribution of the domestic gross returns by creating a histogram.
Investigate user rating vs domestic gross by creating a scatterplot.
Create a plot to investigate budget by release year.
Create one more visualization and analysis of your own choosing. Your insights should include at least one thing that could not be discovered using only one dataset or the other (i.e., something involving relationships between IMDB columns and The Numbers columns).
The scraping process collects all of the requested data, respects proper scraping etiquette, and is sufficiently well-documented.