DATA 311 - Lab 5: Scraping Movie Data

Scott Wehrwein

Winter 2023

Introduction

In this lab, you will analyze trends in the movie industry. The investigation is motivated by an interest in questions pertaining to the relationship between money-related properties (budget, box office returns) and quality/popularity-related properties (average popular ratings; average critic ratings). The good news is that all of this data is available; the bad news is that some of it is only available via webpages; the worse news is that we need to go to two different websites to get it.

Collaboration Policy

You are required to complete this lab in pairs. I highly recommend collaborating synchronously, as each partner will be responsible for understanding (and being able to independently explain) every aspect of your submission. As a reminder, here’s the collaboration policy for labs done in pairs from the syllabus:

For labs done in pairs, any and all collaboration is permissible between members of the same pair. That said, both members must understand and be able to explain in detail all aspects of their submission. For this reason, “pair programming” is highly recommended - you should not split the tasks up for each group member complete independently. I reserve the right to meet with any student one-on-one and ask them to explain any part of their submission to me in detail.

Getting Started

There is no starter notebook for this lab; create a new notebook in Colab, give it a title, and include both partners’ names in a subheading at the top.

The Data

IMDB - The Internet Movie Database

The Internet Movie Database (https://www.imdb.com/) contains a wealth of information about movies. Most things you might want to know about a movie - title, release date, runtime, cast and crew, user ratings, critic ratings, etc. etc. IMDB actually does provide much of their data in a friendly, downloadable form (see https://imdb.com/interfaces/). The problem is that this data doesn’t contain box office returns or critic scores. For this reason, we’re going to need to scrape the webpages for this information.

There are various pages you can find this stuff on, but the easiest and most efficient (requiring the fewest page requests) is a search results page like the following:

https://www.imdb.com/search/title/?release_date=2000-01-01,2020-12-31&sort=boxoffice_gross_us,desc&start=1

For the price of one page request, we get 50 movies with a good amount of information on each, including title, year, runtime, user rating, critic rating (Metascore), number of votes, and “Gross”, which is the amount of money the movie made in the US. Getting more movies is as simple as changing the parameter to the start keyword in the URL - if you click the “Next” link on that page, you’ll see the URL changes to contain &start=51.

The Numbers

IMDB is good for the basics, but for juicy financial information, we have to look to https://www.the-numbers.com/. This website displays data on budgets, US (“Domestic”) box office gross, and worldwide gross. A paginated table of movies sorted by budget showing release date, title, budget, domestic, and worldwide gross can be found here:

https://www.the-numbers.com/movie/budgets/all

Unfortunately, they don’t seem to have the same data sorted by domestic gross - that way we could get, theoretically, the same list of movies from The Numbers and from IMDB. We’ll have to settle for the heuristic that the movies with the highest budget (from The Numbers) are likely to overlap with the movies with the highest domestic gross.

Similar to IMDB, you can click the 101-200 link at the bottom of the page to see how the URL changes to see the next 100 movies.

Scraping Etiquette

When scraping websites, internet etiquette dictates that you should avoid hammering somebody’s servers by making many requests at once in rapid succession. Be sure that you follow these rules, several of which will make your life easier as well as minimize unneccessary traffic to the pages you’re scraping:

Import the time module, then after each time you make a webpage request, call time.sleep(1) (or more) to pause for a second and make sure you’re not making requests at an unreasonable rate.
Avoid unnecessarily repeating requests; develop your scraping approach using data from just one page; make the get request in an early cell that you don’t need to re-run every time you add a step in your soup parsing. When your scraping seems rock solid, you can then apply it to each of the desired pages in a loop.
Once you’ve scraped all the movies and collected your raw data, construct a DataFrame and save it out as a CSV file (or one CSV per site scraped) using the to_csv method. Download these from Colab and re-upload them at the start of each session so you aren’t re-doing the whole scrape each session.

Your Tasks

Part 1: Scrape IMDB

Using the URL pattern provided above (which sorts movies in descending order by US Box Office, a.k.a. Domestic Gross), extract the following data from each of the top 1000 movies:

Title
Runtime
User Rating (this is the number next to the yellow star)
Metascore
Votes
IMDB Domestic Gross
Genre (use the first of the possibly multiple genres listed)

Form this data into the columns of a DataFrame, and save it out to a CSV file so you can load it from disk instead of re-scraping.

Part 2 - Scrape The Numbers

Use the URL pattern provided above to extract the following data for the top 1000 movies, ordered by budget:

Release Date
Title
Budget
TN Domestic Gross
Worldwide Gross

Note that I’ve included Domestic Gross from both sites - it might be interesting to see if they agree! Form this data into a separate DataFrame and save it out to a CSV file so you can load it from disk instead of re-scraping.

Ethically Scraping The Numbers?

Update: scraping woes continue, so I’ve mirrored the pages you need.

Please access these pages by prepending the the-numbers url (without http://www) with the lab 5 handout url. For example, the mirrored first page is here:

https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_23w/lab5/the-numbers.com/movie/budgets/all

and the remaining ones look like this, for k in range(101,902,100):

https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_23w/lab5/the-numbers.com/movie/budgets/all/{k}

For posterity:

As of February 2023, The Numbers seems to be blocking requests coming from Python’s requests library, giving a “403 forbidden” error for a page that loads fine in a browser. When you ask for a webpage, you send along various metadata with the request. One such piece of metadata is the “user agent”, a string that usually corresponds to the browser or program that’s being used to access the page, so I suspect this is what the-numbers.com is looking at when deciding to block the requests.

I took a look at the site’s robots.txt (here), and that didn’t really explain it: there’s no specific rules for the requests library’s user-agent (something along the lines of Python-requests-version#), and for the default (*) user agent, there’s no Disallowed rule for the location we’re accessing. With this established, it seems like we’re not breaking the site’s own rules if we change the user-agent to get past this restriction. After a little experimentation, I found that most custom user-agent strings work. I set this for the request something like this:

requests.get(url, {'User-agent': custom_agent_string})

Notice I’ve pruposefully not told you what to set it to - I’m hoping if we all do something different, it’s less likely to come to someone’s attention and get us blocked.

Finally, I noticed that for the various bots that have special rules in robots.txt, there’s a Crawl-delay: 10 rule, meaning they’re asking search engine indexing bots and the like to wait 10 seconds between requests. Please be polite and respect this in our code and use a generous 10 second delay between page requests. Since we’re only requesting a handful of pages, it’s not really long to wait.

Part 3: Clean and Merge

Merge the data tables together and clean the columns. It’s up to you what order you do this in - you may find it makes sense to clean then merge, or merge then clean, or perhaps clean, merge, then clean some more.

Merge

If two DataFrames have a matching column, you can use pd.merge to merge them together, matching up rows using that column; for a bit more explanation, see this section of the Pandas Getting Started tutorials. Merge the IMDB and The Numbers data on the Title column. Note that any movie titles that don’t appear in both tables will simply not be included in the merged table.

Clean

Among other things, we’re interested in looking at correlations, which means we need to be able to scatterplot our columns. For all the columns that represent numerical values, clean and convert them to numerical types.

Part 4: Analyze

You are now in possession of a dataset that, to my knowledge, did not exist (or could not be obtained for free, anyway) until you made it - that’s pretty neat! Explore the data for interesting trends. Your analysis should:

Investigate, visualize, and (if possible) quantify approximately two nontrivial facts or trends present in the data. One insight might be fine if it’s particularly surprising and the analysis is very in-depth.
Your insights should include at least one thing that could not be discovered using only one dataset or the other (i.e., something involving relationships between IMDB columns and The Numbers columns).

Here are some examples of questions I’d be curious to answer. Don’t treat this list as exhaustive - I don’t even know whether anything interesting would result from each of these, but they should give you an idea of what I’m looking for.

Is there anything interesting to note about the empirical distribution of each of the columns (especially budgets, returns, and ratings columns)?
How do different columns relate to each other (budget vs. gross or rating; rating vs gross; critic vs user rating; …)? If you find a correlation, quantify it by computing and displaying correlation coefficients. Are there interesting facts
Are there any interesting trends happening over time?

Extra Credit

There are many avenues that might make this analysis more interesting or better. Here are some ideas:

I noticed that some titles don’t match perfectly. For example, The Numbers has “Star Wars Ep. VII…”, whereas IMDB spells out “Episode”. Use text normalization or fuzzy string matching (e.g., using thefuzz python package) unify titles that are nearly but not quite exactly the same before merging your datasets. How many additional movies did this buy you? Did you merge any pairs that should have been separate?
Can you scrape more movies (following the Scraping Etiquette guidelines, of course) and discover something that isn’t visible in the overlap of the top 1000 movies from each site? For example, you might see if there are interesting trends visible over longer time periods by looking at movies from a wider range of dates.
Can you scrape more information for each movie and find anything interesting? One example that comes to mind is to include multiple genre labels.

Interesting explorations can receive up to 5 points of extra credit. Submit your extra credit as a separate zip file in the same format as the base assignment.

Guidelines

As usual, your analysis should tell a story clearly and convincingly. All the general guidelines from prior labs apply: assumptions, preprocessing, cleaning, and analysis should be clearly documented.

Your final notebook should not include “scratch work” that you used to develop your scraping approach - just include the final code that scrapes the data, but make sure that it’s clearly documented.

Submitting your work

Notebooks and CSVs

Make sure that all your cells have been run and have the output you intended (exception: you don’t need to re-run your entire scrapes), then download a copy of your notebook in .ipynb format. Make sure you have saved out raw scraped CSV files from each dataset and your notebook reads them in by filename only (i.e., your notebook should assume that your CSV files are in the same directory).

Create a zip file containing your notebook and both CSV files and submit it resulting files to the Lab 5 assignment on Canvas. Note this is different from past labs - this time, all your files should be submitted in one zip file.

Survey

Finally fill out the Lab 5 Survey on Canvas. Your submission will not be considered complete until you have submitted the survey.

Rubric

Parts 1 and 2 are worth 30 points:

20 points - scraping process collects all of the requested columns
10 points - code and explanations are clear and well-documented

Parts 3 and 4 will be graded as usual on the extent to which your work is Correct, Convincing, and Clear:

15 points - Correct
- 5/5 analysis is completely technically sound
- 4/5 small technical issues, but no apparent effect on the outcome of the analysis
- 3/5 some techniques are applied or interpreted incorrectly in a way that affects the analysis
- 2/5 an honest attempt was made, but it’s seriously incomplete or flawed
- 1/5 some effort was made
- 0/5 no effort / no submission
5 points - Convincing
- 5/5 - Sensible preprocessing decisions are made; assumptions are reasonable
- 3/5 - Some questionable assumptions are made or important information ignored, weakening the conclusion of the analysis
- 1/5 - The analysis attempts to, but does not support the conclusion / provide insight into the issue in question.
- 0/5 no effort / no submission
10 points - Clear
- 10/10 - code and analysis are clearly and concisely explained and justified
- 7/10 - code and analysis are partly explained/justified, or explanations are difficult to understand
- 5/10 - many explanations missing or unreadable
- 0/10 no effort / no submission

Extra Credit

Up to 5 points as detailed above.