DATA 311 - Lab 5: Scraping Movie Data

Scott Wehrwein

Fall 2021

Introduction

In this lab, you will analyze trends in the movie industry. The investigation is motivated by an interest in questions pertaining to the relationship between money-related properties (budget, box office returns) and quality/popularity-related properties (average popular ratings; average critic ratings). The good news is that all of this data is available; the bad news is that some of it is only available via webpages; the worse news is that we need to go to two different websites to get it.

Collaboration Policy

You are required to complete this lab in pairs. I highly recommend collaborating synchronously, as each partner will be responsible for understanding (and being able to independently explain) every aspect of your submission. As a reminder, here’s the collaboration policy for labs done in pairs from the syllabus:

For labs done in pairs, any and all collaboration is permissible between members of the same pair. That said, both members must understand and be able to explain in detail all aspects of their submission. For this reason, “pair programming” is highly recommended - you should not split the tasks up for each group member complete independently. I reserve the right to meet with any student one-on-one and ask them to explain any part of their submission to me in detail.

Getting Started

To work on this lab, you’ll need to install the beautifulsoup4 and requests packages to your virtual environment. Assuming you followed the instructions from Lab 1, this would look like the following:

  1. Open a new terminal
  2. cd 311
  3. source data311_env/bin/activate
  4. pip install beautifulsoup4 requests

There is no starter notebook for this lab; create a new notebook, give it a title, and include both partners’ names in a subheading at the top.

The Data

IMDB - The Internet Movie Database

The Internet Movie Database (https://www.imdb.com/) contains a wealth of information about movies. Most things you might want to know about a movie - title, release date, runtime, cast and crew, user ratings, critic ratings, etc. etc. IMDB actually does provide much of their data in a friendly, downloadable form (see https://imdb.com/interfaces/). The problem is that this data doesn’t contain box office returns or critic scores. For this reason, we’re going to need to scrape the webpages for this information.

There are various pages you can find this stuff on, but the easiest and most efficient (requiring the fewest page requests) is a search results page like the following:

https://www.imdb.com/search/title/?release_date=2000-01-01,2020-12-31&sort=boxoffice_gross_us,desc&start=1

For the price of one page request, we get 50 movies with a good amount of information on each, including title, year, runtime, user rating, critic rating (Metascore), number of votes, and “Gross”, which is the amount of money the movie made in the US. Getting more movies is as simple as changing the parameter to the start keyword in the URL - if you click the “Next” link on that page, you’ll see the URL changes to contain &start=51.

The Numbers

IMDB is good for the basics, but for juicy financial information, we have to look to https://www.the-numbers.com/. This website displays data on budgets, US (“Domestic”) box office gross, and worldwide gross. A paginated table of movies sorted by budget showing release date, title, budget, domestic, and worldwide gross can be found here:

https://www.the-numbers.com/movie/budgets/all

Unfortunately, they don’t seem to have the same data sorted by domestic gross - that way we could get, theoretically, the same list of movies from The Numbers and from IMDB. We’ll have to settle for the heuristic that the movies with the highest budget (from The Numbers) are likely to overlap with the movies with the highest domestic gross.

Similar to IMDB, you can click the 101-200 link at the bottom of the page to see how the URL changes to see the next 100 movies.

Scraping Etiquette:

When scraping websites, internet etiquette dictates that you should avoid hammering somebody’s servers by making many requests at once in rapid succession. Be sure that you follow these rules, several of which will make your life easier as well as minimize unneccessary traffic to the pages you’re scraping:

Your Tasks

Part 1: Scrape IMDB

Using the URL pattern provided above (which sorts movies in descending order by US Box Office, a.k.a. Domestic Gross), extract the following data from each of the top 1000 movies:

Form this data into the columns of a DataFrame, and save it out to a CSV file so you can load it from disk instead of re-scraping.

Part 2 - Scrape The Numbers

Use the URL pattern provided above to extract the following data for the top 1000 movies, ordered by budget:

Note that I’ve included Domestic Gross from both sites - it might be interesting to see if they agree! Form this data into a separate DataFrame and save it out to a CSV file so you can load it from disk instead of re-scraping.

Part 3: Clean and Merge

Merge the data tables together and clean the columns. It’s up to you what order you do this in - you may find it makes sense to clean then merge, or merge then clean, or perhaps clean, merge, then clean some more.

Merge

If two DataFrames have a matching column, you can use pd.merge to merge them together, matching up rows using that column; for a bit more explanation, see this section of the Pandas Getting Started tutorials. Merge the IMDB and The Numbers data on the Title column. Note that any movie titles that don’t appear in both tables will simply not be included in the merged table.

Clean

Among other things, we’re interested in looking at correlations, which means we need to be able to scatterplot our columns. For all the columns that represent numerical values, clean and convert them to numerical types.

Part 4: Analyze

You are now in possession of a dataset that, to my knowledge, did not exist (or could not be obtained for free, anyway) until you made it -that’s pretty neat! Explore the data for interesting trends. Your analysis should:

Here are some examples of questions I’d be curious to answer. Don’t treat this list as exhaustive - I don’t even know whether anything interesting would result from each of these, but they should give you an idea of what I’m looking for.

Extra Credit

There are many avenues that might make this analysis more interesting or better. Here are some ideas:

Interesting explorations can receive up to 5 points of extra credit. Submit your extra credit as a separate zip file in the same format as the base assignment.

Guidelines

As usual, your analysis should tell a story clearly and convincingly. All the general guidelines from prior labs apply: assumptions, preprocessing, cleaning, and analysis should be clearly documented.

Your final notebook should not include “scratch work” that you used to develop your scraping approach - just include the final code that scrapes the data, but make sure that it’s clearly documented.

Submitting your work

Notebooks and CSVs

Make sure that all your cells have been run and have the output you intended (exception: you don’t need to re-run your entire scrapes), then download a copy of your notebook in .ipynb format. Make sure you have saved out raw scraped CSV files from each dataset and your notebook reads them in by filename only (i.e., your notebook should assume that your CSV files are in the same directory).

Create a zip file containing your notebook and both CSV files and submit it resulting files to the Lab 5 assignment on Canvas. Note this is different from past labs - this time, all your files should be submitted in one zip file.

Survey

Finally fill out the Week 5 Survey on Canvas. Your submission will not be considered complete until you have submitted the survey.

Rubric

Parts 1 and 2 are worth 30 points:

Parts 3 and 4 will be graded as usual on the extent to which your work is Correct, Convincing, and Clear:

Extra Credit

Up to 5 points as detailed above.