Winter 2023
In this lab, you will analyze trends in the movie industry. The investigation is motivated by an interest in questions pertaining to the relationship between money-related properties (budget, box office returns) and quality/popularity-related properties (average popular ratings; average critic ratings). The good news is that all of this data is available; the bad news is that some of it is only available via webpages; the worse news is that we need to go to two different websites to get it.
You are required to complete this lab in pairs. I highly recommend collaborating synchronously, as each partner will be responsible for understanding (and being able to independently explain) every aspect of your submission. As a reminder, here’s the collaboration policy for labs done in pairs from the syllabus:
For labs done in pairs, any and all collaboration is permissible between members of the same pair. That said, both members must understand and be able to explain in detail all aspects of their submission. For this reason, “pair programming” is highly recommended - you should not split the tasks up for each group member complete independently. I reserve the right to meet with any student one-on-one and ask them to explain any part of their submission to me in detail.
There is no starter notebook for this lab; create a new notebook in Colab, give it a title, and include both partners’ names in a subheading at the top.
The Internet Movie Database (https://www.imdb.com/) contains a wealth of information about movies. Most things you might want to know about a movie - title, release date, runtime, cast and crew, user ratings, critic ratings, etc. etc. IMDB actually does provide much of their data in a friendly, downloadable form (see https://imdb.com/interfaces/). The problem is that this data doesn’t contain box office returns or critic scores. For this reason, we’re going to need to scrape the webpages for this information.
There are various pages you can find this stuff on, but the easiest and most efficient (requiring the fewest page requests) is a search results page like the following:
https://www.imdb.com/search/title/?release_date=2000-01-01,2020-12-31&sort=boxoffice_gross_us,desc&start=1
For the price of one page request, we get 50 movies with a good
amount of information on each, including title, year, runtime, user
rating, critic rating (Metascore), number of votes, and “Gross”, which
is the amount of money the movie made in the US. Getting more movies is
as simple as changing the parameter to the start
keyword in
the URL - if you click the “Next” link on that page, you’ll see the URL
changes to contain &start=51
.
IMDB is good for the basics, but for juicy financial information, we have to look to https://www.the-numbers.com/. This website displays data on budgets, US (“Domestic”) box office gross, and worldwide gross. A paginated table of movies sorted by budget showing release date, title, budget, domestic, and worldwide gross can be found here:
https://www.the-numbers.com/movie/budgets/all
Unfortunately, they don’t seem to have the same data sorted by domestic gross - that way we could get, theoretically, the same list of movies from The Numbers and from IMDB. We’ll have to settle for the heuristic that the movies with the highest budget (from The Numbers) are likely to overlap with the movies with the highest domestic gross.
Similar to IMDB, you can click the 101-200
link at the
bottom of the page to see how the URL changes to see the next 100
movies.
When scraping websites, internet etiquette dictates that you should avoid hammering somebody’s servers by making many requests at once in rapid succession. Be sure that you follow these rules, several of which will make your life easier as well as minimize unneccessary traffic to the pages you’re scraping:
time
module, then after each time you make a
webpage request, call time.sleep(1)
(or more) to pause for
a second and make sure you’re not making requests at an unreasonable
rate.get
request in an early cell that you don’t need to re-run every time you
add a step in your soup parsing. When your scraping seems rock solid,
you can then apply it to each of the desired pages in a loop.to_csv
method. Download
these from Colab and re-upload them at the start of each session so you
aren’t re-doing the whole scrape each session.Using the URL pattern provided above (which sorts movies in descending order by US Box Office, a.k.a. Domestic Gross), extract the following data from each of the top 1000 movies:
Form this data into the columns of a DataFrame, and save it out to a CSV file so you can load it from disk instead of re-scraping.
Use the URL pattern provided above to extract the following data for the top 1000 movies, ordered by budget:
Note that I’ve included Domestic Gross from both sites - it might be interesting to see if they agree! Form this data into a separate DataFrame and save it out to a CSV file so you can load it from disk instead of re-scraping.
Update: scraping woes continue, so I’ve mirrored the pages you need.
Please access these pages by prepending the the-numbers url (without http://www) with the lab 5 handout url. For example, the mirrored first page is here:
https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_23w/lab5/the-numbers.com/movie/budgets/all
and the remaining ones look like this,
for k in range(101,902,100)
:
https://facultyweb.cs.wwu.edu/~wehrwes/courses/data311_23w/lab5/the-numbers.com/movie/budgets/all/{k}
For posterity:
As of February 2023, The Numbers seems to be blocking requests coming from Python’s
requests
library, giving a “403 forbidden” error for a page that loads fine in a browser. When you ask for a webpage, you send along various metadata with the request. One such piece of metadata is the “user agent”, a string that usually corresponds to the browser or program that’s being used to access the page, so I suspect this is what the-numbers.com is looking at when deciding to block the requests.
I took a look at the site’s
robots.txt
(here), and that didn’t really explain it: there’s no specific rules for the requests library’s user-agent (something along the lines ofPython-requests-version#
), and for the default (*
) user agent, there’s no Disallowed rule for the location we’re accessing. With this established, it seems like we’re not breaking the site’s own rules if we change the user-agent to get past this restriction. After a little experimentation, I found that most custom user-agent strings work. I set this for the request something like this:
'User-agent': custom_agent_string}) requests.get(url, {
Notice I’ve pruposefully not told you what to set it to - I’m hoping if we all do something different, it’s less likely to come to someone’s attention and get us blocked.
Finally, I noticed that for the various bots that have special rules in robots.txt, there’s a
Crawl-delay: 10
rule, meaning they’re asking search engine indexing bots and the like to wait 10 seconds between requests. Please be polite and respect this in our code and use a generous 10 second delay between page requests. Since we’re only requesting a handful of pages, it’s not really long to wait.
Merge the data tables together and clean the columns. It’s up to you what order you do this in - you may find it makes sense to clean then merge, or merge then clean, or perhaps clean, merge, then clean some more.
If two DataFrames have a matching column, you can use
pd.merge
to merge them together, matching up rows using
that column; for a bit more explanation, see this
section of the Pandas Getting Started tutorials. Merge the IMDB and
The Numbers data on the Title column. Note that any
movie titles that don’t appear in both tables will simply not be
included in the merged table.
Among other things, we’re interested in looking at correlations, which means we need to be able to scatterplot our columns. For all the columns that represent numerical values, clean and convert them to numerical types.
You are now in possession of a dataset that, to my knowledge, did not exist (or could not be obtained for free, anyway) until you made it - that’s pretty neat! Explore the data for interesting trends. Your analysis should:
Here are some examples of questions I’d be curious to answer. Don’t treat this list as exhaustive - I don’t even know whether anything interesting would result from each of these, but they should give you an idea of what I’m looking for.
There are many avenues that might make this analysis more interesting or better. Here are some ideas:
thefuzz
python package) unify titles that are nearly
but not quite exactly the same before merging your datasets. How many
additional movies did this buy you? Did you merge any pairs that should
have been separate?Interesting explorations can receive up to 5 points of extra credit. Submit your extra credit as a separate zip file in the same format as the base assignment.
As usual, your analysis should tell a story clearly and convincingly. All the general guidelines from prior labs apply: assumptions, preprocessing, cleaning, and analysis should be clearly documented.
Your final notebook should not include “scratch work” that you used to develop your scraping approach - just include the final code that scrapes the data, but make sure that it’s clearly documented.
Make sure that all your cells have been run and have the output you intended (exception: you don’t need to re-run your entire scrapes), then download a copy of your notebook in .ipynb format. Make sure you have saved out raw scraped CSV files from each dataset and your notebook reads them in by filename only (i.e., your notebook should assume that your CSV files are in the same directory).
Create a zip file containing your notebook and both CSV files and submit it resulting files to the Lab 5 assignment on Canvas. Note this is different from past labs - this time, all your files should be submitted in one zip file.
Finally fill out the Lab 5 Survey on Canvas. Your submission will not be considered complete until you have submitted the survey.
Parts 1 and 2 are worth 30 points:
Parts 3 and 4 will be graded as usual on the extent to which your work is Correct, Convincing, and Clear:
15 points - Correct
5/5 analysis is completely technically sound
4/5 small technical issues, but no apparent effect on the outcome of the analysis
3/5 some techniques are applied or interpreted incorrectly in a way that affects the analysis
2/5 an honest attempt was made, but it’s seriously incomplete or flawed
1/5 some effort was made
0/5 no effort / no submission
5 points - Convincing
10 points - Clear
Extra Credit
Up to 5 points as detailed above.