You are on page 1of 5

Case Study 1

Use a webscraper
Prepare the movie release dataset of all the movies released in the last 5 years
using IMDB.
(a) Find all movies that were released in the last 5 years.
Answer: 2016 Movie Titles

(b) Generate a file containing URLs for the top 50 movies every year on IMDB.
Answer: https://www.imdb.com/list/ls033133511/

(c) Read in the URL’s IMDB page and scrape the following information:
Producer(s), Director(s), Star(s), Taglines, Genres, (Partial) Storyline, Box
office budget, and Box office gross.

(d) Make a table out of these variables as columns with movie name being the first
variable.

(e) Analyze the movie-count for every Genre. See if you can come up with some
interesting hypotheses. For example, you could hypothesize that “Action Genres
occur significantly more often than Drama in the top-250 list.” or that “Action
movies gross higher than Romance movies in the top-250 list.”
Answer: Action are usually the best rated by most viewers and is mostly watched. Some
movies that are included in the top 50 are rated PG-13. Drama comes in 2nd to most
watched. Highest star rating for movies is only 8.4 stars in average. Co-directed movies
tend to have higher gross sales, example: Avengers: Civil War by Russo Brothers earned
$408.08 M and Disney’s Zootopia by Byron Howard and Rich Moore earned $341.27 M

Case Study 2
Use a webscraper
Prepare the beer dataset of all the beers that got over 500 reviews.
(a) Go to (https://www.ratebeer.com/beer/top-50/ ) and examine the page.
(b) Scrape the page and tabulate the output into a data frame with columns “name,
url, count, style.”

(c) Filter the data frame. Retain only those beers that got over 500 reviews. Let us
call this Table 1.
(d) Now for each of the remaining beers, go to the beer’s own web page on the
ratebeer site, and scrape the following information:
“Brewed by, Weighted Avg, Seasonal, Est.Calories, ABV, commercial
description” from the top of the page.
Add these fields to Table 1 in that beer’s row.
(e) Now build a separate table for each beer in Table 1 from that beer’s ratebeer
web page. Scrape the first three pages of reviews of that beer and in each review,
scrape the following info:
“rating, aroma, appearance, taste, palate, overall, review (text), location (of
the reviewer), date of the review.”

(f) Store the output in a dataframe, let us call it Table 2.

You might also like