Welcome to Scribd!

Webscraper Activity

Uploaded by

0% found this document useful (0 votes)

6 views5 pages

This document outlines a case study to use web scraping to analyze movie and beer data: 1) Scrape IMDB to generate a dataset of the top 50 movies each year for the last 5 years, extracting details like producers, directors, stars, genres, budgets and box office gross. Analyze the data to identify trends like genres that are most common or highest earning. 2) Scrape ratebeer.com to create a dataset of beers with over 500 reviews, collecting information on the brewer, ratings, calories, ABV and description. Then scrape reviews for each beer to analyze ratings and opinions by location and date.

Original Description:

Webscraper Activity

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

6 views5 pages

Webscraper Activity

Uploaded by

Rizamae Rabusa

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 5

Search inside document

Case Study 1

Use a webscraper
Prepare the movie release dataset of all the movies released in the last 5 years
using IMDB.
(a) Find all movies that were released in the last 5 years.
Answer: 2016 Movie Titles

(b) Generate a file containing URLs for the top 50 movies every year on IMDB.
Answer: https://www.imdb.com/list/ls033133511/

(c) Read in the URL’s IMDB page and scrape the following information:
Producer(s), Director(s), Star(s), Taglines, Genres, (Partial) Storyline, Box
office budget, and Box office gross.

(d) Make a table out of these variables as columns with movie name being the first
variable.

(e) Analyze the movie-count for every Genre. See if you can come up with some
interesting hypotheses. For example, you could hypothesize that “Action Genres
occur significantly more often than Drama in the top-250 list.” or that “Action
movies gross higher than Romance movies in the top-250 list.”
Answer: Action are usually the best rated by most viewers and is mostly watched. Some
movies that are included in the top 50 are rated PG-13. Drama comes in 2nd to most
watched. Highest star rating for movies is only 8.4 stars in average. Co-directed movies
tend to have higher gross sales, example: Avengers: Civil War by Russo Brothers earned
$408.08 M and Disney’s Zootopia by Byron Howard and Rich Moore earned $341.27 M

Case Study 2
Use a webscraper
Prepare the beer dataset of all the beers that got over 500 reviews.
(a) Go to (https://www.ratebeer.com/beer/top-50/ ) and examine the page.
(b) Scrape the page and tabulate the output into a data frame with columns “name,
url, count, style.”

(c) Filter the data frame. Retain only those beers that got over 500 reviews. Let us
call this Table 1.
(d) Now for each of the remaining beers, go to the beer’s own web page on the
ratebeer site, and scrape the following information:
“Brewed by, Weighted Avg, Seasonal, Est.Calories, ABV, commercial
description” from the top of the page.
Add these fields to Table 1 in that beer’s row.
(e) Now build a separate table for each beer in Table 1 from that beer’s ratebeer
web page. Scrape the first three pages of reviews of that beer and in each review,
scrape the following info:
“rating, aroma, appearance, taste, palate, overall, review (text), location (of
the reviewer), date of the review.”

(f) Store the output in a dataframe, let us call it Table 2.

Class 12 CS Practical List 2023-24
Document31 pages
Class 12 CS Practical List 2023-24
bipashadoke2005
No ratings yet
AffiliatesSearch2 1
Document9 pages
AffiliatesSearch2 1
Ismael Serrano
No ratings yet
TP Mongo Students-2015
Document8 pages
TP Mongo Students-2015
Anwar Hamdani
0% (1)
Group Assignment No. 1 Descriptive Statistics-Graphs, Charts and Plots Max. Marks: 30
Document3 pages
Group Assignment No. 1 Descriptive Statistics-Graphs, Charts and Plots Max. Marks: 30
shruthin
No ratings yet
MS Access Assignments
Document1 page
MS Access Assignments
ShanKar Hunnur
No ratings yet
dsc651 Lab 4 Exercise
Document4 pages
dsc651 Lab 4 Exercise
SITI FATIMAH AMALINA ABDUL RAZAK
No ratings yet
Requirments de en
Document21 pages
Requirments de en
V Bar
No ratings yet
Practice 1
Document2 pages
Practice 1
Salah Ahmad Al Jardali
No ratings yet
Max Topics and Requirements
Document3 pages
Max Topics and Requirements
BTS ADELOBRA
No ratings yet
Expansive Pairpoint Design Name Index
From Everand
Expansive Pairpoint Design Name Index
Rob Smith
No ratings yet
Hcit End of Cluster Test Practical 1
Document4 pages
Hcit End of Cluster Test Practical 1
api-247871582
0% (1)
Solved The Table Gives The Demand and Supply Schedules For Sandwiches A
Document1 page
Solved The Table Gives The Demand and Supply Schedules For Sandwiches A
M Bilal Saleem
No ratings yet
Session 8: Url'S, Tips For Improved Internet Search & Site Design
Document7 pages
Session 8: Url'S, Tips For Improved Internet Search & Site Design
Rosa
No ratings yet
WebDesignLab F1 Hernaez
Document3 pages
WebDesignLab F1 Hernaez
Jose Mari Hernaez
No ratings yet
392988-Miskatonic Repository Catalogue v2
Document3 pages
392988-Miskatonic Repository Catalogue v2
Máté Lukács
No ratings yet
DBMS Practical List
Document2 pages
DBMS Practical List
ALL ABOUT EVERYTHING
No ratings yet
Google Search Tricks For Librarians - 2007
Document22 pages
Google Search Tricks For Librarians - 2007
jedrich
No ratings yet
How To Cite Online Sources APA 6th
Document2 pages
How To Cite Online Sources APA 6th
Rebecca Whittaker
No ratings yet
CC106 Sit Lab 019 Jan11
Document3 pages
CC106 Sit Lab 019 Jan11
RyukTatsu
No ratings yet
WD Workshop Syllbus
Document4 pages
WD Workshop Syllbus
Dsk
No ratings yet
2000 Secret
Document6 pages
2000 Secret
Asus X455
No ratings yet
Digital Assignment - I
Document10 pages
Digital Assignment - I
msroshi madhu
No ratings yet
Macroeconomics For Today 9th Edition Tucker Test Bank
Document25 pages
Macroeconomics For Today 9th Edition Tucker Test Bank
MatthewHarriscjgb
100% (55)
Sample PDF Getting The Money
Document22 pages
Sample PDF Getting The Money
Michael Wiese Productions
86% (7)
Ass02 - Querying The Movie Ratinsg Data
Document6 pages
Ass02 - Querying The Movie Ratinsg Data
10bicsesali
No ratings yet
Taller 3 Access
Document7 pages
Taller 3 Access
design pai2
No ratings yet
Book Cataloging For Handouts
Document60 pages
Book Cataloging For Handouts
Celeste Bawag
No ratings yet
Homework 2 - G6
Document6 pages
Homework 2 - G6
CHUA JO EN
No ratings yet
Module 5 - Data Visualization - File 1
Document3 pages
Module 5 - Data Visualization - File 1
Shubham Sharma
No ratings yet
Hacked by Friends GF
Document3 pages
Hacked by Friends GF
testusr
No ratings yet
Powerbi Questions
Document2 pages
Powerbi Questions
Wantwa Mwangomba
No ratings yet
Module 2
Document8 pages
Module 2
David A. Sanchez
No ratings yet
Movie Tracker: Track The Movies You've Watched or Want To Watch
Document51 pages
Movie Tracker: Track The Movies You've Watched or Want To Watch
KeshavL
No ratings yet
B. Viva-Voce - 10 Marks
Document3 pages
B. Viva-Voce - 10 Marks
Rajeshkannan Vasinathan
No ratings yet
Dwnload Full Statistics For Business Decision Making and Analysis 3rd Edition Stine Test Bank PDF
Document36 pages
Dwnload Full Statistics For Business Decision Making and Analysis 3rd Edition Stine Test Bank PDF
wellbornfinikin407k2o
100% (13)
Dbms Practical
Document2 pages
Dbms Practical
nagarajuvcc123
0% (1)
(ECON2113) (2017) (F) Midterm 6ktm39 79367 PDF
Document2 pages
(ECON2113) (2017) (F) Midterm 6ktm39 79367 PDF
Mabel Huang
No ratings yet
Dwnload Full Accounting Information Systems Australasian 1st Edition Romney Test Bank PDF
Document36 pages
Dwnload Full Accounting Information Systems Australasian 1st Edition Romney Test Bank PDF
assapancopepodmhup
100% (15)
17 Assignment 4 RSS PDF
Document4 pages
17 Assignment 4 RSS PDF
Sikandar Khan
No ratings yet
Web Design File
Document59 pages
Web Design File
Shruti verma
No ratings yet
Tutorial 5
Document4 pages
Tutorial 5
vancong.iuetv
No ratings yet
Web D Front
Document2 pages
Web D Front
chochicoffee
No ratings yet
WT LAB Syllabus
Document6 pages
WT LAB Syllabus
rajaphy007
No ratings yet
Final Practical Exam Questions
Document6 pages
Final Practical Exam Questions
saravanakumar
No ratings yet
Article On Formatting The MLA Works Cited
Document2 pages
Article On Formatting The MLA Works Cited
Stephanie Robinson
No ratings yet
Practical Slips Questions
Document5 pages
Practical Slips Questions
Bharat Saranda
No ratings yet
Chapter9 Folio 09
Document2 pages
Chapter9 Folio 09
syahadah12
No ratings yet
Lab Exercise 2
Document3 pages
Lab Exercise 2
khairitkr
No ratings yet
0 MLA 8 Cheat Sheet of Sources - General Rules v6
Document9 pages
0 MLA 8 Cheat Sheet of Sources - General Rules v6
Alara Sirin
No ratings yet
Boletin Indices
Document2 pages
Boletin Indices
gdfgdsf dsfgdsfhg
No ratings yet
Activity 9-1: Checking Spelling, Links, and Running Reports
Document7 pages
Activity 9-1: Checking Spelling, Links, and Running Reports
Fariz
No ratings yet
Googlesearch PDF
Document2 pages
Googlesearch PDF
Raphael Sabouraud
No ratings yet
Assignment1 2565 W16
Document3 pages
Assignment1 2565 W16
Zara
No ratings yet
Worksheet: Secondary Source Research Toolbox
Document5 pages
Worksheet: Secondary Source Research Toolbox
Cheryl West
No ratings yet
Bob Clemens Free Flight Resource List
Document6 pages
Bob Clemens Free Flight Resource List
suttipong_polmag
No ratings yet
Module 4 - Exercises
Document2 pages
Module 4 - Exercises
dfer43
No ratings yet
Macroeconomics For Today 9th Edition Tucker Test Bank
Document35 pages
Macroeconomics For Today 9th Edition Tucker Test Bank
tryphenakhuongbz4rn
100% (30)
Quality Domain
Document15 pages
Quality Domain
Ghazal Hasan
No ratings yet
Opin Rank Dataset
Document2 pages
Opin Rank Dataset
Anthony Widjaja
No ratings yet
Information Retrieval: Unit 4: Web Search - Part 3
Document37 pages
Information Retrieval: Unit 4: Web Search - Part 3
Hari krishnan
No ratings yet