You are on page 1of 6

Tutorial:

Web scrapping on R

In this tutorial, we will learn to do web scraping in the most watched series of all time according
to IMDb, found in this link.

What is web scrapping?


Is when we extract information (data) from a web page and then export it to another format -such
as Exel- so that it is easier to work with. It can be done manually by copying and pasting the
information into a new document, or we can automate the process by using diverse tools -in this
case, R. This is especially useful when we are working with a lot of data.

Getting started…
We will need to download a package into our R system that will help with the scraping, for this
tutorial we will use the rvest package, which allow us to download an HTML document. To
download it, use the following code, remember you only have to do this once.

install.packages("rvest")
If you are not familiar with HTML coding, we will also need a google chrome extension that will
allow us to select and import only the data we are interested in. You can download it by
following this link or by directly looking for SelectorGadget on the Chrome web store.

Now that we have our package installed we need to load it to this session, do this every time you
open an R session.
library(rvest)

We will also need to install and load the tidyverse package to create tibbles and easily manipulate
the data as well as allow us to make use of the pipe (%>%). This is a tool that enables us to do
many steps without saving intermediate results, allowing for shorter and more elegant code.
When going over your code is useful to read the pipe operator as a “then”. To install and load the
packages use the next code chunk
install.packages("tidyverse")
library(tidyverse)
Next, we will have to load the webpage we would like to scrape to the R program, for this we
will get the URL code from the site and assign it to a variable called link. To do this, copy and
paste the whole link from the webpage you want to scrape in between quote marks in the R
console.
link <- "https://www.imdb.com/list/ls095964455/"

Once the page is on our R script, we need to read it in a way that is understandable for R, so that
we can work with the data. For this, we will use the function read_html() from the rvest package,
which takes as an input a URL and transforms it (gives as output) an HTML text that is
“understandable” for R. Because our URL is stored in a variable called link, this is the argument
we will pass to the read function, it would like this
read_html(link)
In this tutorial we will extract the series name, it's rating, the rank it has (the most streamed
series being #1), the intended audience (R, PG13, etc), episode length, and the genre tags. Let's
start with the series name, for this, we will create a variable called series_names.
series_names <- read_html(link)

The read_html function is reading the whole HTML code, so we now need to separate it and
select only the part we are interested in; the name of the series. For this, it is important to know
that HTML has some tags that separate the class of the information, if you know about HTML
programming, look for the needed tag. Otherwise, we can use the SelectorGadget tool to help us
find the tags, once you have the extension installed simply go to the page you want to scrap,
open the extension and with your mouse select the information you want to extract (in this case
the series name), all the series names should appear in yellow, if something else is also selected
click on it to deselect it (it would turn red). Once you are sure only the data you want is
highlighted (selected), copy the path that appears in the lower right corner, this is the node.
This path is the argument we will pass to our next function, the html_nodes(), that will select
only the elements we indicated (the names) and extract them, keep in mind the tag also has to go
in between quotation marks. Our code now looks like this
series_names <- read_html(link) %>% html_nodes(".lister-item-header a")
Notice that we are using the %>% operator, which takes what is on the left of the pipe and takes
the result to pass it as the first argument to what is to the right of the pipe.
The html_nodes() function also extracts the whole HTML string fro the element, which contains
symbols that are no useful for us (it will extract something like this:
<a href="/title/tt0944947/?ref_=ttls_li_tt">Game of Thrones</a>. To solve this issue, we can
use the html_text() function to only extract the text part, as you may infer we will use it in the
pipe.
series_names <- read_html(link) %>% html_nodes(".lister-item-header a")
%>% html_text()

We can run series_names to confirm we correctly extracted the info we wanted. The first
lines should look like this
[1] "Game of Thrones" "Stranger Things"
[3] "The Walking Dead" "13 Reasons Why" ….

Repeat this process for all of the other information you wish to extract (you will find the whole
code in the last page).
There is an extra step for when you are extracting more than one element for the same variable,
for example in the series genre has two or more categories separated by a comma (Game of
Thrones has “Action, Adventure, Drama” as genres). So after running the code:
series_genre <- read_html(link) %>% html_nodes(".genre") %>% html_text()
R language will extract the following, "\nAction, Adventure, Drama and we want to
delete the \n for all the series but doing it manually would take a lot of time. So we will
substitute all \n for a space, for this purpose we will use the gsub() function, the first argument is
the character we wish to replace, the second one is the character we wish to replace it for, and the
last argument is the data in which the replacement will be made.
gsub("\n", "", series_genre)

Finally, we will create a tibble (a table) with all this information, where each of the variables we
have created will become a column. We will assign the tibble to a variable called series_info. To
create it use the tibble() function from the tidyverse package. The arguments for this function
must have this format “Column name” = variable name, “Column name 2 ” = data name 2.
series_info <- tibble('Rank' = seires_rank, 'Title' = series_names, 'IMDb
Rating' = series_rating, 'Episode duration'= series_episode_lenght,
'Audience' =seires_audience, 'Genre' = series_genre)

Optionally you can create an Excel document with the tibble you just created, use the write.csv()
function where the first argument is the tibble name followed by the name you want for your
Excel file in between quotes.
write.csv(series_info, "Most_watched_series.csv")
library(rvest)

library(tidyverse)

link <- "https://www.imdb.com/list/ls095964455/"

#Name of the serie

series_names <- read_html(link) %>% html_nodes(".lister-item-header a") %>% html_text()

#Rating given my IMDb users

series_rating <- read_html(link) %>% html_nodes(".ipl-rating-star.small


.ipl-rating-star__rating") %>% html_text()

#Rank in the list (most watched)

seires_rank<- read_html(link) %>% html_nodes(".text-primary") %>% html_text()

#Recommended audience

seires_audience<- read_html(link) %>% html_nodes(".certificate") %>% html_text()

#Average lenght of one episode

series_episode_lenght <- read_html(link) %>% html_nodes(".runtime") %>% html_text()

#Series genres or categories

series_genre <- read_html(link) %>% html_nodes(".genre") %>% html_text()

series_genre <- gsub("\n", "", series_genre)

series_info <- tibble('Rank' = seires_rank, 'Title' = series_names, 'IMDb Rating' = series_rating,

'Episode duration'= series_episode_lenght, 'Audience' =seires_audience,

'Genre' = series_genre)

write.csv(series_info, "Most_watched_series.csv")
References
https://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rve
st-with-hands-on-knowledge/
https://www.zenrows.com/blog/web-scraping-r#step-2-retrieve-the-html-page
https://www.youtube.com/watch?v=v8Yh_4oE-Fs&ab_channel=Dataslice

You might also like