You are on page 1of 6

Web Scraping using Python

Topics Covered:
● Introduction to Web Scraping
● How Does Web Scraping Work?
● Steps involved in web scraping
● Installing BeautifulSoup
● Installing Requests
● Scraping and Analyzing data from Worldometer website

Introduction to Web Scraping


What is Web Scraping? Why do we use Web Scraping?

Web scraping, web harvesting, or web data extraction is an


automated process of collecting large data(unstructured) from
websites. It is the process of gathering information from the Internet.
Even copying and pasting the lyrics of your favorite song is a form of
web scraping! However, the words “web scraping” usually refer to a
process that involves automation. Some websites don’t like it when
automatic scrapers gather their data, while others don’t mind. The
data collected can be stored in a structured format for further
analysis.

If you’re scraping a page respectfully for educational purposes, then


you’re unlikely to have any problems. Still, it’s a good idea to do some
research on your own and make sure that you’re not violating any
Terms of Service before you start a large-scale project.

1
How Does Web Scraping Work?

When we scrape the web, we write code that sends a request to the
server that’s hosting the page we specified. The server will return the
source code — HTML, mostly — for the page (or pages) we requested.
So far, we’re essentially doing the same thing a web browser does —
sending a server request with a specific URL and asking the server to
return the code for that page.
But unlike a web browser, our web scraping code won’t interpret the
page’s source code and display the page visually. Instead, we’ll write
some custom code that filters through the page’s source code looking
for specific elements we’ve specified, and extracting whatever content
we’ve instructed it to extract.
For example, if we wanted to get all of the data from inside a table that
was displayed on a web page, our code would be written to go
through these steps in sequence:

● Request the content (source code) of a specific URL from the


server
● Download the content that is returned
● Identify the elements of the page that are part of the table we
want
● Extract and (if necessary) reformat those elements into a dataset
we can analyze or use in whatever way we require.

Steps involved in web scraping

● Send an HTTP request to the URL of the webpage you want to


access. The server responds to the request by returning the
HTML content of the webpage.
● Once we have accessed the HTML content, we are left with the
task of parsing the data. Since most of the HTML data is nested,
we cannot extract data simply through string processing. One
needs a parser which can create a nested/tree structure of the
HTML data.

2
● Now, all we need to do is navigate and search the parse tree that
we created, i.e. tree traversal. For this task, we will be using
another third-party python library, Beautiful Soup. It is a Python
library for pulling data out of HTML and XML files.

Installing BeautifulSoup

BeautifulSoup is one of the most prolific Python libraries in existence,


in some part having shaped the web as we know it. BeautifulSoup is a
lightweight, easy-to-learn, and highly effective way to
programmatically isolate information on a single webpage at a time.
It's common to use BeautifulSoupin conjunction with the requests
library, where requests will fetch a page, and BeautifulSoup will extract
the resulting data.

● For installing Pandas Type pip install beautifulsoup4 in the


Command prompt/ terminal.

3
● Or type !pip install beautifulsoup4 or %pip install beautifulsoup4
in a Jupyter notebook cell.

● Then type from bs4 import BeautifulSoup to import pandas.

For more information on BeautifulSoup please refer to the official


Beautiful Soup Documentation.

Installing Requests

The first thing we’ll need to do to scrape a web page is to download


the page. We can download pages using the Python requests library.
The requests library will make a GET request to a web server, which
will download the HTML contents of a given web page for us. There are
several different types of requests we can make using requests, of
which GET is just one.

4
● For installing Pandas Type pip install requests in the Command
prompt/ terminal.

● Or type !pip install requests or %pip install requests in a Jupyter


notebook cell.

● Then type import requests to import pandas.

For more information on Requests please refer to the official Requests


Documentation.

5
Scraping and Analyzing data from Worldometer
website

We have scrapped Covid19. Confirmed cases, Deaths according to


Country and continent from the Worldometer website, this is for
purely educational purposes. The Website-link with the Reference
Notebook and scrapped data set with analysis is given below

● Worldometer website link


● Jupyter Notebook Download Link
● Scrapped Covid19 Dataset Download Link

You might also like