You are on page 1of 34

Data collection form the Web:

Web Scraping
Web Analytics – 4th course. Degree on Data Science and Engineering

Patricia Callejo (pcallejo@inst.uc3m.es)


Angel Cuevas (acrumin@it.uc3m.es)
Rubén Cuevas (rcuevas@it.uc3m.es)
Data collection from the web
• With the creation of the Internet and the World Wide Web, the need to collect
data from the web emerged. Indeed, specific tools have accompanied the
Internet's expansion and evolution, allowing to collect and process the data
from the web.

• What are the main tools for extracting data from the web?
• Web scraping
• APIs (Application Programming Interfaces) (Covered in Block 2, Part 2 - APIs).
Web scraping history
The history of the web scraping dates back nearly to the time when the World Wide Web was born.
• After the birth of World Wide Web in 1989, the first web robot, World Wide Web Wanderer, was
created in June 1993, which was intended only to measure the size of the web.
• In December 1993, the first crawler-based web search engine, JumpStation, was launched. As
there were not so many websites available on the web, search engines at that time used to rely
on their human website administrators to collect and edit the links into a particular format. In
comparison, JumpStation brought a new leap, being the first WWW search engine that relied on
a web robot.
• In 2000, the first Web API and API crawler came. API stands for Application Programming
Interface. It is an interface that makes it much easier to develop a program by providing the
building blocks. In 2000, Salesforce and eBay launched their own API, with which programmers
were enabled to access and download some of the data available to the public. Since then, many
websites offer web APIs for people to access their public database.
Source: Wikipedia
Web scraping definition

Web scraping is the process of automatically retrieving and downloading data


from websites. Web scraping tools are software pieces that access the WWW
directly using an HTTP session or a web browser to extract its information.
Although a user can perform web scraping manually, the term refers to automated
processes implemented using a bot or web crawler. The data is obtained and
copied from the web, usually stored into a database or a file, for later analysis.
Web crawlers uses
1. They are one of the most important components of web search engines (e.g.,
Google, Bing, Yahoo). Search engines are software that allow to conduct web
searches. They assemble a corpus of web pages, index them, and allow users to
make queries to find the web pages that match them; this process is known as
web indexing. The search engines only store and index the information from
the open web.
Web crawlers uses
2. A related use is web archiving (a service provided by, e.g., the Internet Archive),
digital libraries of Internet sites, and other cultural artifacts in digital form.

3. A third use is web data mining, where web pages are analyzed for statistical
properties, or where data analytics is performed on them.

4. But there are more uses …


Focus on Data Science
Business that use Web scraping
Web crawler overview
(web archiving / web indexing)
1. A web crawler starts with a list of web pages to visit.
2. As the crawler visits these websites, it identifies
all the hyperlinks in the pages and adds them
to the list of sites to download.
3. It saves the text and metadata of each
URL visited.

Challenges:
- Deal with the large volume of URLs to crawl within a given time.
- Avoid duplicated content.
Web crawler overview
(web archiving / web indexing)
Selection policy
• Due to the current size of the Web, even the most powerful search engines
cover only a fraction of the open web accessible content.
• Because a crawler only downloads a portion of the Web pages, it is critical that
the downloaded fraction contains the most relevant pages rather than a random
sampling. For this reason, it requires metrics of importance for prioritizing the
set of URLs. The importance of a page is a function of its intrinsic quality, its
popularity in terms of links or visits. Some of the available metrics nowadays are
breadth-first search or PageRank (Covered in Block 3, Graph Theory).
Steps for web data extraction
1. We start the process by sending an HTTP GET request to receive the HTTP
response with the web content.
2. Once we have the HTML code of the web, we parse the code following a tree
structure path.
3. Last, we extract the data selected from the HTML and download it to a file or a
database.
HTML review
Hypertext Markup Language (HTML) is the main language used to write/build web
pages. HTML describes the structure of a web page and it can be used with Cascading
Style Sheets (CSS) (to describe the presentation of web pages, including colors, layout,
and fonts) and a scripting language such as JavaScript (to create interactive websites).
HTML tags
• HTML tags are used to define the
start of an HTML element. All tags
are enclosed in angle brackets < >.
• Most tags must be opened <start
tag> and closed </end tag>. Although
some tags do not need to be closed.
• When a web browser reads an HTML
document, it is able to display its
content following the rules defined
by each tag.
HTML tags
• <!DOCTYPE html> defines the document type and the HTML version. Current version of HTML is 5.
• <html> root element of an HTML page
• <head> contains meta information about the page
• <body> contains the visible part of the page
• <title> specifies the title of the document. Seen on browser tab and search results
• <p> defines a paragraph, use for normal text
• <a> anchor link, i.e., hyperlink
• <h1> main page heading. <h2> … <h6> subheadings within text
• <img> images
• <ul> unordered list, bullet point lists. <ol> ordered list, numbered lists. <li> list items
• <table> , <tr> , <td> table, rows, columns
• <div> division, a section of the page Check full list here: https://www.w3schools.com/tags/
HTML attributes
• The HTML tags can have attributes. The attributes contain additional
information about elements.
• HTML attributes are always specified in the start tag.
• Atributes usually come in name/value pairs like: name=“value”

An example of an attribute is:


<img src=“myimage.png” alt=”This is an empty photo">
src: specifies the path to the image
alt: indicates an alternate text for the image, if cannot be displayed
HTML attributes
• The id attribute is used to specify a unique id for an element. It is used by CSS and JavaScript to
style/select a specific element.
• The class attribute specifies one or more class names for an element. Classes are used by CSS and
JavaScript to select and access specific elements.
• The href attribute of <a> specifies the URL of the page the link goes to.
• The src attribute of <img> specifies the path to the image to be displayed.
• The width and height attributes of <img> provide size information for images.
• The alt attribute of <img> provides an alternate text for an image.
• The style attribute is used to add styles to an element, such as color, font, size, and more.
• The lang attribute of the <html> tag declares the language of the Web page.

Check all HTML attributes: https://www.w3schools.com/tags/ref_attributes.asp


DOM tree
• The DOM represents HTML as a tree
structure of tags, where each node is
an object representing a part of the
document.
• Each branch of the tree ends in a
node, and each node contains
objects.
• DOM methods allow programmatic
access to the tree.

https://en.wikipedia.org/wiki/Document_Object_Model
Web browser developer tools
• Every modern web browser includes a powerful suite of developer tools. These tools do
a range of things, from inspecting currently-loaded HTML, CSS and JavaScript to
showing which assets the page has requested and how long they took to load.
• Keyboard: Ctrl + Shift + I, except
•Internet Explorer and Edge: F12
•macOS: ⌘ + ⌥ + I

• Menu bar:
•Firefox: Menu ➤ Web Developer ➤ Toggle Tools, or Tools ➤ Web Developer ➤ Toggle Tools
•Chrome: More tools ➤ Developer tools
•Safari: Develop ➤ Show Web Inspector. If you can't see the Develop menu, go to Safari ➤ Preferences ➤ Advanced,
and check the Show Develop menu in menu bar checkbox.
•Opera: Developer ➤ Developer tools

• Context menu: Press-and-hold/right-click an item on a webpage (Ctrl-click on the Mac), and choose Inspect Element from
the context menu that appears. (An added bonus: this method straight-away highlights the code of the element you right-
clicked.)

More info: here


Web browser developer tools
Web browser developer tools
CSS
• CSS (Cascading Style Sheets) is the language use to
style an HTML document. CSS controls how HTML
elements look in webpages.
• While HTML is used to define the structure and
semantics of your content, CSS is used to style it and
lay it out. For example, you can use CSS to alter the
font, color, size, and spacing of your content, split it
into multiple columns, or add animations and other
decorative features.
• A CSS selector is the first part of a CSS Rule. It is a
pattern of elements and other terms that tell the
browser which HTML elements should be selected to
have the CSS property values inside the rule applied
to them.
CSS Selectors

More info: https://www.w3schools.com/cssref/css_selectors.asp


XPaths Syntax
• XPath uses path expressions to select nodes in an XML document (similar to
HTML document). The node is selected by following a path or steps. The most
useful path expressions are listed below:

More info: https://www.w3schools.com/xml/xpath_syntax.asp


Web Scraping tools in Python
• Requests: is a simple HTTP library for Python. Allows to download the HTML
document, the first part for parsing the web.
• Beautiful Soup: is a Python library for pulling data out of HTML and XML files.
• Scrapy: is an open-source python framework used for crawling web sites and
extracting data.
• Selenium: set of tools and libraries that enable and support the automation of
web browsers. It is used for testing websites, but also for crawling them.
Requests
• Requests is the most user-friendly HTTP library available. Requests permit the user to
send requests to an HTTP server and GET the response back in the form of HTML or
JSON.
When to use Requests?
• When it comes to web scraping, Requests is the ideal choice. It’s simple to use and does
not require a lot of practice to use it. But it only downloads the content of webpages,
you will need to use another library, like BeautifulSoup, to parse the HTML and extract
the information.
• If the webpage you are trying to scrape has dynamic JavaScript content, don’t use
Requests. Then the responses may not parse the correct data.
Alternative libraries:
• urllib and urlib2: also allow to send requests to HTTP servers. However, they are more
complicated than Requests.
Beautiful Soup
• Beautiful Soup is a python library used to extract information from XML and
HTML files. It is considered a parser library. Parsers help in the task of extracting
data from HTML files. If parsers didn’t exist, we would have to deal with Regex
expressions, which is inefficient.
When to use Beautiful Soup?
• Beautiful Soup is the perfect library for starting with web scraping. It is
straightforward to use and allows the usage of CSS selectors to locate HTML
elements.
• However, it is only recommended for small web scraping projects as it is not
flexible and difficult to maintain if the size of the project increases. Also, it is not
recommended if we need to extract information from JavaScript functions.
Scrapy
• Scrapy has become one of the most popular python tools for web scraping. It is
a free and open-source framework, which means it is a complete tool for
systematically scraping and crawling the web.
• It can scrape multiple websites recursively, which makes it the perfect tool for
big scraping projects. It is also very efficient in terms of CPU and memory.
When to use Scrapy?
• Scrapy is designed for great-scale web scraping tasks. It allows creating well-
structured and flexible projects that make it easy to maintain and increase the
size of the project, e.g., add more web pages to the spider.
• Although it seems all are advantages, it is not recommended if the scraper is
only for one or two web pages, as it may add more complexity than needed.
Selenium
• Selenium is an open-source framework based on web browser automation. It
uses a web-driver, which means it can open a webpage, fill a form, click a button
and save the results. It is a powerful tool for automating tests. It allows the code
to act like a person, which permits obtaining information even from JavaScript
actions.
• It is a beginner-friendly tool that doesn’t need a steep learning curve.
When to use Selenium?
• Selenium is the right choice if you need to scrape a few web pages. And even
more, if the information you need to extract is within JavaScript. It is a powerful,
flexible tool that does not require much effort to build your first scraper.
Ethical guidelines
• Respect the robots.txt rules.
• If there is an API available for downloading the content, use it. Avoid scraping
that site if there is another way to do it.
• Request data at a reasonable rate. If no, they may think you are attacking them.
• Save only the data you really need for your purpose.
• If dealing with personal data, follow the GDPR law, and save any personal
information anonymizing it.

Always have good intentions when scraping sites!


Robots.txt
• The robots exclusion standard, or simply robots.txt, is a standard used by
websites to communicate with web crawlers and other web robots. It tells a
web robot/scraper which parts of the site should not be crawled or processed.
Examples:

Source: Wikipedia
Main challenges of web scraping
• Many websites have anti-scraping/crawling policies, making it more challenging
to gather information. They may block certain user-agents to scrape the pages
or directly block the IPs, forbidding collecting any data. Other sites may have
limits on content extraction, e.g., you may only be able to scrape a website once
per day.
• Some crawling/scraping jobs are very time-consuming and require scheduling in
advance or even adding asynchronous and parallel jobs to be able to extract the
information in a reasonable time.
• Besides, they can need large infrastructure deployment, depending on the scale
of the project, which is another aspect to consider.
Prepare yourself for the labs

Check if you have access to Check if you have installed


Google Colab notebooks or the four libraries we And if you have time… start
your own Python3+Jupyter learned about today. If not, building your first spider!
installation on your PC. check how to install them.

You might also like