Professional Documents
Culture Documents
Web Scraping
Web Analytics – 4th course. Degree on Data Science and Engineering
• What are the main tools for extracting data from the web?
• Web scraping
• APIs (Application Programming Interfaces) (Covered in Block 2, Part 2 - APIs).
Web scraping history
The history of the web scraping dates back nearly to the time when the World Wide Web was born.
• After the birth of World Wide Web in 1989, the first web robot, World Wide Web Wanderer, was
created in June 1993, which was intended only to measure the size of the web.
• In December 1993, the first crawler-based web search engine, JumpStation, was launched. As
there were not so many websites available on the web, search engines at that time used to rely
on their human website administrators to collect and edit the links into a particular format. In
comparison, JumpStation brought a new leap, being the first WWW search engine that relied on
a web robot.
• In 2000, the first Web API and API crawler came. API stands for Application Programming
Interface. It is an interface that makes it much easier to develop a program by providing the
building blocks. In 2000, Salesforce and eBay launched their own API, with which programmers
were enabled to access and download some of the data available to the public. Since then, many
websites offer web APIs for people to access their public database.
Source: Wikipedia
Web scraping definition
3. A third use is web data mining, where web pages are analyzed for statistical
properties, or where data analytics is performed on them.
Challenges:
- Deal with the large volume of URLs to crawl within a given time.
- Avoid duplicated content.
Web crawler overview
(web archiving / web indexing)
Selection policy
• Due to the current size of the Web, even the most powerful search engines
cover only a fraction of the open web accessible content.
• Because a crawler only downloads a portion of the Web pages, it is critical that
the downloaded fraction contains the most relevant pages rather than a random
sampling. For this reason, it requires metrics of importance for prioritizing the
set of URLs. The importance of a page is a function of its intrinsic quality, its
popularity in terms of links or visits. Some of the available metrics nowadays are
breadth-first search or PageRank (Covered in Block 3, Graph Theory).
Steps for web data extraction
1. We start the process by sending an HTTP GET request to receive the HTTP
response with the web content.
2. Once we have the HTML code of the web, we parse the code following a tree
structure path.
3. Last, we extract the data selected from the HTML and download it to a file or a
database.
HTML review
Hypertext Markup Language (HTML) is the main language used to write/build web
pages. HTML describes the structure of a web page and it can be used with Cascading
Style Sheets (CSS) (to describe the presentation of web pages, including colors, layout,
and fonts) and a scripting language such as JavaScript (to create interactive websites).
HTML tags
• HTML tags are used to define the
start of an HTML element. All tags
are enclosed in angle brackets < >.
• Most tags must be opened <start
tag> and closed </end tag>. Although
some tags do not need to be closed.
• When a web browser reads an HTML
document, it is able to display its
content following the rules defined
by each tag.
HTML tags
• <!DOCTYPE html> defines the document type and the HTML version. Current version of HTML is 5.
• <html> root element of an HTML page
• <head> contains meta information about the page
• <body> contains the visible part of the page
• <title> specifies the title of the document. Seen on browser tab and search results
• <p> defines a paragraph, use for normal text
• <a> anchor link, i.e., hyperlink
• <h1> main page heading. <h2> … <h6> subheadings within text
• <img> images
• <ul> unordered list, bullet point lists. <ol> ordered list, numbered lists. <li> list items
• <table> , <tr> , <td> table, rows, columns
• <div> division, a section of the page Check full list here: https://www.w3schools.com/tags/
HTML attributes
• The HTML tags can have attributes. The attributes contain additional
information about elements.
• HTML attributes are always specified in the start tag.
• Atributes usually come in name/value pairs like: name=“value”
https://en.wikipedia.org/wiki/Document_Object_Model
Web browser developer tools
• Every modern web browser includes a powerful suite of developer tools. These tools do
a range of things, from inspecting currently-loaded HTML, CSS and JavaScript to
showing which assets the page has requested and how long they took to load.
• Keyboard: Ctrl + Shift + I, except
•Internet Explorer and Edge: F12
•macOS: ⌘ + ⌥ + I
• Menu bar:
•Firefox: Menu ➤ Web Developer ➤ Toggle Tools, or Tools ➤ Web Developer ➤ Toggle Tools
•Chrome: More tools ➤ Developer tools
•Safari: Develop ➤ Show Web Inspector. If you can't see the Develop menu, go to Safari ➤ Preferences ➤ Advanced,
and check the Show Develop menu in menu bar checkbox.
•Opera: Developer ➤ Developer tools
• Context menu: Press-and-hold/right-click an item on a webpage (Ctrl-click on the Mac), and choose Inspect Element from
the context menu that appears. (An added bonus: this method straight-away highlights the code of the element you right-
clicked.)
Source: Wikipedia
Main challenges of web scraping
• Many websites have anti-scraping/crawling policies, making it more challenging
to gather information. They may block certain user-agents to scrape the pages
or directly block the IPs, forbidding collecting any data. Other sites may have
limits on content extraction, e.g., you may only be able to scrape a website once
per day.
• Some crawling/scraping jobs are very time-consuming and require scheduling in
advance or even adding asynchronous and parallel jobs to be able to extract the
information in a reasonable time.
• Besides, they can need large infrastructure deployment, depending on the scale
of the project, which is another aspect to consider.
Prepare yourself for the labs