You are on page 1of 38

BACHELOR PAPER

Term paper submitted in partial fulfillment of the requirements


for the degree of Bachelor of Science in Engineering at the
University of Applied Sciences Technikum Wien - Degree
Program Business Informatics

Web Scraping
Data Extraction from websites

By: Vojtech Draxl


Student Number: 1310256058

Supervisor: DI Dr. Gerd Holweg

Wien, 04.02.2018
Declaration of Authenticity
“As author and creator of this work to hand, I confirm with my signature knowledge of the
relevant copyright regulations governed by higher education acts (see Urheberrechtsgesetz/
Austrian copyright law as amended as well as the Statute on Studies Act Provisions /
Examination Regulations of the UAS Technikum Wien as amended).

I hereby declare that I completed the present work independently and that any ideas, whether
written by others or by myself, have been fully sourced and referenced. I am aware of any
consequences I may face on the part of the degree program director if there should be
evidence of missing autonomy and independence or evidence of any intent to fraudulently
achieve a pass mark for this work (see Statute on Studies Act Provisions / Examination
Regulations of the UAS Technikum Wien as amended).

I further declare that up to this date I have not published the work to hand nor have I presented
it to another examination board in the same or similar form. I affirm that the version submitted
matches the version in the upload tool.”

Place, Date Signature


Abstract
Web Scraping is a set of methods, which allows a user to collect information presented on the
World Wide Web. Similar technology used by search engines marked as Web Crawling is not
discussed. The difference between those techniques is explained.

This Paper covers the available techniques and development in the recent history of Web
Scraping. Legal aspects of Web Scraping are introduced. Currently available software tools
are listed with a brief summary of their functionalities. The Process of Web Scraping is entirely
explained by an practical example.

Keywords: Web Scraping, World Wide Web, Data Extraction, API


Kurzfassung
Unter Web Scraping werden Technologien verstanden, die der Gewinnung von Informationen
durch gezieltes Extrahieren der benötigten Daten vom World Wide Web dienen.
Suchmaschinen verwenden eine ähnliche Technologie, welche Web Crawling heißt. Der
Unterschied zwischen diesen wird geklärt.

Die vorliegende Arbeit behandelt die Frage, warum und wie man Web Scraping einsetzt. Der
rechtliche Aspekt dieser Aktivitäten wird vorgestellt. Derzeit verfügbare Software wird
aufgelistet und kurz beschrieben. Der praktische Teil dieser Arbeit befindet sich im vorletzten
Kapitel, in dem Web Scraping anhand eines Beispiels erläutert wird.

Schlagwörter: Web Scraping, World Wide Web, Datenextraktion

2
Acknowledgements
I would like to thank my family for the support. Special thanks to Simon Newby as a kind and
uncomplaining editor.

3
Table of Contents

1 Introduction ............................................................................................................ 6

2 General facts about Web Scraping ......................................................................... 7

3 Purpose of Web Scraping ....................................................................................... 9


3.1 Market analysis and research ................................................................................. 9
3.2 Enterprise technologies ........................................................................................ 10
3.3 Opinion Poll .......................................................................................................... 10
3.4 Human Resources Agencies ................................................................................ 10
3.5 Social Network mining .......................................................................................... 10
3.6 Government Services ........................................................................................... 11
3.7 Corporate spying .................................................................................................. 12
3.8 Social Mining and Sentiment Analysis .................................................................. 12

4 Methods of Web Scraping .................................................................................... 13


4.1 Manual Scraping .................................................................................................. 13
4.2 HTML Parsing ...................................................................................................... 13
4.3 DOM Parsing ........................................................................................................ 14
4.4 XPath ................................................................................................................... 14
4.5 APIs ..................................................................................................................... 15

5 Available Software Tools ...................................................................................... 16


5.1 Cloud Software ..................................................................................................... 16
5.1.1 Dexi.io .................................................................................................................. 16
5.1.2 Octoparse............................................................................................................. 17
5.1.3 Scrapinghub Platform ........................................................................................... 17
5.2 Desktop Software ................................................................................................. 17
5.2.1 ParseHub ............................................................................................................. 17
5.2.2 FMiner .................................................................................................................. 18
5.3 Programming libraries .......................................................................................... 18
5.3.1 Scrapy .................................................................................................................. 19
5.3.2 Goutte .................................................................................................................. 19
5.4 Browser Extensions .............................................................................................. 19

4
5.4.1 Outwit Hub ........................................................................................................... 19
5.4.2 Web Scraper ........................................................................................................ 20
5.5 Conclusion ........................................................................................................... 20

6 Hands on Web Scraping ....................................................................................... 21


6.1 Task ..................................................................................................................... 21
6.2 Prerequisites ........................................................................................................ 22
6.3 Selected Software ................................................................................................ 22
6.4 Preparation........................................................................................................... 22
6.5 Execution ............................................................................................................. 23
6.6 Results ................................................................................................................. 24
6.7 Conclusion ........................................................................................................... 24

7 Legal Aspects of Data Extraction .......................................................................... 25


7.1 United States ........................................................................................................ 26
7.2 European Union ................................................................................................... 27
7.2.1 General Data Protection Regulation ..................................................................... 27
7.2.2 Compliance of currently owned data with GDPR .................................................. 28
7.2.3 Data Breach reporting under GDPR ..................................................................... 29
7.2.4 Purpose limitation ................................................................................................. 29
7.3 Conclusion ........................................................................................................... 29

8 General conclusion ............................................................................................... 30

9 Bibliography ......................................................................................................... 31

5
1 Introduction
The World Wide Web consists of an interlinked network of information, which is presented
through websites to the users. World Wide Web has significantly changed the way we share,
collect, and publish data. The amount of presented information grows constantly.

As data grows in amount, variety, and importance, business leaders must focus their attention
on the data that matters the most. Not all data is equally important to businesses or consumers.
The enterprises that thrive during this data transformation will be those that can identify and
take advantage of the critical subset of data that will drive meaningful positive impact for user
experience, solving complex problems, and creating new economies of scale. Business
leaders should focus on identifying and servicing that unique, critical slice of data to realize the
vast potential it holds. ((IDC), 2017) David Reinsel John Gantz John Rydning, Data Age 2025,
Page 4

Also with the usage of Web as a new marketing and sales channel the quantity of content
multiplied. Online merchants offer large packs of data to describe their products. Knowledge
base providers offer access to their databases.

IDC forecasts that by 2025 the global datasphere will grow to 163 zettabytes (that is a trillion
gigabytes). That’s ten times the 16.1ZB of data generated in 2016. All this data will unlock
unique user experiences and a new world of business opportunities. ((IDC), 2017)

With this unorganized growth, it is no longer possible to manually track and record all available
sources. That moment, is when Web Scraping evolved. Automated techniques allow the
collection of a massive amount of data from the Web compared to manual data extraction.

Together with Web Scraping another term became very important – Meta Data. Massive
collection of data obtained by Web Scraping allows Meta Data analysis.

This bachelor thesis will describe the most common reasons for Web Scraping and legal
aspects connected to this topic. Techniques of Web Scraping with appropriate explanation are
presented in separate chapters. Currently available software tools are listed with brief
summary of their functionalities. The Process of Web Scraping is entirely explained by an
practical example at the end of the paper.

6
2 General facts about Web Scraping
Several definitions of web scraping came up during the literature research. All three below
presented definitions mention data extraction from multiple sources. They differ in the form of
the initial sources for the extracted information.

Sometimes it is necessary to gather information from web sites that are intended for human
readers, not software agents. This process is known as “web scraping”. (Apress, 2009)

The first definition mentions data sources, which are originally intended for human readers.
Such definition has proven itself as outdated. With the evolution of automated techniques the
possibility also arose by extraction from software readable sources. However, it must be taken
into account that the date of publication was 2009. Application Programming Interface (API)
Sources at that time were very limited. Public API directory available on Programmable Web
website (Berlind, 2015) had approx. 750 available sources available, compared to 17175 listed
in 2017.

Web scraping, also known as web extraction or harvesting, is a technique to extract data from
the World Wide Web (WWW) and save it to a file system or database for later retrieval or
analysis. Commonly, web data is scrapped utilizing Hypertext Transfer Protocol (HTTP) or
through a web browser. This is accomplished either manually by a user or automatically by a
bot or web crawler. Due to the fact that an enormous amount of heterogeneous data is
constantly generated on the WWW, web scraping is widely acknowledged as an efficient and
powerful technique for collecting big data. (Mooney, 2001)

The current state of affairs is more precisely portrayed by the second definition, where Web
Scraping is mentioned as one of sources for big data collection. In the definition is also
mentioned another term – Web Crawler.

Web Crawling is performed in a different way with different outcomes. Figure 1 illustrates both
activities. The process steps visible on the left side show that Web Crawling does not have a
defined unique target and processes any available data without aiming on specific information.
In comparison on the right side, where the Web Scraper receives, processes and parses the
data from a specified source. Web Crawling is not covered in this paper.

7
Figure 1 Web Scraping vs. Web Crawling - Source Santosh Kalwar [6]

The definition below does not mention many details. However , it succinctly captures the
activities of Web scraping most precisely.

Web Scraping involves the process of querying a source, retrieving the results page and
parsing the page to obtain the results. (John J. Salerno, 2003)

8
3 Purpose of Web Scraping
Enormous amounts of source information, available on the World Wide Web, are still in the
format of a Hypertext Markup Language (HTML) page. Automated extraction is difficult,
because the intended reader was a human. This chapter introduces the motivation and
purpose of Information extraction through Web Scraping.

Rapid growth of the World Wide Web has significantly changed the way we share, collect, and
publish data. Vast amount of information is being stored online, both in structured and
unstructured forms. Regarding certain questions or research topics, this has resulted in a new
problem—no longer is the concern of data scarcity and inaccessibility but, rather, one of
overcoming the tangled masses of online data. (B.C., 2016)

These utilizations are often only possible because the existence of automated Web Scraping.
Without these techniques, it would be impossible to collect the amount of data repeatedly and
in reasonable time.

3.1 Market analysis and research

Data collection from online sources became one of the market research methods. It offers
much faster response, compared to a classical surveying.

While Knowit considers it best to utilize traditional surveys, Web-scraping is seen as cost
effective support for such instruments. To get comprehensive picture and to gain knowledge
of the tools in markets multiple sources should be used. (Raulamo-Jurvanen P., 2016)

Consumers are active in the online world and share their experience, frustration or motivation.
Companies that wish to learn more from consumers can add online sources of information.
Web scraping is one of the method to collect such data.

Targeted data collection from e-shop and advertising servers helps to update Indexes. Which
are based on frequently changed prices. Indexes built through automated Web scraping can
offer more frequent update intervals.

With the increasing relevance and availability of on-line prices that we see today, it is natural
to ask whether the prediction of the consumer price index (CPI), or related statistics, may
usefully be computed more frequently than existing monthly schedules allow for.[13]

9
Wegmann and Chapple (2013) used a small sample of 338 Craigslist listings to study the
prevalence of secondary dwelling units in the San Francisco Bay Area. Finally, Feng (2014)
web-scraped 6,000 Craigslist listings to study Seattle’s housing market. (Daniel Glez-Pen,
nedatováno)

3.2 Enterprise technologies


Incompatible enterprise technologies are common by larger projects. Still a unified
presentation of data from several systems is necessary. In some specific cases the solution is
based on Web Scraping. The Example below demonstrates possible approach to collect and
evaluate measurements from several independent sources:

Water utility company in Surabaya that called as PDAM Surabaya has few reservoirs in their
water supply system which are being monitored by WTW IQ SensorNet 2020 XT. However,
on that sensor devices while it can provide some of the water quality parameters value
information but the sensor is passive and the internal data is still stored in the sensor itself. To
solve the problem, we proposed an application of data logger to manage data collections online
water quality monitoring system by using web scraping. (Jakob G. Thomsen, 2015)

3.3 Opinion Poll


Movie Producers collect data about their current blockbusters. Such data includes the user
feedback, if this was shared on the movie portals in a review. (Jie Yang & Brian Yecies, 2016)

3.4 Human Resources Agencies


Human resource (HR) departments in large companies process many job offers for their
companies and try to match the position with prospective employees. It’s not sufficient to use
only incoming vacancy requests from candidates. HR Departments also cooperate with 3rd
Party companies, which can offer them own directories of professionals. Contact mining is an
important activity for such agencies.

“hiQ is a tech startup which collects and analyzes public profile information on LinkedIn in order
to provide its clients – mostly large companies – with insights about their employees, such as
which employees are likely to be poached by a competitor or which skills its employees have.
(United States District Court, 2017)

3.5 Social Network mining


Social platforms provide trends and reaction level from the users on many topics. Figure 2
composes 4.9 million traces of 12th February 2017 (The Grammy Awards ceremony evening)

10
related data from social media into a Graph. The spikes mark the most interesting moments
as observed by the users of the social media services.

Figure 2 Analyses of social media spikes in GRAMMY evening (Joyce, 2017)

Social media (such as blogs, online social networks, microblogs) has become one of the major
data sources for quantitative communication research over the past decade. By employing
simple programing tools, researchers can extract relevant messages from social media
platforms for various research purposes.

Social media data can be obtained from publicly available sources via various means such as
scraping, using site-provided apps, and crawling. The possibility of obtaining social media data
makes research on social media data, for example, a begin-to-end process for Twitter data
analytics is elaborated with four key steps: crawling, storing, analyzing, and visualizing. Social
media data are just like conventional data, which is a potential treasure trove, but requires data
mining to uncover hidden treasures. (Huan Liu, 2016)

3.6 Government Services


The Monitoring of criminal activities on social websites and specific forums is an important
source of information for Government Agencies and Law Enforcement bodies. Sources for this
behavior usage are clearly not available. We can assume that they exist based on patents
filled by the United States Government, which are publicly listed:

Method and apparatus for improved web scraping. The United States Of America As
Represented By The Secretary Of The Air Force (John J. Salerno, 2003).

11
3.7 Corporate spying
In the corporate context, web scraping allows for a company to review both their own and the
appearance of competitors in headlines of news servers. A company can also collect details
about competitors and even about its own employees.

However, while there are significant benefits to using website data through methods such as
web scraping or web mining in innovation research, the literature on the use and validity of
these approaches is relatively underdeveloped. (Huan Liu, 2016)

At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data
analysis in Business and Competitive Intelligence systems as well as for business process re-
engineering. (Emilio Ferraraa, 2014)

3.8 Social Mining and Sentiment Analysis

Social media is a new source of data that is significantly different from conventional ones.
Social media data sets are mostly user-generated, and are big, interlinked, and
heterogeneous.

Social media data can be obtained from publicly available sources via various means such as
scraping, using site-provided apps, and crawling. The possibility of obtaining social media data
makes research on social media data feasible. In Twitter Data Analytics, for example, a begin-
to-end process for Twitter data analytics is elaborated with four key steps: crawling, storing,
analyzing, and visualizing. Social media data are just like conventional data, which is a
potential treasure trove, but requires data mining to uncover hidden treasures. (Huan Liu,
2016)

12
4 Methods of Web Scraping
The methods of Web Scraping evolved together with the World Wide Web. Not all listed
methods were available at the beginning. There are two examples to mention, because these
are presently the most used techniques.

Since 2000 the Document Object Model (DOM) became more popular in DHTML. A broader
acceptance later on allowed the HTML Parsing technique to evolve to DOM Parsing.

Second example are Application Programming Interfaces (APIs). This technique is the
youngest on the list, the growth of available content APIs is dated from 2005. According to
ProgrammableWeb.com the number of APIs has grown within 8 years from 0 to 10302.
(Berlind, 2015)

4.1 Manual Scraping

Manual scraping is still an option in specific situations. These situations are:

• When the amount of data is minimal,


• When the data being scraped does not require a repetitive task,
• When setting up automated scraping would take longer than the data collection itself.
• Possibly security measures or specific characteristics of the website do not allow
automated methods.

4.2 HTML Parsing

Web sites don’t always provide their data


in comfortable formats such as .csv or
.json files. HTML Pages are created by
the server as a response to a user’s
request. At this point server software is
not relevant, rather the output in the
browser is important. Analysis of the
HTML structure (simple page sample
provided in the Figure 3) in the provided Figure 3 HTML Structure
web page will show repeated elements.
With a programming language script or Web Scraping tool, each page with similar pattern can
be used as a source for data.

13
4.3 DOM Parsing

Document Object Model (DOM) Parsing is


an evolution of HTML Parsing based on
developments of the language and browsers
which lead to the introduction of the
Document Object Model. DOM is heavily
used for Cascading Stylesheets (CSS) and
JavaScript. Integration of DOM revealed new
possibilities for addressing some specific
parts of the webpage. Figure 4 shows
Figure 4 DOM Selector
containers with their own DOM addresses.
These are used in Web Scraping for easier
navigation through webpage content.

4.4 XPath

Similar addressing possibility as DOM provides XPath (XML Path Language). The name
suggests a usage for XML documents. It is applicable also to HTML format.

XPath requires a more precisely structured webpage than DOM and has the same possibility
to address segments within the webpage. Figure 5 shows the document structure as
interpreted in XPath.

Figure 5 XPath Navigation

14
4.5 APIs
Whilst the previous methods work to scrape human-readable outputs, Application Programm-
ing Interface(API) expects an application as a communication partner. Thus APIs are often
named as machine-readable interfaces (versus human-readable).

Even APIs were introduced much later than the WWW, and their growth was very fast. The
world of APIs is fragmented. For a simple overview and orientation were API Directories
created. Most of the available APIs are registered and described in the directory with relevant
links to the sources. Two examples of such directories are - Programmable Web
(https://www.programmableweb.com) and APIs (https://apis.guru/).

API Directories also provide their own API, which allows users to search in their database for
API Sources. A standard HTTP Request sent to an API Endpoint returns an answer from
server. Each API has its own specification and options. The format of the answer can be set
as option in the request. The most widely used format for API communication is JSON.

15
5 Available Software Tools
This chapter will introduce several published software tools for Web Scraping. We can assume,
that there are more tools which have been developed for internal purposes of individuals and
companies and thus have not been published. Therefore, this list does not include all available
software, due to the proliferation of applications and platforms.

Following sections based on software type are formed:


• Cloud Software
• Desktop Software
• Programming libraries
• Browser Extensions

Each section will introduce at minimum 2 publicly available solutions.

5.1 Cloud Software

Cloud solutions provide a User Interface through a Web browser while the application backend
resides on the cloud server. Such a configuration limits the user end hardware requirements
to their minimum. Large amount of data can be scraped without having own computer on.
Larger projects can be created without additional hardware or extended internet bandwidth.

In some situations is very important, where is the cloud located. For example, some webpages
may not be accessible from Europe in their full range. A Cloud solution connected through a
US based ISP may be able to access other sections of such a website.

Cloud Solutions are generally financed by subscription fees. The price depends on allocated
resources. The owners offer monthly or annual payments.

5.1.1 Dexi.io
Website https://dexi.io/ Dexi has an up-to-date sleek User Interface. The
Free 1 hour to test workflow starts with the creation of a robot. The
Tariff webpage is presented in the background and the
Pricing 120-700 USD/month user needs to select the requested areas and define
Range the robot tasks by selecting the actions in a drop-
down menu.
The free tariff option provides all functionalities with the limitation through the time period of
use. Dexi sets a very strict 1-hour evaluation period only.

16
5.1.2 Octoparse
Website https://www.octoparse.com/ Octoparse has a desktop client. The client
Free Unlimited time communicates with the cloud and submits the tasks
Tariff Limit of 10 threads which are then processed on the cloud. The User
interface is very complex, with a lot of controls available at the first click. The user should first
create a task and define the steps. The finished task then moves to the queue for execution.
Execution occurs on the cloud with the results being returned to the user in the desktop
interface when the task has been completed.

5.1.3 Scrapinghub Platform


Website https://scrapinghub.com Scrapinghub offers a platform for creating tasks built
Free Unlimited time up on Scrapy Framework and run those in the cloud.
Tariff Limit of 1 thread Additionally, visual programming is also available.
Pricing That is an advantage for users, who are not familiar
Range with programming. The pricing is based on RAM
usage by the created scraping tasks. Use of the
Scrapy Framework guarantees excellent portability of the created tasks to end-user hardware
resources. Or in the opposite direction, if user hardware resources are lacking the processing
power to finish the Scrapy tasks.

5.2 Desktop Software

Website data are downloaded, parsed and saved locally. Desktop Applications require a
broadband internet connection. Internet link of the computer will directly influence the
processing time of Web Scraping task. Compared to Cloud Software for Desktop Software
more advanced hardware is required. Workstation with Random access memory (RAM) size
larger than 8GB is recommended as minimum.

5.2.1 ParseHub
Website https://www.parsehub.com/ Parsehub is a web scraping software that supports
Free No time limit complicated data extraction from sites that use
Tariff max.200 scraped pages AJAX, JavaScript, redirects and cookies. Machine
Pricing 149-499USD/month learning technology is used to read and analyze
Range documents. Analysis allows ParseHub to output
relevant data. The software is available as a
Desktop client for Windows, Mac and Linux. Additionally, there is a Web Application, that can
be used within the browser. Within the free tariff it is possible to scrape up to 200 pages.

17
5.2.2 FMiner
Website https://www.fminer.com/ FMiner combines visual configuration with scripting
Free Trial for 15 days features. Steps are recorded, in a same way like
Tariff manual web scraping from a website. FMiner
Pricing 138-248 USD/month simultaneously adds steps to a macro. Visual inter-
Range pretation of the macro is captured on the Figure 6.
In that
case the website requires a selection from
dropdown menu and wait time 2 seconds. Visual
dashboard makes extracting data from websites
intuitive. Whether you want to scrape data from
simple web pages or carry out complex data
Figure 6 FMiner Script
fetching projects, both scenarios are possible with
FMiner.

5.3 Programming libraries


Ready built Web Scraping software tools are much easier to use in comparison to
programming your own web scraping setup. For some purposes available software products
are not the ideal solution. However, when a developer is appointed to create a custom Web
Scraping solution, it is not necessary to start from zero. Frameworks allow developers to save
time by re-using generic modules in order to focus on specific areas for which an off the shelf
solution isn’t suitable.

Frameworks for 5 programming languages are presented in the Figure 7. Availability of three
basic functionalities is visualized.

Figure 7 Features of Programming Libraries (Adamuz, 2015)

18
5.3.1 Scrapy
Website https://scrapy.org/ Scrapy is one of the advanced web scraping
Pricing OpenSource frameworks available. The Framework is written in
Range Python. It is an application framework and it provides
many commands to create own applications and use those for Web Scraping projects.

Scrapy is a web crawling framework for developer to write code to create spider, which define
how a certain site (or a group of sites) will be scraped. The biggest feature is that it is built on
Twisted, an asynchronous networking library, so Scrapy is implemented using a non-blocking
(aka asynchronous) code for concurrency, which makes the spider performance is very great.
(MichaelYin, 2017)

Extensive documentation is available to support developers who wish to integrate it into their
own applications. The choice between a Local Desktop Application or a Cloud Solution is not
a limitation, both can be powered by Scrapy code. One of the presented Cloud Solutions
encourages the user to move their Scrapy applications over to cloud infrastructure.

5.3.2 Goutte
Website https://github.com/FriendsOfPHP/Goutte Goutte is a screen scraping and web
Pricing OpenSource crawling library for PHP. Goutte
Range provides a nice API to crawl websites
and extract data from the HTML/XML responses. It makes use of the Symfony BrowserKit
component. This component simulates the behavior of a web browser, allowing you to make
requests, click on links and submit forms programmatically. The developer webpage does not
offer advanced documentation.

5.4 Browser Extensions


Browser extensions belong to the entry level category of Web Scraping software. These tools
are capable enough, if data are scraped from a limited number of websites for a quick research
or project, Those are available in the browser specific add-ons store

5.4.1 Outwit Hub


Website http://outwit.com/ Once installed and activated, it adds web scraping
Available Firefox add-ons store capabilities to the previously installed browser.
Pricing Freeware Outwit Hub offers ready-made templates for
Range extraction of pictures, tables, lists and texts.
Extracting data from sites using Outwit hub does not
demand advanced skills. The setup is not complex. Enough documentation to refer to is

19
provided using Outwit hub website. Outwit Hub was discontinued as Firefox Add-on and
newest versions are available only as standalone applications.

5.4.2 Web Scraper


Website http://webscraper.io/ Web scraper is a browser extension available for
Available Google Chrome Store Google Chrome, that can be used for web scraping.
Pricing Freeware Sitemap defines on how a website should be
Range navigated and what data should be extracted. It can
scrape multiple pages simultaneously and even has
dynamic data extraction capabilities. Web scraper can also handle pages with JavaScript and
Ajax, which makes it more powerful. The tool extracts data to a CSV file or CouchDB. For
larger Web Scraping jobs the developer offers a cloud version. The Google Chrome extension
is also used as a test environment, before scraping jobs are assigned to the cloud version.

5.5 Conclusion
It is difficult to compare the listed tools to each other. Some are meant for hobbyists and some
are suitable for enterprises.

Browser extensions belong to the hobbyists category. To collect data from a few websites for
a brief research or project, these tools are more than enough.

Desktop Applications have a broad set of functionalities. That is reflected in the high price.
Cloud solutions offer a similar set of functions. Compared to Desktop Software Cloud Solution
offer 3 main advantages:

• monthly billing based on resource usage instead of initial high investment


• low hardware requirements on own Server/PC
• own bandwidth is used only for job setup and controlling

Cloud Software currently offers the most accessible way to begin and test web scraping
options.

The most complex option are frameworks. For frameworks the used language and the
available documentation is important for the prospective developers.

20
6 Hands on Web Scraping
The described topic of Web Scraping might not be clear from the description only. Therefore,
this chapter offers a guide through a practical presentation.

6.1 Task
For the demonstration purposes, websites of 5 canteens based in Wien are going to be
targeted. A collection of the Weekly lunch menu should get collected. Each of the canteen
publishes the data on own website.

o http://www.kantine.at/speisen/Menu
o http://www.mq-kantine.at/tagesteller.php?pageid=104
o http://www.magdas-kantine.at/
o https://www.aera.at/wochenmenue/
o http://www.restaurant-wild.at/wochenmenue.php

The goal is to extract the data from the different websites – Figure 8. As a result, JSON Format
with all lunch menus is expected.

Figure 8 Different structures of the websites

JSON Format of the result was selected, because the extracted data should be usable as a
data source for other applications. Suggested format presented in Figure 9. Such data could
be used as information source for an info screen or a local restaurant directory. To keep the
results updated is important to repeat the web scraping task at least twice a week.

21
Figure 9 JSON Format of the final result:

6.2 Prerequisites
• Access to selected websites
• Installed ParseHub software – downloaded at https://www.parsehub.com/
• Sufficient bandwith of an internet connection – this use case is not bandwidth
intensive, the available 75Mbits/75Mbits wont be utilized
• Sufficient Central Processing Unit(CPU) and RAM size – this use case is not
intensive on computing power, the configuration Xeon 3,3GHz, 12GB RAM would
suffice even for scaled up solution

6.3 Selected Software

From the presented list in the previous chapter the recently published Desktop Application
called ParseHub was selected. The following aspects were important for the required task:

• Fast learning curve – the user interface is based on visual controls


• Simple setup – standalone desktop application
• Available documentation – product website contains a video based tutorial

6.4 Preparation
ParseHub works collects the data from websites based on templates. As a first step we need
to define the URL and start to define the template. At this moment is necessary to provide a
structure. Common usage requires to define the main object and bind the related ones. In this
case is the main object a week day and the relative objects are Menu 1, Menu 2.

22
The Figure 10 shows the ParseHub user interface during the preparation. It allows the selection
in straightforward way. The result of each selection is immediately recognizable on the
screenshot of the webpage itself.

Figure 10 ParseHub Visualization of the template on website

6.5 Execution

Once the Template for the webpage was created the


ParseHub application has a description of the data
areas, which we want to collect. Additionally, related
content like images, metadata was set to be collected in
the ParseHub software. Figure 11 shows, how to
proceed with 5 different sources. Each separate page
requires a new template.

Our Data Set contains 5 templates (canteens) with 5


Figure 11 ParseHub Template
items (week days) and 2 relative items (Menu 1, Menu
definition
2) due to this, the execution of the task has took less
than 5 minutes.

23
To keep the results updated is important to repeat the web scraping task at least twice a week.
Unfortunately such settings are available only for paid subscription of ParseHub software.

6.6 Results

The application presented the results


in the JSON format, as this was
selected during the configuration. The
Figure 12 shows highlighted sections
on 2 data rows.

Figure 12 Menus from 5 different canteens formatted as JSON string

6.7 Conclusion
The practical example has shown, how to use different websites as data sources for own
applications.

The purpose of canteen menu collection expects a periodically updates of the web scraping
task. Therefore, the selection of desktop application was not correct. It would require a manual
action to keep the data source updated. Proper solution would be a cloud application for such
case.

In the example the websites terms of use were not examined. This would be necessary, if such
task would be performed for a real purpose. In the next chapter the legal aspects and
necessary steps are described.

24
7 Legal Aspects of Data Extraction
The use of data received by Web Scraping is questioned by several website owners.
Individuals or companies active in Web Scraping have a simple legal defense strategy.
Websites are filled with data, which was published with the clear intention of being seen by
others. (United States District Court, 2017)

For example, in the case of the website called LinkedIn - the main business plan is to publish
profiles and derive an income through charging a portion of services sold to users. The goal of
users is to expose their CVs to a prospective employer. Therefore, holding private data would
not comply with their business plan, and by extension any data available is thus public.

Web Scraping could possibly breach several laws or contract terms. The Table explains the
law areas, where Web Scraper should be prepared for an defense action. (Daly, 2016)

Website Many companies expressly prohibit scraping within their website terms and
terms and conditions. Whether they can enforce such terms is still unclear but depending
conditions on circumstances, a claim for breach of contract is possible.

Copyright, As scraping involves copying, it may lead to a claim for copyright infringement.
Intelectual Whether such a claim has any merits will depend on the particular
Property circumstances because not all scraped data qualifies for copyright protection.
Rights Works which are copyright protected include original literary and artistic works
such as computer programs, website graphics, and photographs.

Database A database right is infringed when all or a substantial part of a database is


rights extracted or re-utilised without the owner’s consent. The repeated extraction
or re-utilisation of insubstantial parts of a database which conflicts with the
normal use of the database may also infringe database rights. Infringement
of database rights may also apply when scraping directories or listings from
third party websites if the owner has incurred costs in developing and
maintaining them.

Trademarks If the scraper reproduces a website owner’s (registered or unregistered)


trademarks without their consent, the website owner could take an action
claiming trademark infringement and/or passing off. Passing off prohibits a
third party from selling goods or carrying on business under a name, mark,
description, or in any other way that is likely to mislead, deceive or confuse
the public into believing that the merchandise or business is that belonging to
the brand owner.

Data Businesses looking to use automated scraping techniques to collect


protection information about individuals should be aware that they risk breaching local
data protection law if they collect “personal data” (that is, any information that

25
identifies a living individual). The central issue is whether individuals have
consented to their personal data being collected. Although data obtained from
one website may in isolation not be personal data, when it’s aggregated from
multiple websites, a business may inadvertently find itself in possession of
personal data without the consent of the individual concerned. Use of such
personal data will infringe data protection laws.

Criminal It is an offence to cause criminal damage to a computer (including damage to


Damage data) or to use a computer to access data without authorisation. Accordingly,
data scraping could be a criminal offence as the website owner has not
authorised access to the data.

Table 1 Law Topics related to Web Scraping (Daly, 2016)

In case of law suit, it is very important, in when dealing with global companies in which region
the suit is brought to a court for examination. Each jurisdiction has different law. In this chapter
the US Federative region and European Union region are captured.

7.1 United States

A recent lawsuit between companies “LinkedIn” and “hiQ Labs” should illustrate a possible
outcome for cases opened because of Web Scraping activities. Below is a cited part of the hiQ
Labs defense against the accusation.

“hiQ is a tech startup which collects and analyzes public profile information on LinkedIn in order
to provide its clients – mostly large companies – with insights about their employees, such as
which employees are likely to be poached by a competitor or which skills its employees have.
hiQ does not analyze the private sections of LinkedIn, such as profile information that is only
visible when you are signed-in as a member, or member private data that is visible only when
you are “connected” to a member. Rather, the information that is at issue here is wholly public
information visible to anyone with an internet connection.” (United States District Court, 2017)

Web Scraping activity can reach to a point, where it violates the Computer Fraud and Abuse
Act (CFAA). The CFAA was designed to prevent unauthorized access to websites and servers.
Federative States in USA have similar state laws, which have requirements similar to the
CFAA.

The CFAA imposes criminal penalties on a party who “intentionally accesses a computer
without authorization or exceeds authorized access, and thereby obtains information from any
protected computer.” (Hirschey, 2014)

26
7.2 European Union
In comparison to United States the legal arguments in Europe are different. Instead of
unauthorized excessive usage argumentation, the plaintiffs utilize Intellectual Property rights.

In Ryanair Ltd v PR Aviation BV [2015] the Court of Justice of the European Union (“CJEU”)
held that no intellectual property rights subsisted in the scraped data (Ryanair’s database of
flight times and prices) and therefore the company scraping the data had not infringed
Ryanair’s IP. This was because the database was not the result of the requisite creative input
necessary to be afforded copyright protection. The CJEU made it clear, however, that it is
possible for a website owner to restrict the re-use of the mined data through its terms and
conditions. This is therefore something that companies should bear in mind – if they access a
website and consent to the terms of use which contain a restriction on the re-use of the website
data, if they do go on to re-use that data they may be liable for breach of contract. (Rezai,
2017)

7.2.1 General Data Protection Regulation

General Data Protection Regulation - publicly known by the abbreviation GDPR is a heavily
discussed topic since it was adopted by the European Parliament and the European Council.
Member states in the EU have their own data protection laws already in action. Also EU wide
directive was already in place. Compared to Directive 95/46/EC (Data Protection Directive)
which it replaces, the GDPR seeks to extend the reach of EU data protection law.

The key aim of the GDPR is to harmonize data protection law throughout the EU. It sets rules
about how personal information can be used by companies, governments and public entities.
The time period dedicated for implementation will end in May 2018. Table 1 outlines the exact
schedule of the GDPR implementation.

Date Action
April 2016 Adopted by the European Parliament and the European Council
May 2016 Published in the EU Official Journal
th
25 May 2018 First Day when GDPR is applied.

Table 2 GDPR Timeline.

The text of GDPR sets the rights of individuals regarding access of data held by organizations
and companies about them. The individuals are the weaker party of such process. To enforce
their rights new fines are defined. The responsibility of organization is strictly defined that they
must obtain the consent of people they collect information about.

27
Any company that uses personal data from EU citizens or offers goods or services to them
must attain GDPR compliance. In the Table are listed examples that would qualify for needing
to comply with the GDPR’s information privacy rules. The GDPR applies irrespective of
whether the actual data processing takes place in the EU or not.

Table 3 GDPR Guidance of Application (Ruth Boardman, 2017)

GDPR
Organizations
Applies

located within the EU YES

located outside of the EU if they offer goods or services to (even for free), or
YES
monitor the behavior of EU residents

processing and holding personal data of EU residents, regardless of the


YES
organization’s location.

Examples

Collection of Email Addresses and sending out email messages YES

Data harvesting from European citizens, which are kept or sold YES

Processing or storage of EU citizen personal data as a service for different


YES
company

Selling goods or services to EU consumers YES

Data collection from mobile devices owned by European citizen YES

Data collection from mobile devices owned by European citizen, when they
YES
travel outside of the EU

7.2.2 Compliance of currently owned data with GDPR

A crucial point for all data owners is, that there is no exception for data captured before GDPR.
Once the GDPR becomes applicable it will not be legally possible to process the data if its in
violation, even if it was not violating the law when it was collected. It has been defined as the
responsibility of the data owners to review and revise their datasets in order to verify it falls
within the bounds of the law and that they possess a valid consent from the affected persons.
Such consent can be dated before May 2018 but has to fulfill all requirements set by GDPR
Law.

28
7.2.3 Data Breach reporting under GDPR

Web Scraping might be affected by the GDPR requirement to report Data breach immediately.
When Company or individual starts any scraping activity, they should consider the impact on
the website. Webpages may implement a guarding mechanism to fulfill the requirement of Data
Breach immediate reporting. Otherwise the website owner would face penalties.

Article 33 of the GDPR handles separately the event of data breaches:

„In the case of a personal data breach, the controller shall without undue delay and, where
feasible, not later than 72 hours after having become aware of it, notify the personal data
breach to the supervisory authority competent in accordance with Article 55, unless the
personal data breach is unlikely to result in a risk to the rights and freedoms of natural persons.
Where the notification to the supervisory authority is not made within 72 hours, it shall be
accompanied by reasons for the delay.“ (Union, 2016)

7.2.4 Purpose limitation

Collection of personal data is possible only with consent of the affected person. Such consent
must list specified, explicit and legitimate purposes. Collected data cannot be further
processed in a way incompatible with those purposes.

Further processing of personal data for archiving purposes in the public interest, or scientific
and historical research purposes or statistical purposes shall not be considered incompatible
with the original processing purposes. However, conditions in Article 89(1) (which sets out
safeguards and derogations in relation to processing for such purposes) must be met. (Union,
2016)

7.3 Conclusion
In summary, there is no specific piece of legislation which forbids Web Scraping to gather
information. The website owners, however, may have legal rights against the company under
intellectual property law and contract law.

Each case will turn on its own facts though and this is very much dependent upon what
information is scraped from the websites. Companies should beware of contractual provisions
which they have agreed to in respect of a website’s terms of use – these may prohibit the user
from taking and using the data off the site. If the data being scraped includes personal data,
then compliance with data protection law must also be borne in mind. The only way to be truly
29
certain that the rights of a website owner have not been infringed is to obtain their express
consent to the screen scraping and subsequent use of the information. (Rezai, 2017)

The overall outcome of forbidden Web Scraping could be for the Website owner negative. The
possibility of less visitors, less links from content aggregator websites and less income from
advertising.

Thus, Data hosts should only use legal actions against scrapers when the scraper presents a
threat to the data host’s core business and the data host has a strong enough claim to prevail
legally against the scraper.

From the law perspective is necessary to adjust the terms of use on the websites. Restriction
of Web Scraping techniques can be directly included. Such step does not require much
resources and allows a direct argumentation at the court.

8 General conclusion

Depending on the primary purpose, different Web Scraping techniques can be used, taken
amount of data, periodicity and required outcome into consideration.

Web scrapers have a broad selection of tools to select from. The project does not consist only
from the technical solution and execution. By all Web Scraping projects should be the legal
aspect of the specific job examined and necessary steps identified.

Data hosts should always asses the benefit scrapers can provide and take a pragmatic
approach to those who scrape their data.

Web Scrapers should keep the connection to the Data hosts and allow identification of the
Data Host as a source of presented information.

30
9 Bibliography
1. (IDC), I. D. C., 2017. Data Age 2025, Framingham, USA: IDC.

2. Adamuz, P. L., 2015. Development of a generic test-bed for web scraping, Barcelona:
European Education and Training Accreditation Center.

3. Apress, 2009. Using Web Scraping to Create Semantic Relations. Scripting


Intelligence, pp. 205-228.

4. B.C., B., 2016. Scraping Data. In: Data Wrangling with R. Use R!.. Cham: Springer.

5. Berlind, D., 2015. APIs Are Like User Interfaces--Just With Different Users in Mind.
[Online]
Available at: https://www.programmableweb.com/news/api-economy-delivers-
limitless-possibilities/analysis/2015/12/03
[Zugriff am 17 November 2017].

6. Daly, M., 2016. Dublin Globe - Legal briefs: 6 Reasons Why Not to Scrape Data.
[Online]
Available at: http://www.dublinglobe.com/community/toolbox/legal-briefs-6-reasons-
not-scrape-data
[Zugriff am 30 November 2017].

7. Daniel Glez-Pen, M. R.-J. a. F. F.-R., kein Datum Webscrapingtechnologies in an API


world. s.l., s.n.

8. Emilio Ferraraa, P. D. M. G. F. R. B., 2014. Web data extraction, applications and


techniques: A survey. Knowledge-Based Systems, Band 70, pp. 301-323.

9. Hai Liang, J. J. H. Z., 2017. Big Data, Collection of (Social Media, Harvesting).. In:
The International Encyclopedia of Communication Research Methods. s.l.:s.n., pp. 1-
18.

10. Hirschey, J. K., 2014. Symbiotic Relationships: Pragmatic Acceptance of Data


Scraping. Berkeley Technology Law Journal, 29(4).

11. Huan Liu, F. M. J. T. R. Z., 2016. The good, the bad, and the ugly: uncovering novel
research opportunities in social media mining. International Journal of Data Science
and Analytics, 1(3-4), pp. 137-143.

12. Jakob G. Thomsen, E. E. B. a. M. S., 2015. WebSelF: A Web Scraping Framework,


s.l.: IT University of Copenhagen.

31
13. Jie Yang & Brian Yecies, 2016. Mining Chinese social media UGC: a big-data
framework for analyzing Douban movie reviews. Journal of Big Data.

14. John J. Salerno, D. M. B., 2003. Method and apparatus for improved web scraping.
United States of America, Patentnr. US 7072890 B2.

15. Joyce, G., 2017. Data Reveals the GRAMMYs 2017 Highlights on Social Media.
[Online]
Available at: https://www.brandwatch.com/blog/react-grammys-2017-highlights/
[Zugriff am 2 January 2018].

16. Kalvar, S., 2017. Is scraping and crawling to collect data illegal?. [Online]
Available at: https://www.quora.com/Is-scraping-and-crawling-to-collect-data-illegal
[Zugriff am 24 November 2017].

17. Kriesel, D., 2016. Auch Spiegelredakteure feiern Weihnachten. Eine Analyse von
70.000 SpiegelOnline-Artikeln, http://www.dkriesel.com/spiegelmining: s.n.

18. MichaelYin, 2017. Scrapy Tutorial. [Online]


Available at: https://blog.michaelyin.info/scrapy-tutorial-series-web-scraping-using-
python/
[Zugriff am 22 December 2017].

19. Mooney, C., 2001. s.l.: Bar Ilan University.

20. Raulamo-Jurvanen P., K. K. M. M., 2016. Using Surveys and Web-Scraping to Select
Tools for Software Testing Consultancy. In: Lecture Notes in Computer Science, vol
10027. Cham: Springer.

21. Rezai, A., 2017. Beware of the Spiders: Web Crawling and Screen Scraping – the
Legal Position. [Online]
Available at: https://parissmith.co.uk/blog/beware-spiders-web-crawling-screen-
scraping-legal-position/
[Zugriff am 28 November 2017].

22. Rizqi Putri Nourma Budiarti, Nanang Widyatmoko, Mochamad Hariadi & Mauridhi
Hery Purnomo, 2016. Web scraping for automated water quality monitoring system: A
case study of PDAM Surabaya. Lombok, Indonesia, IEEE.

23. Ruth Boardman, J. M. A. M., 2017. Guide to the General Data Protection Regulation.
London: Bird & Bird LLP.

24. Union, E., 2016. GDPR Law - Article 33. [Online]


Available at: https://www.privacy-regulation.eu/en/33.htm
[Zugriff am 29 November 2017].

32
25. United States District Court, N. D. o. C., 2017. COMPLAINT FOR DECLARATORY
JUDGMENT. [Online]
Available at:
https://ia801501.us.archive.org/10/items/gov.uscourts.cand.312704/gov.uscourts.can
d.312704.1.0.pdf
[Zugriff am 1 December 2017].

26. Waddell, G. B. a. P., 2016. New Insights into Rental Housing Markets across the
United States: Web Scraping and Analyzing Craigslist Rental Listings. Journal of
Planning Education and Research, pp. 1-20.

33
List of Figures
Figure 1 Web Scraping vs. Web Crawling - Source Santosh Kalwar [6] ............................... 8
Figure 2 Analyses of social media spikes in GRAMMY evening (Joyce, 2017) .....................11
Figure 3 HTML Structure ......................................................................................................13
Figure 4 DOM Selector .........................................................................................................14
Figure 5 XPath Navigation ....................................................................................................14
Figure 6 FMiner Script ..........................................................................................................18
Figure 7 Features of Programming Libraries (Adamuz, 2015) ..............................................18
Figure 8 Different structures of the websites.........................................................................21
Figure 9 JSON Format of the final result: .............................................................................22
Figure 10 ParseHub Visualization of the template on website ..............................................23
Figure 11 ParseHub Template definition ..............................................................................23
Figure 12 Menus from 5 different canteens formatted as JSON string ..................................24

34
List of Tables
Table 1 Law Topics related to Web Scraping (Daly, 2016) ...................................................26
Table 2 GDPR Timeline........................................................................................................27
Table 3 GDPR Guidance of Application (Ruth Boardman, 2017) ..........................................28

35
List of Abbreviations
WWW World Wide Web
GDPR General Data Protection Regulation (referencing to EU Law)
HTTP Hypertext Transfer Protocol
HTML Hypertext Markup Language
CPI Consumer Price Index
API Application Programming Interface
DOM Document Object Model
XPATH XML Path Language

36

You might also like