You are on page 1of 55

A

Seminar Report
On

“A review on Web Scraping and its Applications”


Submitted
In partial fulfillment of the requirement for the award of degree of
Bachelor of Technology
In
Computer Science & Engineering

(Session 2023-2024)

Submitted to: Submitted by:


Mr. Abhishek Jain Manoj Kumar Khandelia
Assistant Professor (20EJCCS162)
VII Semester

Department of Computer Science & Engineering


Jaipur Engineering College & Research Centre, Jaipur
Rajasthan Technical, University
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

Table of Content

S.no Content Page No.


Candidate’s Declaration i
Bonafide Certificate ii
Preface iii
Vision and Mission iv
Program Outcomes (POs) iv
Program Education Objectives (PEOs) vi
Program Specific Outcomes (PSOs) vi
Course Outcomes (COs) vii
Mapping: COs and POs vii
Acknowledgement viii
Abstract ix
1. Introduction 1
2. General facts about Web Scraping 2
3. Purpose of Web Scraping 4
3.1 Market analysis and research 4
3.2 Enterprise Technologies 5
3.3 Opinion Poll 5
3.4 Human Resources Agencies 6
3.5 Social Network mining 6
3.6 Government Services 7
3.7 Corporate spying 7
3.8 Social Mining and Sentiment Analysis 8
4. Methods of Web Scraping 9
4.1 Manual Scraping 9
4.2 HTML Parsing 9
4.3 DOM Parsing 10
4.4 XPath 11
4.5 APIs 11
5. Available Software Tools 12
5.1 Cloud Software 12
5.1.1 Dexi.io 13
5.1.2 Octaparse 13
5.1.3 Scrapinghub Platform 13
5.2 Desktop Software 13
5.2.1 ParseHub 14
5.2.2 FMiner 14
5.3 Programming Libraries 14
5.3.1 Scrapy 15
5.3.2 Goutte 16
5.4 Browser Extensions 16
5.4.1 Outwit Hub 16
5.4.2 Web Scraper 16
5.5 Conclusion 17
6. Hands on Web Scraping 18
6.1 Task 18
6.2 Prerequisites 19
6.3 Selected Software 20
6.4 Preparation 20
6.5 Execution 22
6.6 Results 23
6.7 Conclusion 23
7. Legal Aspects of Data Extraction 24
7.1 United States 25
7.2 European Union 26
7.2.1 General Data Protection Regulation 27
7.2.2 Compliance of currently owned data with GDPR 29
7.2.3 Data Breach reporting under GDPR 29
7.2.4 Purpose Limitation 30
7.3 Conclusion 30
8. General Conclusion 31
9. Book Scraper Project 32
10. Bibliography 37
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

Table Index

S.no Title Page Number


1. Law Topics Related to Web Scraping 24
2. GDPR Timeline 27
3. GDPR Guidance of Application 28

Figure Index

S.no Title Page Number


1 Web Scraping vs Web Crawling 3
2 Analysis of social media spikes in Grammy evening 6
3 HTML Structure 10
4 DOM Selector 10
5 XPath Navigation 11
6 Features of Programming Libraries 15
7 Different structures of the websites 18
8 JSON Format of the final result 19
9 ParseHub Visualization of the template on website 21
10 ParseHub Template definition 22
11 Menus from 5 different canteens formatted as JSON string 23
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

CANDIDATE’S DECLARATION

I hereby declare that the report entitled “A review on Web Scraping and its Applications” has been
carried out and submitted by the undersigned to the Jaipur Engineering College & Research Centre,
Jaipur (Rajasthan) in an original work, conducted under the guidance and supervision of Mr. Abhishek
Jain.

The empirical findings in this report are based on the data, which has been collected by me. I have not
reproduced from any report of the University neither of this year nor of any previous year.

I understand that any such reproducing from an original work by another is liable to be punished in a
way the University authorities’ deed fit.

Date: 7 December, 2023 Manoj Kumar Khandelia


Place: Jaipur 20EJCCS162

i
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

BONAFIDE CERTIFICATE

This is to certify that the report of the seminar submitted is the outcome of the seminar work entitled
“A review on Web Scraping and its Applications” carried out by Manoj Kumar Khandelia bearing
Roll No.: 20EJCCS162 carried under my guidance and supervision for the award of Degree in Bachelor
of Technology of Jaipur Engineering College & Research Centre, Jaipur (Raj.), India during the
academic year 2023-2024.
To the best of my knowledge the report
i) Embodies the work of the candidate.
ii) Has duly been completed.
iii) Fulfills the requirement of the ordinance relating to the bachelor of technology degree of the
Rajasthan technical University and
iv) Is up to the desired standard for the purpose of which is submitted.

Dr. Sanjay Gaur Mr. Abhishek Jain


Head of Department Assistant Professor
Computer Science & Computer Science &
Engineering Engineering
JECRC, Jaipur JECRC, Jaipur

Place: Jaipur
Date: 7 December, 2023

ii
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

PREFACE

Bachelor of Technology in Computer Science & Engineering is the Rajasthan Technical University
course (Approved by AICTE) having duration of 4 years. As a prerequisite of the syllabus every
student on this course has to make a report on seminar lab in order to complete his studies successfully.
And it is required to submit the report on the completion of it.

The main objective of this report is to create awareness regarding the application of theories in the
practical world of Computer Science & Engineering and to give a practical exposure of the real world
to the student.

I, therefore, submit this seminar report on “A review on Web Scraping and its Applications”, which
was undertaken at JECRC, Jaipur. I feel great pleasure to present this seminar report.

ii
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

VISION OF CSE DEPARTMENT

To become renowned Centre of excellence in computer science and engineering and make
competent engineers & professionals with high ethical values prepared for lifelong learning.

MISSION OF CSE DEPARTMENT

M1. To impart outcome based education for emerging technologies in the field of computer science
and engineering.
M2. To provide opportunities for interaction between academia and industry.
M3. To provide platform for lifelong learning by accepting the change in technologies.
M4. To develop aptitude of fulfilling social responsibilities.

PROGRAM OUTCOMES (POs)

1. Engineering knowledge: Apply the knowledge of mathematics, science, engineering fundamentals,


and an engineering specialization to the solution of complex engineering problems.
2. Problem analysis: Identify, formulate, research literature, and analyze complex engineering
problems reaching substantiated conclusions using first principles of mathematics, natural sciences, and
engineering sciences.
3. Design/development of solutions: Design solutions for complex engineering problems and design
system components or processes that meet the specified needs with appropriate consideration for the
public health and safety, and the cultural, societal, and environmental considerations.

i
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

4. Conduct investigations of complex problems: Use research-based knowledge and research


methods including design of experiments, analysis and interpretation of data, and synthesis of the
information to provide valid conclusions.
5. Modern tool usage: Create, select, and apply appropriate techniques, resources, and modern
engineering and IT tools including prediction and modeling to complex engineering activities with an
understanding of the limitations.
6. The engineer and society: Apply reasoning informed by the contextual knowledge to assess
societal, health, safety, legal and cultural issues and the consequent responsibilities relevant to the
professional engineering practice.
7. Environment and sustainability: Understand the impact of the professional engineering solutions
in societal and environmental contexts, and demonstrate the knowledge of, and need for sustainable
development.
8. Ethics: Apply ethical principles and commit to professional ethics and responsibilities and norms of
the engineering practice.
9. Individual and team work: Function effectively as an individual, and as a member or leader in
diverse teams, and in multidisciplinary settings.
10. Communication: Communicate effectively on complex engineering activities with the engineering
community and with society at large, such as, being able to comprehend and write effective reports and
design documentation, make effective presentations, and give and receive clear instructions.
11. Project management and finance: Demonstrate knowledge and understanding of the engineering
and management principles and apply these to one’s own work, as a member and leader in a team, to
manage projects and in multidisciplinary environments.
12. Life-long learning: Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.

v
Jaipur
JaipurEngineering
EngineeringCollege
Collegeand
and
Academic
AcademicYear-
Year-2023-2024
2023-2024
Research
ResearchCentre,
Centre,Shri
ShriRam
RamkikiNangal,
Nangal,
via
viaSitapura
SitapuraRIICO
RIICOJaipur-
Jaipur-302
302022.
022.

PROGRAM EDUCATIONAL OBJECTIVES (PEOs)

The PEOs of the B.Tech (CSE) program are:


PEO1. To provide students with the fundamentals of Engineering Sciences with more emphasis in
computer science and engineering by way of analyzing and exploiting engineering challenges.
PEO2. To train students with good scientific and engineering knowledge so as to comprehend, analyze,
design, and create novel products and solutions for the real life problems.
PEO3. To inculcate professional and ethical attitude, effective communication skills, teamwork skills,
multidisciplinary approach, entrepreneurial thinking and an ability to relate engineering issues with
social issues.
PEO4. To provide students with an academic environment aware of excellence, leadership, written
ethical codes and guidelines, and the self-motivated life-long learning needed for a successful
professional career.
PEO5. To prepare students to excel in Industry and Higher education by Educating Students along with
High moral values and Knowledge.

PROGRAM SPECIFIC OUTCOMES (PSOs)


PSO1 Ability to interpret and analyze network specific and cyber security issues in real world
environment.

PSO2 Ability to design and develop Mobile and Web-based applications under realistic
constraints.

v
Jaipur
JaipurEngineering
EngineeringCollege
Collegeand
and
Academic
AcademicYear-
Year-2023-2024
2023-2024
Research
ResearchCentre,
Centre,Shri
ShriRam
RamkikiNangal,
Nangal,
via
viaSitapura
SitapuraRIICO
RIICOJaipur-
Jaipur-302
302022.
022.

COURSE OUTCOMES (COs)


On completion of seminar Graduates will be able to-
1. CO-1: To identify and analyze latest technology and complex engineering problems through
research methodology in Computer Science & engineering.
2. CO-2: To Explore and justify latest industrial trends.
3. CO-3: To develop Presentation skills, technical report writing, and professional ethics for life-
long learning

MAPPING: CO’s & PO’s


Subject Code Cos Program Outcomes (POs)

PO- PO- PO- PO- PO- PO- PO- PO- PO- PO- PO- PO-
1 2 3 4 5 6 7 8 9 10 11 12
3 3 3 2 2 2 1 1 2 2 2 3
CO-1
7CS7-40 3 3 3 3 3 2 1 1 3 2 2 3
Seminar CO-2
3 3 3 2 2 2 2 2 3 3 3 3
CO-3

v
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

ACKNOWLEDGEMENT

“Any serious and lasting achievement or success, one can never achieve without the help, guidance
and co-operation of so many people involved in the work.”

It is my pleasant duty to express my profound gratitude and extreme regards and thanks to Mr. Arpit
Agarwal, Dr. V.K. Chandna, Dr. Sanjay Gaur gave me an opportunity to take this seminar report.

I am indebted towards my supervisors who have allotted this seminar and his precious time and advice
during the period, which is imminent to the report.

I would like to express deep gratitude to Dr. Sanjay Gaur, Head of Department (Computer Science &
Engineering), Jaipur Engineering College & Research Centre, Jaipur (Rajasthan) with whose support
the seminar report has been made possible.

Last but not the least, I am heartily thankful to my friends and all those people who are involved
directly or indirectly in this seminar report for encouraging me whenever I needed their help in spite of
their busy schedule.

Manoj Kumar Khandelia


20EJCCS162

v
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

ABSTRACT

Web scraping automatically extracts data from websites and online sources. It involves the use of
specialized software tools or programming scripts to retrieve and parse the desired data from web
pages. This abstract provides an overview of web scraping, highlighting its purpose, techniques, and
implications. Web scraping allows users to gather structured data, such as text, images, prices, reviews,
or other relevant information, from a wide range of websites for analysis, research, or business
purposes. It enables the collection of large volumes of data in a relatively short time, automating what
would otherwise be a manual and time-consuming process. However, web scraping also raises ethical
and legal considerations, as it may infringe on website terms of service, violate copyright laws, or
compromise user privacy. Organizations and individuals must navigate these ethical and legal
boundaries while ensuring responsible and respectful web scraping practices. Furthermore,
advancements in technologies and techniques, such as the use of APIs, HTML parsing, and machine
learning, have expanded the capabilities of web scraping, making it more powerful and flexible. As
data becomes increasingly vital for decision-making and innovation, web scraping plays a crucial role
in enabling data-driven insights and applications across various domains, including e-commerce,
finance, market research, and academia.

i
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

1. INTRODUCTION

The World Wide Web consists of an interlinked network of information, which is presented through
websites to the users. World Wide Web has significantly changed the way we share, collect, and publish
data. The amount of presented information grows constantly.

As data grows in amount, variety, and importance, business leaders must focus their attention to the data
that matters the most. Not all data is equally important to businesses or consumers. The enterprises that
thrive during this data transformation will be those that can identify and take advantage of the critical
subset of data that will drive meaningful positive impact for user experience, solving complex problems,
and creating new economies of scale. Business leaders should focus on identifying and servicing that
unique, critical slice of data to realize the vast potential it holds.

Also with the usage of Web as a new marketing and sales channel the quantity of content multiplied.
Online merchants offer large packs of data to describe their products. Knowledgebase providers offer
access to their databases. IDC forecasts that by 2025 the global datasphere will grow to 163 zettabytes
(that is a trillion gigabytes). That’s ten times the 16.1ZB of data generated in 2016. All this data will
unlock unique user experiences and a new world of business opportunities.

With this unorganized growth, it is no longer possible to manually track and record all available
sources. That moment, is when Web Scraping evolved. Automated techniques allow the collection of a
massive amount of data from the Web compared to manual data extraction.

Together with Web Scraping another term became very important – Meta Data. Massive collection of
data obtained by Web Scraping allows Meta Data analysis.

This report will describe the most common reasons for Web Scraping and legal aspects connected to
this topic. Techniques of Web Scraping with appropriate explanation are presented in separate chapters.
Currently available software tools are listed with brief summary of their functionalities.

1
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

2. GENERAL FACTS ABOUT WEB SCRAPING


Several definitions of web scraping came up during the literature research. All three below presented
definitions mention data extraction from multiple sources. They differ in the form of the initial sources for
the extracted information.

Sometimes it is necessary to gather information from web sites that are intended for human readers, not
software agents. This process is known as “web scraping”.

The first definition mentions data sources, which are originally intended for human readers. Such
definition has proven itself as outdated. With the evolution of automated techniques the possibility also
arose by extraction from software readable sources. Application Programming Interface (API)Sources at
that time were very limited.

Web scraping, also known as web extraction or harvesting, is a technique to extract data from the World
Wide Web (WWW) and save it to a file system or database for later retrieval or analysis. Commonly, web
data is scrapped utilizing Hypertext Transfer Protocol (HTTP) or through a web browser. This is
accomplished either manually by a user or automatically by a bot or web crawler. Due to the fact that an
enormous amount of heterogeneous data is constantly generated on the WWW, web scraping is widely
acknowledged as an efficient and powerful technique for collecting big data.

The current state of affairs is more precisely portrayed by the second definition, where Web Scraping is
mentioned as one of sources for big data collection. In the definition is also mentioned another term –
Web Crawler. Web Crawling is performed in a different way with different outcomes. Figure 1 illustrates
both activities. The process steps visible on the left side show that Web Crawling does not have a defined
unique target and processes any available data without aiming on specific information. In comparison on
the right side, where the Web Scraper receives, processes and parses the data from a specified source.

2
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

Figure 1. Web Scraping vs Web Crawling

The definition below does not mention many details. However , it succinctly captures the activities of
Web scraping most precisely.

Web Scraping involves the process of querying a source, retrieving the results page and parsing the page
to obtain the results.

3
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

3. PURPOSE OF WEB SCRAPING

Enormous amounts of source information, available on the World Wide Web, are still in the format of a
Hypertext Markup Language (HTML) page. Automated extraction is difficult, because the intended
reader was a human. This chapter introduces the motivation and purpose of Information extraction
through Web Scraping.

Rapid growth of the World Wide Web has significantly changed the way we share, collect, and publish
data. Vast amount of information is being stored online, both in structured and unstructured forms.
Regarding certain questions or research topics, this has resulted in a new problem — no longer is the
concern of data scarcity and inaccessibility but, rather, one of overcoming the tangled masses of online
data.

These utilizations are often only possible because the existence of automated Web Scraping. Without
these techniques, it would be impossible to collect the amount of data repeatedly and in reasonable time.

3.1 MARKET ANALYSIS AND RESEARCH

Data collection from online sources became one of the market research methods. It offers much faster
response, compared to a classical surveying. While Knowit considers it best to utilize traditional surveys,
Web-scraping is seen as cost effective support for such instruments. To get comprehensive picture and to
gain knowledge of the tools in markets multiple sources should be used.

Consumers are active in the online world and share their experience, frustration or motivation. Companies
that wish to learn more from consumers can add online sources of information. Web scraping is one of the
method to collect such data. Targeted data collection from e-shop and advertising servers helps to update
Indexes. Which are based on frequently changed prices. Indexes built through automated Web scraping
can offer more frequent update intervals.

4
5
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

With the increasing relevance and availability of on-line prices that we see today, it is natural to ask
whether the prediction of the consumer price index (CPI), or related statistics, may usefully be computed
more frequently than existing monthly schedules allow for.

Wegmann and Chapple (2013) used a small sample of 338 Craigslist listings to study the prevalence of
secondary dwelling units in the San Francisco Bay Area. Finally, Feng (2014)web-scraped 6,000
Craigslist listings to study Seattle’s housing market.

3.2 ENTERPRISE TECHNOLOGIES

Incompatible enterprise technologies are common by larger projects. Still a unified presentation of data
from several systems is necessary. In some specific cases the solution is based on Web Scraping. The
Example below demonstrates possible approach to collect and evaluate measurements from several
independent sources:
Water utility company in Surabaya that called as PDAM Surabaya has few reservoirs in their water
supply system which are being monitored by WTW IQ SensorNet 2020 XT. However, on that sensor
devices while it can provide some of the water quality parameters value information but the sensor is
passive and the internal data is still stored in the sensor itself. To solve the problem, we proposed an
application of data logger to manage data collections online water quality monitoring system by using
web scraping.

3.3 OPINION POLL

Movie Producers collect data about their current blockbusters. Such data includes the user feedback, if
this was shared on the movie portals in a review.

6
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

3.4 HUMAN RESOURCES AGENCIES

Human resource (HR) departments in large companies process many job offers for their companies and
try to match the position with prospective employees. It’s not sufficient to use only incoming vacancy
requests from candidates. HR Departments also cooperate with 3 rd Party companies, which can offer them
own directories of professionals. Contact mining is an important activity for such agencies.

“ hiQ is a tech startup which collects and analyzes public profile information on LinkedIn in order to
provide its clients – mostly large companies – with insights about their employees, such as which
employees are likely to be poached by a competitor or which skills its employees have.
(United States District Court, 2017)

3.5 SOCIAL NETWORK MINING


Social platforms provide trends and reaction level from the users on many topics. Figure 2 composes 4.9
million traces of 12th February 2017 (The Grammy Awards ceremony evening) related data from social
media into a Graph. The spikes mark the most interesting moments as observed by the users of the social
media services.

Figure 2. Social data analysis via Brandwatch

7
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

Social media (such as blogs, online social networks, microblogs) has become one of the major data
sources for quantitative communication research over the past decade. By employing simple programing
tools, researchers can extract relevant messages from social media platforms for various research
purposes.
Social media data can be obtained from publicly available sources via various means such as scraping,
using site-provided apps, and crawling. The possibility of obtaining social media data makes research on
social media data, for example, a begin-to-end process for Twitter data analytics is elaborated with four
key steps: crawling, storing, analyzing, and visualizing. Social media data are just like conventional data,
which is a potential treasure trove, but requires datamining to uncover hidden treasures.

3.6 GOVERNMENT SERVICES

The Monitoring of criminal activities on social websites and specific forums is an important source of
information for Government Agencies and Law Enforcement bodies. Sources for this behavior usage are
clearly not available. We can assume that they exist based on patents filled by the United States
Government, which are publicly listed:
Method and apparatus for improved web scraping. The United States Of America As Represented By The
Secretary Of The Air Force (John J. Salerno, 2003).
.
3.7 CORPORATE SPYING

In the corporate context, web scraping allows for a company to review both their own and the appearance
of competitors in headlines of news servers. A company can also collect details about competitors and
even about its own employees.
However, while there are significant benefits to using website data through methods such as web scraping
or web mining in innovation research, the literature on the use and validity of these approaches is
relatively underdeveloped.
At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in
Business and Competitive Intelligence systems as well as for business process re-engineering.
8
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

3.8 SOCIAL MINING AND SENTIMENT ANALYSIS

Social media is a new source of data that is significantly different from conventional ones. Social media
data sets are mostly user-generated, and are big, interlinked, and heterogeneous.
Social media data can be obtained from publicly available sources via various means such as scraping,
using site-provided apps, and crawling. The possibility of obtaining social media data makes research on
social media data feasible. In Twitter Data Analytics, for example, a begin- to-end process for Twitter
data analytics is elaborated with four key steps: crawling, storing, analyzing, and visualizing. Social
media data are just like conventional data, which is a potential treasure trove, but requires data mining to
uncover hidden treasures.

9
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

4. METHODS OF WEB SCRAPING

The methods of Web Scraping evolved together with the World Wide Web. Not all listed methods were
available at the beginning. There are two examples to mention, because these are presently the most
used techniques. Since 2000 the Document Object Model (DOM) became more popular in DHTML. A
broader acceptance later on allowed the HTML Parsing technique to evolve to DOM Parsing. Second
example are Application Programming Interfaces (APIs). This technique is the youngest on the list, the
growth of available content APIs is dated from 2005. According to ProgrammableWeb.com the number
of APIs has grown within 8 years from 0 to 10302.

4.1 MANUAL SCRAPING

Manual scraping is still an option in specific situations. These situations are:


1. When the amount of data is minimal,
2. When the data being scraped does not require a repetitive task,
3. When setting up automated scraping would take longer than the data collection itself.
4. Possibly security measures or specific characteristics of the website do not allow automated methods.

4.2 HTML PARSING

Web sites don’t always provide their data in comfortable formats such as .csv or .json files. HTML
Pages are created by the server as a response to a user’s request. At this point server software is not
relevant, rather the output in the browser is important. Analysis of the HTML structure (simple page
sample provided in the Figure 3) in the provided web page will show repeated elements. With a
programming language script or Web Scraping tool, each page with similar pattern can be used as a
source for data.

1
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

Figure 3. HTML Structure

4.2 DOM PARSING

Document Object Model (DOM) Parsing is an evolution of HTML Parsing based on developments of the
language and browsers which lead to the introduction of the Document Object Model. DOM is heavily
used for Cascading Stylesheets (CSS) and JavaScript. Integration of DOM revealed new possibilities for
addressing some specific parts of the webpage. Figure 4 shows containers with their own DOM addresses.
These are used in Web Scraping for easier navigation through webpage content.

Figure 4. DOM Selector

1
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

4.4 XPATH

Similar addressing possibility as DOM provides XPath (XML Path Language). The name suggests a
usage for XML documents. It is applicable also to HTML format.
XPath requires a more precisely structured webpage than DOM and has the same possibility to address
segments within the webpage. Figure 5 shows the document structure as interpreted in XPath.

Figure 5. XPath Navigation


4.5 APIs

Whilst the previous methods work to scrape human-readable outputs, Application Programm-ing
Interface(API) expects an application as a communication partner. Thus APIs are often named as
machine-readable interfaces (versus human-readable). Even APIs were introduced much later than the
WWW, and their growth was very fast. The world of APIs is fragmented. For a simple overview and
orientation were API Directories created. Most of the available APIs are registered and described in the
directory with relevant links to the sources. Two examples of such directories are - Programmable
Web(https://www.programmableweb.com) and APIs(https://apis.guru/).API Directories also provide their
own API, which allows users to search in their database for API Sources. A standard HTTP Request sent
to an API Endpoint returns an answer from server. Each API has its own specification and options. The
format of the answer can be set as option in the request. The most widely used format for API
communication is JSON.

1
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

5. AVAILABLE SOFTWARE TOOLS

This chapter will introduce several published software tools for Web Scraping. We can assume, that there
are more tools which have been developed for internal purposes of individuals and companies and thus
have not been published. Therefore, this list does not include all available software, due to the
proliferation of applications and platforms.

Following sections based on software type are formed:


• Cloud Software
• Desktop Software
• Programming libraries
• Browser Extensions

Each section will introduce at minimum 2 publicly available solutions.

5.1 CLOUD SOFTWARE

Cloud solutions provide a User Interface through a Web browser while the application backend resides on
the cloud server. Such a configuration limits the user end hardware requirements to their minimum. Large
amount of data can be scraped without having own computer on. Larger projects can be created without
additional hardware or extended internet bandwidth.

In some situations is very important, where is the cloud located. For example, some webpages may not be
accessible from Europe in their full range. A Cloud solution connected through a US based ISP may be
able to access other sections of such a website.

Cloud Solutions are generally financed by subscription fees. The price depends on allocated resources.
The owners offer monthly or annual payments.

1
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

5.1.1 Dexi.io
Dexi has an up-to-date sleek User Interface. The workflow starts with the creation of a robot. The
webpage is presented in the background and the user needs to select the requested areas and define the
robot tasks by selecting the actions in a drop-down menu. The free tariff option provides all
functionalities with the limitation through the time period of use. Dexi sets a very strict 1-hour evaluation
period only.

5.1.2 Octoparse
Octoparse has a desktop client. The client communicates with the cloud and submits the tasks which are
then processed on the cloud. The User interface is very complex, with a lot of controls available at the
first click. The user should first create a task and define the steps. The finished task then moves to the
queue for execution. Execution occurs on the cloud with the results being returned to the user in the
desktop interface when the task has been completed.

5.1.3 Scrapinghub Platform


Scrapinghub offers a platform for creating tasks built up on Scrapy Framework and run those in the cloud.
Additionally, visual programming is also available. That is an advantage for users, who are not familiar
with programming. The pricing is based on RAM usage by the created scraping tasks. Use of the Scrapy
Framework guarantees excellent portability of the created tasks to end-user hardware resources. Or in the
opposite direction, if user hardware resources are lacking the processing power to finish the Scrapy tasks.

5.2 DESKTOP SOFTWARE

Website data are downloaded, parsed and saved locally. Desktop Applications require a broadband
internet connection. Internet link of the computer will directly influence the processing time of Web
Scraping task. Compared to Cloud Software for Desktop Software more advanced hardware is required.
Workstation with Random access memory (RAM) size larger than 8GB is recommended as minimum.

1
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

5.2.1 ParseHub
Parsehub is a web scraping software that supports complicated data extraction from sites that use AJAX,
JavaScript, redirects and cookies. Machine learning technology is used to read and analyze documents.
Analysis allows ParseHub to output relevant data. The software is available as a Desktop client for
Windows, Mac and Linux. Additionally, there is a Web Application, that can be used within the browser.
Within the free tariff it is possible to scrape up to 200 pages.

5.2.2 FMiner
FMiner combines visual configuration with scripting features. Steps are recorded, in a same way like
manual web scraping from a website. FMiner simultaneously adds steps to a macro. Visual inter-pretation
of the macro is captured on the Figure 6.In that case the website requires a selection from dropdown menu
and wait time 2 seconds. Visual dashboard makes extracting data from websites intuitive. Whether you
want to scrape data from simple web pages or carry out complex data fetching projects, both scenarios are
possible with FMiner.

5.3 PROGRAMMING LIBRARIES

Ready built Web Scraping software tools are much easier to use in comparison to programming your own
web scraping setup. For some purposes available software products are not the ideal solution. However,
when a developer is appointed to create a custom Web Scraping solution, it is not necessary to start from
zero. Frameworks allow developers to save time by re-using generic modules in order to focus on specific
areas for which an off the shelf solution isn’t suitable.

Frameworks for 5 programming languages are presented in the Figure 6. Availability of three basic
functionalities is visualized.

1
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

Figure 6. Features of Programming Libraries

5.3.1 Scrapy
Scrapy is one of the advanced web scraping frameworks available. The Framework is written in Python. It
is an application framework and it provides many commands to create own applications and use those for
Web Scraping projects.

Scrapy is a web crawling framework for developer to write code to create spider, which define how a
certain site (or a group of sites) will be scraped. The biggest feature is that it is built on Twisted, an
asynchronous networking library, so Scrapy is implemented using a non-blocking(aka asynchronous)
code for concurrency, which makes the spider performance is very great.

Extensive documentation is available to support developers who wish to integrate it into their own
applications. The choice between a Local Desktop Application or a Cloud Solution is nota limitation, both
can be powered by Scrapy code. One of the presented Cloud Solutions encourages the user to move their
Scrapy applications over to cloud infrastructure.

1
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

5.3.2 Goutte
Goutte is a screen scraping and web crawling library for PHP. Goutte provides a nice API to crawl
websites and extract data from the HTML/XML responses. It makes use of the Symfony Browser Kit
component. This component simulates the behavior of a web browser, allowing you to make requests,
click on links and submit forms programmatically. The developer webpage does not offer advanced
documentation.

5.4 Browser Extensions

Browser extensions belong to the entry level category of Web Scraping software. These tools are capable
enough, if data are scraped from a limited number of websites for a quick researchor project, Those are
available in the browser specific add-ons store.

5.4.1 Outwit Hub


Once installed and activated, it adds web scraping capabilities to the previously installed browser. Outwit
Hub offers ready-made templates for extraction of pictures, tables, lists and texts. Extracting data from
sites using Outwit hub does not demand advanced skills. The setup is not complex. Enough
documentation to refer to is provided using Outwit hub website. Outwit Hub was discontinued as Firefox
Add-on and newest versions are available only as standalone applications.

5.4.2 Web Scraper


Web scraper is a browser extension available for Google Chrome, that can be used for web scraping.
Sitemap defines on how a website should be navigated and what data should be extracted. It can scrape
multiple pages simultaneously and even has dynamic data extraction capabilities. Web scraper can also
handle pages with JavaScript and Ajax, which makes it more powerful. The tool extracts data to a CSV
file or CouchDB. For larger Web Scraping jobs the developer offers a cloud version. The Google Chrome
extension is also used as a test environment, before scraping jobs are assigned to the cloud version.

1
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

5.5 CONCLUSION

It is difficult to compare the listed tools to each other. Some are meant for hobbyists and some are suitable
for enterprises.

Browser extensions belong to the hobbyists category. To collect data from a few websites fora brief
research or project, these tools are more than enough.

Desktop Applications have a broad set of functionalities. That is reflected in the high price. Cloud
solutions offer a similar set of functions. Compared to Desktop Software Cloud Solution offer 3 main
advantages:

• monthly billing based on resource usage instead of initial high investment


• low hardware requirements on own Server/PC
• own bandwidth is used only for job setup and controlling

Cloud Software currently offers the most accessible way to begin and test web scrapingoptions.

The most complex option are frameworks. For frameworks the used language and the available
documentation is important for the prospective developers.

1
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

6. HANDS ON WEB SCRAPING

The described topic of Web Scraping might not be clear from the description only. Therefore, this
chapter offers a guide through a practical presentation.

6.1 TASK
For the demonstration purposes, websites of 5 canteens based in Wien are going to be targeted. A
collection of the Weekly lunch menu should get collected. Each of the canteen publishes the data on own
website.
o http://www.kantine.at/speisen/Menu
o http://www.mq-kantine.at/tagesteller.php?pageid=104
o http://www.magdas-kantine.at/
o https://www.aera.at/wochenmenue/
o http://www.restaurant-wild.at/wochenmenue.php/
The goal is to extract the data from the different websites –Figure 7 As a result, JSON Format with all
lunch menus is expected.

Figure 7. Different structures of the websites

1
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

JSON Format of the result was selected, because the extracted data should be usable as a data source for
other applications. Suggested format presented in Figure 8. Such data could be used as information source
for an info screen or a local restaurant directory. To keep the results updated is important to repeat the
web scraping task at least twice a week.

Figure 8. JSON Format of the final result

6.2 PREREQUISITES

• Access to selected websites


• Installed ParseHub software – downloaded at https://www.parsehub.com/
• Sufficient bandwith of an internet connection – this use case is not bandwidth intensive, the
available 75Mbits/75Mbits wont be utilized
• Sufficient Central Processing Unit(CPU) and RAM size – this use case is not intensive on
computing power, the configuration Xeon 3,3GHz, 12GB RAM would suffice even for scaled up
solution

2
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

6.3 SELECTED SOFTWARE

From the presented list in the previous chapter the recently published Desktop Application called
ParseHub was selected. The following aspects were important for the required task:

• Fast learning curve – the user interface is based on visual controls


• Simple setup – standalone desktop application
• Available documentation – product website contains a video based tutorial

6.4 PREPARATION

ParseHub works collects the data from websites based on templates. As a first step we need to define the
URL and start to define the template. At this moment is necessary to provide a structure. Common usage
requires to define the main object and bind the related ones. In this case is the main object a week day and
the relative objects are Menu 1, Menu 2. The Figure 9 shows the ParseHub user interface during the
preparation. It allows the selection in straightforward way. The result of each selection is immediately
recognizable on the screenshot of the webpage itself.

2
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

Figure 9. ParseHub Visualization of the template on website

2
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

6.5 EXECUTION

Once the Template for the webpage was created the ParseHub application has a description of the data
areas, which we want to collect. Additionally, related content like images, metadata was set to be
collected in the ParseHub software. Figure 10 shows, how to proceed with 5 different sources. Each
separate page requires a new template. Our Data Set contains 5 templates (canteens) with 5items (week
days) and 2 relative items (Menu 1, Menu2) due to this, the execution of the task has took less than 5
minutes. To keep the results updated is important to repeat the web scraping task at least twice a week.
Unfortunately such settings are available only for paid subscription of ParseHub software.

Figure 10. ParseHub Template definition

2
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

6.6 RESULTS

The application presented the result sin the JSON format, as this was selected during the configuration.
The Figure 11 shows highlighted sections on 2 data rows.

Figure 11. Menus from 5 different canteens formatted as JSON string

6.7 CONCLUSION

The practical example has shown, how to use different websites as data sources for own applications.

The purpose of canteen menu collection expects a periodically updates of the web scraping task.
Therefore, the selection of desktop application was not correct. It would require a manual action to keep
the data source updated. Proper solution would be a cloud application for such case.

In the example the websites terms of use were not examined. This would be necessary, if such task would
be performed for a real purpose. In the next chapter the legal aspects and necessary steps are described.

2
2
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

7. LEGAL ASPECTS OF DATA EXTRACTION

The use of data received by Web Scraping is questioned by several website owners. Individuals or
companies active in Web Scraping have a simple legal defense strategy. Websites are filled with data,
which was published with the clear intention of being seen by others.

For example, in the case of the website called LinkedIn - the main business plan is to publish profiles and
derive an income through charging a portion of services sold to users. The goal of users is to expose their
CVs to a prospective employer. Therefore, holding private data would not comply with their business
plan, and by extension any data available is thus public.

Web Scraping could possibly breach several laws or contract terms. The Table explains the law areas,
where Web Scraper should be prepared for an defense action.

Website Many companies expressly prohibit scraping within their website terms and conditions.
terms and Whether they can enforce such terms is still unclear but depending on circumstances, a
conditions claim for breach of contract is possible.
Copyright, As scraping involves copying, it may lead to a claim for copyright infringement. Whether
Intelectual such a claim has any merits will depend on the particular circumstances because not all
Property scraped data qualifies for copyright protection. Works which are copyright protected
Rights include original literary and artistic works such as computer programs, website graphics,
and photographs.
Database A database right is infringed when all or a substantial part of a database is extracted or re-
rights utilised without the owner’s consent. The repeated extraction or re-utilisation of
insubstantial parts of a database which conflicts with thenormal use of the database may
also infringe database rights. Infringement of database rights may also apply when
scraping directories or listings from third party websites if the owner has incurred costs in
developing and maintaining them.
Trademarks If the scraper reproduces a website owner’s (registered or unregistered)

2
2
trademarks without their consent, the website owner could take an action claiming
trademark infringement and/or passing off. Passing off prohibits a third party from selling
goods or carrying on business under a name, mark, description, or in any other way that is
likely to mislead, deceive or confuse the public into believing that the merchandise or
business is that belonging to the brand owner.
Data Businesses looking to use automated scraping techniques to collect information about
Protection individuals should be aware that they risk breaching local data protection law if they
collect “personal data” (that is, any information that identifies a living individual). The
central issue is whether individuals have consented to their personal data being collected.
Although data obtained from one website may in isolation not be personal data, when it’s
aggregated from multiple websites, a business may inadvertently find itself in possession
of personal data without the consent of the individual concerned. Use of such personal
data
will infringe data protection laws.
Criminal It is an offence to cause criminal damage to a computer (including damage to data) or to
Damage use a computer to access data without authorisation. Accordingly, data scraping could be a
criminal offence as the website owner has not authorised access to the data.

Table 1 Law Topics related to Web Scraping

7.1 UNITED STATES

A recent lawsuit between companies “LinkedIn” and “hiQ Labs” should illustrate a possible outcome for
cases opened because of Web Scraping activities. Below is a cited part of the hiQ Labs defense against
the accusation.

“ hiQ is a tech startup which collects and analyzes public profile information on LinkedIn in order to
provide its clients – mostly large companies – with insights about their employees, such as which
employees are likely to be poached by a competitor or which skills its employees have hiQ does not
analyze the private sections of LinkedIn, such as profile information that is only visible when you are
signed-in as a member, or member private data that is visible only when you are “connected” to a
member. Rather, the information that is at issue here is wholly public information visible to anyone with
an internet connection.”

2
2
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

Web Scraping activity can reach to a point, where it violates the Computer Fraud and Abuse Act (CFAA).
The CFAA was designed to prevent unauthorized access to websites and servers. Federative States in
USA have similar state laws, which have requirements similar to the CFAA.

The CFAA imposes criminal penalties on a party who “intentionally accesses a computer without
authorization or exceeds authorized access, and thereby obtains information from any protected
computer.”

7.2 EUROPEAN UNION

In comparison to United States the legal arguments in Europe are different. Instead of unauthorized
excessive usage argumentation, the plaintiffs utilize Intellectual Property rights.

In Ryanair Ltd v PR Aviation BV [2015] the Court of Justice of the European Union (“CJEU”) held that
no intellectual property rights subsisted in the scraped data (Ryanair’s database of flight times and prices)
and therefore the company scraping the data had not infringed Ryanair’s IP. This was because the
database was not the result of the requisite creative input necessary to be afforded copyright protection.
The CJEU made it clear, however, that it is possible for a website owner to restrict the re-use of the mined
data through its terms and conditions. This is therefore something that companies should bear in mind –
if they access a website and consent to the terms of use which contain a restriction on the re-use of the
website data, if they do go on to re-use that data they may be liable for breach of contract.

3
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

7.2.1 General Data Protection Regulation

General Data Protection Regulation - publicly known by the abbreviation GDPR is a heavily discussed
topic since it was adopted by the European Parliament and the European Council. Member states in the
EU have their own data protection laws already in action. Also EU wide directive was already in place.
Compared to Directive 95/46/EC (Data Protection Directive)which it replaces, the GDPR seeks to extend
the reach of EU data protection law. The key aim of the GDPR is to harmonize data protection law
throughout the EU. It sets rules about how personal information can be used by companies, governments
and public entities. The time period dedicated for implementation will end in May 2018. Table 1 outlines
the exact schedule of the GDPR implementation.

Date Action
April 2016 Adopted by the European Parliament and the European Council
May 2016 Published in the EU Official Journal
25th May 2018 First Day when GPPR is applied
Table 2. GDPR Timeline

he text of GDPR sets the rights of individuals regarding access of data held by organizations and
companies about them. The individuals are the weaker party of such process. To enforce their rights new
fines are defined. The responsibility of organization is strictly defined that they must obtain the consent of
people they collect information about.

Any company that uses personal data from EU citizens or offers goods or services to them must attain
GDPR compliance. In the Table are listed examples that would qualify for needing to comply with the
GDPR’s information privacy rules. The GDPR applies irrespective of whether the actual data processing
takes place in the EU or not.

3
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

Table 3 GDPR Guidance of Application

3
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

7.2.2 Compliance of currently owned data with GDPR

A crucial point for all data owners is, that there is no exception for data captured before GDPR. Once the
GDPR becomes applicable it will not be legally possible to process the data if its in violation, even if it
was not violating the law when it was collected. It has been defined as the responsibility of the data
owners to review and revise their datasets in order to verify it falls within the bounds of the law and that
they possess a valid consent from the affected persons. Such consent can be dated before May 2018 but
has to fulfill all requirements set by GDPR Law.

7.2.3 Data Breach reporting under GDPR

Web Scraping might be affected by the GDPR requirement to report Data breach immediately. When
Company or individual starts any scraping activity, they should consider the impact on the website.
Webpages may implement a guarding mechanism to fulfill the requirement of Data Breach immediate
reporting. Otherwise the website owner would face penalties.

Article 33 of the GDPR handles separately the event of data breaches:

“In the case of a personal data breach, the controller shall without undue delay and, where feasible, not
later than 72 hours after having become aware of it, notify the personal data breach to the supervisory
authority competent in accordance with Article 55, unless the personal data breach is unlikely to result in
a risk to the rights and freedoms of natural persons. Where the notification to the supervisory authority is
not made within 72 hours, it shall be accompanied by reasons for the delay.“

3
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

7.2.4 Purpose Limitation


Collection of personal data is possible only with consent of the affected person. Such consent must list
specified, explicit and legitimate purposes. Collected data cannot be further processed in a way
incompatible with those purposes. Further processing of personal data for archiving purposes in the public
interest, or scientific and historical research purposes or statistical purposes shall not be considered
incompatible with the original processing purposes. However, conditions in Article 89(1) (which sets out
safeguards and derogations in relation to processing for such purposes) must be met.

7.3 CONCLUSION

In summary, there is no specific piece of legislation which forbids Web Scraping to gather information.
The website owners, however, may have legal rights against the company under intellectual property law
and contract law.

Each case will turn on its own facts though and this is very much dependent upon what information is
scraped from the websites. Companies should beware of contractual provisions which they have agreed to
in respect of a website’s terms of use – these may prohibit the user from taking and using the data off the
site. If the data being scraped includes personal data, then compliance with data protection law must also
be borne in mind. The only way to be truly certain that the rights of a website owner have not been
infringed is to obtain their express consent to the screen scraping and subsequent use of the information.

The overall outcome of forbidden Web Scraping could be for the Website owner negative. The possibility
of less visitors, less links from content aggregator websites and less income from advertising.

Thus, Data hosts should only use legal actions against scrapers when the scraper presents a threat to the
data host’s core business and the data host has a strong enough claim to prevail legally against the scraper.
From the law perspective is necessary to adjust the terms of use on the websites. Restriction of Web
Scraping techniques can be directly included. Such step does not require much resources and allows a
direct argumentation at the court.
3
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

8. GENERAL CONCLUSION

Depending on the primary purpose, different Web Scraping techniques can be used, taken amount of
data, periodicity and required outcome into consideration.

Web scrapers have a broad selection of tools to select from. The project does not consist only from the
technical solution and execution. By all Web Scraping projects should be the legal aspect of the specific
job examined and necessary steps identified.

Data hosts should always asses the benefit scrapers can provide and take a pragmatic approach to those
who scrape their data.

Web Scrapers should keep the connection to the Data hosts and allow identification of the Data Host as
a source of presented information.

3
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

9. BOOK SCRAPER PROJECT

Pre-requisites: Requests, BeautifulSoup, Scraping a single page, Scraping Multiple pages, creating a
Data frame and Saving.

In this article we will scrape a website books.toscrape. The URL for the website
is 'https://books.toscrape.com/catalogue/page-1.html'.

# code
#Importing Important
The above code is just the basics on how we start any project related to web scraping.
Libraries import requests
Lets
fromstart
bs4scraping 1st page from https://books.toscrape.com/
import BeautifulSoup

Whenever we want to scrape any website. The first step is inspecting the website and finding out the
link = 'https://books.toscrape.com/catalogue/page-
class/div/tag etc that we want to scrap. For the above URL we want to scrap information about the
books present in the website.
1.html'
We want#Sending a request
to scrap books datatosothe
wewebsite(link)
have to select the list li having class = 'col-xs-6 col-sm-4 col-md-3
col-lg-3'
res = requests.get(link)

#Creating a soup using BeautifulSoup


soup = BeautifulSoup(res.text,'html.parser')

3
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

# code
book = soup.find_all('li',class_ = 'col-xs-6 col-sm-4 col-md-3 col-lg-3')

data=[]
for sp in soup.find_all('li',class_ = 'col-xs-6 col-sm-4 col-md-3 col-lg-3'):
#print(sp)
#Finding different aspects to be scraped from the first page
book_link = "https://books.toscrape.com/catalogue/" + sp.find_all('a')[-
1].get('href') title = sp.find_all('a')[-1].get('title')
img_link = "https://books.toscrape.com/" + sp.find('img').get('src')
[3:] book_rating = (sp.find('p').get('class')[-1])
price = sp.find('p',class_='price_color').text[1:]
stock = sp.find('p',class_ = 'instock
availability').text.strip() #Appending all the data into a
list(data)

Output from Data :

3
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

Scraping Multiple Pages

# code
from tqdm import tqdm
Multiple_Pages = []
#tqdm used for better representation.
for page in tqdm(range(1,51)):
#using a for loop as there are 50 pages then creating a link using page-1,page-2...page-50
link = 'https://books.toscrape.com/catalogue/page-'+str(page)+'.html'
res = requests.get(link)
soap =
BeautifulSoup(res.text,'html.parser') #same
code as in scraping for 1 page
for sp in soap.find_all('li',class_ = 'col-xs-6 col-sm-4 col-md-3 col-lg-3'):

book_link = "https://books.toscrape.com/catalogue/" + sp.find_all('a')[-1].get('href')


title = sp.find_all('a')[-1].get('title')
img_link = "https://books.toscrape.com/" + sp.find('img').get('src')
[3:] book_rating = (sp.find('p').get('class')[-1])
price = sp.find('p',class_='price_color').text[1:]
stock = sp.find('p',class_ = 'instock availability').text.strip()
Multiple_Pages.append([title,book_rating,price,stock,book_link,img_link])

#Creating a Dataframe
Multiple_Pages_df = pd.DataFrame(data=Multiple_Pages)
Multiple_Pages_df = Multiple_Pages_df.rename(columns={0: 'Title', 1: 'Rating',2:'Price',3:'Stock
Available',4:'Book Link',5:'Image Link'})
df_1 = Multiple_Pages_df

#Saving the Dataframe in a csv file


import pandas as pd

3
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.
Multiple_Pages_df.to_csv('All Books.csv',index=False)

3
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

Dataframe Output:

Books Page Scraping

# code
data_2 =
[]
for links in tqdm(Multiple_Pages_df['Book
Link']): res = requests.get(links)
soup = BeautifulSoup(res.text,'html.parser')
Book_Genre = soup.find('ul',class_ = 'breadcrumb').find_all('a')
[2].text temp = soup.find('table',class_ = 'table table-
striped').find_all('td')
UPC = temp[0].text
price_exc_tax =
temp[2].text[1:] price_inc_tax
= temp[3].text[2:]
tax = temp[4].text[2:]
availability = temp[5].text
reviews = temp[6].text
data_2.append([Book_Genre,price_exc_tax,price_inc_tax,tax,UPC,availability,reviews])

#Creating a dataframe
df_2 = pd.DataFrame(data_2)
df_2 = df_2.rename(columns={0:'Genre',1:'Price',2:"Price(Excluding Tax)",3:"Price(Including Tax)"
,4:"UPC",5:"Stock",6:"Reviews"})

4
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

# code
df = pd.DataFrame()
df['UPC'] = df_2['UPC']
df['Title'] = df_1['Title']
df['Genre'] = df_2['Genre']
df['Rating'] = df_1['Rating']
df['Price'] = df_2['Price']
df['Total Tax'] = df_2['Total
Tax'] df['Stock'] = df_2['Stock']
df['Review'] = df_2['Reviews']
df['Book Link'] = df_1['Book Link']
df['Cover Image'] = df_1['Image
Link'] #Saving final scrap
df.to_csv("Final Scrap.csv")

4
Jaipur Engineering College and
Academic Year- 2023-2024
Research Centre, Shri Ram ki Nangal,
via Sitapura RIICO Jaipur- 302 022.

10. BIBLIOGRAPHY

1.(IDC), I. D. C., 2017. Data Age 2025, Framingham, USA: IDC.

2. Adamuz, P. L., 2015. Development of a generic test-bed for web scraping, Barcelona: European
Education and Training Accreditation Center.

3. Apress, 2009. Using Web Scraping to Create Semantic Relations. Scripting Intelligence, pp. 205-
228.

4. B.C., B., 2016. Scraping Data. In: Data Wrangling with R. Use R!.. Cham: Springer.

5. Berlind, D., 2015. APIs Are Like User Interfaces--Just With Different Users in Mind.
[Online]Available at: https://www.programmableweb.com/news/api-economy-delivers-limitless-
possibilities/analysis/2015/12/03[Zugriff am 17 November 2017].

6. Daly, M., 2016.Dublin Globe - Legal briefs: 6 Reasons Why Not to Scrape Data. [Online]Available
at: http://www.dublinglobe.com/community/toolbox/legal-briefs-6-reasons-not-scrape-data[Zugriff am
30 November 2017].

7. Daniel Glez-Pen, M. R.-J. a. F. F.-R., kein Datum Webscrapingtechnologies in an


APIworld. s.l., s.n.

8. Emilio Ferraraa, P. D. M. G. F. R. B., 2014. Web data extraction, applications and techniques: A
survey. Knowledge-Based Systems, Band 70, pp. 301-323.

9. Hai Liang, J. J. H. Z., 2017. Big Data, Collection of (Social Media, Harvesting).. In: The
International Encyclopedia of Communication Research Methods. s.l.:s.n., pp. 1-18.

10. Hirschey, J. K., 2014. Symbiotic Relationships: Pragmatic Acceptance of


DataScraping. Berkeley Technology Law Journal, 29(4).

11. Huan Liu, F. M. J. T. R. Z., 2016. The good, the bad, and the ugly: uncovering novelresearch
opportunities in social media mining. International Journal of Data Science and Analytics,1(3-4), pp.
137-143.12. Jakob G. Thomsen, E. E. B. a. M. S., 2015. WebSelF: A Web Scraping Framework, s.l.: IT
University of Copenhagen.

You might also like