You are on page 1of 19

Self-Learning Material

tushar.1801@gmail.com
D0OLHR8SGA

Program: MCA
Specialization: Core
Semester: 3
Course Name: Application Development using Python *
Course Code: 21VMT0C301
Unit Name: Web Scraping

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Table of Contents:
- Introduction to web scraping …3
- Why do we use web scraping, is it legal …4
- Why python for webscraping. …5
- MapIt.py with webbrowser module: …5
1. Figuring out the URL
2. Handling the command line arguments
3. Handling the clipboard content and launch the browser
- Downloading a web pafe with requests.get() function …7
- Checking for errors …8
- Downloading files from the web using python …8
- Html basics … 10
- BeautifulSoup … 12
- Scraping URLs and email IDs from the web … 14
- Scraping images from the web … 16
- Advanced web scraping techniques: Selenium and Scrapy … 16
- Dynamic Pages or client side rendering … 17
- Authentication … 18
tushar.1801@gmail.com
D0OLHR8SGA - IP Blocking … 18

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Unit 11:
Web Scraping
Unit Overview:
Everyone needs to know how to scrape websites in order to get data from them. This post
will demonstrate how to use Python to scrape photos from websites. A Python package
called Beautiful Soup(bs4) is used to extract data from HTML and XML files. Python does not
include this module by default. Enter the following command in the terminal to instal this.
Queries makes it incredibly simple to send HTTP/1.1 requests. Additionally, Python does not
include this module by default. Enter the following command in the terminal to instal this.
Unit Outcomes:
- Exploring web scraping with python
- Project
- Request module
- Saving downloaded files to hard drive
- HTML Basics
- Web scraping using BeautifulSoup
- Scrape URLs and Email IDs from Web
- Scraping images
- Scraping Data on Page Load
tushar.1801@gmail.com
D0OLHR8SGA

In cases where one wants huge amounts of information from a website as early as possible.
Web scraping comes in handy. Web scraping employs intelligent automation techniques to
obtain thousands, if not millions, of data sets more quickly.
Web scraping is a computerised technique for gathering copious volumes of data from
websites. The majority of this data is unstructured in HTML format and is transformed into
structured data in a database or spreadsheet so that it can be used in multiple applications.
To collect data from websites, web scraping can be done in a variety of methods. These
include leveraging specific APIs, online services, or even writing your own code from scratch
for web scraping. You may access the structured data on many huge websites, including
Google, Twitter, Facebook, StackOverflow, and others, using their APIs. This is the greatest
option, however there are alternative websites that either lack the technological
sophistication or don't permit users to access significant volumes of structured data. In that
case, it's advisable to employ web scraping to collect data from the website.
Webs craping can be used for competitive analysis, R&D, Social media scraping, monitoring
of brands, lead generation etc. Web scraping is not illegal in any way, but whether it is or is

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
not depends on a number of other criteria, including how the data will be used or does one
appear to be in violation of the "Terms & Conditions" statements?
Since it is allowed to save that data on your device for personal use, the majority of
websites normally make their data available to the public. However, if you intend to use it as
your own without the owner's permission and in violation of the "Terms & Conditions"
Guidelines, this will be regarded as illegal. Although the law on online scraping is opaque,
there are still several rules that you can break if you perform it without authorization.
Following is a list of a few of these:
- Violation of the Digital Millennium Copyright Act (DMCA)
- Violation of the Computer Fraud and Abuse Act (CFAA)
- Breach of Contract
- Copyright Infringement
- Trespassing, etc.
Tactics to employ when performing web scraping –
- Avoid using the given API to scrape the web.
- Ensure that there is a gap of 12 to 15 seconds between each request.
- Without the owner's permission, you must not utilise the data you have scraped for
commercial purposes.
- Always read the terms of service and abide by them.
- It would be wise to first obtain consent from the person who has placed restrictions
tushar.1801@gmail.com
D0OLHR8SGA on access to their data.
The scraper and the crawler are the two components needed for web scraping. The crawler
is an artificial intelligence system that searches the internet for the specific data needed by
clicking on links. On the other hand, the scraper is a unique tool designed to extract data
from the website. The scraper's architecture can vary significantly depending on the
difficulty and size of the project in order to efficiently and precisely extract the data.
Working of Web Scrapers:
Web scrapers can collect all the information from particular websites or the specific
information a user requests. It's ideal if you describe the data you require so that the web
scraper only swiftly retrieves that information.
Therefore, the URLs are first provided when a web scraper needs to scrape a website. Then,
all of the websites' HTML code is loaded. A more sophisticated scraper might also extract all
of the CSS and Javascript parts. The scraper then extracts the necessary data from this HTML
code and outputs it in the manner that the user has chosen. The data is typically stored as
an Excel spreadsheet or a CSV file, but it is also possible to save it in other formats, such a
JSON file.
The URLs are first provided when a web scraper needs to scrape a website. Then, all of the
websites' HTML code is loaded. A more sophisticated scraper might also extract all of the
CSS and Javascript parts. The scraper then extracts the necessary data from this HTML code

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
and outputs it in the manner that the user has chosen. The data is typically stored as an
Excel spreadsheet or a CSV file, but it is also possible to save it in other formats, such a JSON
file.
Why Python?
Some of the greatest frameworks for web-scraping and web-crawling based on/in the
Python library are Scrapy, Beautiful Soup, and Python Requests.
Such frameworks are used by a large number of programmers because they provide tools
for data extraction that are incredibly quick and effective. These frameworks also have a lot
of wonderful features, like support for XPath, HTML, and other languages. If Python was
used to create the code capsules, this deployment happens much faster.
They provide with a variety of debugging tools that enable uninterrupted, secure, and
hassle-free development. In comparison to other tools, Scrapy & Beautiful Soup is also more
user-friendly when it comes to modifying parse trees and navigating websites.
Python makes use of a package called Pandas to assist programmers in converting all of
their gathered data into concrete and practical information. They are converted into any
necessary format, including.csv,.sav,.omv, etc.
By memorising a few easy steps, this procedure makes all the data collected by web
scraping usable. With different languages, these actions could develop into quite intricate
processes.
tushar.1801@gmail.com
D0OLHR8SGA
It develops a number of frameworks and libraries that enable you to gather, sort through,
and arrange a variety of international businesses. Users appreciated how easily
customizable and appealing this language was.
Python's ability to collaborate with other programmes makes up for whatever run-time
shortcomings it may have. It makes use of their stronger engines to offer shorter run-times,
which reduces energy usage.

Mapit.py with Webbrowser module (reference: Automate the boring stuff with python):
A practical web browser controller in Python is the webbrowser module. It offers a high
level user interface that enables users to view documents hosted on the Web.
The web browser can be used as a CLI tool as well. The URL is accepted as the input, and the
following other parameters are optional: If it's possible, the options -n and -t open the URL
in new browser windows and tabs, respectively.

The given code will open a new tab on the browser with the Google webpage.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
There are some intriguing possibilities made possible by the open() method. For instance, it
is time-consuming to copy a street address to the clipboard and then open Google Maps to
view a map of it. By creating a quick script that uses the information in your clipboard to
launch the map in your browser automatically, you might eliminate a few steps from this
task. In this manner, all you need to do to load the map is run the script after copying the
address to the clipboard.
Set up a MapIt.py file:

tushar.1801@gmail.com
Figuring out the URL:
D0OLHR8SGA
Let’s say you see a place on a website and you wish to open the address of that place on
google maps. Here, we will look at a Starbucks located in Colaba, Mumbai. We can do this
automatically using python.
The command line arguments will be used by the script rather than the clipboard. The
application will know to use the contents of the clipboard if there are no command-line
inputs.
To begin with, you must decide which URL to utilise for a specific street address. When you
use the browser to access http://maps.google.com and search for an address, the URL in the
address bar like this:
https://www.google.com/maps/place/Terminal+2,+Navpada,+Vile+Parle+East,+Vile+Parle,+
Mumbai,+Maharashtra+400099/@19.0974424,72.8723077,17z/data=!3m1!4b1!4m5!3m4!1
s0x3be7c842b68282f1:0x200d8c72871da4f1!8m2!3d19.0974373!4d72.8745017
There is a lot of additional text in the URL in addition to the address. URLs are frequently
extended by websites in order to track users or to personalise content. However, if you try
simply going to:
https://www.google.com/maps/place/Terminal2,%20Vile%20Parle,%20Mumbai

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
You'll discover that it still displays the right page. So, you can instruct your software to
launch a web browser and navigate to https://www.google.com/maps/place/’your address
string’.
‘your address string’ is the location that has to be mapped.
Handling Command Line Arguments:
You must import the webbrowser module to start the browser and the sys module to read
potential command line arguments after the program's #! shebang line. A list of the
program's filename and command line arguments is kept in the sys.argv variable.
Len(sys.argv) evaluates to an integer greater than 1 if this list contains more than just the
filename, indicating that command line arguments have been supplied.
Normally, spaces are used to separate command-line arguments; however, in this situation,
you want to treat all of the parameters as a single string. You can give sys.argv to the join()
method, which produces a single string value, because it is a list of strings. You should pass
sys.argv[1:] to remove the first member of the array because you don't want the
programme name in this string. The address variable holds the resultant string that this
expression evaluates to.

tushar.1801@gmail.com
D0OLHR8SGA

If you enter this into the command line to launch the programme:
mapit Terminal 2, Navpada, Vile Parle East, Vile Parle, Mumbai, Maharashtra 400099
The variable address will have ‘Terminal 2, Navpada, Vile Parle East, Vile Parle, Mumbai,
Maharashtra 400099’ as string.
sys.argv will then contain values: [‘mapIt.py’, ‘Terminal 2’, ‘Navpada’, ‘Vile Parle East’, ‘Vile
Parle’, ‘Mumbai’, ‘Maharashtra’, ‘400099’
Handling the Clipboard content and launching the browser:
The program will presume the address is on the clipboard if there are no command-line
arguments. With pyperclip.paste(), you can retrieve the contents of the clipboard and save
them in a variable called address. Lastly, call webbrowser.open to start a web browser with
the Google Maps URL ().

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Request module:
We first install the requests module from the CMD prompt or Terminal window using:
pip install requests
The de facto industry standard for sending HTTP requests in Python is the requests library.
In order to let you concentrate on communicating with services and consuming data in your
application, it isolates the difficulties of making requests behind a simple, straightforward
API.
When you submit an HTTP request, HTTP methods like GET and POST identify the activity
you're attempting to complete. You'll utilise a number of additional typical methods in the
tushar.1801@gmail.com
course of this tutorial in addition to GET and POST.
D0OLHR8SGA

GET is one of the most popular HTTP methods. By using the Receive technique, you can get
or retrieve data from a certain resource. Invoke requests to send a GET request.get().
It returns a request.Response object.

The status code is the first piece of data you can learn from a response. You may learn the
status of the request from the status code.
An easy example of status code is 404 request not found. This means that the resource that
you are interested in seeing is not found. Similarly, 200 OK status indicates that your request
is successful. Decisions can be made using these status codes.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Downloading files from the web using python:
Firstly we import the requests module and save the image in a url. We then use the above
explained requests.get function to get the image. The URL of the image is defined as
image_url and a HTTP response objects is created after which an HTTP request is sent to the
server and saved in a response object called g. we then proceed to save the content as a file
with jpg format in binary format and write the response contents to a new file in binary
mode.
One must now check their local directory where the script is located and this image will be
found.

tushar.1801@gmail.com
D0OLHR8SGA

Downloading large files:


Nothing but a string containing the file data is contained in the HTTP response content
(r.content). Therefore, in the case of huge files, it won't be able to save all the data in a
single string. We do some programme adjustments to address this issue:
Since a single string cannot contain all of the file's data, we utilise the r.iter content method
to load the data in chunks while defining the chunk size.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
For the requests, we provide the stream=True argument in the requests.get() method. If you
set the stream option to True, just the response headers will be downloaded; otherwise, the
connection will stay open. This prevents reading the information into memory all at once for
lengthy responses. As r.iter_content is iterated, a fixed chunk will load each time.

HTML:
It is a text file with the.html or.htm extension that contains text and some tags included in
the brackets " <>" which provide the configuration instructions for the web page.
Each HTML document has two sections:
- One element that displays the complete page's content to the browser and cannot be
altered directly.
- another section that has the page's source code, which we may use to edit the HTML file.
We work with this component.
tushar.1801@gmail.com
D0OLHR8SGA
Simply click the right mouse button inside the page's text area and select "View source" or
"View Frame-Source" to view the source code of any HMTL document. The page's source
code will be displayed in a document that will be opened in the Text Editor.
There are three tags that describe and provide basic information about the fundamental
structure of an HTML content. These tags just frame and organise the HTML file; they have
no impact on how the content looks.
<! Doctype>:
A doctype, also known as a document type declaration, is a directive that informs the web
browser of the markup language used to create the current page. The Doctype, which is not
an element or tag, informs the browser of the HTML or other markup language version or
standard that is being used in the page.
A DOCTYPE declaration is shown at the top of a web page before any other elements. Every
HTML document is required to have a document type declaration in accordance with the
HTML specification or standards in order to guarantee that the pages are displayed as
intended.
<!DOCTYPE html> is case insensitive on HTML5.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Since HTML 4.01 was entirely based on a Standard Generalized Markup Language, the
DOCTYPE declaration was used to make a reference to a document type definition (DTD) in
that version (SGML).
The Standard Generalized Markup Language (SGML) rules must be specified in the
document type definition (DTD) for the browser to properly handle the content. However,
as HTML 5 is not based on a Standard Generalized Markup Language (SGML), there is no
requirement for a reference to a document type definition (DTD) in that version of HTML
(SGML).
A sample code and its corresponding output.

tushar.1801@gmail.com
D0OLHR8SGA
HTML headings:
The heads of a page are specified using an HTML heading tag. HTML defines headers at six
different levels. These six heading elements are designated by the letters h1, h2, h3, h4, h5,
and h6, where h1 denotes the highest level and h6 the lowest.
For the primary heading, use <h1>. (Size-wise largest)
Subheadings are designated using a <h2> element; if there are more sections beneath the
subheadings, a <h3> element is used.
For the small heading, use <h6> (smallest one).
How are headings important?
Headings are used by search engines to index the website's structure and organise its
content. They are used to draw attention to key points. They give us useful information and
describe the document's structure.
Example:

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Major HTML tags:

Tag Purpose
<p> Paragraph tag.
In HTML, a paragraph is defined by the p>
tag. There are opening and closing tags on
these. Therefore, everything that is spoken
between p and p is considered to be a
paragraph. Even if we don't use the closing
tag, /p>, most browsers still treat a line as a
paragraph, although this could lead to
unexpected outcomes. Therefore, it is both
a wise norm and something we must do.
<center> Centre Alignment.
tushar.1801@gmail.com
D0OLHR8SGA In HTML, the <centre> tag is used to align
content in the middle of a page. HTML5
does not support this tag. Instead of using
the centre tag like HTML5 does, CSS's
property is used to determine how the
element is aligned. It does not have any
attributes.
<hr> Horizontal lines tag.
The HTML tag known as "horizontal rule"
(abbreviated "hr") is used to insert
horizontal lines between sections of a
document. There is no need for the closing
tag because the tag is empty or unpaired.
<pre> Preserve formatting tag.
The block of preformatted text that retains
the tabs, line breaks, spaces, and other
formatting elements that web browsers
ignore is defined by the pre> tag in HTML.
Although it appears in a fixed-width font in
the pre> element, the text can be
customised using CSS. Starting and end tags
are necessary for the "pre" tag.
&nbsp; Non-breaking space tag

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
A non-breaking space, often known as a
fixed space, is indicated by the character
entity &nbsp. This could be thought of as
having double the space of a typical space.
It is used to make a space in a line that
word wrapping cannot destroy.
While using two spaces between words, we
must use &ensp;, and when using four
spaces, we must use &emsp;

Web Scraping using BeautifulSoup:


Python web scraping framework called Beautiful Soup provides the BeautifulSoup object.
The technique of obtaining data from a website using automated technologies to speed up
the process is called web scraping. The entire parsed document is represented by the
BeautifulSoup object. You can use it as a Tag object for the majority of your needs.
Syntax:
BeautifulSoup(document, parser)
Where,
The parameter document contains HTML or XML document.
The parameter parser contains the name of the parser that is used to parse the document.
tushar.1801@gmail.com
D0OLHR8SGA

Here, we have printed a tag and created a document using a BeautifulSoup object.
Content can be scraped using a variety of techniques. One of them is the beautifulsoup
select() function. As an argument to the method, the select() CSS selector allows pulling
content from inside the specified CSS path.
We first import the packages necessary to use the select() method.
We then create a samplt HTML document containings links and texts. We then parse the
HTML before extracting its contents from the document. The html.parser argument is
passed in the BeautifulSoup() method. After that, we finally extract the contents from the
HTML document using the select() method of beautifulsoup. Inside the select() there is a
CSS like class name that needs to be found.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Use the [~] notation to extract attributes from items in Beautiful Soup. For instance, el["id"]
returns the id attribute's value.

tushar.1801@gmail.com
D0OLHR8SGA

However, an error is thrown if the attribute is not available. In that case, it can be checked
whether an attribute is present.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Scraping URLs and EmailIDs from the web (reference: https://medium.com/swlh/how-to-
scrape-email-addresses-from-a-website-and-export-to-a-csv-file-c5d1becbd1a0) :
We first import the necessary modules. The functionality of each module is mentioned next
to its import statements in the code below. We then initialise variables where deque is for
saving unscraped URLS, emails scraped and a set of URLs successfully scraped from the
website. original_url reads the url input from the user while ‘unscraped’ saves the urls that
are to be scraped. Further, ‘scraped’ saves the urls and ‘emails’ is used to fetch emails. One
must not that duplicates are not allowed in the elements of Set. Next, we move the
tushar.1801@gmail.com
D0OLHR8SGA unscraped_url to scraped_urls set. The .popleft() will return the element after removing it
from the left side of the deque. url can be extracted using urlsplit. It returns 5 tuples,
namely, addressing scheme, path, network location, fragment identifier and query. Like this
we get the path and base of the URL website.
We then send HTTP get request to the website. it will continue with the next url in case any
page has errors.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Using a regular expression, extract every email address from the response, and then include
them in the email set. Then we find all linked URLs on the website. BeautifulSoup is used to
parse the document of HTML type to find the linked URLs. It can be found by finding <a
href=””> tags which indicate hyperlinks. If the new URL hasn't been scraped or unscraped
yet, add it to the queue. We exclude links that cannot be scraped. Finally, after scraping
emails from the website, they can be exported to a csv file.

tushar.1801@gmail.com
D0OLHR8SGA

Scraping images using python:

In the above code, we first import the necessary modules, that is, requests and
BeautifulSoup. Then we pass the URL after making an instance of requests. Further, we pass
the requests into the BeautifulSoup() function and use the ‘img’ tag to find them. The
output is a ‘.svg’ link which is the logo of the website.
Likewise, we can also use the urlopen module to scrape web images.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
The same approach is followed, first modules are imported. The URL is read with urlopen().
Requests are passed into the BeautifulSoup() function and then the ‘img’ tag is used to find
them.
Advanced Web Scraping Techniques:
A full tool for scraping is provided by the web crawling framework known as Scrapy. In
Scrapy, we design Python classes called Spiders that specify how a specific site or set of
related sites will be scraped. Therefore, Scrapy is a great option if you want to construct a
reliable, concurrent, scalable, large-scale scraper. Additionally, Scrapy includes a number of
middlewares for cookies, redirects, sessions, caching, and other issues that you could
encounter.
Selenium webdriver is the ideal tool to use for complex websites or pages rendered with a
lot of JS. Selenium, also referred to as a web-driver, is a programme that automates web
browsers. You can use this to launch an automatic Google Chrome or Mozilla Firefox
tushar.1801@gmail.com
D0OLHR8SGA window that visits a URL and follows the links. It is not, however, as effective as the
technologies we have already covered. When all avenues for web scraping are shut down
but you still need the information that matters to you, utilise this programme.
Choosing the right tool - Beginners who wish to begin with easy web scraping projects
should use Beautifulsoup. Scrapy is effective, especially for big projects where performance
and bandwidth are important considerations. Despite having a challenging learning curve,
Scrapy can handle a wide range of project requirements. Selenium can be a great tool for
working with dynamic front-ends because of its extensive feature set and smooth learning
curve.
Installing Scrapy
You can instal the package from the conda-forge channel, which has up-to-date packages for
Linux, Windows, and macOS, if you're using Anaconda or Miniconda.
Scrapy can be installed from its dependences from PyPI using:
pip install Scrapy
note that lxml is an efficient HTML and XML parser.
You can also install Scrapy with:
conda install -c conda -forge scrapy

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Installing Selenium
We use pip to install selenium package. Since pip is available in the standard library, one
can:
pip install selenium
For Selenium to communicate with the selected browser, a driver is needed. For example,
geckodriver must be installed because Firefox, for instance, needs it. Ensure that it is on
your PATH by adding it to /usr/bin or /usr/local/bin, for instance.
If you don't follow this step, you'll get an error – selenium.common.exceptions.
WebDriverException: Message: ‘geckodriver’ executable needs to be in PATH.

Dynamic pages and client side rendering:


Despite the fact that websites are becoming more interactive and user-friendly, web
crawlers are suffering as a result.
Nowadays, dynamic coding techniques are widely used on modern websites, which are not
at all crawler friendly. Examples include slow image loading, infinite scrolling, or items
loaded through AJAX requests, all of which make it challenging for Googlebot to crawl.
JavaScript is widely used today to load dynamic content on webpages.
tushar.1801@gmail.com
Viewing the page source will reveal whether a web page uses asynchronous loading or is
D0OLHR8SGA
dynamic (if you right click on the page, you will find option View Page Source). If you search
for the content you're looking for but are unsuccessful, it's likely that Javascript is used to
render the information.
Web scrapers have a difficult time with modern websites because they are Javascript
rendered pages. One of the most well-liked tools for Web UI Automation is Selenium
WebDriver. It enables the automatic execution of tasks carried out in a web browser
window, such as opening a website, completing forms (including interacting with text boxes,
radio buttons, and drop-downs), submitting the forms, surfing web pages, responding to
pop-up windows, and so on.
Authentication and authentication handling:
Private data, which is accessible once you log in to the website, has to be scraped
occasionally.
For simpler websites, we can send a POST request with the user's login information and
keep it in the cookie. However, there may be other issues as well, such as:
In some cases, you'll also need to enter CSRF_TOKEN along with your username and
password. It's also possible that you will need to provide additional header information
before sending a POST request.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.
Giving a user authorization to access a certain resource is referred to as authentication.
Since not everyone can access data from every URL, authentication is the main requirement.
Typically, authentication data is sent by an Authorization header or a server-defined custom
header in order to achieve this authentication. Your username and password should be
substituted for "user" and "pass." It will verify the request's authenticity and either respond
with code 200 or return error 403.

IP Blocking:
Similar to physical addresses, IP addresses reveal details about the device and the network
being used to connect.
While you'll typically have the same IP address when connecting devices through your home
network, this address changes if you're using another network outside of your home. It can
also change if you reboot your router or switch Internet providers. IP addresses are not
static, unlike physical addresses.
IPv4 addresses, the most popular type of IP address, employ four sets of up to three
tushar.1801@gmail.com
D0OLHR8SGA integers each, separated by dots.
A machine that serves as a gateway between your computer and the internet is referred to
as a proxy server or simply a "proxy." Your requests are sent through the proxy when you
are utilising one. The website you are scraping is not directly exposed to your IP address.
Rotating your identification, or IP address, on a regular basis is the best strategy to
circumvent IP banning. To avoid having your spider blocked, it is always preferable to use
proxy and VPN services, rotate IP addresses, and other security measures. It will aid in
reducing the chance of being stuck and banned.
Rotating IP addresses is a simple task if you use Scrapy. You can choose to incorporate the
proxies in your spider using Scrapy.
You can easily integrate one of the many IP blocking APIs, like scraperapi, into your project
for web scraping.

Proprietary content. All rights reserved. Unauthorized use or distribution prohibited.


This file is meant for personal use by tushar.1801@gmail.com only.
Sharing or publishing the contents in part or full is liable for legal action.

You might also like