You are on page 1of 21

Web Scraping With Python

SUBMITTED TO THE AMITY UNIVERSITY KOLKATA IN PARTIAL FULFILMENT OF


REQUIREMENTS FOR THE AWARD OF DEGREE OF BCA+MCA(Dual)
AMITY INSTITUTE OF INFORMATION TECHNOLOGY KOLKAT

BY:- Satyam kumar


Enroll no :- A91449519010
Aditya Sagar
Enroll no:- A91449519008
Under the supervision of 7th sem
Prof.Dhrubashish Sarkar
DECLARATION

 
 I, Satyam kumar and Aditya Sagar student of AMITY UNIVERSITY KOLKATA hereby declare that the
Seminar / Term Paper / Summer Internship / Dissertation titled “Web Scraping Using Python” Which is
submitted to the Department of Information Technology Amity Institute of Information Technology ,
KOLKATA, in partial fulfilment of the requirement for the award of the degree of BCA+MCA (Dual) has not
been previously formed the basis for the award of any degree, diploma or other similar title or recognition

Kolkata

Date:- 29/11/2022 Satyam kumar


Aditya Sagar
ACKNOWLEDGEMENT

 I would like to express my deepest appreciation to all those who supported me in the
completion of my report. I have special gratitude I express to my project guide
Prof.Dhrubashish Sarkar , whose contribution to encouragement and development of
Information technology understanding, has helped me to coordinate my project and write
the report.
At last, I would like to thank my parents and mentor who constantly encouraged me to
complete the report.
CONTENT

 Introduction
 Approaches to Web Scraping in Python
 Creating a spider using ScraPy
 ScraPy canned spider templates
 Libraries used to create a custom spider
 Requests
 LXML
 SQL Alchemy
 Celery
Python is great for collecting data from the
web :
 Python is unique as a programming language , in that it has very
strong communities for both web programming and data
management/analysis.
 As a result, your entire computational research pipeline can be
integrated.
 Instead of having seprate application for data collection , storage ,
management, analysis and visualization, you have one code – base.
Approaches to Web Scraping in Python

 Python's two main methods for web scraping are as follows:


 1. Use ScraPy to modify a canned spider
 2. Use requests, xml, sqlalchemy, and celery to build a completely
unique spider.
 ScraPy is generally the best option unless you're trying to do
something really unusual, like distributed, high throughput
crawling.
How should I manage my data?

 Prior to discussing the intricacies of each method, we must first discuss our data
management plan. Keeping it in a PostgreSQL database is generally the correct response.
All operating systems support PostgreSQL, which is also simple to set up.
 Additionally, managed PostgreSQL databases are provided via research computing.
 For each scraping job, you should make a table containing columns for the resource's URL,
the time and date it was scraped, and the different data fields you are collecting. You can
construct a JSON column and store the data there if the information you're gathering is
difficult to fit into a normal fied.
Creating a spider using ScraPy

 The first thing to do after installing ScraPy is to start a new project:


-> $ scrapy start project <your project names

~ Next, change directory to the newly created project directory, and create a spider:
-> $ cd <your project names

-> $ scrapy genspider <name> <domains


Creating a spider using ScraPy

 After that is complete, a directory with the name of your project and a directory named
spiders with a python file containing the name of the spider you created will be created. It
will appear something like this when you open it up:
 import scrapy
 class MySpider (scrapy ‚Spider):
 name = example.com
 allowed_ domains = [*example.con 1
 start_urls = L'http://www.example.com/*J
 def parse(self, response):
 pass
ScraPy canned spider templates

 Building scrapers for popular site types and data formats is simple with ScraPy
thanks to its collection of spider templates, which include:
 Instead of writing extraction code, CrawlSpider, a general-purpose spider, lets
you provide rules for following connections and extracting items.
 XMLFeedSpider is a spider for XML feed crawling.Similar to the
XMLFeedSpider but for CS document feeds is the CSVFeedSpider.
 A spider called SitemapSpider scans a website using the links in a sitemap.
Using ScraPy spiders

 The following action (from the project directory) is to test and execute your
ScraPy project once you've set it up.
 Run the following command to test your extraction logic: scrapy shell
 spider <spider> <url>. This will start an interactive spider session, so you can
verify that your logic is extracting data and links from the page as intended.
 Once you're satisfied that your spider is working properly, you can crawl a site by
running the command: scrapy crawl spiders -o results.json.
ScraPy shell

 The ScraPy shell is a good place to verify that you've written your spider
correctly.
 From the shell you can test your parse function and extraction code like so:
 # Produce a list of all links and data extracted from the target
urllist(spider.parse(response))
 # You can also test path selectors herefor anchor text in
response.xpath(*//a/text()*). extract():print anchor_ text
What if ScraPy doesn't do what you need?

 First, ScraPy is quite configurable, therefore it is worth your time to


double-check. If you can, try to stay away from creating something
entirely new.
 The next step is a custom spider if you are positive that ScraPy is
not up to the task.
 You will need Requests, LXML, SQL Alchemy, and perhaps Celery
for this.
Libraries used to create a custom spider

 The best HTTP request library for Python is called Requests. It will be used to retrieve web
pages via get, post, etc. requests.
 The best Python library for processing and parsing HTML/XML is LXML. It will be put to
use to take information and links out of the retrieved web pages.
 The best database library for Python is called SQL Alchemy. It will be used to input
database scraping outcomes.
 Celery is a library for task queues. Unless you want many spiders to execute on the same
domain at once, using
Requests
LXML

 LXML is a sizable, potent library that can be challenging to learn. To fully benefit from
LXML's capabilities, you must become familiar with path.
 from xml import tree
 # here r is the request object we got in the previous example
 document = etree.fromstring(r.text)
 anchor_tags = document.xpath("//a")
 hrefs = [a. attrib. get(*href) for a in anchor_tags]
 anchor_text = [axpath("text()") for a in anchor_tags]
SQL Alchemy

 A sizable, potent library for Python database interaction is called SQL Alchemy. Although
there is a lot to learn, at first you only need to be able to insert data.
 import sqlalchemy
 engine = sqlalchemy. create.
 _engine("postgresql://user pass @ host/database")
 connection = engine.connect()
 tables = sqlalchemy.MetaData(bind=engine, reflect-True). Tables
 connection. execute(tables[" my_table"] insert(), column_1='random", column_2='data")
Celery

 You can omit this section. Unless numerous spiders are active on a site at once,
celery is not necessary. You have access to a task queue with Celery.
 You may be able to scrape all the pages you desire from this queue. The task
queue would then have jobs that scrape a single page, collect data from it, and
add any links discovered.
 In one area of your application, you can add tasks to the task queue, and a
separate worker process finish tasks from the queue.
 A separate RabbitMQ (or comparable) server is necessary for Celery.
Conclusion future scope

 The significance of online scraping is growing as more and more data is being
added to the internet. Today, a lot of businesses provide their customers with
specialized web scraping tools that they use to collect data from the internet and
organize it into valuable and intelligible data. It saves valuable human resources
from having to manually visit each website and get the information. For each
unique website, web scrapers are created and coded, and crawlers perform broad
scraping. More coding is needed to scrape data from a website with a complex
structure than one with a simple one. Web scraping has a bright future and will
become increasingly important for every type of business.
References

 The Web scraping Wikipedia page has a concise definition of many concepts discussed here.
 The School of Data Handbook has a short introduction to web scraping, with links to resources e.g. for data
journalists.
 This blog has a discussion on the legal aspects of web scraping.
 Scrapy documentation
 morph.io is a cloud-based web scraping platform that supports multiple frameworks, interacts with GitHub
and provides a built-in way to save and share extracted data.
 import.io is a commercial web-based scraping service that requires little coding.
 Software Carpentry is a non-profit organization that runs learn-to-code workshops worldwide. All lessons are
publicly available and can be followed independently. This lesson is heavily inspired by Software Carpentry.
 Data Carpentry is a sister organization of Software Carpentry focused on the fundamental data management
skills required to conduct research.
 Library Carpentry is another Software Carpentry spinoff focused on software skills for librarians.

You might also like