Professional Documents
Culture Documents
Data Collection
Data Collection
Web Scraping
Beautiful Soup
Scrapy
Expertise
What methods and procedures will be used to collect, store, and process the
information?
PRE-DATA COLLECTION
Data Collection is the systematic approach to gathering relevant information from a variety of sources depending on the problem statement
DATA COLLECTION
PRIMARY V SECONDARY
Inconsistent Data
When working with various data sources, it's conceivable that the same information will have a lack of
compatibility.
The differences could be in formats, units, or occasionally spellings.
Data Downtime
Schema modifications and migration problems are just two examples of the causes of data downtime.
Data downtime must be continuously monitored, and it must be reduced through automation.
Data engineer spends about 80% of their time updating, maintaining, and guaranteeing the integrity of
the data pipeline.
Ambiguous Data/ Inaccurate Data
Even with thorough oversight, some errors can still occur in massive databases or data lakes
Hidden Data
Spelling mistakes can go unnoticed, formatting difficulties can occur, and column heads might be
deception
Duplicate Data
Too Much Data
Data scientists, data analysts, and business users devote 80% of their work to finding and organizing the
appropriate data.
With an increase in data volume, other problems with data quality become more serious, particularly
when dealing with streaming data and big files or databases.
Relevant Data
CHALLENGES ?
Web scraping refers to the extraction of data from a website.
This information is collected and then exported into a more useful format for the user.
Be it a spreadsheet or an API
WEB SCRAPING
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide
idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of
work.
Documentation -> https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Pak Wheels Data
BEAUTIFUL SOUP
Scrapy is an open-source and collaborative web crawling framework for Python. It's
used for extracting data from websites and can be used for a wide range of
purposes such as data mining, data processing, etc.
Scrapy works by sending HTTP requests to a website's server, and then it parses
the server's response to extract the data. It then saves the data in a structured
format such as CSV or JSON.
Scrapy is fast and efficient, it can handle multiple concurrent requests and it's easy
to use. It also has a built-in support for handling common web scraping tasks such
as logging in, handling cookies, etc.
SCRAPY
Information to be scraped: "Book names and prices“
BOOKBERRY.PK
CODE
CSV
Thank You.
info@folio3.com www.folio3.com