Data Collection

Data Collection
Hamza Yar Khan & Muhammad Hassan Najeeb

Contents
 Data Science Pipeline
 Pre Data Collection
 Data Collection and its types

 Primary Data Collection
 Secondary Data Collection
 Challenges in Data Collection
 Web Scraping
 Beautiful Soup
 Scrapy
Expertise
DATA SCIENCE PIPLINE

Before an analyst begins collecting data, they must answer three
questions first:
 What’s the goal or purpose of this research?
 What kinds of data are they planning on gathering?
 What methods and procedures will be used to collect, store, and process the
information?
 Additionally, we can break up data into qualitative and quantitative types.

 Qualitative data covers descriptions such as color, size, quality, and appearance.
 Quantitative data, unsurprisingly, deals with numbers, such as statistics, poll numbers, percentages, etc.
PRE-DATA COLLECTION
 Data Collection is the systematic approach to gathering relevant information from a variety of sources depending on the problem statement
 Data Collection is classified in two categories

 Primary
 Secondary
DATA COLLECTION
PRIMARY V SECONDARY
 Inconsistent Data
 When working with various data sources, it's conceivable that the same information will have a lack of
compatibility.
 The differences could be in formats, units, or occasionally spellings.
 Data Downtime
 Schema modifications and migration problems are just two examples of the causes of data downtime.
 Data downtime must be continuously monitored, and it must be reduced through automation.
 Data engineer spends about 80% of their time updating, maintaining, and guaranteeing the integrity of
the data pipeline.
 Ambiguous Data/ Inaccurate Data
 Even with thorough oversight, some errors can still occur in massive databases or data lakes
 Hidden Data
 Spelling mistakes can go unnoticed, formatting difficulties can occur, and column heads might be
deception
 Duplicate Data
 Too Much Data
 Data scientists, data analysts, and business users devote 80% of their work to finding and organizing the
appropriate data.
 With an increase in data volume, other problems with data quality become more serious, particularly
when dealing with streaming data and big files or databases.
 Relevant Data
CHALLENGES ?
 Web scraping refers to the extraction of data from a website.
 This information is collected and then exported into a more useful format for the user.
Be it a spreadsheet or an API
 Is web scraping legal?

 In short, the action of web scraping isn't illegal. However, some rules need to be followed. Web scraping
becomes illegal when non-publicly available data becomes extracted .
 Some libraries for Manual Web Scraping

 Beautiful Soup
 Scrapy
 Selenium
WEB SCRAPING
 Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide
idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of
work.
 Documentation -> https://www.crummy.com/software/BeautifulSoup/bs4/doc/
 Pak Wheels Data
BEAUTIFUL SOUP
 Scrapy is an open-source and collaborative web crawling framework for Python. It's
used for extracting data from websites and can be used for a wide range of
purposes such as data mining, data processing, etc.
 Scrapy works by sending HTTP requests to a website's server, and then it parses
the server's response to extract the data. It then saves the data in a structured
format such as CSV or JSON.
 Scrapy is fast and efficient, it can handle multiple concurrent requests and it's easy
to use. It also has a built-in support for handling common web scraping tasks such
as logging in, handling cookies, etc.
SCRAPY
 Information to be scraped: "Book names and prices“
BOOKBERRY.PK
CODE
CSV
Thank You.
info@folio3.com www.folio3.com
Pleasanton, California Head Office: Surrey, United Kingdom

Chicago, Illinois 1301 Shoreway Road, Dubai, UAE
Toronto, Canada Suite 160, Sofia, Bulgaria
Guadalajara, Mexico Belmont, CA 94002, USA Lahore & Karachi, Pakistan

Data Collection

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Collection

Uploaded by

Copyright:

Available Formats

Data Collection

Hamza Yar Khan & Muhammad Hassan Najeeb

 Pre Data Collection

 Data Collection and its types

 Challenges in Data Collection

DATA SCIENCE PIPLINE

 What kinds of data are they planning on gathering?

 Additionally, we can break up data into qualitative and quantitative types.

 Data Collection is classified in two categories

 Is web scraping legal?

 Some libraries for Manual Web Scraping

Pleasanton, California Head Office: Surrey, United Kingdom

You might also like