You are on page 1of 40

Data

Collection
Zakaria KERKAOU
Zakaria.kerkaou@e-polytechnique.ma
Sources of Data
Company data : Open data :
• Collected by companies. • Free open data sources.
• Helps them make data • Can be used, shared, and build
driven decisions on by anyone
Company Data


Web events.

Survey data.

Customer data.

Logistics data.

Financial transactions.
Company Data
Example: Web events
When you visit a web page or click on a link, usually this
information is tracked by companies in order to calculate
conversion rates or monitor the popularity of different pieces of
content.
User_id event_name Time/date Links_clicked

4856 Page_visited 2022-10-05 Link1


11:01:02 Link2
Link3
Company Data
Example: Survey Data
Data can also be collected by asking people for their opinions in
surveys.

Face-to-face interview.

Online questionnaire.

focus group.
Open Data


API (Application
Programming Interface)

Public records.
Public data APIs
• Using APIs we can request data from a
third party companies over the internet.
• Many companies have public APIs to let
anyone access their data.
• Some APIs:

Youtube Data API.

Facebook Graph API.

MediaWiki Action API (Wikipedia).

Openweather API.

Many more.
Public data APIs
• Example of API call of • Example of API Response
OpenWeatherAPI
Public Record
Public records are another great way of gathering data. They can
be collected and shared by:

International organisations:

World Bank, United nations, world heath organisation.

National statistical office:

Censuses surveys

Governmental agencies:

Weather, environment, population.
Public Record
For example www.covidmaroc.ma
Application
Programming
Interface
Python APIs
To use an API, you make a request to a remote web server, and retrieve the data you
need.

But why use an API instead of a static CSV dataset you can download from the web? APIs
are useful in the following cases:

The data is changing quickly. An example of this is stock price data.

You want a small piece of a much larger set of data. Reddit comments are one example.

There is repeated computation involved. Spotify has an API that can tell you the genre
of a piece of music.

In cases like the ones above, an API is the right solution.


What is an API?
An API, or Application Programming Interface, is a server that you can use to retrieve and
send data to using code. APIs are most commonly used to retrieve data, and that will be
the focus in our case.
API Documentation
In order to ensure we make a successful request, when we work with APIs it’s
important to consult the documentation. Documentation can seem scary at
first, but as you use documentation more and more you’ll find it gets easier.
API Requests in Python
To make a ‘GET’ request, we’ll use the requests.get() function, which requires most of the
time, one argument — the URL we want to make the request to.

Status codes are returned with every request that is made to a web server. Status codes
indicate information about what happened with a request. Here are some codes that are
relevant to GET requests:

200: Everything went okay, and the result has been returned (if any).

301: The server is redirecting you to a different endpoint.

400: The server thinks you made a bad request.

401: The server thinks you’re not authenticated.

403: The resource you’re trying to access is forbidden.

404: The resource you tried to access wasn’t found on the server.

503: The server is not ready to handle the request.
Requests in Python
The documentation for this specific API tells us that the response we’ll get is in JSON
format. In the next section we’ll learn about JSON, but first let’s use the response.json()
method to see the data we received back from the API.

Example :
The Notify API, which gives access to data
about the international space station. It’s a
great API for learning because it has a very
simple design, and doesn’t require
authentication. We’ll teach you how to use an
API that requires authentication in a later
post.
JSON Data in Python
JSON (JavaScript Object Notation) is the language of APIs. JSON is a way to encode data
structures that ensures that they are easily readable by machines.
You can think of JSON as being a combination of Python dictionaries, lists, strings and
integers represented as strings.
JSON Data in Python
Python has great JSON support with the json package. The json package is part of the
standard library, so we don’t have to install anything to use it. We can both convert lists
and dictionaries to JSON, and convert strings to lists and dictionaries.

The json library has two main functions:



json.dumps() — Takes in a Python object, and converts (dumps) it to a string.

json.loads() — Takes a JSON string, and converts (loads) it to a Python object.

The dumps() function is particularly useful as we can use it to print a formatted string
which makes it easier to understand the JSON output, like in the diagram we saw above:
API with Query Parameters
In order to use request parameters with your API request, we need to ad the argument
params to our request.

Or we can directly insert it into our URL.


Web Scraping
• Using the browser, requests, and Beautiful Soup.

Introduction to Web-scraping

Inspect Data source.

Scrape HTML content from a page.

Parse HTML code with Beautiful Soupe
What is web scraping
Any type of gathering information from the internet.
But generally, when we talk about web scraping we mean the
automated gathering of information from the web.
Or writing some code that fetches information from the internet.
Some websites offer data sets that are downloadable in CSV
format, or accessible via an Application Programming Interface
(API). But many websites with useful data don’t offer these
convenient options.
How Does Web Scraping Work?
When we scrape the web, we’re essentially doing the same thing
a web browser does.
We write code that sends a request to the server that’s hosting
the page we specified. The server will return the source code —
HTML, mostly — for the page (or pages) we requested.
Is Web Scraping Legal?
Unfortunately, there’s not a cut-and-dry answer here. Some
websites explicitly allow web scraping. Others explicitly forbid it.
Many websites don’t offer any clear guidance one way or the
other.
Before scraping any website, we should look for a terms and
conditions page to see if there are explicit rules about scraping. If
there are, we should follow them. If there are not, then it becomes
more of a judgement call.
Web Scraping Good Practices
Never scrape more frequently than you need to.
Consider caching the content you scrape so that it’s only
downloaded once.
Build pauses into your code using functions like time.sleep() to
keep from overwhelming servers with too many requests too
quickly.
How Does Web Scraping Work?
1.Request the content (source code) of a specific URL from the
server
2.Download the content that is returned
3.Identify the elements of the page that are part of the table we
want
4.Extract and (if necessary) reformat those elements into a dataset
we can analyse or use in whatever way we require.
HTML Page Structure
HTML Page Structure
Tags have commonly used names that depend on their position in relation to
other tags:


child — a child is a tag inside another tag. So the two tags p and h1 above
are both children of the body tag.

parent — a parent is the tag another tag is inside. Above, the html tag is the
parent of the body tag.

sibling — a sibling is a tag that is nested inside the same parent as another
tag. For example, head and body are siblings, since they’re both inside
html. Both tags p and h1 are siblings, since they’re both inside body.

We can also add properties to HTML tags that change their behaviour.
The Request Library
The first thing we’ll need to do to scrape a web page is to download the page.
We can download pages using the Python requests library.

A status_code of 200 means that the page downloaded successfully.


The Request Library
We can print out the HTML content of the page using the content property:
Parsing a page with
BeautifulSoup
We can use the BeautifulSoup library to
parse the documents, and extract the
text from tags.
We first have to import the library, and
create an instance of the BeautifulSoup
class to parse our document.

We can print out the HTML content of the page,


formatted nicely, using the prettify method on the
BeautifulSoup object.
Parsing a page with
BeautifulSoup
We can first select all the elements at the top level of the page using the
children property of soup.
Note that children returns a list generator, so we need to call the list function
on it:
Parsing a page with
BeautifulSoup
All of the items are BeautifulSoup objects:
The first is a Doctype object, which contains information about the type of the
document.
The second is a NavigableString, which represents text found in the HTML
document.
The final item is a Tag object, which contains other nested tags.
Parsing a page with
BeautifulSoup
To retrieve the children inside the html tag we use:
Finding all instances of a tag at
once
If we want to extract a single tag, we can instead use the find_all method, which will find
all the instances of a tag on a page.
Note that find_all returns a list, so we’ll have to loop through:

or use list indexing, it to extract text:

To find the first instance of a tag, we can use the find method
Searching for tags by class
You can use the find_all method to search for items by class or by id using the argument
class_ or id.

Example :
Lets search any tag that has the class outer-text:

Lets search for any p tag that has the class outer-text:
Searching for tags by Id
The same thing can be done to search for elements by id, using the argument id

Example :
Lets search any tag that has the class outer-text:
Searching for tags Using CSS
Selectors
We can also search for items using CSS selectors. These selectors are how the CSS
language allows developers to specify HTML tags to style. Here are some examples:


p a — finds all a tags inside of a p tag.

body p a — finds all a tags inside of a p tag inside of a body tag.

html body — finds all body tags inside of an html tag.

p.outer-text — finds all p tags with a class of outer-text.

p#first — finds all p tags with an id of first.

body p.outer-text — finds any p tags with a class of outer-text inside of a body tag.
Challenges of web scraping
Variety : every page is special
Challenges of web scraping
Durability : Websites change their structures.

You might also like