You are on page 1of 38

CSS 507 - Data Сollection, Wrangling, Analysis and

Visualization

Lecture 2 - Acquiring data, scrapping tools

Nazerke Sultanova
What is web scrapping?
Web scraping is a technique for extracting information from websites. This can
be done manually but it is usually faster, more efficient and less error-prone to
automate the task.

Web scraping allows you to acquire non-tabular or poorly structured data from
websites and convert it into a usable, structured format, such as a .csv file or
spreadsheet.

Scraping is about more than just acquiring data: it can also help you archive
data and track changes to data online.
Why do we need it as a skill?
Web scraping is increasingly being used by academics and researchers to
create data sets for text mining projects; these might be collections of
journal articles or digitised texts. The practice of data journalism, in
particular, relies on the ability of investigative journalists to harvest data
that is not always presented or published in a form that allows analysis.
When do we need scraping?

As useful as scraping is, there might be better options for the task. Choose the
right (i.e. the easiest) tool for the job.

● Check whether or not you can easily copy and paste data from a site into
Excel or Google Sheets. This might be quicker than scraping.
● Check if the site or service already provides an API to extract structured data.
If it does, that will be a much more efficient and effective pathway. Good
examples are the Facebook API, the Twitter APIs or the YouTube comments
API.
● For much larger needs, Freedom of information requests can be useful. Be
specific about the formats required for the data you want.
What is HTML?

● HTML stands for HyperText Markup Language


● It is the standard markup language for the webpages which make up the
internet.
● HTML contains a series of elements which make up a webpage which can
connect with other webpages altogether forming a website.
● The HTML elements are represented in tags which tell the web browser how
to display the web content.
A sample raw HTML file below
Rendered HTML
We will start with the basics of sending a GET request to a web server for a
specific page, reading the HTML output from that page, and doing some simple
data extraction in order to isolate the content that we are looking for.
URL request and response

URL Request - Requesting a web server for content to be viewed by the user.
This request is triggered whenever you click on a link or open a webpage.

URL Response - A response for the request irrespective of success or failure. For
every request to the web server, a mandatory response is provided by the web
server and most of the times this would be the respective content requested by the
URL Request.
Request in Python
urllib.request library
Beautiful soup library
The BeautifulSoup library was named after a Lewis Carroll poem of the same
name in Alice’s Adventures in Wonderland.

Like its Wonderland namesake, BeautifulSoup tries to make sense of the


nonsensical; it helps format and organize the messy web by fixing bad HTML and
presenting us with easily-traversible Python objects representing XML structures
Installing BeautifulSoup
Because the BeautifulSoup library is not a default Python library, it must be
installed. We will be using the BeautifulSoup 4 library (also known as BS4)

Web Scraping with Python: Collecting Data from the Modern Web. p24 -
installation
bs4 library
Connecting reliably
The web is messy. Data is poorly formatted, websites go down, and closing tags
go missing. One of the most frustrating experiences in web scraping is to go to
sleep with a scraper running.

There are two main things that can go wrong in this line:

● The page is not found on the server (or there was some error in retrieving it)
● The server is not found
Catching errors
Another serving of Beautiful Soup

Web scrapers can easily separate these two different tags based on
their class; for example, they might use BeautifulSoup to grab all of the
red text but none of the green text. Because CSS relies on these
identifying attributes to style sites appropriately, you are almost
guaranteed that these class and ID attributes will be plentiful on most
modern websites.
findall() library
find() and findAll() with BeautifulSoup
The two functions are extremely similar, as evidenced by their definitions in the
BeautifulSoup documentation:
tag
The tag argument is one that we’ve seen before — you can pass a string name of
a tag or even a Python list of string tag names. For example, the following will
return a list of all the header tags in a document:
attributes
The attributes argument takes a Python dictionary of attributes and matches tags
that contain any one of those attributes. For example, the following function would
return both the green and red span tags in the HTML document:
recursive
The recursive argument is a boolean. How deeply into the document do you want
to go? If recursion is set to True , the findAll function looks into children, and
children’s children, for tags that match your parameters. If it is false , it will look
only at the top-level tags in your document. By default, findAll works recursively
(recursive is set to True); it’s generally a good idea to leave this as is, unless you
really know what you need to do and performance is an issue.
text
The text argument is unusual in that it matches based on the text content of the
tags, rather than properties of the tags themselves. For instance, if we want to find
the number of times “the prince” was surrounded by tags on the example page, we
could replace our .findAll() function in the previous example with the following
lines:
limit
The limit argument, of course, is only used in the findAll method; find is
equivalent to the same findAll call, with a limit of 1. You might set this if you’re only
interested in retrieving the first x items from the page. Be aware, however, that this
gives you the first items on the page in the order that they occur, not necessarily
the first ones that you want.
keyword
The keyword argument allows you to select tags that contain a particular attribute.
For example:
Keyword argument
The keyword argument can be very helpful in some situations. However, it is
technically redundant as a BeautifulSoup feature. Keep in mind that anything that
can be done with keyword can also be accomplished using Regular Expressions
Navigating trees
The findAll function is responsible for finding tags based on their name and
attribute. But what if you need to find a tag based on its location in a document?

Single way navigation:


Dealing with children and other descendants
Children are always exactly one tag below a
parent,

whereas descendants can be at any level in


the tree below a parent.
Find children
If you want to find only descendants that are children, you can use the .children
tag:

If you were to write it using the descendants() function instead of the children()
function, about two dozen tags would be found within the table and printed
Dealing with siblings
The BeautifulSoup next_siblings() function makes it trivial to collect data from
tables, especially ones with title rows:
Regular Expressions
Regular expressions are so called because they are used to identify regular
strings.

This can be exceptionally handy for quickly scanning large documents to look for
strings that look like phone numbers or email addresses.

What is a regular string?

It’s any string that can be generated by a series of linear rules


Example
1. Write the letter “a” at least once.

2. Append to this the letter “b” exactly five times.

3. Append to this the letter “c” any even number of times.

4. Optionally, write the letter “d” at the end.

Strings that follow these rules are: “aaaabbbbbccccd,” “aabbbbbcc,” and so on

(there are an infinite number of variations).


Regular expressions

Regular expressions are merely a shorthand way of expressing these sets of


rules.

For instance, here’s the regular expression for the series of steps just described:
Commonly used regular expression symbols
Bs4 and regular expressions
A regular expression can be inserted as any argument in a BeautifulSoup
expression, allowing you a great deal of flexibility in finding target elements.
Thank you

You might also like