You are on page 1of 13

Seminar 12:

Web Scraping
Make Python a Browser

import webbrowser
import requests
r = requests.get
("http://finance.google.com/finance?q=aapl")
print(r)
with open("aapl.html","w") as file:
file.write(r.text)
webbrowser.open_new_tab("aapl.html")

2
Web scraping

Ø Websites are full of data, lots of them!


Ø Web scraping is the use of program to sift
through websites and extract data of interest.
Ø Through the right use of modules, data can
be formatted to produce insights.
Ø Caution!!
Check the Copyright, Terms and Conditions of
the website prior to scraping the data!!
Use it only when it is legitimate to do so.

3
Viewing HTML Codes Manually
Ø Browse to www.lazada.sg ecommerce website.
Click on any product of interest for example “Dyson
Tower Fan”. Then right click a webpage to view page
source.

4
Viewing HTML Codes Manually
Ø With the page source opened, you will see
chunks of html and css codes like the
following.
Ø Interpreting webpage like this can be very
tedious given the amount of html generated
from content management software or html
generator.

5
Components of a webpage
Ø HTML
Ø Cascading Style Sheet
§ For laying structure and design for
webpage.
§ See CSS codes on top right.
Ø Javascripts
§ Typically for validating user inputs
from webpage
§ See javascripts codes on right
bottom
Ø Images or other media
resources.

6
Basics of HTML

All HTML documents must start with a document type declaration: <!DOCTYPE html>.

The HTML document itself begins with <html> and ends with </html>.
<!DOCTYPE html> <html>
<head> HELLO </head> The visible part of the HTML document is between <body> and </body>.

HTML headings are defined with the <h1> to <h6> tags.


<body> <h1> defines the most important heading. <h6> defines the least important heading
<h1> First Scraping </h1>
Matchin An HTML element usually consists of a start tag and end tag, with the content inserted in between:
g pair <p> Hello World </p>
<tagname>Content goes here...</tagname>

</body> The HTML element is everything from the start tag to the end tag:
<p>My first paragraph.</p>
</html>
Attributes provide additional information about HTML elements.
• All HTML elements can have attributes
• Attributes provide additional information about an element
• Attributes are always specified in the start tag
• Attributes usually come in name/value pairs like: name="value"

See https://www.w3schools.com/html/default.asp for other HTML tags.

7
Scraping Rules
• You should check a website’s Terms and Conditions before
you scrape it. Be careful to read the statements about legal
use of data. Usually, the data you scrape should not be used
for commercial purposes.
• Do not request data from the website too aggressively with
your program (also known as spamming), as this may break
the website. Make sure your program behaves in a
reasonable manner (i.e. acts like a human). One request for
one webpage per second is good practice.
• The layout of a website may change from time to time, so
make sure to revisit the site and rewrite your code as needed
• Not all sites are able to perform scraping

8
Analysing a web page
https://www.bloomberg.com/quote/SPX:IND

• Search to find the


content
whereabout
• Identify any tags
or attributes

9
Web Scraping with Beautiful Soup
Ø Beautiful Soup is another Python API that can work with HTML
and XML markup languages.
Ø Once the region of the data we are interested in are identified,
we embed the name of the division <div class=xxxxx> into
Beautiful soup function to help to retrieve the content.

Ø In the above example, soup object can be used for subsequent


searching for desired data.
Ø See more detailed examples from the documentation below:
§ https://www.crummy.com/software/BeautifulSoup/bs4/doc/

10
Web Scraping with Beautiful Soup
Ø Once the “soup” object is obtained,
can call on the different properties of
html from soup object to extract
information. The following example
shows how to extract title and paragraph
section.

11
Web Scraping with Beautiful Soup
Ø More ways to extract html content …

Ø String manipulation techniques can be applied to select


or subset the data.

12
Handling Time Out

• Frequent appearance of
the status codes like 404
(Not Found), 403
(Forbidden), 408 (Request try:
Timeout) might indicate that page_response = requests.get(page_link,
you got blocked. You may timeout=5)
want to check for those error if page_response.status_code == 200:
# extract
codes and proceed else:
accordingly. print(page_response.status_code)
Also, be ready to handle # notify, try again
exceptions from the request except requests.Timeout as e:
print("It is time to timeout")
print(str(e))
except # other exception

13

You might also like