S12 Web Scraping

Seminar 12:
Web Scraping
Make Python a Browser
import webbrowser
import requests
r = requests.get
("http://finance.google.com/finance?q=aapl")
print(r)
with open("aapl.html","w") as file:
file.write(r.text)
webbrowser.open_new_tab("aapl.html")
2
Web scraping
Ø Websites are full of data, lots of them!

Ø Web scraping is the use of program to sift
through websites and extract data of interest.
Ø Through the right use of modules, data can
be formatted to produce insights.
Ø Caution!!
Check the Copyright, Terms and Conditions of
the website prior to scraping the data!!
Use it only when it is legitimate to do so.
3
Viewing HTML Codes Manually
Ø Browse to www.lazada.sg ecommerce website.
Click on any product of interest for example “Dyson
Tower Fan”. Then right click a webpage to view page
source.
4
Viewing HTML Codes Manually
Ø With the page source opened, you will see
chunks of html and css codes like the
following.
Ø Interpreting webpage like this can be very
tedious given the amount of html generated
from content management software or html
generator.
5
Components of a webpage
Ø HTML
Ø Cascading Style Sheet
§ For laying structure and design for
webpage.
§ See CSS codes on top right.
Ø Javascripts
§ Typically for validating user inputs
from webpage
§ See javascripts codes on right
bottom
Ø Images or other media
resources.
6
Basics of HTML
All HTML documents must start with a document type declaration: <!DOCTYPE html>.
The HTML document itself begins with <html> and ends with </html>.
<!DOCTYPE html> <html>
<head> HELLO </head> The visible part of the HTML document is between <body> and </body>.
HTML headings are defined with the <h1> to <h6> tags.

<body> <h1> defines the most important heading. <h6> defines the least important heading
<h1> First Scraping </h1>
Matchin An HTML element usually consists of a start tag and end tag, with the content inserted in between:
g pair <p> Hello World </p>
<tagname>Content goes here...</tagname>
</body> The HTML element is everything from the start tag to the end tag:
<p>My first paragraph.</p>
</html>
Attributes provide additional information about HTML elements.
• All HTML elements can have attributes
• Attributes provide additional information about an element
• Attributes are always specified in the start tag
• Attributes usually come in name/value pairs like: name="value"
See https://www.w3schools.com/html/default.asp for other HTML tags.
7
Scraping Rules
• You should check a website’s Terms and Conditions before
you scrape it. Be careful to read the statements about legal
use of data. Usually, the data you scrape should not be used
for commercial purposes.
• Do not request data from the website too aggressively with
your program (also known as spamming), as this may break
the website. Make sure your program behaves in a
reasonable manner (i.e. acts like a human). One request for
one webpage per second is good practice.
• The layout of a website may change from time to time, so
make sure to revisit the site and rewrite your code as needed
• Not all sites are able to perform scraping
8
Analysing a web page
https://www.bloomberg.com/quote/SPX:IND
• Search to find the

content
whereabout
• Identify any tags
or attributes
9
Web Scraping with Beautiful Soup
Ø Beautiful Soup is another Python API that can work with HTML
and XML markup languages.
Ø Once the region of the data we are interested in are identified,
we embed the name of the division <div class=xxxxx> into
Beautiful soup function to help to retrieve the content.
Ø In the above example, soup object can be used for subsequent

searching for desired data.
Ø See more detailed examples from the documentation below:
§ https://www.crummy.com/software/BeautifulSoup/bs4/doc/
10
Ø Once the “soup” object is obtained,
can call on the different properties of
html from soup object to extract
information. The following example
shows how to extract title and paragraph
section.
11
Ø More ways to extract html content …
Ø String manipulation techniques can be applied to select

or subset the data.
12
Handling Time Out
• Frequent appearance of
the status codes like 404
(Not Found), 403
(Forbidden), 408 (Request try:
Timeout) might indicate that page_response = requests.get(page_link,
you got blocked. You may timeout=5)
want to check for those error if page_response.status_code == 200:
# extract
codes and proceed else:
accordingly. print(page_response.status_code)
Also, be ready to handle # notify, try again
exceptions from the request except requests.Timeout as e:
print("It is time to timeout")
print(str(e))
except # other exception
13

S12 Web Scraping

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

S12 Web Scraping

Uploaded by

Copyright:

Available Formats

Seminar 12:

Ø Websites are full of data, lots of them!

HTML headings are defined with the <h1> to <h6> tags.

See https://www.w3schools.com/html/default.asp for other HTML tags.

• Search to find the

Ø In the above example, soup object can be used for subsequent

Ø String manipulation techniques can be applied to select

You might also like