Professional Documents
Culture Documents
Web Scraping
Make Python a Browser
import webbrowser
import requests
r = requests.get
("http://finance.google.com/finance?q=aapl")
print(r)
with open("aapl.html","w") as file:
file.write(r.text)
webbrowser.open_new_tab("aapl.html")
2
Web scraping
3
Viewing HTML Codes Manually
Ø Browse to www.lazada.sg ecommerce website.
Click on any product of interest for example “Dyson
Tower Fan”. Then right click a webpage to view page
source.
4
Viewing HTML Codes Manually
Ø With the page source opened, you will see
chunks of html and css codes like the
following.
Ø Interpreting webpage like this can be very
tedious given the amount of html generated
from content management software or html
generator.
5
Components of a webpage
Ø HTML
Ø Cascading Style Sheet
§ For laying structure and design for
webpage.
§ See CSS codes on top right.
Ø Javascripts
§ Typically for validating user inputs
from webpage
§ See javascripts codes on right
bottom
Ø Images or other media
resources.
6
Basics of HTML
All HTML documents must start with a document type declaration: <!DOCTYPE html>.
The HTML document itself begins with <html> and ends with </html>.
<!DOCTYPE html> <html>
<head> HELLO </head> The visible part of the HTML document is between <body> and </body>.
</body> The HTML element is everything from the start tag to the end tag:
<p>My first paragraph.</p>
</html>
Attributes provide additional information about HTML elements.
• All HTML elements can have attributes
• Attributes provide additional information about an element
• Attributes are always specified in the start tag
• Attributes usually come in name/value pairs like: name="value"
7
Scraping Rules
• You should check a website’s Terms and Conditions before
you scrape it. Be careful to read the statements about legal
use of data. Usually, the data you scrape should not be used
for commercial purposes.
• Do not request data from the website too aggressively with
your program (also known as spamming), as this may break
the website. Make sure your program behaves in a
reasonable manner (i.e. acts like a human). One request for
one webpage per second is good practice.
• The layout of a website may change from time to time, so
make sure to revisit the site and rewrite your code as needed
• Not all sites are able to perform scraping
8
Analysing a web page
https://www.bloomberg.com/quote/SPX:IND
9
Web Scraping with Beautiful Soup
Ø Beautiful Soup is another Python API that can work with HTML
and XML markup languages.
Ø Once the region of the data we are interested in are identified,
we embed the name of the division <div class=xxxxx> into
Beautiful soup function to help to retrieve the content.
10
Web Scraping with Beautiful Soup
Ø Once the “soup” object is obtained,
can call on the different properties of
html from soup object to extract
information. The following example
shows how to extract title and paragraph
section.
11
Web Scraping with Beautiful Soup
Ø More ways to extract html content …
12
Handling Time Out
• Frequent appearance of
the status codes like 404
(Not Found), 403
(Forbidden), 408 (Request try:
Timeout) might indicate that page_response = requests.get(page_link,
you got blocked. You may timeout=5)
want to check for those error if page_response.status_code == 200:
# extract
codes and proceed else:
accordingly. print(page_response.status_code)
Also, be ready to handle # notify, try again
exceptions from the request except requests.Timeout as e:
print("It is time to timeout")
print(str(e))
except # other exception
13