46
47
Modern Web data may include other data structures such as JSON and XML.
Pandas provide the functions pd.read_html, pd.read_json for dealing with simple HTML and JSON.
However, pd.read_json has problem with nested JSON files and it is advisable to use Python’s json
For the XML format, other Python libraries such as lxml.etree, ElementTree, ete. are required. Since
JSON and XML are more complicated and we will not pursue them further.
Example 1.5.6. Write a Python script to read the above HTML tables.
Solution. =H. pd.read_html ZILA TREE.
| import pandas as pd
tbls = pd.read_html ("*tableeg. html")
for i,tbl in enumerate( tbls):
print("="8 + f"Table (ist}" + "="*8)
print(tbl)
a
When the HTML consist JavaScript, we will need to follow StackOverflow advice by calling external
browser to do some processing:
# https: //stackover flow. com/ quest ions /25062365/ python -parsing -html -table- generated -by- jay
from pandas. io.html import read_html
from selenium import webdriver
driver = webdriver.Firefox()
driver. get("http://www1.nyse.com/about/Listed/1P0_Index html")
table = driver. find element_by_xpath('//div[@class="sp5"]/table//table/..')
table_html = table. get_attribute('innerHTML')
11 df = read_html(table_html)[0]
12. print (df)
14 driver.close()
§1.5.6 Handling Proprietary Formats
‘There are many proprietary formats one needs to deal with in data analysis. For example many social
science research uses the SPSS format or Stata format for storing data, On the other hand, many
‘companies use the SAS business intelligence system and store their data in SAS format.
Due to the popularity of Python, many software companies are providing Python support for their
formats, For example, according to https: //b1ogs.sas.com/content/sasdurmy/2017/04/88/python-to-sas-saspy/,
Python coders can now bring the power of SAS into their Python scripts. The project is SASPy, and
it’s available on the SAS Software GitHub https://github.con/sassof tware/saspy. It works with SAS
9.4 and higher, and requires Python 3.x.24 TOPIC 1, DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE
‘The savReaderWriter and pyreadstat modules could be used to read SPSS format.
§1.5.7 Handling SQL Database and Cloud Server
Retrieving data from a SQL database server or a cloud server is much more complicated than open files
because a connection to the server. We can use pd.read_sql to read and store the return result of the
server.
For example, reading from a local “SQL database" is listed below.
from pandas. io import sql
import sqlite
conn = sqlite3. connect (‘data.db")
query = "SELECT * FROM tablename'
tbl = sql.read_sql(query, con=conn, parse dates={'date': 'Sd/%m/RY"})
print (tbl.head())
In general, the process can simplify by using the SQLAlchemy module or other generalised Python
‘module such as PugSQL (https: //pugsql org)
Dealing with a cloud server is similar and we sometimes need to use special API. For example, to
deal with the spreadsheet data on Google cloud, we need to use Google Sheet APL as pointed in the
following articles.
+ https://towardsdatascience..com/accessing-google- spreadsheet -data-using-python-90aSbc214fd2
+ https://developers .google.com/sheets/api/quickstart/python
+ https://github. con/burnast/gspread
§1.6 Assignment Part 1
Based on your training experience, design a programme for your company to train the staff, Write up
‘a proposal which includes the following items
+ The data structure of the training programme;
+ An update and review system for the programme structure;
+ Online testing system for the trainees;
Data analytical system for the results of the online trainees.
+ Expand your system to accommodate for external trainees.
+ Evaluate if the data analytical system is worthwhile by listing down the pros and cons of the
system,