You are on page 1of 20
Topic 1 Data Analytics from Programmer’s Perspective Data science becomes very popular lately because everyone is so envious of FANG (Facebook, Amazon, Neflix, Google) and BAT (Baidu, Alibaba, Tencent) being the richest companies in the world because of their powerful ability in big data processing, We will try to investigate the true nature of data science, in particular, the collection, cleaning, visualisation, analysis and reporting of data. We will try to introduce various data analysis software, with a focus on Python data analysis tools. In this subject, the terms data science, data processing, data analysis, business intelligence, etc. are used tomean the same thing. According to Chekanov (2016, Chapter 12] and https://wmu.stoltzmaniac. on, data science consists of the following components: 1, Data “collection” and extraction Topic 1 2. Data exploration and cleaning --Topie 2 + Understanding your data + Looking for red flags (warning of danger) + Identifying things outside of the “normal” range + Deciding what to do with NaN or missing values + Discovering data with the wrong data type + Utilise the pandas library and pyjanitor to transform the data into tidy format + Descriptive statistics 3. Data visualisation Topic 3 4, Data organisation and query Topic 4 + Determining whether or not you actually need a database + Choosing the right database: Deciding between relational and NoSQL + Schema design & normalisation . .. UECS1203/UECS1403 Database System Fundamentals + Using an ORM — SQLAlchemy to insert data + Data mining ...... coe : _ UECM3213 / UECM3453 Data Mining 5. Data analysis and predictive modelling . Topic 5 + Building a data pipeline with Python luigi? + Etror monitoring + Statistical learning or Machine Learning UECM3993 Predictive Modelling 6. Data interpretation, presentation and reporting ...... : siseeeseees Topic 6 In reality, simple data requires only some of the components while a large collection of data may involves all components. 6 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE ‘The most popular software for simple data processing is Excel. However, each new version of Excel introduces incompatibilities which can wastes the user’s time tuning the Excel file and incompatible versions Excel cannot even be saved and this is what happens if one tries to save a 2010 Excel file in Excel 2007. For other problems of Excel, see ht tp://mu.eusprig.org/horror-stories.htm Since more and more of us are trained with various programming skills in secondary schools or universities, programming languages are becoming an important alternative to spreadsheet software in data processing, especially in large data processing, Python and R are currently two of the most popular programming languages which can be used freely and legally. ‘The main references are Kimball and Caserta [2004], Yau [2011], and Few [2006]. The main Python reference is McKinney [2013] A supplement reference is Grus [2015]. Advance techniques for data processing can be found in the books on data mining such as Witten et al. [2011] (using Weka), Aggarwal [2015] (theoretical textbook) and books on machine learning such as Coelho and Richert [2015], ete. Course Outcomes: COL. Apply the data analytics concepts in business scenario .......4+s.sseseeee+1+2++PO5, C3 C02. Apply Extract-Transform-Load (ETL) process PO1,C3 COS. Assess descriptive analytics for business intelligence .......2..6.s0c+ese1e+0+41+PO2,C6 CO4, Construct predictive models for different business applications .........+++++++,PO2 C5 COS. Develop dashboard for data visualization -PO3, P3 Assessments: + Excel (Dr Chang Yun Fah): 20% Test + 30% Assignment/Presentation + Python: = 25% Test 2 + Qh 15%. ee cisveeeesee COB + Q2: 10% cos ~ Assignment/Presentation + Part 1: 10% .. eee cos + Part 2: 10% .. oe : - cote tee teeeeee COM # PAIS: 5% oe scscseseseceeanscseseeseceeseseseserssatsssesaesscsesssess COS §1.1 Business, Social and Scientific Data Data come from various domains. A classification of data according to the applications domains (e.g. business domain, social science domain or scientific domain) is given below. + Agriculture: Plants nutrients and classification, forest data, ete. scientific + Bioinformatics: Important in finding out rare diseases (using DNA patterns) .........scientifie + Climate+Weather: Data in relation to global warming, etc. E.g. https://en.tutiempo.net/ climate ... vette citer vcceeeseseeeeses es Setemtific + Computer Networks: E.g. http://www, caida.org/data/overvien/ .............+.++..Scientific + Earth Science: Water resources, oceanographic data, earth-quake data, etc. . -scientific + Economics: E.g. https://ourworldindata.org/ covers international trade, human resources, tax, corruption, ete. business + Education: US Scorecard data, PISA test scores (nttp://m.oecd.org/pisa/), ete. .......social 41.2, DATA SOURCES 7 + Energy: Eg. http://datasets.ur: .org/dataset/globalponerplantdatabase business + Finance: Stock and derivatives data - business + Geographical Information System (GIS): Waze, Google Maps, OpenStreetMap, etc. ... business + Government: E.g. Department of Statistics, Malaysia https://mw.dosm.gov.my/v1/ .. business + Health care: Food Nutrient, https://mm.gapminder .org/data/, ete. social + Image Processing: http://w. inage-net.org/, Animal images https: //cvml ist .ac.at/ANA2/, facial data http://wmu. face-rec.org/databases/, X-ray images http: //dnery. ing.puc.cl/index. php/nater ial/gdxray/, etc. social + Machine Learning: Album or Music related data such as https://mmw. indb.con/interfaces/, https://github.com/ndef f/m... ceceeecseceeteeetseseteteeeeseseeeee cess SOHAL + Museums , . , re sees Scientific + Natural Language ....0.0.0ccceeeeeeeceee cee eteteee tees cseeeseses Sogial + Neuro-science: MRI data https: //openfnri.org/ : - scientific + Physics: Particle physics data http://opendata.cern.ch/, cosmos observation https: //icecube. wisc.edu/science/data, crystal structures http: //mm, crystallography net /cod/, planetary ob- servations such as https: //exoplanetarchive. ipac.caltech.edu/, https://nssdc.gsfc.nasa.gov/ nssdc/obtaining_data. htm, Sloan Digital Sky Survey https://mm.sdss.org/, etc. ... scientific + Social Networks: https: //m. gharchive.org/, http: //snap. stanford. edu/data/higgs-twitter. html, http: //help. sentiment 140. con/for-students/, http://netsg.cs.sfu.ca/youtubedata/, https: //webscope. sandbox. yahoo .con/catalog.php?datatype=g, etc. social + Social Sciences: https: //github. com/enor isse/FBI-Hate-Crime-Statistics/tree/master/2013, http://w. europeansocialsurvey.org/data/, ete. 0.0 ..0...csccceseeecseeeeeseese eS0cial + Sports: https://mm. jokecamp.con/blog/guide-to-football-and-soccer-data-and-apis/ social + Time Series: https: //datamarket .com/data/List/?qzprovider:tsdl, Heart rate http: //ecg.mit. edu/time-series/ scientific + Transportation: Airline data, ete. 2.2.00... foes eteeseettenseeeeseereeerees Social Many data are very large, ranging from tens of megs to a few hundred terabytes. Unless the data is ‘more than a few terabytes, it is possible for us to process them using classical data analytic program- ‘ming tools instead of the big data analytic tools. §1.2 Data Sources ‘There are many data sources, such as personal data, public data, classified data, and business data According to European Union (https: //ec europa. eu/info/law/aw-topic/data-protection/reform/ \what-personal-data_en), personal data is any information that relates to an identified or identifiable living individual, Different pieces of information, which collected together can lead to the identifica tion of a particular person, also constitute personal data. Personal data that has been de-identified, encrypted or pseudonymised but can be used to re~ identify a person remains personal data and falls within the scope of the GDPR. Public data is usually provided by governments, non-profit organisations, non-governmental or- ‘ganisations for the benefit of public interests. Examples of public data are listed below. + USS, Government's open data: https://mm.data.gov/; + UK. Government's open data: https: //data.gov.uk/; + Malaysia's Open Data Portal: http: //wm.data..gov.my/; 8 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE + Google Trend: https: //trends.google.con/trends/explore; + Wikipedia: https://me.wikipedia.org/; + CIA Factbook: https: //wu.cia.gov/Library/publications/ the-world-factbook/; ~ https: //iancoleman. io/explor ing-the-cia-norld-factbook/ ~ https://codingdisciple.con/cia-facthoook-sql. html ~ https: //gi thub.com/MikeAnthony6/factbook + National Centres for Environmental Information https: //mw.ncde.noaa.gov/data-access /quick-links; + Earth Science Data Systems (ESDS) Program: https://earthdata.nasa.gov/; + The United Nations Children’s Fund (UNICEF): https://data.unicef.org/, Classified data are those data which the government or companies feel that disclosing them can lead to the jeopardising of country, public or company interest. Examples are public health data, military data, ete Business data are data that belong to business entities. These are the data sources where a data “scientists” or analysts need to deal with. Data collection refers to the gathering of the information of objects. A business entity has to collect customer data in order to provide services to customers. The data can be collected through the web (eg. online shops), customer service counters, etc. Apart from customer data, a business entity also need to store the data of its employees, business operations, etc. For a social network services provider, the data of interests are social connectivity of users, products of interest of users (based ‘on tags), ete. [Russel, 2014] For a scientific institute, the data are normally collected from scientific observations such as telescopes or measurements from lab apparatus, §1.3 Computer Data Structures at In the past, there is no difference between data science and statistics. However, with the invention of digital computers, data science becomes a subject which uses computer software to store data and extract useful information from data. All computer software are written in some kind of programming languages. We will explore how programming languages represent data. As we will see in this section, programming languages support basic data structures, derived data structures and user defined data structures, Note that some language prefer the term data types instead of data structures. We will treat them as the same thing. ‘The basic data structures are Boolean, character (or string), integer and floating point number. They are closely related to the underlying computer architecture (UECS1013 Introduction to Computer Or- ganisation and Architecture and UEEA2283 Computer Organisation and Architecture). ‘The basic data structures are limited and high level programming languages were developed to support more complex data structures. Around the 1960s, we have Fortran language supporting data structures for scientific applications, Lisp for supporting data structures used in symbolic and artificial intelligence applications and COBOL for support data structures for business applications. There were other programming languages such as ALGOL 68 but they were not popular. Entering the 21st century, they are many more high level programming languages such as Python, R, C++, Go, Java, Scala, Kotlin, C#, etc. Out of all these high level programming languages, Python may be the “easiest to learn” language. This may be one of the reasons for its popularity. Contrary to what people try to portrait, “data science” has always been important before the in- vention of “digital computer”. Before 1900s, data processing (in the formal of statistics) were applied to business accounting (https: //en. wikipedia.org/wiki /Account.ing), insurance (https: //en.wikipedia, org/wiki/Insurance) and financial data analysis. 1.3. COMPUTER DATA STRUCTURES 9 Inthe 1960s, the rise of IBM was also due to the need of fast business data processing. The program- ming language COBOL was developed by IBM and used in so many financial and business institutes that until today, there are stil a lot of COBOL programs that requires minor tuning and maintenance. ‘Since 2000, the growth of Internet and social media has led to the development of “Big Data” industries. “Big data” need very specialised computer networks to stream, process and respond to user input. One could probably learn about "big data” from UECS3223/UECS3473 Cloud Computing (one is the older code while the other one is the newer code with the requirement that students need to get at least 40 in final exam to pass). In this section, we will learn the data structures used in the “old” programming language COBOL, the “newer” programming languages such as Java, Python and R as well as SQL and a little bit about proprietary system. §1.3.1 COBOL Data structures In the 1960s, writing a program was something complex because one needs to use 80 column punch cards to “write” the program and then one “places” a stack of punch cards into the computer card reader to load the “program”. Therefore, we have the following rules which seem strange in modern days: + The 7th column in each line can be used to identify a line as comment. If itis an asterisk symbol * the rest of the line is ignored. + Columns 8 through 11 in each line are referred to as the A margin. + Columns 12 through 72 in each line are referred to as the B margin. + Columns 73 to 80 are not used. + Every statement must be ending with a full-stop *” + A data should only use letters, digits 0 to 9 and hyphens only and should not be more than 30 characters. ‘The data structures are declared using the Picture clause, ie. a statement with PIC keyword and suitable “characters” (unused positions are set to spaces or zeros). ‘The following are the characters that can be used in Picture clauses according to Murach et al. [2005): Item type Characters Meaning Examples ‘Alphanumeric x ‘Any character X, XK, XC) Numeric 9 Digit 99 s Sign 999 v Assumed decimal point $9(5)V99 Numeric edited 9 Digit 99 z Zero suppressed digit 729 ' Inserted comma 2,1 : Inserted decimal point 722,722.99 - Minus sign if negative 722,22 ‘The concept behind a COBOL program format is based on the structure of a document outline, with a single to-level heading followed by subordinate level. ‘The levels of COBOL’s hierarchy are PROGRAM, DIVISION (ENVIRONMENT DIVISION, DATA DIVISION, PROCEDURE DIVISION), SECTION, PARAGRAPH, SENTENCE, STATEMENT and CHARACTER. A SECTION contains zero or many PARAGRAPHS; a PARAGRAPH contains one or ‘more STATEMENTS. A STATEMENT is a line of execution which is made up of CHARACTERs. Two sample COBOL programs are given below to illustrate the hierarchy. g:cobol-stdio] Example 1.3.1. Read the following COBOL program and explain what does it do. 33 34 35 36 37 38 39 40 a 2 B “4 45 46 a7 4B 49 50 10 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE IDENTIFICATION DIVISION. PROGRAM -1D.. AUTHOR. Mii Listing4-1, chael Coughlan. DATA DIVISION. WORKING-STORAGE SECTION. @1 UserName PIC X(28) *> Receiving data item for DATE system variable: Format is YYMMOD 01 CurDate. 02 CurYear PIC 99. 02 CurMonth PIC 99. 02 CurDay PIC 99. *> Receiving 01 DayOfYea 2 FILL 02 Year! *> Receiving 1 CurTime. 02 Curk 02 Curm. 02 FILL *> Receiving @1 Y2KDate. 02 y2KYe 02 Y2KMor 02 Y2KDa *> Receiving @1 Y2kDay0f data item for DAY system variable: Format is YYDDD Ir. ER PIC 99. Day PIC 9(3). item for TIME: Format is HHMMSSss s = S/10@ jour PIC 99, inute PIC 99. ER PIC 9(4). item for DATE YYYYMMDD system variable ar PIC 9(4). nth PIC 99. Y PIC 99. item for DAY YYYYDDD system variable Year. @2 Y2KDOY-Year PIC 9(4). 02 Y2KDOY-Day PIC 999. PROCEDURE DIVISION. Begin. DISPLAY ACCEPT DISPLAY ACCEPT *> GnuCobol ACCEPT ACCEPT ACCEPT ACCEPT DISPLAY DISPLAY DISPLAY DISPLAY DISPLAY "Please enter your name - " WITH NO ADVANCING UserName CurDate FROM DATE returns the Day0fYear less by ONE DAY, a Bug? DayOfYear FROM DAY CurTime FROM TIME y2kDate FROM DATE YYYYMMDD Y2kDay0fYear FROM DAY YYYYDDD "Name is " UserName "Date is " CurDay *-" CurMonth "-" CurYear "Today is day " YearDay * of the year" "The time is " CurHour *:" CurHinute "y2KDate is " Y2KDay SPACE Y2KMonth SPACE Y2KYear 51 52 1.3. COMPUTER DATA STRUCTURES uw DISPLAY "Y2K Day of Year is " Y2KDOY-Day " of " Y2KDOY-Year STOP RUN. Solution. CRAP MARIE F RAL 20 TSEC HB) VA THA FAY BIS YT IE a eg:cobol-int | Example 1.3.2, Read the following COBOL program and explain what does it do, 1 IDENTIFICATION DIVISION. 2 PROGRAM-ID. Listing5 -11 3 AUTHOR. Michael Coughlan. 4 *> Accepts two numbers and an operator from the user. 5 *> Applies the appropriate operation to the two numbers. 6 7 DATA DIVISION. 8 WORKING -STORAGE SECTION. 9 1 Wumt PIC 9 VALUE 7. 10 et Num? PIC 9 VALUE 3. u O1 Result PIC --9.99 VALUE ZEROS. 12 @1 Operator PIC X_ VALUE "=", B 88 Valid0perator VALUES "*", — “ 15 PROCEDURE DIVISION. 16 CalculateResult. 7 DISPLAY "Enter a single digit number : " WITH NO ADVANCING 18 ACCEPT Num1 » DISPLAY "Enter a single digit number : " WITH NO ADVANCING 20 ACCEPT Num? at DISPLAY "Enter the operator to be applied : * 2 WITH NO ADVANCING 2 ACCEPT Operator 4 EVALUATE Operator 25 WHEN "+" ADD Num2 TO Nun1 GIVING Result 26 WHEN "-" SUBTRACT Num2 FROM Num1 GIVING Result ar WHEN "*" MULTIPLY Num2 BY Nun] GIVING Result 28 WHEN "/" DIVIDE Num BY Num2 GIVING Result ROUNDED 29 WHEN OTHER DISPLAY "Invalid operator entered" 30 END - EVALUATE 31 IF ValidOperator 32 DISPLAY "Result is = ", Result 33 END-IF 34 STOP RUN. Solution. “TFA P $i ADK BCE DAE IAT TS, a As we shall see in the later section, a business intelligence system need to handle “COBOL copy- books” and EBCDIC character sets when extracting data from mainframe systems and certain mini- ‘computer systems such as IBM AS/400 [Kimball and Caserta, 2004]. §1.3.2 Java Data Structures Java isan important language for many business intelligence software and big data software. According. tohttps://towardsdatasc ence. com/8-open- source-big-data-tools-to-use-in-2018-e35cabd7catd, the following big data processing projects all run on Java: 12 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE + Apache Hadoop: Big Data processing, + Apache Spark: 100 times faster than MapReduce + Apache Storm: real-time framework for data stream processing + Apache Cassandra: It is one of the pillars behind Facebook’ s massive success, as it allows to process structured data sets distributed across huge number of nodes across the globe. Java has three different categories of data types: 1, Primitive data types (@) boolean (b) char (16-bit Unicode character) (©) byte (8-bit signed two's complement integer), short (16-bit), int (32-bit) long (64-bit) (a) float (32-bit IEEE 754 floating point), double (64-bit) 2, Derived data types (Java Collection Framework): They are made by using any other data type. (@) java.lang.String (b) java.util Vector (©) javaut (@) java.util (©) java.util Map 3. User defined data types: classes and interfaces. They are normally a combination of primitive data types and derived data types. User defined data types are sufficiently powerful in modelling many business entities but some- limes BigInteger or BigDecimal data types may be required. §1.3.3. Python 3 and NumPy Data Structures ‘The data structures available in the standard Python 3 are + Basic data structures: Boolean, strings, integers (equivalent to java’s Biglnteger), floating point values and complex floating point values (java doesn't have this); + Derived data structures: tuple, list, set, dictionary (hashmap); + Record data structures which are constructed using class. + Special data structures: NoneType and Object; ‘We will first investigate a few data structures in Python and then proceed to study how to “process” data using Python programming, Example 1.3.3 (Basic Data Structures). Write down the Python 3 statements to perform the following instructions. 1, Express 12345 as (a) string, (b) integer, (c) floating point value. 2. What is the representation of “true” and “false” in Python? 1.3. COMPUTER DATA STRUCTURES 13 Solution. KEMIS IMA A? astr = '12345' # (a) aint = 12345 # (b) anum = 12345.6 # (c) True, False a ved] Example 1.3.4 (Derived Data Structures). Store the integer 1, string “abc” and float 2.0 as a (a) tuple, (b) list and (c) set. Solution. List 5 Tuple HMI, NT # CH & methods (1, ‘abe', 2.0) U1, ‘abc’, 2.0) (1, ‘abe', 2.0} a eg:derived? | Example 1.3.5 (Derived Data Structures). Store the following data (Malaysia population data from https://en.wikipedia.org/wiki/List_of_cities_and_touns_in Malaysia_by_population) using Python dictionary. Tocal government area | Total population | Local government area | Total population Kuala Lumpur 7588,750, Padawan 27BAB5 Seberang Perai 818,197 Taiping 245,182 Kajang 795,522 Miri 234,541 Klang 744,062 Kulai 234,532 Subang Jaya 708,296 Kangar 225,590 George Town 708,127 Kuala Langat 220,214 Ipoh 657,892 Kubang Pas 214,479 Petaling Jaya 613,97 Bintulu 212,994 Selayang 542,409 Manjung 2u1,13 Shah Alam 541,306 Batu Pahat 209,461 Iskandar Puteri 529,074 Sepang 207,354 Seremban 515,490 Kuala Selangor 205,257 Johor Bahru 497,067 Muar 201,148 ‘Melaka City 484,855 Lahad Datu 199,830 ‘Ampang Jaya 468,961 Hulu Selangor 194.387 Kota Kinabalu 452,058 Kinabatangan. 182,328 Sungai Petani 443,488 Pasir Mas 180,878 Kuantan 427,515 Penampang 176,607 ‘Alor Setar 405,523 ‘Alor Gajah 13.72 Tawau 397,673 Keningau 173,103 Sandakan 396,290 Kluang 167,833 Kuala Terengganu 337,553 Kemaman 166,750 Kuching 325,132 Sibu 162,676 Kota Bharu 314,964 ‘Temerloh 158,724 Kulim 281,260 Keterch 153.474 Solution, ££ Python 3.6+ Fe {fe LL a} popul = {"Kuala Lumpur" : 1_588_750, "Seberang Perai" : 818.197, } # S&F a mesg Example 1.3.6. Translate the COBOL program in Example [23-110 Python program. 17 print("Date is “4 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE Solution. UserName = input("Please enter your name - Print (Hetero) import time CurTime = time. 1ocaltime() Curvear = str(CurTime.tm_year)[-2:] CurHonth = CurTime. tm_mon CurDay = CurTime. tmmday CurHour = CurTime. tm_hour CurMinute = CurTime.tm_min YearDay = CurTime.tm_yday yokYear = CurTime.tm_year y2kMonth = CurTime. tm_mon Y2KDay = CurTime. tmmday Y2KDOY_Year = CurTime.tmyear Y2KDOY Day = CurTime.tm_yday print(*Name is " + UserName) + str(CurDay) + "=" + str(CurMonth) + "=" + CurYear) print("Today is day %3d of the year" % (YearDay,)) # Tuple print(*Time is day $82d:802d" & (CurHour, CurMinute)) # Tuple print("Y2KDate is $d 8d $d" & (Y2KDay, V2KMonth, Y2kVear)) print("Y2K Day of Year is $d of $d" & (Y2KDOY Day, Y2KDOY Year)) 0 Example 1.3.7. Translate the COBOL program in Example 1°32 to Python program. Solution. AU(iE1 = float(input ("Enter a single digit number : ")[0]) ‘HUiH2 = float(input("Enter a single digit number : ")[@]) HE = input("Enter the oprator to be applied : *)[0] BAU = True if WAP == '+ wR + BUA2 elif Wi ae ease = Reh - Rez elit BAF == tt: eR = Seta * Hei elit BAF == 17": eR = SHI 7 MD else: BHAI = False print('Invalid operator entered’) if HAH: print(*Result is %.2f" % (ZHH)) # Floating point a Unlike Java data structures, Python data structures are not suitable for large data processing be- cause they. are slow. The Numpy array is used as the basic data structure for the pandas module (Section 1.5-2} 10 handle series and data frames. 1.3. COMPUTER DATA STRUCTURES 15 §1.3.4 R Data Structures ‘The rich data structures and statistical packages provided by R made it a powerful open source business intelligence platform. R has three categories of data types: 1, Basic Data Types (@) Logical (boolean) (b) Character (string) (©) Integer (a) Numeric (©) Complex 2, Derived Data Types: (a) Date: Sys.Date() (b) Array: array(e(1:4)) (©) List: List(1,"2", TRUE) (a) Matrix: matrix(c(1:4),2,2)%. (©) Time Series: ts(rep(1,10)) (®) Data Frame: data. frame (x= 3,2,1),yec("A" CF) 3. User defined types with $3 or S4 classes: setClass( "student", slots=List(name="character", age="numeric", GPA="nuneric*)) R data structures are rich and it is used along with JuPyteR stack (Julia, Python, R) for enabling wide-scale statistical analysis and data visualization. JupyteR Notebook is one of 4 most popular Big Data visualization tools, as it allows composing literally any analytical model from more than 9,000 CRAN (Comprehensive R Archive Network) algorithms and modules, running it in a convenient envi- ronment, adjusting it on the go and inspecting the analysis results at once, ‘The main benefits of using Rare as follows: 1, R can run inside the SQL server; 2. Rrruns on both Windows (https: //mran.microsof t.com/open) and Linux servers; 3. R supports Apache Hadoop and Spark; 4, Ris highly portable; 5. R easily scales from a single test machine to vast Hadoop data lakes. §1.3.5 SQL Data Structures ‘There is standard data structures in SQL system. In Salite (see https://wmy.sqlite.org/datatypes. html), it only has the following data structures: + NULL. The value is a NULL value, + BLOB. The value is a blob of data, stored exactly as it was input. + TEXT, The value is a text string, stored using the database encoding (UTF-8, UTF-16BE or UTF- 16LE). + INTEGER. The value is a signed integer, stored in 1, 2, 3, 4, 6, or 8 bytes depending on the ‘magnitude of the value. 16 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE + REAL. The value is a floating point value, stored as an 8-byte IEEE floating point number. However, Postgresql (see http://m.postgresql tutorial .com/postgresql-data-types/) has richer data structures: + Boolean; + Character types such as char, varchar, and text; + Numeric types such as integer and floating-point number; + Temporal types such as date, time, timestamp, and interval; + UUID for storing Universally Unique Identifiers; + Array for storing array strings, numbers, ete; + JSON stores JSON data; + hstore stores key-value pair; + Special types such as network address and geometric data ‘Many companies store their data in SQL systems as a collection of tables of various data structure above. However, the growth of Internet has changed the scenes of “database” world from the old LAMP ‘or WAMP stacks to “cloud” and “NoSQL”. So, when it comes to choosing a database, one of the biggest, decisions is picking a relational (SQL) or non-relational (NoSQL) data structure. While both are viable options, there are certain key differences between the two that users must keep in mind when making a decision SQL databases use structured query language (SQL) for defining and manipulating data. On one hand, this is extremely powerful: SQL is one of the most versatile and widely-used options available, ‘making ita safe choice and especially great for complex queries. On the other hand, itcan be restrictive. SQL requires that you use predefined schemas to determine the structure of your data before you work with it. In addition, all of your data must follow the same structure. This can require significant up- front preparation. On the other hand, it can mean that a change in the structure would be both difficult, and disruptive to your whole system. Some examples of SQL databases include Sqlite, PostgreSQL, MariaDB/MySQL, FirebirdDB, IBM DR2, Oracle DB, and Microsoft SQL Server. A comprehensive list is given in https://en. wikipedia, org/wiki/List_of relational _database_management_systens According to OvidPer! [2018], prior to the creation of SQL in the 1970s, all databases where NoSQL. ‘That's why we have SQL. There was the client who was running at a maximum of about 40 transactions per second. They won a major contract that required 500 transactions per second and their technology director told them they needed to switch to NoSQL to get better performance. We got them to around 700 transactions per second in about three weeks ... using PostgreSQL. Most of their performance problems were simply a matter of technical debt and a poor use of their database. ‘The design and setting up of SQL database system require proper normalisation, configuration, queries tuning and some de-normalisation. Embracing NoSQL. just because it is a trend is never the correct answer. There are special purpose databases such as Hadoop that make sense in very specific contexts, But if relational database is a viable option, then that’s what one should be picking. They ‘won the debate about database design four decades ago for very good reasons. Many NoSQL databases are just rehashes of those old, flawed designs that we discarded because they don’t work anywhere near as well. And in the few cases where NoSQL does have an interesting trick, chances are the relational database can do it too (or will be able to very shortly) §1.3.6 NoSQL Data Structures According to https: //en.wikipedia.org/niki/NoSQL, NoSQL databases can be roughly classified into the following categories: 1.3. COMPUTER DATA STRUCTURES uy + Column-oriented: Accumulo, Apache Cassandra, Scylla, Apache Druid, HBase, Vertica, Google BigTable; + Document-oriented: Apache CouchDB, ArangoDB, BaseX, Clusterpoint, Couchbase, Cosmos DB, IBM Domino, MarkLogic, MongoDB, OrientDB, Qizx, RethinkDB, RavenDB; + Key-value pair store: Aerospike, Apache Ignite, ArangoDB, Berkeley DB, Couchbase, Dynamo, FoundationDB, InfinityDB, MemcacheDB, MUMPS, Oracle NoSQL Database, OrientDB, Redis, Riak, SciDB, ZooKeeper; Graph-based: AllegroGraph, ArangoDB, InfinityDB, Apache Giraph, MarkLogic, Neo4J, Ori- entDB, Virtuoso Universal Server Object database: Objectivity/DB, Perst, ZopeDB, ‘Since NoSQL databases are closely related to the Web, they support the JSON data structures: 1, null/empty: {"grade": } (a blank indicates null or empty); 2. object: a set of name or value pairs between {}; 3. Boolean: {"result": true}; 4, string: {"name":"Vivek"}; 5, mumber: {age":20, "percenage":82.44); 6, array: {"subjects": ["UECH1304", "UECN3013"}} ANoSQL database, has dynamic schema for unstructured data, and data is stored in many ways ‘as mentioned earlier. This flexibility means that: + We can create documents without having to first define their structure; + Each document can have its own unique structure; + The syntax can vary from database to database; and + We can add fields as you go. In most situations, SQL databases are vertically scalable, which means that we can increase the load on a single server by increasing things like CPU, RAM or SSD. NoSQL databases, on the other hand, are horizontally scalable, This means that we handle more traffic by sharing, or adding more servers in your NoSQL database. It’s like adding more floors to the same building versus adding more buildings to the neighbourhood. The latter can ultimately become larger and more powerful, making NoSQL databases the preferred choice for large or ever-changing data sets. SQL databases are table-based, while NoSQL databases are either document-based, key-value pairs, graph databases or wide-column stores. ‘This makes relational SQL databases a better option for appli- cations that require multi-row transactions such as an accounting system or for legacy systems that ‘were built for a relational structure, According to Kleppmann [2017], companies which do not have petabytes of data, should use rela~ tional database instead of NoSQL just for scale because building for scale may be a waste of effort. 18 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE §1.4 Business Intelligence Infrastructure According to https://en.wikipedia.org/wiki/Business_IntelLigence, business intelligence consists of the strategies and technologies for the data analysis of business information. The business intelli- gence infrastructure is developed to organise data and turn them into knowledge. ‘The components of the infrastructure comprise [Kleppmann, 2017] + a data storage (databases); + a cache for fast data reading (e.g. Memcached, Redis); + an indexing system for finding data by keywords (e.g. ElasticSearch, Solr); + a stream processing pipeline for sending a message to another processing asynchronously; and/or + a batch processing pipeline for periodically processing a large amount of accumulated data There are a lot of commercial offering of business intelligence such as IBM Cognos, Microsoft PowerBI, Oracle Business Intelligence Suite Enterprise Edition, Hitachi Data Systems, Plotly, Qlik, SAP, SAS, Tableau Software, ete. (see https://en.wikipedia.org/niki/Business_intelLigence_software for more) There are very limitted “open source” business intelligence and data analytics tools which can be found from the Internet. According to Octoparse [2019], some open source tools are 1, KNIME Analyties Platform (https://mw.knine.con/, Java based) 2, OpenRefine (http: //openrefine.org/, previously Google Refine, Java based) 3. R-Programming (https://wwn.r-project.org/, S language, C, C++) 4, Orange Data Mining Tool (ht tps: //orange.biolab.si/, https://gi thub.com/biolab/orange3, Python. based) 5. RapidMiner Studio (https: //rapidminer .com/, https: //github.con/rapidniner /rapidminer-studio, Java based) 6, Pentaho data integration Kettle (https: //help.pentaho..con/Docunentation/8.2/Products/Data_ Integration, Java based) 7. Talend Forge (ht tps://mm.talend. con/, https: //mww. talendforge.org/sources, Java based) 8. Weka (https: //mm. cs. waikato.ac.nz/ml/weka/, Java based) 9. NodeXL (https://archive. codeplex. com/?p=nodexl, Excel, C# based) 10. Gephi (https: //gephi .org/, Java based) §1.5 Data Handling with Python In data analytics, text formats are popular because they are easy to debug, Popular text formats include JSON (popularised by Web and JavaScript), XML, CSV [Kleppmann, 2017]. Popular binary formats include Microsoft office documents, Apache ‘Thrift (by Facebook), Google's Protocol Buffers, Apache Avro, ete In this section, we will study the Python packages that handle various data formats, 1.5, DATA HANDLING WITH PYTHON 19 §1.5.1 Anaconda Python (on Windows) seczanaconda The software “Anaconda Python 3.x” (which can be found on the Internet) has all the necessary mod- ules, in particular, the pandas module, for data processing with Python 3. The following libraries/mod- ules are the essential components building the pandas module (and are available in Anaconda Python): setuptools: 24.2.0 or higher + NumPy: 1.12.0 or higher + python-dateutil: 25.0 or higher + pytz + numexpr: for accelerating certain numerical operations. numexpr uses multiple cores as well as smart chunking and caching to achieve large speedups. If installed, must be Version 2.6.1 or higher. bottleneck: for accelerating certain types of nan evaluations. bottleneck uses specialized cython routines to achieve large speedups. Ifinstalled, must be Version 1.2.0 or higher. + Cython: Only necessary to build development version. Version 0.28.2 or higher. + SciPy: miscellaneous statistical functions, Version 0.18.1 or higher + xarray: pandas like handling for > 2 dims, needed for converting Panels to xarray objects. Version 0.7.0 or higher is recommended. + PyTables: necessary for HDFS-based storage, Version 3.4.2 or higher + pyarrow ( 0.9.0): necessary for feather-based storage. + Apache Parquet, either pyarrow (+= 0.7.0) or fastparquet (+= 0.2.1) for parquet-based storage. ‘The snappy and brotli are available for compression support. + SQLAlchemy: for SQL database support. Version 0.8.1 or higher recommended. Besides SQLAlchemy, you also need a database specific driver. You can find an overview of supported drivers for each SQL dialect in the SQLAlchemy docs. Some common drivers are: = psycopg2: for PostgreSQL ~ pymysql: for MySQL.c = SQLite: for SQLite, this is included in Python's standard library by default. ‘matplotlib: for plotting, Version 2.0.0 or higher. For Excel I/O: ~ xlrd/xlwt: Excel reading (xlrd), version 1.0.0 or higher required, and writing (xlwt) ~ openpyxk: openpyxl version 2.4.0 for writing xlsx files (xlrd >= 0.9.0) ~ XlsxWriter: Alternative Excel writer Jinja: Template engine for conditional HTML formatting, + blose: for msgpack compression using blose + gesfs: necessary for Google Cloud Storage access (gesfs == 0.1.0). One of atpy (requires PyQt or PySide), PyQts, PyQt4, pygtk, xsel, or xclip: necessary to use read_clipboard(). Most package managers on Linux distributions will have xclip and/or xsel immediately available for installation. + pandas-gbq: for Google BigQuery I/O. (pandas-gbq >= 0.8.0) + One of the combinations of libraries is needed to use the top-level read_html() function, ie. (a) BeautifulSoup4 4.2.1+ and htmlslib; or (b) BeautifulSoup4 4.2.1+ and [xml 20 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE According to McKinney [2013], Numpy, pandas, matplotlib, jupyter, Seipy, scikit-learn, statsmodels are the essential Python libraries. For the over 200 packages available in Anaconda Python, refer to https://docs anaconda. com/anaconda/ “ia5) 5152 ‘The Pandas Module & Its Data Structures ‘The standard Python is not suitable for data analysis because it does not provide the necessary tool for easy array handling. The Numpy library provides the necessary tool for easy array handling. The pandas library is designed for working with tabular data [McKinney, 2013, Chapter 5]. It depends on Numpy and cython for array structure processing support and dateutil and pytz for date-time and timezone processing support (mentioned in ection tf ‘To turn on data processing and analysis support, we need to load the necessary modules in Python as follows. 1 import numpy as np 2 import pandas as pd 3 import matplotlib.pyplot as plt 4 import seaborn as sns 5 import statsmodels as sm 6 from urllib.request import urlopen 7 from bs4 import BeautifulSoup Pandas provides two data structures for data processing, ie, Series and DataFrame, A series is a ID labelled array while a data frame is a 2D labelled array. A few important attributes of these data structures ds are: d5.size, ds.dtypes, ds. index, ds. values. Example 1.5.1. Construct a pandas series with the list from Example P= Solution sr = pd. Series ((1,*abc",2.0]) rr ivedd Example 1.5.2. Write a Python script to express the top 6 populations from the table in Example 5 as a pandas series. Solution: A sample script is given below. | import pandas as pd states = ["Kuala Lumpur", "Seberang Perai", “Kajang", "Klang", "Subang Jaya", "George Town") pd.Series([1_588_750, 81819, 795 522, 744062, 708296, 708_127], index=states) print("The two key aspects of a Series is") print("1, values =", ts.values, "of type", ts.dtype) print("2, index =", ts.index, “of type", ts. index. dtype) ts.index = ['KL","SP',"K]","K6', "SJ", "6T'] 10 print ("After the index is changed, we have\nts =\n", ts, sep= ts Alternatively, the Python dictionary can also be used. Example 1.5.3 (https: //en.wikipedia.org/wiki/Birth_rate). he birth rate, which stands for the number of live births per thousand of population per year, is essential to predict the future population of ‘a.country. By using years as index, the birth rate can be represented as a pandas series. An interesting exercise will be to read Malaysia Department of Statistics’ Open Data from https: //mru.dosm.gov.ny/ V1/index.php?r=colum3/accordionfmenu_id=alhRYUpWS3B4V¥LYaVBOeUFONFpHUTOS and store it as pan- das series 1.5, DATA HANDLING WITH PYTHON a1 Example 1.5.4 (https://en.wikipedia.org/wiki/Heart_rate). Human’s heart rate is a time series and can be represented as pandas series. §1.5.3 Handling Excel Data We rarely load data by writing a Python program. We usually load data from data files such as Excel files. However, we need to beware that there are many Excel files (see https://en.wikipedia.org/ wiki/List_of_Microsoft_Office_filename_extensions): + xls: Legacy Excel worksheets 1997-2003 binary format + xlsx: Excel workbook (2007, 2010, 2013, ete. are incompatible) + axlsm: Excel macro-enabled workbook; + xltx: Excel template; + xltm: Excel macro-enabled template. So when we process Excel files with Python, we need to install the appropriate modules. The following are the standard modules from http: //www.python-excel.org/ which handles different types of Excel files: + openpyxl: This module is used for reading and writing Excel 2010 files; + xlsewriter: This module can handle Excel 2010 files with better support for formatting and charts; + xlrd and xlwt: These modules are used for reading and writing Excel xls files ‘There are other modules such as pyexcel, xlutils which have some extra functionality but may depend amples en he und ints: astntetebr inst cn/ hater 2! ‘The pandas module's pd.read_excel(' file.xlsx" , index_col-None, header=None) depends the above modules, §1.5.4 Handling CSV Data ‘Text file is easier to process compare to binary files, so many companies and many software will save the data to CSV files. CSV, which stands for Comma-Separated Values, is a text format to store tabular data separated by “ ina file. TSV, which stands for Tab-Separated Values, separates table data by tab instead of comma. The pandas functions for reading a CVS file is df = pd.read_csv("file.csv") and saving df can be carried out with df. to_csv("neufile.csv") Example 1.5.5. Download the world population data from https://ourworldindata.org/grapher/ wor1d-populat ion-by-world-regions-post-1820 and store it as a data frame. Solution: ‘The Python script is as follows. | import pandas as pd # https://ourworldindata . org/grapher /world-populat ion -by-world -regions -post}-1820 pop_data = pd.read_csv("data/world-population -by-world-regions -post -1820. csv") print(pop_data.head(4)) # Print the first 4 rows of the data print(pop_data.tail(4)) # Print the last 4 rows of the data print (pop_data. index) print (pop_data. columns) We can find the similarity in R script 22 TOPIC 1. DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE | pop.data = read.csv("data/world- population -by-world- regions -post -1826. csv" | header=1) # sep=","; dec="."; stringsAsFactors=FALSE print (head(pop.data, 4)) print(tail(pop.data, 4)) print (rownames (pop.data)) print (colnames( pop. data)) §1.5.5 Handling “Web” Data Web data are much more complex these days with the flooding of JavaScript. Old style Web data is just simple H'TML as shown below. 1 2 https://umw.w3schools.com/html/html_tables.asp --> 3. 4 13. 4 15 Table 1 16 7 ctr> 18 » 20 a 22 2B ctd>Jill 4 25 26 a ctr> 28 29 30 ctd>94 31 32.
Firstname Lastname age
Smith 50
Eve Jackson
33 34 Table 2 35 36 37 38 39 1.5, DATA HANDLING WITH PYTHON 23 40 a 2 8 4 45
NameTelephone
Bill Gates 5557785455577855
46 47 Modern Web data may include other data structures such as JSON and XML. Pandas provide the functions pd.read_html, pd.read_json for dealing with simple HTML and JSON. However, pd.read_json has problem with nested JSON files and it is advisable to use Python’s json For the XML format, other Python libraries such as lxml.etree, ElementTree, ete. are required. Since JSON and XML are more complicated and we will not pursue them further. Example 1.5.6. Write a Python script to read the above HTML tables. Solution. =H. pd.read_html ZILA TREE. | import pandas as pd tbls = pd.read_html ("*tableeg. html") for i,tbl in enumerate( tbls): print("="8 + f"Table (ist}" + "="*8) print(tbl) a When the HTML consist JavaScript, we will need to follow StackOverflow advice by calling external browser to do some processing: # https: //stackover flow. com/ quest ions /25062365/ python -parsing -html -table- generated -by- jay from pandas. io.html import read_html from selenium import webdriver driver = webdriver.Firefox() driver. get("http://www1.nyse.com/about/Listed/1P0_Index html") table = driver. find element_by_xpath('//div[@class="sp5"]/table//table/..') table_html = table. get_attribute('innerHTML') 11 df = read_html(table_html)[0] 12. print (df) 14 driver.close() §1.5.6 Handling Proprietary Formats ‘There are many proprietary formats one needs to deal with in data analysis. For example many social science research uses the SPSS format or Stata format for storing data, On the other hand, many ‘companies use the SAS business intelligence system and store their data in SAS format. Due to the popularity of Python, many software companies are providing Python support for their formats, For example, according to https: //b1ogs.sas.com/content/sasdurmy/2017/04/88/python-to-sas-saspy/, Python coders can now bring the power of SAS into their Python scripts. The project is SASPy, and it’s available on the SAS Software GitHub https://github.con/sassof tware/saspy. It works with SAS 9.4 and higher, and requires Python 3.x. 24 TOPIC 1, DATA ANALYTICS FROM PROGRAMMER’S PERSPECTIVE ‘The savReaderWriter and pyreadstat modules could be used to read SPSS format. §1.5.7 Handling SQL Database and Cloud Server Retrieving data from a SQL database server or a cloud server is much more complicated than open files because a connection to the server. We can use pd.read_sql to read and store the return result of the server. For example, reading from a local “SQL database" is listed below. from pandas. io import sql import sqlite conn = sqlite3. connect (‘data.db") query = "SELECT * FROM tablename' tbl = sql.read_sql(query, con=conn, parse dates={'date': 'Sd/%m/RY"}) print (tbl.head()) In general, the process can simplify by using the SQLAlchemy module or other generalised Python ‘module such as PugSQL (https: //pugsql org) Dealing with a cloud server is similar and we sometimes need to use special API. For example, to deal with the spreadsheet data on Google cloud, we need to use Google Sheet APL as pointed in the following articles. + https://towardsdatascience..com/accessing-google- spreadsheet -data-using-python-90aSbc214fd2 + https://developers .google.com/sheets/api/quickstart/python + https://github. con/burnast/gspread §1.6 Assignment Part 1 Based on your training experience, design a programme for your company to train the staff, Write up ‘a proposal which includes the following items + The data structure of the training programme; + An update and review system for the programme structure; + Online testing system for the trainees; Data analytical system for the results of the online trainees. + Expand your system to accommodate for external trainees. + Evaluate if the data analytical system is worthwhile by listing down the pros and cons of the system,

You might also like