You are on page 1of 14

Business Intelligence System Infrastructure

Database Systems
Administration and Management
(Foundations of transforming
and storing Big Data)

Algonquin College

20W_CST2200
References and Articles
CST2200 Database Systems Administration and Management
Foundations of transforming and storing Big Data

Contents
Introduction .................................................................................................................................................. 3
ETL Core Competencies ................................................................................................................................ 4
Creativity ....................................................................................................................................................... 6
Technology and tools .................................................................................................................................... 7
Microsoft Excel.......................................................................................................................................... 8
Python ....................................................................................................................................................... 9
PostgreSQL .............................................................................................................................................. 10
Geospatial related................................................................................................................................... 10
Talend Open Studio................................................................................................................................. 10
General ETL & Data Science ........................................................................................................................ 11
Interesting sources of data ......................................................................................................................... 12
Canadian Open Data (municipal and federal)......................................................................................... 12
Useful Data Science and ETL tools .............................................................................................................. 13
R .............................................................................................................................................................. 13
KNIME ..................................................................................................................................................... 13
Microsoft VBA ......................................................................................................................................... 13
QGIS ........................................................................................................................................................ 14
Windows Subsystem for Linux ................................................................................................................ 14

Algonquin College – W20_CST2200_300 - 200104 Page 2


CST2200 Database Systems Administration and Management
Foundations of transforming and storing Big Data

Introduction
This document contains a collection of references and articles that may be helpful to your success in
CST2200. Content in this document is not examinable (will not appear on tests or exams) unless
specifically referred in a slide or during a lecture.

As you work through these resources, be sure to keep notes on any issues you run into so we can talk
about them in class. Be sure to ask about anything here that sparks your curiosity.

Algonquin College – W20_CST2200_300 - 200104 Page 3


CST2200 Database Systems Administration and Management
Foundations of transforming and storing Big Data

ETL Core Competencies


Business Intelligence (BI) consists of strategies and technologies used for analysis of structured and
unstructured information (data). BI technologies can identify and create understanding that can be
leveraged to provide businesses with competitive market advantages.

BI analytics cannot be performed without data. But raw data is (usually/often) next to useless. It needs
to be structured in such a way that it can be recognized and differentiated in order to be understood.
Only then can a business glean intelligence out of their data.

At its root, the process of applying a controlled structure to data is known as ETL (extract, transform and
load). Having a solid foundation in ETL will significantly strengthen your analytical skills and abilities.

Core competencies for every ETL developer.


- Modelling theory
- Technology and tools
- Performance tweaking / Debugging / Problem solving
- Creativity
- Continuous learning
- Data Stewardship (ownership and security)

Modelling theory

- A mandatory competency for all ETL work.


- A proper model (data modelling, process modelling) is critical to solve any problem.
- Database modelling (for staging areas and data warehouse databases) is critical to meet
performance and storage resource goals.

Technology and Tools

- A mandatory competency for all ETL work.


- You don’t need to be a full stack web development professional. But you should know what it
means to be one.
- While “most” of the data will be text based, you must be able to consider binary files (images,
audio and video) and their associated embedded metadata.
- An understanding of how computers process data is required in order to optimize performance
(SSD vs. 10k RPM vs. Raid disk structures; “in memory” tables; column vs row data structures).
- SQL, a standard within RDBMS environments is a required skill for any ETL developer.
- Oracle, Microsoft SQL, MySQL, PostgreSQL, Teradata are examples of what? Are there additional
choices available? You must know enough about each to make educated decisions.
- When available to you, making the right choice of operating systems can significantly impact
overall performance. Knowing when to use desktop or server versions of Windows, UNIX and
even Apple OSes will be important.

Algonquin College – W20_CST2200_300 - 200104 Page 4


CST2200 Database Systems Administration and Management
Foundations of transforming and storing Big Data

Performance tweaking / Debugging / Problem solving

- For those who are a more tech savvy, stronger skills here will allow you to differentiate yourself
from the growing crowd of ETL developers.
- Learning how to parameterize your ETL jobs can save time and reduce headaches. Using
parameters allows you to dynamically change certain aspects of your ETL job without altering
the job itself.
- There will be times where the ETL tools alone cannot do everything that is needed. Scripting
languages can aid with juggling files, directories, users, and permissions. Popular scripting
languages for ETL include Python, Perl, and Bash.

Creativity

- The ability to transcend traditional ideas, rules, patterns, relationships, to create meaningful
new ideas, forms, methods, interpretations. 1
- Strong creative skills can make ETL work much more satisfying and is often an overlooked as a
core competency. This is another area where you can differentiate yourself from the crowd of
ETL developers.

Continuous learning

- To keep your career moving, don’t forget to watch trends in the market. Knowing when to adopt
new methods must be balanced with resources such as time and money. Technology tends to be
much less frustrating when you understand the inner workings.

Data Stewardship (reliability, ownership, security)

- Respecting data stewardship is another key competency. Being able to maintain data integrity
and security (from acquisition to dissemination) will require all of the above competencies.
- Consider ethical uses of data; lossy vs. lossless compression; data encryption; dropping
unneeded data; maintaining chains of ownership; laws governing retention and distribution of
data; international regulations about privacy and access to information; on premise vs. cloud
storage – These (and more) all come into play in the design and implementation of any data
solution that feeds BI efforts.

1
https://www.dictionary.com/browse/creativity

Algonquin College – W20_CST2200_300 - 200104 Page 5


CST2200 Database Systems Administration and Management
Foundations of transforming and storing Big Data

Creativity
The ability to transcend traditional ideas, rules, patterns, relationships, or the like, and to create
meaningful new ideas, forms, methods, interpretations.2

Some of the messages in the following videos are subtle, but they all have a common element of
creativity.

The human insights missing from big data (15 min)


https://www.ted.com/talks/tricia_wang_the_human_insights_missing_from_big_data/reading-list

Big data is better data (16 min)


https://www.ted.com/talks/kenneth_cukier_big_data_is_better_data?language=en

The rise of human-computer cooperation ( min)


https://www.ted.com/talks/shyam_sankar_the_rise_of_human_computer_cooperation?language=en

The birth of a word (min)


https://www.ted.com/talks/deb_roy_the_birth_of_a_word?language=en

The math behind basketball’s wildest moves


https://www.ted.com/talks/rajiv_maheswaran_the_math_behind_basketball_s_wildest_moves?language=en

Is Big Data Killing Creativity? | Michael Smith | TEDxHarvardCollege


https://youtu.be/A1XibEzp6K0

Why smart statistics are the key to fighting crime


https://www.ted.com/talks/anne_milgram_why_smart_statistics_are_the_key_to_fighting_crime?language=en

What do we do with all this big data?


https://www.ted.com/talks/susan_etlinger_what_do_we_do_with_all_this_big_data

Making data mean more through storytelling | Ben Wellington | TEDxBroadway


https://youtu.be/6xsvGYIxJok

The upside of data


https://www.ted.com/talks/jessica_donohue_the_upside_of_data

2
https://www.dictionary.com/browse/creativity

Algonquin College – W20_CST2200_300 - 200104 Page 6


CST2200 Database Systems Administration and Management
Foundations of transforming and storing Big Data

Technology and tools


Throughout this program, we will be utilizing the following tools and packages to help us traverse the
world of ETL and Database Administration.

- Python: A programming language / environment that servers many functions. From scripting
(executing other programs in a controlled way) to full applications with user interfaces.
- Microsoft Excel: This is much more than a number cruncher. Of course, Excel can be used to
perform actual analytical work. But that’s for a different course. We’ll be using it to help us move
data through the transform stage. 1-of transformations that are less than 1 million rows work well in
Excel.
- UNIX (Ubuntu): Ubuntu is one of many offerings in the world of UNIX operating systems. We will
learn to use typical UNIX tools like sed, awk, tail and grep.
- PostgreSQL: Most of our database work will be completed with PostgreSQL
- VMware: We will be leveraging virtual machines to install required software. This way we can
reduce the clutter on our host computers.
- Talend open studio for data integration: A free, open source tool that simplify the loading,
extraction, transformation and processing of large and diverse data sets.
- Data Visualization Tools: Tableau and Power BI will be used to validate our ETL efforts. We will
connect to data at various stages of ETL (raw, CSV, Excel, Database) to help us understand
limitations prevalent at each of those stages.

Algonquin College – W20_CST2200_300 - 200104 Page 7


CST2200 Database Systems Administration and Management
Foundations of transforming and storing Big Data

Microsoft Excel
500 Excel Formula Examples
https://exceljet.net/formulas

Beginner level materials

Lynda – Learning Excel 2019


https://www.lynda.com/Excel-tutorials/Learning-Excel-2019/746264-2.html?org=algonquincollege.com

Lynda – Excel: Introduction to Formulas and Functions


https://www.lynda.com/Excel-tutorials/Excel-Introduction-Formulas-Functions/743149-2.html?org=algonquincollege.com

Lynda – Excel Quick Tips


https://www.lynda.com/Excel-tutorials/Excel-Quick-Tips/530432-2.html?org=algonquincollege.com

Lynda – Excel: PivotTables for Beginners


https://www.lynda.com/Excel-tutorials/Excel-PivotTables-Beginners/651187-2.html?org=algonquincollege.com

Intermediate level

Lynda – Excel: PivotTables in Depth


https://www.lynda.com/Excel-tutorials/Excel-PivotTables-Depth/761925-2.html?org=algonquincollege.com

Lynda – Excel Data Visualization Part 1: Mastering 20+ Charts and Graphs
https://www.lynda.com/Excel-tutorials/Excel-Data-Visualization-Part-1-Mastering-20-Charts-Graphs/791339-2.html?org=algonquincollege.com

Advanced

Lynda – Excel Data Visualization Part 2: Designing Custom Visualization


https://www.lynda.com/Excel-tutorials/Excel-Data-Visualization-Part-2-Designing-Custom-Visualizations/791340-
2.html?org=algonquincollege.com

Various postings and articles of interest

Find Position of the Last Occurrence of a Character in a String in Excel


https://trumpexcel.com/find-characters-last-position/

29 ways to save time with Excel formulas


https://exceljet.net/blog/29-ways-to-save-time-with-excel-formulas

Excel performance: Improving calculation performance


https://docs.microsoft.com/en-us/office/vba/excel/concepts/excel-performance/excel-improving-calcuation-performance

Excel: Round a number to n significant digits


https://exceljet.net/formula/round-a-number-to-n-significant-digits

Algonquin College – W20_CST2200_300 - 200104 Page 8


CST2200 Database Systems Administration and Management
Foundations of transforming and storing Big Data

Python
Python Libraries that are interesting

- openpyxl – A Python library to read/write Excel 2010 xlsx/xlsm files


https://media.readthedocs.org/pdf/openpyxl/latest/openpyxl.pdf
- pyinstaller – Compiles python code into an executable file for standalone distribution.
From a DOS prompt, run pyinstaller --onefile myprogram.py
https://www.pyinstaller.org/
- psycopg2 – Libraries for PostgreSQL
https://wiki.postgresql.org/wiki/Psycopg2
- beautifulsoup4 – Web scraping library
https://pypi.org/project/beautifulsoup4/

Various postings and articles of interest

Lynda – Introduction to Beautiful Soup (Web scraping tool)


https://www.lynda.com/Python-tutorials/Python-Data-Science-Essential-Training/520233-2.html?org=algonquincollege.com

Practical Introduction to Web Scraping in Python


https://realpython.com/python-web-scraping-practical-introduction/

Beginner’s guide to Web Scraping in Python (using BeautifulSoup)


https://www.analyticsvidhya.com/blog/2015/10/beginner-guide-web-scraping-beautiful-soup-python/

Python for beginners


https://www.pythonforbeginners.com/beautifulsoup/

How to Web Scrape with Python in 4 Minutes


https://towardsdatascience.com/how-to-web-scrape-with-python-in-4-minutes-bc49186a8460

Ten handy python libraries for (aspiring) data scientists


https://bigdata-madesimple.com/ten-handy-python-libraries-for-aspiring-data-scientists/

Top 20 Python libraries for data science in 2018


https://activewizards.com/blog/top-20-python-libraries-for-data-science-in-2018/

SQLite (a single-user SQL database, great for embedding SQL into Python without a full RDBMS)
https://www.sqlite.org
http://www.sqlitetutorial.net/sqlite-python
https://sqlitebrowser.org

Algonquin College – W20_CST2200_300 - 200104 Page 9


CST2200 Database Systems Administration and Management
Foundations of transforming and storing Big Data

PostgreSQL
PostgreSQL primary website
https://www.postgresql.org/

Location of PostgreSQL install files


https://www.postgresql.org/download/windows/

YouTube video – Installation walkthrough


https://youtu.be/ghTksCsFBcI

PostgreSQL Tutorial
http://www.postgresqltutorial.com/

Learn PostgreSQL
https://www.tutorialspoint.com/postgresql/

PostgreSQL Python Tutorial


http://www.postgresqltutorial.com/postgresql-python/

DB Designer Fork
https://sourceforge.net/projects/dbdesigner-fork

System Architect
https://www.codebydesign.com

Geospatial related
Intro to Python GIS
https://automating-gis-processes.github.io/CSC18/course-info/Installing_Anacondas_GIS.html
https://automating-gis-processes.github.io/CSC18/lessons/L2/projections.html

Excel GIS conversion


https://geogeek.xyz/download-excel-template-convert-geographic-coordinates-utm.html
https://www.colby.edu/chemistry/Colby%20Compass/King%20Coordinate%20Conversion%20Master.xls
https://grindgis.com/wgs84-vs-nad83/

2016 Census - Boundary files (shapefiles)


https://www12.statcan.gc.ca/census-recensement/2011/geo/bound-limit/bound-limit-2016-eng.cfm

Talend Open Studio


Introductory video tutorials
https://www.talend.com/resources/data-integration-how-to-build-job

Components list
https://www.talendforge.org/components/index.php

Install walkthrough.
https://youtu.be/MtR-o0asWRU

Algonquin College – W20_CST2200_300 - 200104 Page 10


CST2200 Database Systems Administration and Management
Foundations of transforming and storing Big Data

General ETL & Data Science


Lynda – Twelve myths about big data
https://www.lynda.com/Big-Data-tutorials/Twelve-Myths-About-Data-Science/560047-
2.html?org=algonquincollege.com

The 10 Best Web Scraping Tools of 2018


https://www.scraperapi.com/blog/the-10-best-web-scraping-tools

Talend Open Studio for Data Integration (Unix_Linux|Windows|MAC)


https://www.talend.com/products/data-integration-manuals-release-notes/?lang=en

How To Install Java with `apt` on Ubuntu 18.04


https://www.digitalocean.com/community/tutorials/how-to-install-java-with-apt-on-ubuntu-18-04

TED – What we learned from 5 million books


https://www.ted.com/talks/what_we_learned_from_5_million_books

Data generator websites


https://mockaroo.com/
https://www.generatedata.com/
https://www.onlinedatagenerator.com/

Algonquin College – W20_CST2200_300 - 200104 Page 11


CST2200 Database Systems Administration and Management
Foundations of transforming and storing Big Data

Interesting sources of data


Canadian Open Data (municipal and federal)
- Open Data Standards Pilot Project – An Open Data collaborative effort between Toronto,
Ottawa, Vancouver and Edmonton
https://cdn.ymaws.com/www.misa-asim.ca/resource/collection/DB45CAF9-4640-4381-9B67-
B1A4DF288874/MISA_Ontario_Open_Data_Standards_Pilot_Project_Report.pdf
- Open Data Ottawa
http://data.ottawa.ca
- Open Data Canada
https://open.canada.ca/en/open-data
- Kaggle – A crowd-sourced platform for data scientists
https://www.kaggle.com/

Algonquin College – W20_CST2200_300 - 200104 Page 12


CST2200 Database Systems Administration and Management
Foundations of transforming and storing Big Data

Useful Data Science and ETL tools


This section will list software packages and tools that we will discuss in class at least at an introductory
level.

R
R is a programming language for statistical computing. It compiles and runs on a wide variety of UNIX
platforms and similar systems (including FreeBSD and Linux), Windows and MacOS. It is considered a
mainstream tool within data science communities.

R is flexible enough that you can find it being used as an ETL tool in addition to statistical modelling.

- The R Project for Statistical Computing: https://www.r-project.org/


- RStudio: https://www.rstudio.com/
- Wikipedia R: https://en.wikipedia.org/wiki/R_(programming_language)

KNIME
KNIME is an open-source data analytics platform. It integrates components through a modular data
pipelining concept. Modeling, data analysis and visualization can be performed with little programming.
To some extent, NIME can be considered as a SAS alternative.

- KNIME: https://www.knime.com/
- Wikipedia KNIME: https://en.wikipedia.org/wiki/KNIME

Microsoft VBA
VBA (Visual Basic for Applications) is the programming language of Excel and other MS Office programs.
If, for example, you have tasks in Microsoft Excel that you do repeatedly, you can record a macro to
automate those tasks – These macros are written in VBA.

VBA code can link most of the MS Office suite. The most relevant for us are Excel and Access. VBA code
normally can only run within a host application, rather than as a standalone program. VBA can, however,
control one application from another using OLE Automation.

While VBA skills are not too useful when creating corporate ETL solutions, they can certainly make a
difference in your personal productivity (testing and preparing data).

Should you run out and become an expert VBA programmer? No, not really, Python and R skills are
more practical.

- Wikipedia VBA: https://en.wikipedia.org/wiki/Visual_Basic_for_Applications

Algonquin College – W20_CST2200_300 - 200104 Page 13


CST2200 Database Systems Administration and Management
Foundations of transforming and storing Big Data

- PC World article: https://www.pcworld.com/article/2880353/software-productivity/5-essential-


tips-for-creating-excel-macros.html
- Is VBA dead? https://analystcave.com/vba-dead-whats-future-vba/

QGIS
is a free and open-source geographic information system application that supports viewing, editing, and
analysis of geospatial data.

QGIS integrates with other open-source GIS packages, including PostGIS, GRASS GIS, and MapServer.
Plugins can be written in Python or C++ to extend QGIS's capabilities. Plugins can geocode using the
Google Geocoding API, perform geoprocessing functions similar to those of the standard tools found in
ArcGIS, and interface with PostgreSQL/PostGIS, SpatiaLite and MySQL databases.

A Free and Open Source Geographic Information System


https://qgis.org/en/site/

Wikipedia – Spatial ETL


https://en.wikipedia.org/wiki/Spatial_ETL

Windows Subsystem for Linux


This is very cool for the UNIX geeks in the class.

What is Windows Subsystem for Linux (WSL)?

The Windows Subsystem for Linux (WSL) is a new Windows 10 feature that enables you to run native
Linux command-line tools directly on Windows, alongside your traditional Windows desktop and
modern store apps.

Who is WSL for?

This is primarily a tool for developers -- especially web developers and those who work on or with open
source projects. This allows those who want/need to use Bash, common Linux tools (sed, awk, etc.) and
many Linux-first tools (Ruby, Python, etc.) to use their toolchain on Windows.

What can I do with WSL?

WSL provides an application called Bash.exe that, when started, opens a Windows console running the
Bash shell. Using Bash, you can run command-line Linux tools and apps. For example, type lsb_release -a
and hit enter; you’ll see details of the Linux distro currently running:

https://docs.microsoft.com/en-us/windows/wsl/faq

https://docs.microsoft.com/en-us/learn/modules/get-started-with-windows-subsystem-for-linux/

Algonquin College – W20_CST2200_300 - 200104 Page 14

You might also like