You are on page 1of 66

Beginner’s

Guide to Data Analytics









Oliver Theobald

First Edition Copyright © 2017 by Oliver Theobald All rights reserved. No part of this publication may be
reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or
other electronic or mechanical methods, without the prior written permission of the publisher, except in the
case of brief quotations embodied in critical reviews and certain other non-commercial uses permitted by
copyright law.
This book is written for anyone who is interested in making sense of data
analytics without the assumption that you understand specific data science
terminology or advanced programming language.

If you are just starting out as a student of data science or already working in
marketing, medical research, senior management, policy analysis or IT then this
book is ideally suited for you.

Overview of Data Science


The Evolution of Data Science
The discipline of studying large volumes of data, known as ‘data science’, is
relatively new and has grown hand-in-hand with the development and wide
adoption of computers. Prior to computers, data was calculated and processed by
hand under the umbrella of ‘statistics’ or what we might now refer to as
‘classical statistics’.

Baseball batting averages, for example, existed well before the evolution of
computers. Anyone with a pencil, notepad and basic mathematical skills could
calculate Babe Ruth’s batting average over a season with the aid of classical
statistics.

The process of calculating a batting average involved the dedication of time to
collect and review batting sheets, and the application of addition and division.

Growing up in Australia I calculated my own batting average playing cricket.
With a knack for remembering numbers, I recorded runs in my head as I scored
them on the field. Then in the car ride home I’d calculate my season batting
average to a single decimal point. I simply dividing the sum of runs scored over
the season by the total number of dismissals (the baseball equivalent of being
‘out’).

Admittedly I didn’t score a lot of runs. So my backseat number crunching hardly
qualified me as prodigal statistician!

But the key point to make about classical statistics is that because you’re
working with a small data set you don’t need a computer to manage the data and
draw out insightful information. An educated twelve year-old should be more
than capable of applying basic statistics.

Indeed, statistics are still taught in schools today, as they have been for centuries.
There are advanced levels of classical statistics but the volume of data remains
manageable for us as humans to process.

However, what if you wanted to calculate numbers at a higher velocity, higher
volume and higher value? What if you wanted to conduct calculations on your
heart beat? Calculations not just on your heart beat, but also how your heartbeat
reacts to temperature fluctuations and calories you consume. This is not
something you can calculate in your head or even on paper for that matter. Nor
would it be practical to collect such data by hand.

This is where the information age and the advent of computers have radically
transformed the field of statistics. Modern computing technology provides the
infrastructure to now collect, store and analyze massive amounts of data.

Data storage, or database management systems, came into existence as early as
the 1960’s. But it wasn’t until the following decade that statistics and computer
science were united, and which followed the creation of relational database
management systems.

Following the accelerated growth of database systems, “knowledge economy”
and “data mining” gained currency in the 1980’s. “Big Data”, in the form of
massive quantities of business data, then came into existence in the 1990’s but
was not known as “big data” until the end of the decade.

The first official reference to big data was in an article published by the ACM
Digital Library in 1997.

Today, big data, big data analytics, machine learning and data mining are all
popular but less well understood terms. The following chapters will walk you
through the definitions and the unique characteristics of these terms.
Big Data
What is big data? Big data is used to describe a data set, which due to its value,
variety and velocity defies conventional ways of processing and is reliant on
technology to be treated. Or in other words, big data is a collection of data that
would be virtually impossible for a human to make sense of without the help of a
computer.

Big data does not have an exact definition in terms of size or how many rows
and columns it would take to house such a large data set. But data sets are also
becoming increasingly bigger as we find new ways to efficiently collect and
store data at low cost.

It’s important as well to note that not all useful data is considered “big data”, as
the following example points out.

Imagine the total number of coffees sold by Starbucks over one business day in a
suburb in the U.S. Total sales can be calculated on the back of a napkin by
recording the total number of sales of each store within that suburb, and totalling
those numbers using simple addition. This however – as you may have guessed
by the mention of a napkin – is not considered ‘big data’.

Simple calculations such as total revenue, total profits and total assets have been
recorded for millennia with the aid of pen and paper. Other rudimentary
calculation tools such as abacuses in China have been used with equal success.

Nor does Starbucks dwarf the size of companies in existence prior to the
computer age. The British Empire is a notable example of a highly organised and
massive organization that could calculate income generated across a multitude of
far-flung geographical territories.

What today defines big data is the power to process much larger sets of data to
unearth information never seen before through the aid of computers.

Given the advent of computer-powered data mining, what then can the luxury
brand Louis Vuitton learn today from big data that they couldn’t 50 years ago?

We can assume that profits, sales revenue, expenses and wage outlays are
recorded with virtually the same precision today as they were 50 years ago. But
what about other observations? How does, for example, staff demographics
impact total sales?

Let’s say we want to know how age, company experience and the gender of
Louis Vuitton service staff impacts a customer purchasing decision.

This is where technology and computers come into the frame. Digital equipment,
including staff fingerprint check-in systems, customer relationship management
systems (to manage details about sales and staff members), and payment systems
can be all linked into one ecosystem.

The data is then stored in a database management system on a physical server or
a distributed computing storage platform such as Hadoop, within a series of
interconnecting tables that can be retrieved for instant access, or analyzed at a
later date.

Big data analytics or data mining can then be applied to clean up and analyze the
data to uncover interesting variables and gain insight from the trove of
information collected.

Other business examples are plentiful. Starbucks now chooses store locations
based on big data reports that factor in nearby location check-ins on social
media, including Foursquare, Twitter and Facebook.

Netflix invested in a whole TV series based on a direct relationship they
extracted via big data analytics and data mining. Netflix identified that:
Users who watched the David Fincher directed movie The Social
Network typically watched from beginning to end.
The British version of “House of Cards” was well watched.
Those who watched the British version “House of Cards” also enjoyed
watching films featuring Kevin Spacey, and/or films directed by David
Fincher.

These three synergies equated to a potential audience lucrative enough to
warrant purchasing the broadcasting rights to the well-acclaimed American TV
series House of Cards.

Another big data business example is Juwai.com. Juwai is an Australian founded
online real estate platform that lists overseas properties to its user base of
Chinese investors.

This online real estate platform is now leveraging their access to big data to feed
hedge fund managers and investment bankers.

Based on the data they can collect regarding what users search for on their
portal, Juwai can collect data early in the purchasing decision cycle and
synthesise search queries in rapid time through cloud computing and a super
computer called a Vulcan (only five in the world).

The online behaviour they can capture from users on the site can be packaged up
and commercialised to pinpoint future real estate patterns.

As an example, Juwai explained to me that a major trend over the last 12 months
has been a surge in interest in Japanese real estate. A historically low Yen and
growing exposure to Japan through tourism from China is leading to strong
demand for Japanese properties from Chinese buyers.

With Juwai’s data 6-12 months ahead of the purchasing cycle, investment firms
can stock up on urban hotspots in Japan and properties in close proximity to
universities (which are a traditional magnet for Chinese investment money).

However, it’s important to remember that big data is not a process. It is a noun to
describe a lot of data. To extract insight and value from the big data requires data
science approaches, such as big data analytics and data mining. We will explore
the difference between big data analytics and data mining in the next two
chapters.
Big Data Analytics
Big data analytics is a discipline aimed at analyzing big data. The defining
feature of big data analytics is having a starting hypothesis.

An example of a big data analytics’ hypothesis could be: A relationship between
the ambience (measured in decibels) at Manchester United home games played
at Old Trafford and the likelihood of the home team coming from behind to win.

How do you conduct big data analytics? Is there an app for it?

Not really.

Due to the inherent size of big data, big data analytics involves the use
of significant IT infrastructure or distributed computing technologies as well as
advanced analytic techniques.

‘Distributed computing’ refers to an approach to break down one project into
smaller tasks over multiple computer servers. Once partitioned into smaller
tasks, each task is assigned to a computer server. These computer servers could
be based in the same location or in different geographical locations.

Cloud computing, for example, provides the infrastructure to run distributed
computing over two or more continents at virtually the same cost of processing
data in one location. Microsoft, Amazon and my own employer Alibaba Cloud
all have data centers located around the world from which you can access
distributing computing on the cloud.

Computer servers within one data center can link up with each other or with
computer servers at another data center in order to store and process data via the
distributing computing technique.
Data Mining
Data mining is a term to describe the process of discovering and unearthing
patterns in a data set. The term is aptly named.

Much like prospecting for gold in the 19th Century Gold Rush, data mining
begins without a clear future outcome. In fact, you don’t even know what you
are mining for! It could be gold, but it could just as equally be silver or oil that
you stumble upon.

This is where data mining differs to big data analytics. Data mining does not
start with a set hypothesis.

Data mining involves applying advanced algorithms to unearth previously
unknown relationships, patterns and regularities from a very large data set.

A key point to remember here is that data mining only applies to situations when
you are looking to find patterns and regularities within the data set that are yet to
be seen. For example, there’s no point running advanced data mining algorithms
over your company’s website traffic to discern key user groups if you’re already
aware that these key groups exist. This would instead qualify as data analytics –
and wouldn’t actually register as big data anyway.

Given that data mining doesn’t begin with a hypothesis as an initial starting
point, a myriad of data sorting techniques are applied. One of the better-known
techniques is text retrieval.

Text retrieval labels a group of data entries into specific categories. For
instance, football fans could be broken down into two categories: hooligans and
non-hooligans.

We will discuss specific data mining techniques, including text retrieval, in later
chapters.
Machine Learning
A big question for people new to the field of data science is what’s the
difference between ‘data mining’ and ‘machine learning’?

As mentioned earlier, both disciplines fall under the broad umbrella of data
science. Under the umbrella of data science is a grouping of disciplines
including artificial intelligence. Machine learning then falls under the field of
artificial intelligence.

Data mining on the other hand is not considered as artificial intelligence and
belongs to its own discipline of data science. To understand why machine
learning and data mining are separate disciplines is easily understood when
examining their properties.

Machine learning concentrates on prediction based on already known properties
learned from the data.

Data mining concentrates on the discovery of previously unknown properties
within the data.

There is however a strong correlation between the two. In some cases, data
mining utilizes the same algorithms applied to machine learning in order to work
the data. Popular algorithms such as k-means clustering, dimension reduction
algorithms and linear regression are used in both data mining and machine
learning.

Source: Inovancetech 2014

Given the interconnectivity between data mining and machine learning, it is
important to understand and decipher the two from each other.

Let’s now look at machine learning. Machine learning focuses on developing
algorithms that can learn from the data and make subsequent predictions to solve
a business, social or other problem.

Machine learning algorithms have existed for virtually two decades but only in
recent times has computing power and data storage caught up to make machine
learning widely practical.

Computers for a long time were inept at mimicking human-specific tasks, such
as reading, translating, writing, video recognition and identifying objects.
However, with advances in computing power, machines cannot only do all this
but have now exceeded human capabilities at identifying patterns found in very
large data sets.

Another important advancement has been the adoption of self-improving
algorithms. Just as humans learn from previous experiences to make future
decisions, so too do self-improving algorithms.

Google’s machine learning backed search engine algorithm is a great example.

Let’s say you navigate to the Google search page and type in: “Who is Donald
Trump?”

Traditionally, Google would return with a list of webpages featuring those four
keywords, and from there you would select the most relevant result.

But say you now want to know who Donald Trump’s wife is. You would then
search: “Donald Trump wife” or even “Who is Donald Trump’s wife?”

Prior to machine learning, the Google search engine would sift through its online
repository for webpages containing those exact keywords. But Google would not
be able to connect your first search query with the second search query.

Machine learning transforms the way we can now search. Given that Google
already knows our first search query regarding, “Donald Trump”, it can decipher
the second search query from less specific information provided.

For example, you could follow up by asking: “Who is his wife?”
And BANG Google will come back with results regarding Donald Trump’s wife
– all based on self-learning algorithms.

Google’s new line of thinking is very similar to human thinking and which is
why Google’s new technology falls under the banner of artificial intelligence
and machine learning.

Not only can machine learning think like us, but as mentioned, it’s also more
effective than us.

Humans are simply not predisposed to being as reliable and proficient at
repetitive tasks to the same standard as computers are in handling large sets of
data. In addition, the size, complexity, and speed at which big data can be
generated exceed our limited human capabilities.

Imagine the following data pattern:
1: [0, 0]
2: [3, 6]
3: [6, 12]
4: [9, 18]
5: [12, ? ]

As humans it’s pretty easy for us to see the pattern here. As the second number
in each row is double the subsequent number to its left inside the brackets, we
can comfortably predict that the unknown number in the brackets on row five
will be ‘24’. In this scenario we hardly need the aid of a computer to predict the
unknown number.

However, what if each row was composed of much larger numbers with decimal
points running into double digits and with a far less clear relationship between
each value? This would make it extremely difficult and near on impossible for
anyone to process and predict in quick time. This mundane task though is not
daunting to a machine.

But how can we program a computer to calculate something we don’t even know
how to calculate ourselves?

This is an important feature of machine learning. If properly configured,
machine learning algorithms are capable of learning and recognising new
patterns within a matter of minutes.

But machine learning naturally doesn’t just start by itself. As with any machine
or automated production line, there needs to be a human to program and
supervise the automated process. This is where data scientists and data
professionals come into the picture.

The role of data professionals is to configure the equipment (including servers,
operating systems and databases) and architecture (how the equipment interacts
with each other) as well as programming algorithms using various mathematical
operations.

You can think of programming a computer like training a Labrador to become a
guide dog. Though specialized training the dog is taught how to respond to
various scenarios. For example, the dog is trained to heel at a red light or to
safely lead its master around certain obstacles.

If the dog has been properly trained then the trainer is no longer needed and the
dog will be able to apply his/her training in unsupervised situations.

This example draws on a situational scenario but what if you want to program a
computer to take on more complex tasks such as image recognition.

How do you teach a computer to recognise the physical differences between
various animals? Again, it requires human input.

However, rather than programming the computer to respond to a fixed
possibility, such as a navigating an obstacle on the path or responding to a red
light, the data scientist will need to approach this method differently.

The data scientist cannot program the computer to recognise animals based on a
human description (i.e. four legs, long tail and long neck), as this would induce a
high rate of failure. This is because there are numerous animals with similar
characteristics, such as wallabies and kangaroos.

Solving such complex tasks has long been the limitation of computers and
traditional computer science programming.

Instead the data scientist needs to program the computer to identify animals
based on socializing examples, and the same way you teach a child.

A young child cannot recognise a ‘goat’ accurately based on a description of its
key features. An animal with four legs, white fur and a short neck could be
confused with various other animals.

So rather than playing a guessing game with the child, it’s more effective to
show the child what a goat looks like by showing toy goats, images of goats or
even real-life goats in a paddock.

Image recognition in machine learning is done much the same way, except
teaching is managed via images and programming language.

For example, we can display to the computer various images, which are labelled
as the subject matter, ie. ‘goat’. Then just as the way a child learns, the machine
will draw on these samples to identify the specific features of the subject.

At work I recently read an example of Chinese company that had developed
machine learning algorithms to detect illicit video content and pornography
(which is officially banned in China but at the same time prevalent). Now to
answer what you are probably thinking… Yes, the computer would have been
fed a high volume of pornographic material in order to develop those advanced
video recognition capabilities!

Whether its recognising animals, human faces or illicit adult material, the
machine can apply examples to write its own program to provide the capability
to recognise and identify subjects. This eliminates the need for humans to
explain in detail the characteristics of each subject and dramatically mitigates the
chance of failure.

Once both the architecture and algorithms have been successfully configured,
machine learning can then take place. The computer can begin to implement
algorithms and models to classify, predict and cluster data in order to draw new
insights.





Data Management
Data Infrastructure
There are several important underlying technologies that provide the
infrastructure for advanced data analytics.

Data infrastructure is technology that allows data to be stored and processed.
Data infrastructure includes both traditional hardware and virtual resources.

Traditional hardware is equipment physically stored on premise in the form of
large computer servers. Virtual resources are provided in the form of cloud
computing from major cloud providers including Amazon and Microsoft.

Similar to the way you consume and pay for your electricity, gas, water and
other traditional utilities, cloud computing offers you full control to consume
compute resources without actually owning the resource equipment. As a user
you can simply rent compute resources from a cloud provider in the form of
virtual machines, and pay for what you consume.

In government and virtually all sectors of business, traditional hardware is
rapidly being replaced by cloud infrastructure. Rather than procuring their own
private and physical hardware to house and process data, companies can pay for
access to advanced technology offered by cloud providers. This means that
companies only pay for what they need and use.

By using data infrastructure services available on the cloud, companies can
avoid the expensive upfront cost of provisioning traditional hardware as well as
the cost to maintain and later upgrade the equipment.

Data services, including storing, mining and analytics, are also available on the
cloud through vendors such as Amazon, IBM, Alibaba Cloud and Google.

The affordability of cloud technology has led to an increase in demand from
companies to conduct data analytics programs in order to solve business
problems. Meanwhile, this has led to greater demand for data scientists to
manage such programs.

Cloud technology also frees up data scientists to focus on data management and
data mining rather than configuring and maintaining the hardware. Updates and
data backups can be made automatically on the cloud.

Managing Data
Data analytics involves careful management of data.

That data could be measured in petabytes and be considered as big data, or it
could just as well qualify as data stored on an Excel spread sheet.

As a data scientist or for anyone interested in developing their expertise in data
mining or a sister discipline, it is imperative that you understand the key
categories and models of managing data.

Data management falls into three primary categories: storing, scrubbing and
analysis. It could also be argued that there is an additional two categories of
collection and visualisation which bookmark these three categories. However,
this chapter will concentrate solely on the three key categories, as data collection
is most often automated these days and data visualisation is discussed in a later
chapter.

Data Storage
To ensure business information is easy and quick to access, data needs to be
stored in a centralized location. Data storage platforms come in many shapes and
sizes, including spread sheets, relational database management systems, key
value stores, data warehouses and distributed computing clusters.

- Database management systems
A relational database management system (RDMS) is software to store
data items in organized and formally described tables that have a relationship to
each other.

RDMS were invented in the 1950’s to succeed traditional spread sheets in
storing and retrieving data. Back then, it simply wasn’t practical to access data
from a single spread sheet with up to a million rows on a black and white
terminal.

A RDMS makes it easy to store data via connecting multiple tables. The data
stored in the tables is easy to retrieve as each table (also called a relation)
contains data categories in columns. Each row contains a single instance of data
according to the category defined by the column.

A relational database for a small business, for example, may have customer data
divided into columns denoting name, mailing address, date of birth, phone
number, loyalty card number, etc.

Additional tables may then store information regarding orders, products,
suppliers, etc. The information in these tables interacts so that the database
manager can view or create a report based on customers that have purchased
certain products at a set time period.

Source: www.startingabusiness.ca

In order to group and designate new data to the database tables, RDMS relies on
what is known as schema.

Scheme defines what you data looks like, and where it can be placed within the
relational database. Relational databases therefore require considerable upfront
design to determine scheme or the format and category of data you are
collecting.

Most relational database management systems use Structured Query Language
(SQL) to access the data and perform commands to manipulate and view the
data across multiple tables.

Despite their long history RDMS are still used widely today, and especially in
regards to data warehousing, or what is also known as an Enterprise Data
Warehouse (EDW).

- Data Warehouse/EDM
EDW is a relational database that focuses on storing data for the purpose of
future analysis. Whereas traditional databases typically store data that will be
processed in real-time in order to access and record information, known also as
online transaction processing (OLTP), an EDW is optimized for storing data and
then performing offline analytics and creating reports, also referred to as OLAP.


An easy to understand example of a traditional database/OLTP task would be
using SQL to retrieve information to process a purchase order. As an e-
commerce store for example, you will need access to multiple tables in real-time
in order to process the order, including the customer billing details, customer
mailing address and your inventory list. You don’t want to have to crank up the
servers and wait 30 minutes just to retrieve the data you need.

Instead, SQL is an easy way to access the information and perform online
transaction processing (OLTP)

The role of a data warehouse or OLAP (Online Analytical Processing) is to then
analyse tables and transactions after the purchasing has taken place. This may
mean analyzing transactions in relation to other transactions, ie. what did other
customers buy? Or analysing the data for commonalities among types of
customers, ie. where they live and what time they order.

Ultimately, the goal of data warehousing and OLAP is to store data that will be
later processed in order to produce new insight that aids decision makers to
better understand their data and provide unique information to improve your
operations.

Also note that relational databases can be scripted to automatically upload data
to the data warehouse at regular intervals.


- Distributed computing clusters
One of the most popular distributed computer clusters to store data is
Hadoop. Hadoop operates on a distributed file system to store data on multiple
servers, also known as a Hadoop cluster. In addition, Hadoop provides the
infrastructure to later process your data across hundreds or even thousands of
servers by splitting tasks across the cluster.

- Key value store
A key value store is a simple and easy-to-use database that stores data as a key-
value pair. The key is marked by a string such as a filename, URI or hash, and
matches a value that can be any kind of data such as an image or document. The
value (data) is essentially stored as a blob, and therefore does not require a
schema definition.

This is a fast option to store, retrieve and delete data. However, as the value is
vague, you cannot filter or control what’s returned from a request.

Apache Cassandra is a free open-source distributed database management
system similar to Hadoop but fits into the category of key value store as it can
store NOSQL without schema and the use of SQL.

Cassandra is known as one of most highly available (reliable) and fault-tolerant
database systems available. It is therefore suited to handling massive amounts of
data, such as indexing every web page on the Internet or as a backup system to
replace tape recordings.

- Data Migration
A common task and challenge for data scientists is the migration of data from
one storage platform to another. This may entail migrating data from a legacy
database, spread sheet or data warehouse to a distributed computing based
storage platform such as Hadoop or a cloud storage platform.

Migrating data from data warehouses to Hadoop and cloud platforms is
becoming more common given the cost savings of storing data on a distributing
computing network.

The need for data migration is especially common when one business or website
is acquired by another, and the new owner wishes to merge data from both
entities into one storage repository.

ETL (extract, transform and load) is a process that is used to migrate the data
from one storage platform to another. Under ETL, you will first extract the data
in a standard format that is compatible with the future home of the data, and
which may involve storage platforms with different schema. The task is to then
scrub and transform the data so it will all fit. Lastly, the transformed data will be
loaded onto the new storage platform.


Data Scrubbing
After storing your data the next process is refining your data to make it easier to
work with, known as scrubbing. Data scrubbing can entail modifying or
removing incomplete, incorrectly formatted or repeated data within the dataset.
The overall goal of data scrubbing is to make the dataset more accurate and
convenient to process.

For data scientists, data scrubbing usually entails the most application of time
and effort. Data scrubbing could entail removing data (including anomalies and
outliers), data dimension reduction, classification and clustering.

Data scientists use a wide range of tools, including text editors, scripting
tools, and programming languages such as Python to scrub the data.

Hadoop can also be used for data scrubbing. Hadoop is used for storage, as
mentioned, and also for data scrubbing through MapReduce. Via MapReduce
data is divided into smaller batches and each batch is subsequently processed via
the Hadoop cluster. Apache Spark is another alternative that can process data in
real-time.

An example of scrubbing could be taking data collected from Facebook and
stored on a Hadoop cluster, and then using data techniques to categorize
Facebook posts. Various algorithms could be used to classify, cluster or group
Facebook posts into a variety of more manageable groups such as: -
Positive/negative
- Use of collective language (we, us) or individual language (I, me), or,
- Location of post

Data scrubbing could also entail reducing the dimensionality of your data. This
is a process to identify important combinations or variables (fields) that seem the
most important and relevant to your hypothesis. This may mean merging two
variables into one. For example, ‘hours’ and ‘seconds’ could be merged and
denoted as “day of the week.”

This helps to simplify the data set and reduce data noise or distractions.

Another option is to delete the data all together. However, it should be noted that
there exists a counter-argument to permanently removing data. This argument
notes that as a data scientist or a data scientist team you will never truly
know what questions you may wish to ask from the data now and in the future.

Secondly, as it’s becoming increasingly more cost-effective to store massive
amounts of data, it could be safer to hold on to the data under a data retention
policy rather than throw it away.

This is a question you as a data scientist or a decision maker will need to make
based on your company’s size, industry field, IT resources, data storage budget
and data analysis goals. However, it’s still perfectly possible to hold on to all
your data and conduct data reduction – simply just keep a backup copy of all
your data on record!


Data Analysis
Data analysis calls on statistical packages to deeply analyze the scrubbed
data. Popular package tools include R, SBSS and Python's data libraries. Many
of these tools allow you to visualize the data in the form of charts and graphs. To
a certain extent, data visualisation overlaps with data analyzing depending on
your package tool.

More about data analysis techniques will be covered in a following section.

Web Scraping
A great way to get started with data analytics and collecting data is web
scraping. Web scraping is a computer software technique to extract information
from websites.

Expressed differently, web scraping goes out and automatically collects
information from the web for you. This saves you the hassle of manually
clicking through webpages, and copying and pasting that information into a
spreadsheet.

Web scraping can be extremely helpful for sales and marketing, and is an
efficient way to get large amounts of data in a very short period of time.

Web scraping is used widely and search engines are the best example. As part of
their search service, Google uses web scraping to crawl and index nearly every
website on the web. Other websites use web scraping to aggregate services and
prices, including hotels, flights and other booking sites.

The goal of scraping is to typically work backwards from your data needs. For
example, you may need a list of Twitter key opinion leaders or you need to index
the sales of products on e-commerce products. You therefore choose your target
source to scrape based on your data needs.

It is possible to scrape data from multiple different sources (websites), but
obviously it’s easier to concentrate on one source to start with.


While scraping may sound technically difficult, there are tools available that
make it a simple click and drag process for non-technical people. One such tool
is Import.io.

Import.io is a browser tool you can download to scrape websites. The tools was
previously free to download but now has a steep pricing system in place.
However, if you are a student, teacher, journalist, charity or startup then you
may request for free access. Otherwise, they do have a free trial option also.


You can use Import.io in three simple stages to scrape and collect information
from the web.

1) Download: Your web browser downloads information from a web server to
load and display to you. Import.io will crawl the web server and then retrieve
information from the webpage/s through parsing.

2) Parse: Via parsing, the scraper will selectively retrieve useful segments of
information from defined guidelines. For instance, the tool can be configured to
scrape Twitter posts and comments but avoid information that you don’t want to
scape such as menu items and the website footer. Defining what you wish to
extract is a matter of just click and drag and following the prompts within the
toolbox.

3) Store: Finally, the tool takes this information and stores it for you online. You
can then decide whether you wish to export the data to a cloud storage platform,
JSON, or a simple CSV file saved on your Desktop.

For more technical web scraping Python is the go-to programming language in
the industry. Basic Python crawlers can be written in less than 40 lines of code.

Downloading Datasets
If manually downloading datasets is not of interest to you, and you want to first
concentrate on data analytics techniques, then go no further than Kaggle.

Kaggle is an online community – now bought out by Google – for data scientists
and statisticians to access free datasets, join competitions, and simply hangout
and talk about data.

Below are five free sample datasets you might want to look into downloading
from the site.

Starbucks Locations Worldwide
What to figure out which country has the highest density of Starbuck stores, or
which Starbucks store is the most isolated from any other? This dataset is for
you.
Scraped from the Starbucks store location webpage, this dataset includes the
name and location of every Starbucks store in operation as of February 2017.

European Football Database
Sometimes not a lot of action happens in 90 minutes but with 25,000+ matches
and 10,000+ players over 11 leading European country championships from
seasons 2008 to 2016 this is the dataset for Football diehards.

The dataset even includes team line up with squad formation represented in X,Y
coordinates, betting odds from 10 providers, and detailed match events including
goals, possession, goals, cards and corners.

Craft Beers Dataset
Do you like craft beer? This dataset contains a list of 2,410 American craft beers
and 510 breweries collected from January 2017 on CraftCans.com. Drinking and
data crunching is also perfectly legal.

New York Stock Exchange
Interested in fundamental and technical analysis? With up to 30% of traffic on
stocks said to be machine generated, how far can we take this number based on
lessons learnt from historical data.

This dataset includes prices, fundamentals and securities retrieved from Yahoo
Finance, Nasdaq Financials, and EDGAR SEC databases. From this dataset you
can look to see what impacts return on investment and indicates future
bankruptcy.

Brazil's House of Deputies Reimbursements
As politicians in Brazil are entitled to receive refunds from money spent on
activities to "better serve the people," there’s a lot of interesting data and
suspicious outliers to be found from this dataset.

Data on these expenses are publically available but there is very little monitoring
of expenses in Brazil. So don’t be surprised to see one public servant racking up
over 800 flights in one year, and another who recorded R$140,000 (USD
$44,500) on mailing post expenses.

The following section will examine data analytics techniques applied to both
data mining and machine learning.
Data Mining & Machine Learning

Techniques
Regression
Regression, and linear regression specifically, is the “Hello World” equivalent
of data analytics. Just as programmers start with “Hello World” as the first line
of code they learn to write, prospective data scientists typically start with linear
regression.

Regression is a statistical measure that takes a group of random variables and
seeks to determine a mathematical relationship between them. Expressed
differently, regression calculates various variables to predict an outcome or
score.

Regression is used in a range of disciples including data mining, finance,
business and investing. In investment and finance, regression is used to value
assets and understand the relationship between variables such as exchange rates
and commodity prices.

In business, regression can help to predict sales for a company based on a range
of variables such as weather temperatures, social media mentions, previous
sales, GDP growth and inbound tourists.

Specifically, regression is applied to determine the strength of a relationship
between one dependent variable (typically represented as Y) and other changing
variables (known also as independent variables).

A simple and practical way to understand regression is to consider the scatter
plot below:


The two quantitative variables you see below are house cost and square footage.
House value is measured on the vertical axis (Y), and square footage is
expressed along the horizontal axis (x). Each dot (data point) represents one
paired measurement of both ‘square footage’ and ‘house cost’. As you can see,
there are numerous data points representing houses within a particular suburb.

To apply regression to this example, we simply draw a straight line which
represents the least deviation through the data points, as seen above.

But how do we know where to draw the straight line? There any many ways we
could split the data points with the regression line, but the goal is to draw a
straight line that best fits all the points on the graph, with the minimum distance
possible from each point to the regression line.

This means that if you were to draw a vertical line from the regression line to
every data point on the graph, the distance of each point would equate to the
smallest possible distance of any potential regression line.

As you can see also, the regression line is straight. This is a case of linear
regression. If the line were not straight, it would be known as non-linear
regression, but we will get to that in a moment.

Another important feature of regression is slope. The slope can be simply
calculated via referencing the regression line. As one variable (X or Y)
increases, you can expect the other variable will increase to the average value
denoted on the regression line. The slope is therefore very useful for forming
predictions.

The closer the data points are to the regression line, the more accurate your
prediction will be. If there is a greater degree of deviation in the distance
between the data points and your regression line then the less accurate your
slope will be in its predictive ability.

Do note that this particular example applies to a bell-curve, where the data points
are generally moving from left-to-right in an ascending fashion. The same linear
regression approach does not apply to all data scenarios. In other cases you will
need to use other regression techniques – beyond just linear.

There are various types of regression, including linear regression (as
demonstrated), multiple linear regression and non-linear regression methods,
which are more complicated.

Linear Regression
Linear regression uses one independent variable to predict the outcome of the
dependent variable, or (represented as Y).

Multiple Regression
Multiple regression uses two or more independent variables to predict the
outcome of the dependent variable (represented as Y).

Non-linear Regression
Non-linear regression modelling is similar in that it seeks to track a particular
response from a set of variables on the graph. However, non-linear models are
somewhat more complicated to develop.

Non-linear models are created through a series of approximations (iterations),
typically based on a system of trial-and-error. The Gauss-Newton method and
the Levenberg-Marquardt method are popular non-linear regression modelling
techniques.

Linear Regression on Google Sheets
An easy way to get started with Linear Regression is on Microsoft Excel or
Google Sheets. Below are instructions to create a linear regression line on
Google Sheets.

1. Open a spreadsheet in Google Sheets.
2. Enter your data into two rows (x and y). See image below depicting how your
data entry should look.
3. Select a scatter plot chart. In the top right corner, click the Down arrow.
4. Click ‘Advanced edit’.
5. Click Customization and scroll down to the “Trendline” section. If for some
reason you don’t see the trendline option, it means that your data probably
doesn’t not have an X and Y coordinate and a trendline cannot be added.
6. Click the menu next to “Trendline.”
7. Select “Linear”
8. Click Update.
Done!
Example


Data Reduction
While it may sound somewhat counter-intuitive, one of the core processes of
data mining and machine learning is data reduction – part of the data scrubbing
process as mentioned in the previous chapter.

You would think the more data the better, right? With more data comes more
potential insight to draw from. But not all data is important. Too much data
creates noise and distraction.

Think of it as having way too many files saved on the Desktop of your computer.
Having so many photos, word documents, videos and other files is not
necessarily a bad thing but it does make it hard to find what you’re looking for.

You can reduce data noise and simplify the data set through what is known as
dimensionality reduction. This is a process to identify important combinations or
variables (fields) that seem the most important and relevant to your hypothesis.

The second reason why it could be important to conduct data reduction is you
may have limited machine performance to manage the data. Just like having too
many browsers open on your computer affects your Internet speed, likewise
having too much data slows down your data mining process.

Available storage on the hard drive of your machine could be another limitation,
as well as memory constraints in the form of RAM.

You can overcome computer performance limitations by linking to cloud
services offered by Amazon, Microsoft and Alibaba Cloud etc, but this will cost
money to access their servers. (Most cloud providers offer a 1-12 month free
trial period.) One approach to data reduction is applying a descending
dimension algorithm that effectively reduces data from high-dimensional to
low-dimensional.

Dimensions are the number of features characterizing the data. For instance,
hotel prices may have four features: room length, room width, number of rooms
and floor level (view).

Given the existence of four features, hotel room data would be expressed on a
four dimensional (4D) data graph. However, there is an opportunity to remove
redundant information and reduce the number of dimensions to three by
combining ‘room length’ and ‘room width’ to be expressed as ‘room area.’

Applying a descending dimension algorithm will thereby enable you to compress
the 4D data graph into a 3D data graph.

Another advantage of this algorithm is visualization and convenience.
Understandably, it’s much easier to work and communicate information on a 2D
plane rather than a 4D data graph.

After reducing the data you will be able to focus on the patterns and
regularities. You can do this by zooming out of the data on a graphical interface.

Having too many dimensions or data points would otherwise make it harder for
you to spot these patterns and regularities.
Classification
Classification is a process to place new cases into the correct group. Think of it
like collecting stamps and then placing them into categories.

Key to classification is that the categories already exist and have been pre-
determined – which is very different to ‘clustering’ as we touch upon next.

A commonly used example of classification is email spam detection. Your email
client applies a classification algorithm to determine whether incoming email
should be classified under the two existing categories of ‘spam’ or ‘non-spam’.

Another example of classification could be sorting and allocating e-commerce
deliveries into zip codes at a central post depot.

From these two examples you can see that classification is a simple way to find
patterns within a dataset with known variables.

One of the challenges with classification is that it can be difficult to accurately
apply the classification system unless you know the existing variables.
Clustering though helps to solve this problem.
Clustering
Clustering is another key data principle to group similar data objects into a class,
and differs from classification.

Unlike classification, which starts with predefined labels reflected in the
database table, clustering creates its own labels after clustering the data set.
Analysis by clustering can be used in various scenarios such as pattern
recognition, image processing and market research.

For example, clustering can be applied to uncover customers that share similar
purchasing behaviour. By understanding a particular cluster of customer
purchasing preferences you can form decisions on which products you want to
recommend groups based on their commonalities. You can do this by offering
them the same promotions via email or click ad banners on your website.

The Netflix example I brought up earlier was a case of identifying a cluster of
viewers that both enjoyed watching the British version of House of Cards and
films featuring Kevin Spacey, and/or films directed by David Fincher.

There would have been very little way of seeing this relationship prior to the
clustering process, as there was no set data field classifying fans of Kevin Spacy
who also liked watching the British version of House of Cards. But by clustering
and isolating the analysis into a group of data points on a scatter map we are able
to identify new valuable relationships.

Clustering can also be complemented by classification. As mentioned, clustering
creates new classes or buckets to group data based on the application of
algorithms. From the process of clustering, new classification categories can be
created.

Clustering Data Algorithms
Various algorithms can be used to identify clusters. These include:
1) Measuring the distance between data points, known as Euclidean Distance.
2) Measuring the distance from a centroid, or mean value, to the surrounding data points

3) Measuring the density of the data within a space and drawing a border around those data points.

4) Distributional models, drawing a normal shape around data points such as an ellipse
Anomaly Detection
Anomaly detection is different in that you are now seeking to collect data points
that are different. They are different in regards to their location on the data plot,
and because they don't naturally fit into a cluster.

It’s important to first differentiate between anomalies and outliers. An anomaly
is an event, which should not have happened and is usually seen as a problem.
For instance, you detect that the traffic lights at one train crossing on a
metropolitan network is unavailable and needs to be fixed.

Outliers are closely linked but represent a slightly larger grouping than
anomalies. Outliers as you can imagine are small groups of data points that
diverge from the main clusters because they record unusual scores on at least
one variable.

While it will depend on the total size of the data set, some data scientists deem
categories with less than 10% of cases as an outlier category. Outliers can
therefore distort your conclusions – even if only caused by a very small number
of cases.

There are several options available to mitigate this challenge. If there is a small
number of outliers and deleting them would not have any substantive effect on
other analyzes, then the best option is to go ahead and delete.

But what then of the anomalies? You first want to detect anomalies. Then you
need to make a decision. You can decide to exclude them in order to focus on
other data points and clusters. But in other cases, you may want to study why
they are different.

Anomalies, for example, are commonly used in the domain of fraud detection to
identify illegal activities.

Human guinea pig and life hacker, Tim Ferris, is another close observer of
anomalies. Rather than studying athletes, chefs, chess players, linguists and other
successful types who were destined to be successful by genetic makeup, family
background or upbringing, he studies the anomalies. He looks for examples that
defy the odds and then decodes and breaks down what they did to become world
class.


Text Mining
Text mining is one of the most important and popular methods of data mining to
manage unstructured data. Unstructured data is not numerical data stored in
rows and columns as found in a spread sheet. In the case of text mining, we are
looking at non-structured data in the form of passages of text.

Text mining is commonly used to analyze social media posts but can be applied
to various other scenarios as well. An example of text mining could be
measuring approval or disapproval amongst users on Twitter by reviewing the
text of public Tweets with the hash tag #Bigdata over a set period of time.

Clustering is often applied in combination with text mining. After mining text
data, the data science team then groups the results. Clustering allows the owner
of the data to make certain decisions on how to manage these found groups.

For example, brands may be able to use the data to identify ‘true-fans’ on social
media who denounce ‘haters’ and Internet trolls criticizing the brand. With this
information, brands could then reach out to these social media users to negotiate
terms as a Key Opinion Leader (KOL) or brand ambassador.

The process of text mining entails two primary algorithm categories. The first
category of algorithms identifies the nuances of the text language in the form of
verbs, adjectives, proper nouns, and adverbs. It is also able to identify
positive/negative sentiments.

The second algorithm category treats words simply as individual items. Rather
than analyze the function and context of the word to understand its meaning, the
algorithm treats the word as an individual object and analyzes how often the
word is mentioned and how frequently it appears next to other words.

In addition, the algorithm is actually tabulating words into numbers.
One way to remember this is to think of the game of Scrabble. In scrabble each
letter you pull out of the bag has a number on it, except in text retrieval we are
looking at words not letters.

Popular algorithms for analysing individual words include:

Naive Bayes
You make all variables conditionally independent based on a certain outcome.
IE, is the word an adjective?

K-means Clustering
A clustering technique used to uncover categories, for example words that
frequently appear next to each other.

Support Vector Machines
The objective of support vector machines is to categorize data into two classes. It
does so by drawing a straight or squiggly line between the data points of both
categories. In other words, it cuts down the middle of the two categories.

Term Frequency Inverse Document Frequency (TFIDF) vectorization
TFIDF analyses how frequently a word occurs in the text.

Binary Presence
Binary presence simply analyses whether a word is present in a document or
not. Yes/No.
Association Analysis
Association analysis is a method to identify items that have an affinity for each
other, and fits under the statistical field of correlation.

Association analysis algorithms are commonly used by e-commerce companies
and off-line retailers to analyze transactional data and identify commonly
purchased together items. This insight allows e-commerce sites and retailers to
strategically showcase and recommend products to customers based on common
purchase combinations.

Association analysis is a relatively straightforward data mining concept to grasp.

Suppose your lemon aid stand sells five different products. These products are
A, B, C, D and E. Over the course of the day your have multiple buyers stop by
your stand to purchase products. Your first customer purchases A and C. The
next customer buys C, D and E. Eight more customers arrive and purchase
various other combinations of products.

Based on this data you now want to predict what your next customer will
purchase.

The first step in association analysis is to construct frequent itemsets (X).
Frequent itemsets are a combination of items that regularly appear together, or
have an affinity for each other. The combination could be one item with another
single item. Alternatively, the combination could be two or more items with one
or more other items.

From here you can calculate an index number called support (SUPP) that
indicates how often these items appear together.

Please note that in practice, “support” and “itemset” are commonly expressed as
“SUPP” and “X”.

Support can be calculated by dividing X by T, where X is how often the itemset
appears in the data and T is your total number of transactions. For example, if E
only features once in five transactions, then the support will be only 1 / 5 =0.2.

However in order to save time and to allow you to focus on items with higher
support, you can set a minimum level known as minimal support or minsup.
Applying minsup will allow you to ignore low-level cases of support.

The other step in association analysis is rule generation. Rule generation is a
collection of if/then statements, in which you calculate what is known as
confidence. Confidence is a metric similar to conditional probability.

IE, Onions + Bread Buns > Hamburger Meat

Numerous models can be applied to conduct association analysis. Below is a list
of the most common algorithms: - Apriori
- Eclat (Equivalence Class Transformations)
- FP-growth (Frequent Pattern)
- RElim (Recursive Elimination)
- SaM (Split and Merge)
- JIM (Jaccard Itemset Mining)

The most common algorithm is Apriori. Apriori is applied to calculate support
for itemsets one item at a time. It thereby finds the support of one item (how
common is that item in the dataset) and determines whether there is support for
that item. If the support happens to be less than the designated minimum support
amount (minsup) that you have set, the item will be ignored.

Apriori will then move on to the next item and evaluate the minsup value and
determine whether it should hold on to the item or ignore it and move on.

After the algorithm has completed all single-item evaluations, it will transition to
processing two-item itemsets. The same minsup criteria is applied to gather
items that meet the minsup value. As you can probably guess, it then proceeds to
analyze three-item combinations and so on.

The downside of the Apriori method is that the computation time can be slow,
demanding on computation resources, and can grow exponentially in time and
resources at each round of analysis. This approach can thus be inefficient in
processing large data sets.

The most popular alternative is Eclat. Eclat again calculates support for a single
itemset but should the minsup value be successfully reached, it will proceed
directly to adding an additional item (now a two-item itemset).

This is different to Apriori, which would move on to process the next single
item, and process all single items first. Eclat on the other hand will seek to add
as many items to the original single item as it can, until it fails to reach the set
minsup.

This approach is fast and less intensive in regards to computation and
memory, but the itemsets produced are long and difficult to manipulate.

As a data scientist you thus need to form a decision on which algorithm to apply
and factor in the trade-off in using each algorithm based your available
computing resources, the amount of data and your time schedule.

Sequence Mining
Sequence mining is a process to identify repeating sequences in a dataset
row. Sequencing can be applied to various scenarios, including instructing a
person what to do next or predict what's going to happen next.

Sequence mining is similar to association analysis in regards to the prediction
that if x occurs then z and y are also likely to occur. The big difference in
sequence mining is that the order of events matters. In association analysis it’s
not important if the combination is ‘x, y, and z’, or ‘z, y, and x’ but in sequence
mining it is.

Sequence mining algorithms
A number of models can be applied to conduct sequence mining.
- GSP (Generalized Sequential Patterns)
- SPADE ( Sequential Pattern Discovery using Equivalence )
- FreeSpan
- HMM (Hidden Markov Model)

A common method for sequence mining is GSP. GSP is similar to Apriori, as
discussed in the previous chapter. But unlike Apriori, GSP adheres to the order
of events, which could be say ordinal or temporal.

Temporal refers to the state of time and ordinal refers to the logical progression
of categories, ie, “elementary school > middle school > senior school,” or
“single > engaged > married.”

Unlike Apriori, GSP will not treat “X, Y and then Z” and “X, Z and then Y” as
the same thing. But, like Apriori, GSP must do a lot of passes through the data to
conducts its findings and can therefore be a slow and computationally draining
procedure.

For larger data sets, SPADE (Sequential Pattern Discovery using Equivalence
classes) is recommended. Spade does fewer database scans by using intersecting
ID-lists. It first uses a vertical ID-list of the database and stretches out the data
by row into a two-dimensional matrix of the first and second item. Based on
those results, it can then add a third, forth, fifth etc. result to analyze.










Machine Learning Techniques




Introduction
Machine learning algorithms can be split into different classes of algorithms,
including supervised and unsupervised.

Supervised
Supervised algorithms refer to learning guided by human observations and
feedback with known outcomes.

For instance, supposes you want the machine to separate email into spam and
non-spam messages.

In a supervised learning environment, you already have information that you can
feed the machine to describe what type of email should belong to which
category. The machine therefore knows that there are two labels available in
which to sort the incoming data (emails).

Or to predict who will win a basketball game, you could create a model to
analyze games over the last three years. The games could be analyzed by total
number of points scored and total number of points scored against in order to
predict who will win the next game.

This data could then be applied to a model of classification. Once the data has
been classified and plotted on a data plot we can then apply regression to predict
who will win based on the average of previous performances. The final result
would then supply an answer based on overall points.

As with the first example, we have instructed the machine which categories to
analyze (points for, and points against). The data is therefore already pre-
tagged.

Supervised algorithms, with tags applied, include Linear Regression, Logistic
Regression, Neural Networks, and Support Vector Machine algorithms
Unsupervised
In the case of an unsupervised learning environment, there is no such integrated
feedback or use of tags. Instead the machine learning algorithm must rely
exclusively on clustering separate data and modifying its algorithm to respond to
its initial findings - all without the external feedback of humans.

Clustering algorithms are a popular example of unsupervised
learning. Clustering groups together data points which are discovered to possess
similar features.

For example, if you cluster data points based on the weight and height of 13-year
old high school students, you are likely to find that two clusters will emerge
from the data. One large cluster will be male and the other large cluster will be
female. This is because girls and boys tend to have separate commonalities in
regards to physical measurements.

The advantage of applying unsupervised algorithms is that it enables you to
discover patterns within the data that you may not have been aware existed –
such as the presence of two different sexes.

Clustering can then provide the springboard to conduct further analysis after
particular groups have been discovered.

Unsupervised algorithms, without tags, include clustering algorithms and
descending dimension algorithms.

Support Vector Machine Algorithms
Support vector machine (SVM) algorithms are an advanced progression from
logistic regression algorithms. SVM algorithms are essentially logistic
regression algorithms with stricter set conditions. To that end, SVM algorithms
are better at drawing classification boundary lines.
Let’s see what this looks like in practice. Above on the plane are data points that
are linearly separable. A logistic regression algorithm, as we know, will split the
two groups of data points with a straight line that minimizes the distance
between all points. In the picture above you can see that Line A (logistic
regression hyperplane) is positioned snuggly between points from both groups.

As you can also see, line B (SVM hyperplane) is likewise separating the two
groups but from a position with maximum space between itself and the two
groups of data points.

You will also notice that within the image is a light blue area that denotes
Margin. Margin is the distance between the hyperplane and the nearest point,
multiplied by two. An SVM hyperplane should be located in the middle of the
Margin.

If however the data is not linearly separable, then it is possible to apply what is
known as a Kernel Trick. When combined with SVM, the Kernel trick can map
data from low-dimensional to high-dimensional.

Transitioning from a two dimensional to a third dimensional space allows you to
use a linear plane to achieve similar result to split the data but within a 3-D
space.








Source: ieeexplore
Artificial Neural Networks - Deep Learning
Deep learning is a popular area within machine learning today.

Deep learning became widely popular in 2012 when tech companies started to
show off what they were able to achieve through sophisticated layer analysis,
including image classification and speech recognition.

Deep learning is just a sexy term for Artificial Neural Networks (ANN), which
have been around for over forty years.

Artificial Neural Networks (ANN), also known as Neural Networks, is one of
the most widely used algorithms within the field of machine learning. Neural
networks are commonly used in visual and audio recognition.

ANN emphasizes on analyzing data in many layers, and was inspired by the
human brain, which can visually process objects through layers of neurons.

ANN is typically presented in the form of interconnected neurons that interact
with each other. Each connection has numeric weight that can be altered and is
based on experience.

Much like building a human pyramid or a house of cards, the layers or neurons
are stacked on top of each other starting with a broad base.

The bottom layer consists of raw data such as text, images or sound, which are
divided into what we call neurons. Within each neuron is a collection of data.
Each neuron then sends information up to the layer of neurons above. As the
information ascends it becomes less abstract and more specific, and the more we
can learn from the data from each layer.

A simple neural network can be divided into input, hidden, and output layers.
Data is first received by the input layer, and this first layer detects broad
features. The hidden layer/s then analyze and processes that data, and through
the passing of each layer with less neurons (which diminish in number at each
layer) the data becomes clearer, based on previous computations. The final result
is shown as the output layer.

The middle layers are considered hidden layers, because like human sight we are
unable to naturally break down objects into layered vision.

For example, if you see four lines in the shape of a square you will visually
recognize those four lines as a square. You will not see the lines as four
independent objects with no relationship to each other.

ANN works much the same way in that it breaks data into layers and examines
the hidden layers we wouldn’t naturally recognize from the onset.

This is how a cat, for instance, would visually process a square. The cat’s brain
would follow a step-by-step process, where each polyline (of which there are
four in the case of a square) is processed by a single neuron.

Each polyline then merges into two straight lines, and then the two straight lines
merge into a single square. Via staged neuron processed, the cat’s brain can see
the square.

Four decades ago neural networks were only two layers deep. This was because
it was computationally unfeasible to develop and analyze deeper networks.
Naturally, with the development of technology it is possible to easily analyze ten
or more layers, or even over 100 layers.

Most modern algorithms, including decision trees and naive bayes are
considered shallow algorithms, as they do not analyze information via numerous
layers as ANN can.
Data Visualization
Once your data analytics has been completed, you are one step closer to
commercializing your data set. But first you need a means to communicate the
value of your new insight. You need to convey the findings to the rest of the
organization, and to inform decision makers or other parties.

No matter how impactful and insightful your data discoveries are, you have to
find a way of effectively communicating the results to an audience who perhaps
aren’t proficient with data science terminology.

This is why data visualization has been so successful and widely adopted in data
science. Visualisation is a highly effective medium to communicate data
findings to a general audience. The visual storytelling applied by graphs, pie
charts and representation of numbers in shapes makes for fast and easy
storytelling.

Your can think of data visualization as the middleman interfacing the data
science experts and the intended audience.

As a data scientist it’s an advantage to have a grasp or understanding of effective
visualization techniques. This will assist your efforts in effectively
communicating data with your audience.

Tableau is a popular visualization tools for data scientists. The software program
supports a range of visualization techniques including charts, graphs, maps and
other options.

Where to From Here
Career Opportunities in Data Analytics
Data analytics takes training and absorption of theoretical knowledge,
technology and software in order to master.

Those with a fascination to unravel how things work and deconstruct
complicated tasks through set teachings and theory rather than common sense
and human intuition, will be naturally drawn to data mining.

Those of you with backgrounds in statistics, computer programming,
mathematics and technology systems are naturally going to take to the topic with
ease.

However, there’s nothing stopping you from mastering data analytics even if you
don’t have a background in relate fields. As long as you have the willpower and
enthusiasm to go on from here and learn computational languages, statistics and
data software management you should be able to go and one day earn a 6 figure
salary.

Career opportunities in data analytics are indeed both expanding and becoming
more lucrative at the same time. Due to current shortages in qualified
professionals and the escalating demand for experts to manage and analyze data
the outlook for data professionals is bright.

To work in data analytics you will need both a strong passion for the field of
study and dedication to educate yourself on the various facets of data analytics.

There are various channels in which you can start to train yourself in the field.
Identifying a university degree, an online degree program or online
curriculum are common entry points.

Along the way it is also important to seek out mentors who you can turn to for
advice on both technical analytics questions but also on career options and
trajectories.

A mentor could be a professor, colleague, or even someone you don’t yet know.
If you are looking to meet professionals with more industry specific experience
it is recommended that you attend industry conferences or smaller offline events
held locally. You could decide to attend either as a participant or as a volunteer.
Volunteering may in fact offer you more access to certain experts and save
admission fees at the same time.

LinkedIn and Twitter are terrific online resources to identify professionals in the
field or access leading industry voices. When reaching out to established
professionals you may receive resistance or a lack of response depending on
whom you are contacting.

One way to overcome this potential problem is to offer your services in lieu of
mentoring. For example, if you have experience and expertise in managing a
WordPress website you could offer your time to build or manage an existing
website for the person you are seeking to form a relationship with.

Other services you can offer are proof reading books, papers and blogs, or
interning at their particular company or institute.

Sometimes its better to start your search for mentors locally as that will open
more opportunities to meet in person, to find local internship and job
opportunities. This too naturally conveys more initial trust than say emailing
someone across the other side of the world.

Interviewing experts is one of the most effective ways to access one-on-one time
with an industry expert. This is because it is an opportunity for the interviewee
to reach a larger audience with their ideas and opinions. In addition, you get to
choose your questions and ask your own selfish questions after the recording.

You can look for local tech media news outlets, university media groups, or even
start your own podcast series or industry blog channel. Bear in mind that
developing ongoing content via a podcast series entails a sizeable time
commitment to prepare, record, edit and market. The project though can bear
fruit as you produce more episodes.

Quora is an easy-to-access resource to ask questions and seek advice from a
community who are naturally very helpful. However, do keep in mind that
Quora responses tend to be influenced by self-interest and if you ask for a book
recommendation you will undoubtedly attract responses from people
recommending their own book!

However, there is still a wealth of non-biased information available on Quora,
you just need to use your own judgement to discern high value information from
a sales pitch.


College Degrees
Recommended Degrees in the U.S:

Southern Methodist University, Dallas, Texas
Online Master of Science in Data Science
Available online over 20 months. Ranked a Top National University by US
News.

Syracuse University, Syracuse, New York Online Master of Science in


Business Analytics
Available online. GMAT waivers available.

Syracuse University, Syracuse, New York
Online Master of Science in Information Management
Available online. GRE waivers available.

American University, Washington DC
Online Master of Science in Analytics
Available online. No GMAT/GRE required to apply.

Villanova University, Villanova, Pennsylvania Online Master of Science in
Analytics
Available online.

Purdue University, West Lafayette, Indiana
Master of Science in Business Analytics and Information Management
Full-time 12-month program. Eduniversal ranks Krannert's Management
Information Systems field of study #4 in North America.


University of California-Berkeley, Berkeley California Available Online. #1
ranked public university by US News
Recommended Resources
Introduction to Advanced Analytics
Covers Data Mining, as well as Predictive Analytics, and Business Intelligence.
By Rapid Miner.

Data Mining For the Masses
A comprehensive introduction to Data Mining. By Rapid Miner.

Data Science
Udacity Course.

Want to be a Data Scientist?
Udemy Course.

Introduction to Data Analysis and Visualization
By Mathworks.

20 short tutorials all data scientists should read (and practice)
By Data Science Central.

The Netflix Prize and Production Machine Learning Systems: An Insider Look
A very interesting blog showing how Netflix uses machine learning to form
movie recommendations. By MathWorks.
Final Word
This book I hope has helped to ease you into the field of data analytics and
translate this topic into layman’s terms.

I hope you enjoyed this short book and I wish you all the best with your future
career in data analytics.

Yours sincerely,

Oliver Theobald

You might also like