Professional Documents
Culture Documents
Beginner's Guide To Data Analytics. Crash Course (PDFDrive)
Beginner's Guide To Data Analytics. Crash Course (PDFDrive)
Oliver Theobald
First Edition Copyright © 2017 by Oliver Theobald All rights reserved. No part of this publication may be
reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or
other electronic or mechanical methods, without the prior written permission of the publisher, except in the
case of brief quotations embodied in critical reviews and certain other non-commercial uses permitted by
copyright law.
This book is written for anyone who is interested in making sense of data
analytics without the assumption that you understand specific data science
terminology or advanced programming language.
If you are just starting out as a student of data science or already working in
marketing, medical research, senior management, policy analysis or IT then this
book is ideally suited for you.
Source: www.startingabusiness.ca
In order to group and designate new data to the database tables, RDMS relies on
what is known as schema.
Scheme defines what you data looks like, and where it can be placed within the
relational database. Relational databases therefore require considerable upfront
design to determine scheme or the format and category of data you are
collecting.
Most relational database management systems use Structured Query Language
(SQL) to access the data and perform commands to manipulate and view the
data across multiple tables.
Despite their long history RDMS are still used widely today, and especially in
regards to data warehousing, or what is also known as an Enterprise Data
Warehouse (EDW).
- Data Warehouse/EDM
EDW is a relational database that focuses on storing data for the purpose of
future analysis. Whereas traditional databases typically store data that will be
processed in real-time in order to access and record information, known also as
online transaction processing (OLTP), an EDW is optimized for storing data and
then performing offline analytics and creating reports, also referred to as OLAP.
An easy to understand example of a traditional database/OLTP task would be
using SQL to retrieve information to process a purchase order. As an e-
commerce store for example, you will need access to multiple tables in real-time
in order to process the order, including the customer billing details, customer
mailing address and your inventory list. You don’t want to have to crank up the
servers and wait 30 minutes just to retrieve the data you need.
Instead, SQL is an easy way to access the information and perform online
transaction processing (OLTP)
The role of a data warehouse or OLAP (Online Analytical Processing) is to then
analyse tables and transactions after the purchasing has taken place. This may
mean analyzing transactions in relation to other transactions, ie. what did other
customers buy? Or analysing the data for commonalities among types of
customers, ie. where they live and what time they order.
Ultimately, the goal of data warehousing and OLAP is to store data that will be
later processed in order to produce new insight that aids decision makers to
better understand their data and provide unique information to improve your
operations.
Also note that relational databases can be scripted to automatically upload data
to the data warehouse at regular intervals.
- Distributed computing clusters
One of the most popular distributed computer clusters to store data is
Hadoop. Hadoop operates on a distributed file system to store data on multiple
servers, also known as a Hadoop cluster. In addition, Hadoop provides the
infrastructure to later process your data across hundreds or even thousands of
servers by splitting tasks across the cluster.
- Key value store
A key value store is a simple and easy-to-use database that stores data as a key-
value pair. The key is marked by a string such as a filename, URI or hash, and
matches a value that can be any kind of data such as an image or document. The
value (data) is essentially stored as a blob, and therefore does not require a
schema definition.
This is a fast option to store, retrieve and delete data. However, as the value is
vague, you cannot filter or control what’s returned from a request.
Apache Cassandra is a free open-source distributed database management
system similar to Hadoop but fits into the category of key value store as it can
store NOSQL without schema and the use of SQL.
Cassandra is known as one of most highly available (reliable) and fault-tolerant
database systems available. It is therefore suited to handling massive amounts of
data, such as indexing every web page on the Internet or as a backup system to
replace tape recordings.
- Data Migration
A common task and challenge for data scientists is the migration of data from
one storage platform to another. This may entail migrating data from a legacy
database, spread sheet or data warehouse to a distributed computing based
storage platform such as Hadoop or a cloud storage platform.
Migrating data from data warehouses to Hadoop and cloud platforms is
becoming more common given the cost savings of storing data on a distributing
computing network.
The need for data migration is especially common when one business or website
is acquired by another, and the new owner wishes to merge data from both
entities into one storage repository.
ETL (extract, transform and load) is a process that is used to migrate the data
from one storage platform to another. Under ETL, you will first extract the data
in a standard format that is compatible with the future home of the data, and
which may involve storage platforms with different schema. The task is to then
scrub and transform the data so it will all fit. Lastly, the transformed data will be
loaded onto the new storage platform.
Data Scrubbing
After storing your data the next process is refining your data to make it easier to
work with, known as scrubbing. Data scrubbing can entail modifying or
removing incomplete, incorrectly formatted or repeated data within the dataset.
The overall goal of data scrubbing is to make the dataset more accurate and
convenient to process.
For data scientists, data scrubbing usually entails the most application of time
and effort. Data scrubbing could entail removing data (including anomalies and
outliers), data dimension reduction, classification and clustering.
Data scientists use a wide range of tools, including text editors, scripting
tools, and programming languages such as Python to scrub the data.
Hadoop can also be used for data scrubbing. Hadoop is used for storage, as
mentioned, and also for data scrubbing through MapReduce. Via MapReduce
data is divided into smaller batches and each batch is subsequently processed via
the Hadoop cluster. Apache Spark is another alternative that can process data in
real-time.
An example of scrubbing could be taking data collected from Facebook and
stored on a Hadoop cluster, and then using data techniques to categorize
Facebook posts. Various algorithms could be used to classify, cluster or group
Facebook posts into a variety of more manageable groups such as: -
Positive/negative
- Use of collective language (we, us) or individual language (I, me), or,
- Location of post
Data scrubbing could also entail reducing the dimensionality of your data. This
is a process to identify important combinations or variables (fields) that seem the
most important and relevant to your hypothesis. This may mean merging two
variables into one. For example, ‘hours’ and ‘seconds’ could be merged and
denoted as “day of the week.”
This helps to simplify the data set and reduce data noise or distractions.
Another option is to delete the data all together. However, it should be noted that
there exists a counter-argument to permanently removing data. This argument
notes that as a data scientist or a data scientist team you will never truly
know what questions you may wish to ask from the data now and in the future.
Secondly, as it’s becoming increasingly more cost-effective to store massive
amounts of data, it could be safer to hold on to the data under a data retention
policy rather than throw it away.
This is a question you as a data scientist or a decision maker will need to make
based on your company’s size, industry field, IT resources, data storage budget
and data analysis goals. However, it’s still perfectly possible to hold on to all
your data and conduct data reduction – simply just keep a backup copy of all
your data on record!
Data Analysis
Data analysis calls on statistical packages to deeply analyze the scrubbed
data. Popular package tools include R, SBSS and Python's data libraries. Many
of these tools allow you to visualize the data in the form of charts and graphs. To
a certain extent, data visualisation overlaps with data analyzing depending on
your package tool.
More about data analysis techniques will be covered in a following section.
Web Scraping
A great way to get started with data analytics and collecting data is web
scraping. Web scraping is a computer software technique to extract information
from websites.
Expressed differently, web scraping goes out and automatically collects
information from the web for you. This saves you the hassle of manually
clicking through webpages, and copying and pasting that information into a
spreadsheet.
Web scraping can be extremely helpful for sales and marketing, and is an
efficient way to get large amounts of data in a very short period of time.
Web scraping is used widely and search engines are the best example. As part of
their search service, Google uses web scraping to crawl and index nearly every
website on the web. Other websites use web scraping to aggregate services and
prices, including hotels, flights and other booking sites.
The goal of scraping is to typically work backwards from your data needs. For
example, you may need a list of Twitter key opinion leaders or you need to index
the sales of products on e-commerce products. You therefore choose your target
source to scrape based on your data needs.
It is possible to scrape data from multiple different sources (websites), but
obviously it’s easier to concentrate on one source to start with.
While scraping may sound technically difficult, there are tools available that
make it a simple click and drag process for non-technical people. One such tool
is Import.io.
Import.io is a browser tool you can download to scrape websites. The tools was
previously free to download but now has a steep pricing system in place.
However, if you are a student, teacher, journalist, charity or startup then you
may request for free access. Otherwise, they do have a free trial option also.
You can use Import.io in three simple stages to scrape and collect information
from the web.
1) Download: Your web browser downloads information from a web server to
load and display to you. Import.io will crawl the web server and then retrieve
information from the webpage/s through parsing.
2) Parse: Via parsing, the scraper will selectively retrieve useful segments of
information from defined guidelines. For instance, the tool can be configured to
scrape Twitter posts and comments but avoid information that you don’t want to
scape such as menu items and the website footer. Defining what you wish to
extract is a matter of just click and drag and following the prompts within the
toolbox.
3) Store: Finally, the tool takes this information and stores it for you online. You
can then decide whether you wish to export the data to a cloud storage platform,
JSON, or a simple CSV file saved on your Desktop.
For more technical web scraping Python is the go-to programming language in
the industry. Basic Python crawlers can be written in less than 40 lines of code.
Downloading Datasets
If manually downloading datasets is not of interest to you, and you want to first
concentrate on data analytics techniques, then go no further than Kaggle.
Kaggle is an online community – now bought out by Google – for data scientists
and statisticians to access free datasets, join competitions, and simply hangout
and talk about data.
Below are five free sample datasets you might want to look into downloading
from the site.
Starbucks Locations Worldwide
What to figure out which country has the highest density of Starbuck stores, or
which Starbucks store is the most isolated from any other? This dataset is for
you.
Scraped from the Starbucks store location webpage, this dataset includes the
name and location of every Starbucks store in operation as of February 2017.
European Football Database
Sometimes not a lot of action happens in 90 minutes but with 25,000+ matches
and 10,000+ players over 11 leading European country championships from
seasons 2008 to 2016 this is the dataset for Football diehards.
The dataset even includes team line up with squad formation represented in X,Y
coordinates, betting odds from 10 providers, and detailed match events including
goals, possession, goals, cards and corners.
Craft Beers Dataset
Do you like craft beer? This dataset contains a list of 2,410 American craft beers
and 510 breweries collected from January 2017 on CraftCans.com. Drinking and
data crunching is also perfectly legal.
New York Stock Exchange
Interested in fundamental and technical analysis? With up to 30% of traffic on
stocks said to be machine generated, how far can we take this number based on
lessons learnt from historical data.
This dataset includes prices, fundamentals and securities retrieved from Yahoo
Finance, Nasdaq Financials, and EDGAR SEC databases. From this dataset you
can look to see what impacts return on investment and indicates future
bankruptcy.
Brazil's House of Deputies Reimbursements
As politicians in Brazil are entitled to receive refunds from money spent on
activities to "better serve the people," there’s a lot of interesting data and
suspicious outliers to be found from this dataset.
Data on these expenses are publically available but there is very little monitoring
of expenses in Brazil. So don’t be surprised to see one public servant racking up
over 800 flights in one year, and another who recorded R$140,000 (USD
$44,500) on mailing post expenses.
The following section will examine data analytics techniques applied to both
data mining and machine learning.
Data Mining & Machine Learning
Techniques
Regression
Regression, and linear regression specifically, is the “Hello World” equivalent
of data analytics. Just as programmers start with “Hello World” as the first line
of code they learn to write, prospective data scientists typically start with linear
regression.
Regression is a statistical measure that takes a group of random variables and
seeks to determine a mathematical relationship between them. Expressed
differently, regression calculates various variables to predict an outcome or
score.
Regression is used in a range of disciples including data mining, finance,
business and investing. In investment and finance, regression is used to value
assets and understand the relationship between variables such as exchange rates
and commodity prices.
In business, regression can help to predict sales for a company based on a range
of variables such as weather temperatures, social media mentions, previous
sales, GDP growth and inbound tourists.
Specifically, regression is applied to determine the strength of a relationship
between one dependent variable (typically represented as Y) and other changing
variables (known also as independent variables).
A simple and practical way to understand regression is to consider the scatter
plot below:
The two quantitative variables you see below are house cost and square footage.
House value is measured on the vertical axis (Y), and square footage is
expressed along the horizontal axis (x). Each dot (data point) represents one
paired measurement of both ‘square footage’ and ‘house cost’. As you can see,
there are numerous data points representing houses within a particular suburb.
To apply regression to this example, we simply draw a straight line which
represents the least deviation through the data points, as seen above.
But how do we know where to draw the straight line? There any many ways we
could split the data points with the regression line, but the goal is to draw a
straight line that best fits all the points on the graph, with the minimum distance
possible from each point to the regression line.
This means that if you were to draw a vertical line from the regression line to
every data point on the graph, the distance of each point would equate to the
smallest possible distance of any potential regression line.
As you can see also, the regression line is straight. This is a case of linear
regression. If the line were not straight, it would be known as non-linear
regression, but we will get to that in a moment.
Another important feature of regression is slope. The slope can be simply
calculated via referencing the regression line. As one variable (X or Y)
increases, you can expect the other variable will increase to the average value
denoted on the regression line. The slope is therefore very useful for forming
predictions.
The closer the data points are to the regression line, the more accurate your
prediction will be. If there is a greater degree of deviation in the distance
between the data points and your regression line then the less accurate your
slope will be in its predictive ability.
Do note that this particular example applies to a bell-curve, where the data points
are generally moving from left-to-right in an ascending fashion. The same linear
regression approach does not apply to all data scenarios. In other cases you will
need to use other regression techniques – beyond just linear.
There are various types of regression, including linear regression (as
demonstrated), multiple linear regression and non-linear regression methods,
which are more complicated.
Linear Regression
Linear regression uses one independent variable to predict the outcome of the
dependent variable, or (represented as Y).
Multiple Regression
Multiple regression uses two or more independent variables to predict the
outcome of the dependent variable (represented as Y).
Non-linear Regression
Non-linear regression modelling is similar in that it seeks to track a particular
response from a set of variables on the graph. However, non-linear models are
somewhat more complicated to develop.
Non-linear models are created through a series of approximations (iterations),
typically based on a system of trial-and-error. The Gauss-Newton method and
the Levenberg-Marquardt method are popular non-linear regression modelling
techniques.
Linear Regression on Google Sheets
An easy way to get started with Linear Regression is on Microsoft Excel or
Google Sheets. Below are instructions to create a linear regression line on
Google Sheets.
1. Open a spreadsheet in Google Sheets.
2. Enter your data into two rows (x and y). See image below depicting how your
data entry should look.
3. Select a scatter plot chart. In the top right corner, click the Down arrow.
4. Click ‘Advanced edit’.
5. Click Customization and scroll down to the “Trendline” section. If for some
reason you don’t see the trendline option, it means that your data probably
doesn’t not have an X and Y coordinate and a trendline cannot be added.
6. Click the menu next to “Trendline.”
7. Select “Linear”
8. Click Update.
Done!
Example
Data Reduction
While it may sound somewhat counter-intuitive, one of the core processes of
data mining and machine learning is data reduction – part of the data scrubbing
process as mentioned in the previous chapter.
You would think the more data the better, right? With more data comes more
potential insight to draw from. But not all data is important. Too much data
creates noise and distraction.
Think of it as having way too many files saved on the Desktop of your computer.
Having so many photos, word documents, videos and other files is not
necessarily a bad thing but it does make it hard to find what you’re looking for.
You can reduce data noise and simplify the data set through what is known as
dimensionality reduction. This is a process to identify important combinations or
variables (fields) that seem the most important and relevant to your hypothesis.
The second reason why it could be important to conduct data reduction is you
may have limited machine performance to manage the data. Just like having too
many browsers open on your computer affects your Internet speed, likewise
having too much data slows down your data mining process.
Available storage on the hard drive of your machine could be another limitation,
as well as memory constraints in the form of RAM.
You can overcome computer performance limitations by linking to cloud
services offered by Amazon, Microsoft and Alibaba Cloud etc, but this will cost
money to access their servers. (Most cloud providers offer a 1-12 month free
trial period.) One approach to data reduction is applying a descending
dimension algorithm that effectively reduces data from high-dimensional to
low-dimensional.
Dimensions are the number of features characterizing the data. For instance,
hotel prices may have four features: room length, room width, number of rooms
and floor level (view).
Given the existence of four features, hotel room data would be expressed on a
four dimensional (4D) data graph. However, there is an opportunity to remove
redundant information and reduce the number of dimensions to three by
combining ‘room length’ and ‘room width’ to be expressed as ‘room area.’
Applying a descending dimension algorithm will thereby enable you to compress
the 4D data graph into a 3D data graph.
Another advantage of this algorithm is visualization and convenience.
Understandably, it’s much easier to work and communicate information on a 2D
plane rather than a 4D data graph.
After reducing the data you will be able to focus on the patterns and
regularities. You can do this by zooming out of the data on a graphical interface.
Having too many dimensions or data points would otherwise make it harder for
you to spot these patterns and regularities.
Classification
Classification is a process to place new cases into the correct group. Think of it
like collecting stamps and then placing them into categories.
Key to classification is that the categories already exist and have been pre-
determined – which is very different to ‘clustering’ as we touch upon next.
A commonly used example of classification is email spam detection. Your email
client applies a classification algorithm to determine whether incoming email
should be classified under the two existing categories of ‘spam’ or ‘non-spam’.
Another example of classification could be sorting and allocating e-commerce
deliveries into zip codes at a central post depot.
From these two examples you can see that classification is a simple way to find
patterns within a dataset with known variables.
One of the challenges with classification is that it can be difficult to accurately
apply the classification system unless you know the existing variables.
Clustering though helps to solve this problem.
Clustering
Clustering is another key data principle to group similar data objects into a class,
and differs from classification.
Unlike classification, which starts with predefined labels reflected in the
database table, clustering creates its own labels after clustering the data set.
Analysis by clustering can be used in various scenarios such as pattern
recognition, image processing and market research.
For example, clustering can be applied to uncover customers that share similar
purchasing behaviour. By understanding a particular cluster of customer
purchasing preferences you can form decisions on which products you want to
recommend groups based on their commonalities. You can do this by offering
them the same promotions via email or click ad banners on your website.
The Netflix example I brought up earlier was a case of identifying a cluster of
viewers that both enjoyed watching the British version of House of Cards and
films featuring Kevin Spacey, and/or films directed by David Fincher.
There would have been very little way of seeing this relationship prior to the
clustering process, as there was no set data field classifying fans of Kevin Spacy
who also liked watching the British version of House of Cards. But by clustering
and isolating the analysis into a group of data points on a scatter map we are able
to identify new valuable relationships.
Clustering can also be complemented by classification. As mentioned, clustering
creates new classes or buckets to group data based on the application of
algorithms. From the process of clustering, new classification categories can be
created.
Clustering Data Algorithms
Various algorithms can be used to identify clusters. These include:
1) Measuring the distance between data points, known as Euclidean Distance.
2) Measuring the distance from a centroid, or mean value, to the surrounding data points
3) Measuring the density of the data within a space and drawing a border around those data points.
4) Distributional models, drawing a normal shape around data points such as an ellipse
Anomaly Detection
Anomaly detection is different in that you are now seeking to collect data points
that are different. They are different in regards to their location on the data plot,
and because they don't naturally fit into a cluster.
It’s important to first differentiate between anomalies and outliers. An anomaly
is an event, which should not have happened and is usually seen as a problem.
For instance, you detect that the traffic lights at one train crossing on a
metropolitan network is unavailable and needs to be fixed.
Outliers are closely linked but represent a slightly larger grouping than
anomalies. Outliers as you can imagine are small groups of data points that
diverge from the main clusters because they record unusual scores on at least
one variable.
While it will depend on the total size of the data set, some data scientists deem
categories with less than 10% of cases as an outlier category. Outliers can
therefore distort your conclusions – even if only caused by a very small number
of cases.
There are several options available to mitigate this challenge. If there is a small
number of outliers and deleting them would not have any substantive effect on
other analyzes, then the best option is to go ahead and delete.
But what then of the anomalies? You first want to detect anomalies. Then you
need to make a decision. You can decide to exclude them in order to focus on
other data points and clusters. But in other cases, you may want to study why
they are different.
Anomalies, for example, are commonly used in the domain of fraud detection to
identify illegal activities.
Human guinea pig and life hacker, Tim Ferris, is another close observer of
anomalies. Rather than studying athletes, chefs, chess players, linguists and other
successful types who were destined to be successful by genetic makeup, family
background or upbringing, he studies the anomalies. He looks for examples that
defy the odds and then decodes and breaks down what they did to become world
class.
Text Mining
Text mining is one of the most important and popular methods of data mining to
manage unstructured data. Unstructured data is not numerical data stored in
rows and columns as found in a spread sheet. In the case of text mining, we are
looking at non-structured data in the form of passages of text.
Text mining is commonly used to analyze social media posts but can be applied
to various other scenarios as well. An example of text mining could be
measuring approval or disapproval amongst users on Twitter by reviewing the
text of public Tweets with the hash tag #Bigdata over a set period of time.
Clustering is often applied in combination with text mining. After mining text
data, the data science team then groups the results. Clustering allows the owner
of the data to make certain decisions on how to manage these found groups.
For example, brands may be able to use the data to identify ‘true-fans’ on social
media who denounce ‘haters’ and Internet trolls criticizing the brand. With this
information, brands could then reach out to these social media users to negotiate
terms as a Key Opinion Leader (KOL) or brand ambassador.
The process of text mining entails two primary algorithm categories. The first
category of algorithms identifies the nuances of the text language in the form of
verbs, adjectives, proper nouns, and adverbs. It is also able to identify
positive/negative sentiments.
The second algorithm category treats words simply as individual items. Rather
than analyze the function and context of the word to understand its meaning, the
algorithm treats the word as an individual object and analyzes how often the
word is mentioned and how frequently it appears next to other words.
In addition, the algorithm is actually tabulating words into numbers.
One way to remember this is to think of the game of Scrabble. In scrabble each
letter you pull out of the bag has a number on it, except in text retrieval we are
looking at words not letters.
Popular algorithms for analysing individual words include:
Naive Bayes
You make all variables conditionally independent based on a certain outcome.
IE, is the word an adjective?
K-means Clustering
A clustering technique used to uncover categories, for example words that
frequently appear next to each other.
Support Vector Machines
The objective of support vector machines is to categorize data into two classes. It
does so by drawing a straight or squiggly line between the data points of both
categories. In other words, it cuts down the middle of the two categories.
Term Frequency Inverse Document Frequency (TFIDF) vectorization
TFIDF analyses how frequently a word occurs in the text.
Binary Presence
Binary presence simply analyses whether a word is present in a document or
not. Yes/No.
Association Analysis
Association analysis is a method to identify items that have an affinity for each
other, and fits under the statistical field of correlation.
Association analysis algorithms are commonly used by e-commerce companies
and off-line retailers to analyze transactional data and identify commonly
purchased together items. This insight allows e-commerce sites and retailers to
strategically showcase and recommend products to customers based on common
purchase combinations.
Association analysis is a relatively straightforward data mining concept to grasp.
Suppose your lemon aid stand sells five different products. These products are
A, B, C, D and E. Over the course of the day your have multiple buyers stop by
your stand to purchase products. Your first customer purchases A and C. The
next customer buys C, D and E. Eight more customers arrive and purchase
various other combinations of products.
Based on this data you now want to predict what your next customer will
purchase.
The first step in association analysis is to construct frequent itemsets (X).
Frequent itemsets are a combination of items that regularly appear together, or
have an affinity for each other. The combination could be one item with another
single item. Alternatively, the combination could be two or more items with one
or more other items.
From here you can calculate an index number called support (SUPP) that
indicates how often these items appear together.
Please note that in practice, “support” and “itemset” are commonly expressed as
“SUPP” and “X”.
Support can be calculated by dividing X by T, where X is how often the itemset
appears in the data and T is your total number of transactions. For example, if E
only features once in five transactions, then the support will be only 1 / 5 =0.2.
However in order to save time and to allow you to focus on items with higher
support, you can set a minimum level known as minimal support or minsup.
Applying minsup will allow you to ignore low-level cases of support.
The other step in association analysis is rule generation. Rule generation is a
collection of if/then statements, in which you calculate what is known as
confidence. Confidence is a metric similar to conditional probability.
IE, Onions + Bread Buns > Hamburger Meat
Numerous models can be applied to conduct association analysis. Below is a list
of the most common algorithms: - Apriori
- Eclat (Equivalence Class Transformations)
- FP-growth (Frequent Pattern)
- RElim (Recursive Elimination)
- SaM (Split and Merge)
- JIM (Jaccard Itemset Mining)
The most common algorithm is Apriori. Apriori is applied to calculate support
for itemsets one item at a time. It thereby finds the support of one item (how
common is that item in the dataset) and determines whether there is support for
that item. If the support happens to be less than the designated minimum support
amount (minsup) that you have set, the item will be ignored.
Apriori will then move on to the next item and evaluate the minsup value and
determine whether it should hold on to the item or ignore it and move on.
After the algorithm has completed all single-item evaluations, it will transition to
processing two-item itemsets. The same minsup criteria is applied to gather
items that meet the minsup value. As you can probably guess, it then proceeds to
analyze three-item combinations and so on.
The downside of the Apriori method is that the computation time can be slow,
demanding on computation resources, and can grow exponentially in time and
resources at each round of analysis. This approach can thus be inefficient in
processing large data sets.
The most popular alternative is Eclat. Eclat again calculates support for a single
itemset but should the minsup value be successfully reached, it will proceed
directly to adding an additional item (now a two-item itemset).
This is different to Apriori, which would move on to process the next single
item, and process all single items first. Eclat on the other hand will seek to add
as many items to the original single item as it can, until it fails to reach the set
minsup.
This approach is fast and less intensive in regards to computation and
memory, but the itemsets produced are long and difficult to manipulate.
As a data scientist you thus need to form a decision on which algorithm to apply
and factor in the trade-off in using each algorithm based your available
computing resources, the amount of data and your time schedule.
Sequence Mining
Sequence mining is a process to identify repeating sequences in a dataset
row. Sequencing can be applied to various scenarios, including instructing a
person what to do next or predict what's going to happen next.
Sequence mining is similar to association analysis in regards to the prediction
that if x occurs then z and y are also likely to occur. The big difference in
sequence mining is that the order of events matters. In association analysis it’s
not important if the combination is ‘x, y, and z’, or ‘z, y, and x’ but in sequence
mining it is.
Sequence mining algorithms
A number of models can be applied to conduct sequence mining.
- GSP (Generalized Sequential Patterns)
- SPADE ( Sequential Pattern Discovery using Equivalence )
- FreeSpan
- HMM (Hidden Markov Model)
A common method for sequence mining is GSP. GSP is similar to Apriori, as
discussed in the previous chapter. But unlike Apriori, GSP adheres to the order
of events, which could be say ordinal or temporal.
Temporal refers to the state of time and ordinal refers to the logical progression
of categories, ie, “elementary school > middle school > senior school,” or
“single > engaged > married.”
Unlike Apriori, GSP will not treat “X, Y and then Z” and “X, Z and then Y” as
the same thing. But, like Apriori, GSP must do a lot of passes through the data to
conducts its findings and can therefore be a slow and computationally draining
procedure.
For larger data sets, SPADE (Sequential Pattern Discovery using Equivalence
classes) is recommended. Spade does fewer database scans by using intersecting
ID-lists. It first uses a vertical ID-list of the database and stretches out the data
by row into a two-dimensional matrix of the first and second item. Based on
those results, it can then add a third, forth, fifth etc. result to analyze.
Source: ieeexplore
Artificial Neural Networks - Deep Learning
Deep learning is a popular area within machine learning today.
Deep learning became widely popular in 2012 when tech companies started to
show off what they were able to achieve through sophisticated layer analysis,
including image classification and speech recognition.
Deep learning is just a sexy term for Artificial Neural Networks (ANN), which
have been around for over forty years.
Artificial Neural Networks (ANN), also known as Neural Networks, is one of
the most widely used algorithms within the field of machine learning. Neural
networks are commonly used in visual and audio recognition.
ANN emphasizes on analyzing data in many layers, and was inspired by the
human brain, which can visually process objects through layers of neurons.
ANN is typically presented in the form of interconnected neurons that interact
with each other. Each connection has numeric weight that can be altered and is
based on experience.
Much like building a human pyramid or a house of cards, the layers or neurons
are stacked on top of each other starting with a broad base.
The bottom layer consists of raw data such as text, images or sound, which are
divided into what we call neurons. Within each neuron is a collection of data.
Each neuron then sends information up to the layer of neurons above. As the
information ascends it becomes less abstract and more specific, and the more we
can learn from the data from each layer.
A simple neural network can be divided into input, hidden, and output layers.
Data is first received by the input layer, and this first layer detects broad
features. The hidden layer/s then analyze and processes that data, and through
the passing of each layer with less neurons (which diminish in number at each
layer) the data becomes clearer, based on previous computations. The final result
is shown as the output layer.
The middle layers are considered hidden layers, because like human sight we are
unable to naturally break down objects into layered vision.
For example, if you see four lines in the shape of a square you will visually
recognize those four lines as a square. You will not see the lines as four
independent objects with no relationship to each other.
ANN works much the same way in that it breaks data into layers and examines
the hidden layers we wouldn’t naturally recognize from the onset.
This is how a cat, for instance, would visually process a square. The cat’s brain
would follow a step-by-step process, where each polyline (of which there are
four in the case of a square) is processed by a single neuron.
Each polyline then merges into two straight lines, and then the two straight lines
merge into a single square. Via staged neuron processed, the cat’s brain can see
the square.
Four decades ago neural networks were only two layers deep. This was because
it was computationally unfeasible to develop and analyze deeper networks.
Naturally, with the development of technology it is possible to easily analyze ten
or more layers, or even over 100 layers.
Most modern algorithms, including decision trees and naive bayes are
considered shallow algorithms, as they do not analyze information via numerous
layers as ANN can.
Data Visualization
Once your data analytics has been completed, you are one step closer to
commercializing your data set. But first you need a means to communicate the
value of your new insight. You need to convey the findings to the rest of the
organization, and to inform decision makers or other parties.
No matter how impactful and insightful your data discoveries are, you have to
find a way of effectively communicating the results to an audience who perhaps
aren’t proficient with data science terminology.
This is why data visualization has been so successful and widely adopted in data
science. Visualisation is a highly effective medium to communicate data
findings to a general audience. The visual storytelling applied by graphs, pie
charts and representation of numbers in shapes makes for fast and easy
storytelling.
Your can think of data visualization as the middleman interfacing the data
science experts and the intended audience.
As a data scientist it’s an advantage to have a grasp or understanding of effective
visualization techniques. This will assist your efforts in effectively
communicating data with your audience.
Tableau is a popular visualization tools for data scientists. The software program
supports a range of visualization techniques including charts, graphs, maps and
other options.
Where to From Here
Career Opportunities in Data Analytics
Data analytics takes training and absorption of theoretical knowledge,
technology and software in order to master.
Those with a fascination to unravel how things work and deconstruct
complicated tasks through set teachings and theory rather than common sense
and human intuition, will be naturally drawn to data mining.
Those of you with backgrounds in statistics, computer programming,
mathematics and technology systems are naturally going to take to the topic with
ease.
However, there’s nothing stopping you from mastering data analytics even if you
don’t have a background in relate fields. As long as you have the willpower and
enthusiasm to go on from here and learn computational languages, statistics and
data software management you should be able to go and one day earn a 6 figure
salary.
Career opportunities in data analytics are indeed both expanding and becoming
more lucrative at the same time. Due to current shortages in qualified
professionals and the escalating demand for experts to manage and analyze data
the outlook for data professionals is bright.
To work in data analytics you will need both a strong passion for the field of
study and dedication to educate yourself on the various facets of data analytics.
There are various channels in which you can start to train yourself in the field.
Identifying a university degree, an online degree program or online
curriculum are common entry points.
Along the way it is also important to seek out mentors who you can turn to for
advice on both technical analytics questions but also on career options and
trajectories.
A mentor could be a professor, colleague, or even someone you don’t yet know.
If you are looking to meet professionals with more industry specific experience
it is recommended that you attend industry conferences or smaller offline events
held locally. You could decide to attend either as a participant or as a volunteer.
Volunteering may in fact offer you more access to certain experts and save
admission fees at the same time.
LinkedIn and Twitter are terrific online resources to identify professionals in the
field or access leading industry voices. When reaching out to established
professionals you may receive resistance or a lack of response depending on
whom you are contacting.
One way to overcome this potential problem is to offer your services in lieu of
mentoring. For example, if you have experience and expertise in managing a
WordPress website you could offer your time to build or manage an existing
website for the person you are seeking to form a relationship with.
Other services you can offer are proof reading books, papers and blogs, or
interning at their particular company or institute.
Sometimes its better to start your search for mentors locally as that will open
more opportunities to meet in person, to find local internship and job
opportunities. This too naturally conveys more initial trust than say emailing
someone across the other side of the world.
Interviewing experts is one of the most effective ways to access one-on-one time
with an industry expert. This is because it is an opportunity for the interviewee
to reach a larger audience with their ideas and opinions. In addition, you get to
choose your questions and ask your own selfish questions after the recording.
You can look for local tech media news outlets, university media groups, or even
start your own podcast series or industry blog channel. Bear in mind that
developing ongoing content via a podcast series entails a sizeable time
commitment to prepare, record, edit and market. The project though can bear
fruit as you produce more episodes.
Quora is an easy-to-access resource to ask questions and seek advice from a
community who are naturally very helpful. However, do keep in mind that
Quora responses tend to be influenced by self-interest and if you ask for a book
recommendation you will undoubtedly attract responses from people
recommending their own book!
However, there is still a wealth of non-biased information available on Quora,
you just need to use your own judgement to discern high value information from
a sales pitch.
College Degrees
Recommended Degrees in the U.S:
Southern Methodist University, Dallas, Texas
Online Master of Science in Data Science
Available online over 20 months. Ranked a Top National University by US
News.