You are on page 1of 40

Unit 4 – Big Data Analytics and Visualization

Copyright Guideline
© 2018 Infosys Limited, Bangalore, India. All Rights Reserved.

Infosys believes the information in this document is accurate as of its publication date; such information is subject to change
without notice. Infosys acknowledges the proprietary rights of other companies to the trademarks, product names and such
other intellectual property rights mentioned in this document. Except as expressly permitted, neither this documentation nor
any part of it may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic,
mechanical, printing, photocopying, recording or otherwise, without the prior permission of Infosys Limited and/ or any named
intellectual property rights holders under this document.

Copyright © 2018, Infosys Limited PUBLIC


3

Learning Objectives
On completion of this module, the learner should be able to:

1. understand big data analytics

2. understand data analysis process


• Identify types of data analytics

3. identify various types of data visualization tools

Copyright © 2018, Infosys Limited PUBLIC


4

Table of Contents “Big Data Analytics and Visualization”


I. Big Data Analytics II. Data Visualization

1. Introduction to Big Data 1. What is Data Visualization


2. Big Data Analytics 2. Why is it important?
3. Big Data Technologies
3. Different Chart Types
4. Data Analysis Process
4. Introduction to Data Visualization Tools
5. Types of Data Analytics

Copyright © 2018, Infosys Limited PUBLIC


5

I. Big Data Analytics


6

1. Introduction to Big Data


As Wikipedia defines it, Big Data is a term used for describing data that is very large or complex to be processed by
traditional data processing software.
It can comprise of unstructured , semi-structured or structured data though mostly is large set of unstructured data and
thus a primary focus on it. This unstructured data may exist in many document files in any format or huge volume of
data from the sensors especially in an IoT ecosystem.

• Data containing a defined data type, format, structure


Structured • Example: Transaction data and OLAP

• Textual data files with a discernable pattern,


enabling parsing
Semi-Structured • Example: XML data files that are self describing
and defined by an xml schema
• Textual data with erratic data
formats, can be formatted with Quasi-Structured • Data that has no inherent structure
effort, tools, and time and is usually stored as different types
• Example: Web clickstream of files.
data that may contain some • Example: Text documents, PDFs,
inconsistencies in data values Unstructured images and video
and formats

Copyright © 2018, Infosys Limited PUBLIC


7

Why Big Data?


Key enablers for the growth of “Big Data” are:
• Availability of data

• Increase in storage capacities


• Increase in processing power
• This growth in data is only going to increase exponentially with the advent of Internet of Things – as the
number of connected devices will increase.

Copyright © 2018, Infosys Limited PUBLIC


8

2. Big Data Analytics


Analytics is important in the context of IoT as a lot of Big Data is generated in huge numbers and the next step is
to extract meaning out of that data. The process of uncovering hidden patterns and deriving meaningful
conclusions from the Big Data is commonly referred to as Big Data Analytics.
With the right analytics, big data can deliver richer insight since it draws data from multiple sources and
transactions that will help uncover hidden relationships and patterns.

Let us look at some important characteristics of Big Data ; the 4Vs –


• Volume,
• Velocity
• Variety and
• Veracity of data.
The Volume of data captured over time can range from some terabytes to even exabytes!
44x increase from 2010 to 2020(1.2zettabytes to 35.2zb)
Variety of files such as SQL systems for structured data and unstructured data received directly from the sensors
defines the possibility of different variety in which the data can exist.
Velocity, here refers to the speed with which the data must be processed for analysis – if it should be real-time,
near real-time or periodic in nature. Veracity refers to the noise, biases and abnormality in data. The data that is
being stored and mined should be meaningful to the problem being analyzed.

Copyright © 2018, Infosys Limited PUBLIC


9

By Magnai17 (Own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons

Copyright © 2018, Infosys Limited PUBLIC


10

3. Big Data Technologies


Databases and storage is required along with the following technologies to store and process Big Data.
Some of the popular tools and technologies are as follows:
• NoSQL Databases
A database that does not store data in tables (rows and columns). (In contrast to relational database).
MongoDB, CouchDB, Cassandra, Redis, BigTable, Hbase, Hypertable, Voldemort, Riak, ZooKeeper
• MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets
with a parallel, distributed algorithm on a cluster.
Hadoop, Hive, Pig, Cascading, Cascalog, mrjob, Caffeine, S4, MapR, Acunu, Flume, Kafka, Azkaban,
Oozie, Greenplum
• Storage
S3, Hadoop Distributed File System
• Servers
EC2, Google App Engine, Elastic, Beanstalk, Heroku
• Processing
R, Yahoo! Pipes, Mechanical Turk, Solr/Lucene, ElasticSearch, Datameer, BigSheets, Tinkerpop

Copyright © 2018, Infosys Limited PUBLIC


11

4. Data Analysis Process


As seen previously, the ultimate goal for an organization or a user is to be able to make better decisions.
For this, Big Data needs to be processed for analysis. Following are the recommended steps to follow during
the analysis process.

Question

The process begins with a question you want to answer or problem you want to solve. This
might be something like what are the characteristics of students who pass their projects? Or
how can I better stock my store with the products people most want to buy?

Wrangle

The next step of the process is data wrangling and this really has two parts, data acquisition
and data cleaning.
First, you need to acquire the data that you need to answer your question or solve your problem.
Then it's time to begin investigating the data and cleaning up any problems that you find.

Copyright © 2018, Infosys Limited PUBLIC


12

Explore and Draw Conclusions

The third phase is data exploration. During this phase, you spend sometime getting familiar
with your data, building your intuition about it and finding patterns. Once you're familiar with
your data, you'll usually want to draw some conclusions about it or maybe make some
predictions. For e.g. Netflix's movie recommendation systems needs to predict which movies
its users will like. This phase usually involves statistics or machine learning that are beyond
the scope of this course.

Communicate

Finally, you'll need to communicate your findings to other people.


Your findings are only as useful as your ability to communicate them. Even if your end goal is
to build some sort of system, like a movie recommender or a news feed ranking algorithm,
you'll usually need to share what you've built and how it works with your team. There are a
variety of formats this communication can take. You might write a blog post, a paper, an email,
PowerPoint presentation, or just have an in-person conversation. Data visualization is a common
technique that's almost always useful when communicating findings about data.

Copyright © 2018, Infosys Limited PUBLIC


13

However, data visualization process doesn't actually follow a straight line.

Especially the data wrangling phase and the data exploration phase are very intertwined because you can't
really clean the data before you take a look to see what problems there are to solve.
And even when you think you're done wrangling and you're ready to just explore, you'll keep finding more
problems and have to go back. Throughout the process, you may need to return to your question and refine it
as you become more familiar with the data set.
And sometimes data acquisition actually comes before you pose a question. If a new, exciting data set is
released, you might acquire the data first, take a look and see what's there, and then think of some questions
you could answer with the data.

However, this should give you an idea of the high level steps that are involved when you're doing data
analysis.

Copyright © 2018, Infosys Limited PUBLIC


14

5. Types of Data Analytics


• There are four types of Big data Analytics that really aid business:

• Prescriptive – This type of analysis reveals what actions should be taken. This is the most valuable kind of
analysis and usually results in rules and recommendations for next steps.
• Predictive – An analysis of likely scenarios of what might happen. The deliverables are usually a predictive
forecast.
• Diagnostic – A look at past performance to determine what happened and why. The result of the analysis is
often an analytic dashboard.
• Descriptive – What is happening now based on incoming data. To mine the analytics, you typically use a real-
time dashboard and/or email reports.

Copyright © 2018, Infosys Limited PUBLIC


15

Big Data Analytics in Action


• Prescriptive analytics is really valuable, but largely not used. According to Gartner, 13 percent of
organizations are using predictive but only 3 percent are using prescriptive analytics. Where big data analytics
in general sheds light on a subject, prescriptive analytics gives you a laser-like focus to answer specific
questions. For example, in the health care industry, you can better manage the patient population by using
prescriptive analytics to measure the number of patients who are clinically obese, then add filters for factors
like diabetes and LDL cholesterol levels to determine where to focus treatment. The same prescriptive model
can be applied to almost any industry target group or problem.

• Predictive analytics use big data to identify past patterns to predict the future. For example, some companies
are using predictive analytics for sales lead scoring. Some companies have gone one step further use
predictive analytics for the entire sales process, analyzing lead source, number of communications, types of
communications, social media, documents, CRM data, etc. Properly tuned predictive analytics can be used to
support sales, marketing, or for other types of complex forecasts.

Copyright © 2018, Infosys Limited PUBLIC


16

Big Data Analytics in Action…


• Diagnostic analytics are used for discovery or to determine why something happened. For example, for a
social media marketing campaign, you can use descriptive analytics to assess the number of posts, mentions,
followers, fans, page views, reviews, pins, etc. There can be thousands of online mentions that can be distilled
into a single view to see what worked in your past campaigns and what didn’t.

• Descriptive analytics or data mining are at the bottom of the big data value chain, but they can be valuable
for uncovering patterns that offer insight. A simple example of descriptive analytics would be assessing credit
risk; using past financial performance to predict a customer’s likely financial performance. Descriptive
analytics can be useful in the sales cycle, for example, to categorize customers by their likely product
preferences and sales cycle.

• As you can see, harnessing big data analytics can deliver big value to business, adding context to data that
tells a more complete story. By reducing complex data sets to actionable intelligence you can make more
accurate business decisions. If you understand how to demystify big data for your customers, then your value
has just gone up tenfold.

Copyright © 2018, Infosys Limited PUBLIC


17

II. Data Visualization


18

What is Data Visualization and Why is it important?


• Data visualization is the representation of data in a pictorial or graphical format. It enables decision makers to
see analytics presented visually, so they can grasp difficult concepts or identify new patterns.
• It may also refer to as Technologies used for creating images, diagrams, or animations to communicate a
message that are often used to synthesize the results of big data analyses.

• Why is data visualization important?


• Because of the way the human brain processes information, using charts or graphs to visualize large amounts
of complex data is easier than going through spreadsheets or reports. Data visualization is a quick, easy way
to convey concepts.

Copyright © 2018, Infosys Limited PUBLIC


19

Chart Types
• There are many different
charts that can be used to
represent data such as bar
charts, line charts, pie charts
etc. or even some complex
forms to enable interactivity.
• A complete list of interesting
and

• contemporary catalogue can


be found here

Copyright © 2018, Infosys Limited PUBLIC


20

Some of the most common types of data charts include:


Bar Graph. A bar chart (also known as a bar graph) shows the differences between categories or trends
over time using the length or height of its bars

Chart Title
6

0
Category 1 Category 2 Category 3 Category 4
Series 1 Series 2 Series 3

Copyright © 2018, Infosys Limited PUBLIC


21

Stacked Bar Chart or Relative Value Chart

Chart Title

Category 4

Category 3

Category 2

Category 1

0 2 4 6 8 10 12 14
Series 1 Series 2 Series 3

Copyright © 2018, Infosys Limited PUBLIC


22

Clustered Bar Chart


Chart Title

Category 4

Category 3

Category 2

Category 1

0 1 2 3 4 5 6
Series 3 Series 2 Series 1

Copyright © 2018, Infosys Limited PUBLIC


23

Line Graph Chart Title


6

0
Category 1 Category 2 Category 3 Category 4
Series 1 Series 2 Series 3

Copyright © 2018, Infosys Limited PUBLIC


24

Pie Chart Sales

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

Copyright © 2018, Infosys Limited PUBLIC


25

Area Charts Chart Title


35

30

25

20

15

10

0
1/5/2002 1/6/2002 1/7/2002 1/8/2002 1/9/2002
Series 1 Series 2

Copyright © 2018, Infosys Limited PUBLIC


26

Combination Charts Chart Title


6

0
Category 1 Category 2 Category 3 Category 4
Series 1 Series 2 Series 3

Copyright © 2018, Infosys Limited PUBLIC


27

Points that should be considered before deciding on a chart type:


• Understand the data you’re trying to visualize, including its size and cardinality (the uniqueness of data values
in a column).
• Determine what you’re trying to visualize and what kind of information you want to communicate.
• Know your audience and understand how it processes visual information.
• Use a visual that conveys the information in the best and simplest form for your audience.

Copyright © 2018, Infosys Limited PUBLIC


28

Data Visualization Tools


Today, there are plenty of visualization tools available in the market and some of the charts can also be built
using JavaScript libraries.
Some of the popular tools and libraries as listed below.
• Tableau
• D3.JS

• Highcharts
• Charts.js

Copyright © 2018, Infosys Limited PUBLIC


29

Some more interesting examples can be


found at the following Zingchart site

However, most part of the analytics and


visualization for IoT is available on the
popular IoT Platforms – the next chapter in
this unit.

Copyright © 2018, Infosys Limited PUBLIC


IoT Platforms
31

OT and IT of IoT

IT: Information Technology


OT: Operational Technology

Source: Cisco Live 2015

Copyright © 2018, Infosys Limited PUBLIC


32

What is an IoT Platform


• IoT ecosystem consists of several heterogeneous components and technologies as below:
• Things

• Node

• IoT Gateways

• IoT Communication Technologies

• IoT Platform

IoT platform is essentially what makes IoT happen for your device. It is the application that connects it with the cloud and
the corresponding output device.

• Hardware & software


• Includes data storage, data analytics, data security & development tools

• Designed to support small applications that solve business problems

Copyright © 2018, Infosys Limited PUBLIC


33

Why Platforms
• Common standard application platform to hide the heterogeneity
• To provide a common working environment

• Translate data from the IoT devices to useful information


• Because it resides on cloud
– Predictive maintenance
– Analytics
– Real time data management
• IoT platforms provide a comprehensive set of functionalities which can be used to build IoT applications; It is
a virtual solution and resides over cloud.
• IoT application platforms provide a complete suite for application development to its deployment and
maintenance.

Copyright © 2018, Infosys Limited PUBLIC


34

Components of an IoT Platform

Source: IoT Analytics


Copyright © 2018, Infosys Limited PUBLIC
35

Some Popular IoT Platforms

• End-to-end ecosystem strategy


• Has own modules, network stack & cloud on-boarding
• Java/Android, ObjC (iOS and Mac), Python, Ruby and more SDKs.
• REST/HTTP, websockets, MQTT, and CoAP support

• Supports HTTP, WebSockets, and MQTT


• Rules Engine can route messages to AWS endpoints including AWS
Lambda, Amazon Kinesis, Amazon S3, Amazon Machine Learning,
Amazon DynamoDB, Amazon CloudWatch, and Amazon Elasticsearch
Service with built-in Kibana integration
• Create a persistent, virtual version, or “shadow,” of each device that
includes the device’s latest state

Copyright © 2018, Infosys Limited PUBLIC


36

Some Popular IoT Platforms


• Supports over 60 regulatory frameworks worldwide.
• Predix Services
• Based on Pivotal Cloud Foundry

• Easily integrate Azure IoT Suite with your systems and applications,
including Salesforce, SAP, Oracle Database, and Microsoft Dynamics
• Azure IoT Suite packages together Azure IoT services with
preconfigured solutions.
• Supports HTTP, Advanced Message Queuing Protocol (AMQP), and
MQ Telemetry Transport (MQTT).
• Gateway SDK

Copyright © 2018, Infosys Limited PUBLIC


37

Some Popular IoT Platforms


• Machine Learning - Automate data processing and rank data based on
learned priorities.
• Raspberry Pi Support - Develop IoT apps that leverage Raspberry Pi,
cognitive capabilities and APIs
• Real-Time Insights - Contextualize and analyze real-time IoT data

• Coldlight - IoT Analytics


• Augmented Reality Integration (Vuforia Studio Enterprise)
• Edge Microserver & "Always On“ SDK

• The messaging broker supports connections using native MQTT and


WebSockets MQTT.
• Xively provides a C client library for use on devices
• Xively provides an application for integrating connected products into the
Salesforce Service Cloud.

Copyright © 2018, Infosys Limited PUBLIC


38

Open Source Platforms

Copyright © 2018, Infosys Limited PUBLIC


39

References
1. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley
publications, ISBN: 978-1-118-87613-8
2. Getting Started with Python Data Analysis, PACKT Publishing, by Phuong Vo.T.H (Author), Martin Czygan
(Author). ISBN-10: 1785285114, ISBN-13: 978-1785285110
3. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython, O’Reilly Media, ISBN-10:
1449319793, ISBN-13: 978-1449319793
4. http://www.creativebloq.com/design-tools/data-visualization-712402

Copyright © 2018, Infosys Limited PUBLIC


Thank You

© 2013 Infosys Limited, Bangalore, India. All Rights Reserved. Infosys believes the information in this document is accurate as of its publication date; such information is subject to change without notice. Infosys acknowledges the proprietary rights of other
companies to the trademarks, product names and such other intellectual propertyrights mentioned in this document. Except as expresslyper mitted, neither this documentation nor anypart of it maybe r eproduced, stored in a retrieval system, or transmitted in
any form or byany means, electronic, mechanical, printing, photocopying, recording or otherwise, without the prior permission of Infosys Limited and/ or anynamed intellectual propertyrights holders under this document.

You might also like