Big Data Analytics Lifecycle

UNIT 5
DATA HANDLING &

ANALYTICS
What is big data analytics?
 As one of the most “hyped” terms in the market today,
there is no consensus as to how to define big data.
 The term is often used synonymously with related concept
such as Business Intelligence ( BI) and data mining.
 It is true that all three term is about analysing data and in
many cases advanced analytics . But big data concept is
different from the two others when data volumes, number
of transactions and the number of data sources are so big
and complex that they require special methods and
technologies in order to draw insight out of data (for
instance, traditional data warehouse solutions may fall
short when dealing with big data).
SNJB's LS KBJ COE, Chandwad
 This also forms the basis for the most used definition of
big data, the three V: Volume, Velocity and Variety as
shown in Figure

 Volume: Large amounts of data , from datasets with sizes of terabytes
to zettabyte.
 Velocity: Large amounts of data from transactions with high refresh rate
resulting in data streams coming at great speed and the time to act on
the basis of these data streams will often be very short . There is a shift
from batch processing to real time streaming.
 Variety: Data come from different data sources. For the first, data can
come from both internal and external data source. More importantly,
data can come in various format such as transaction and log data from
various applications , structured data as database table , semi-
structured data such as XML data, unstructured data such as text,
images, video streams, audio statement, and more. There is a shift from
sole structured data to increasingly more unstructured data or the
combination of the two.
 Definition
“ Big data is high-volume, high-velocity and/or
high-variety information assets that demand
cost-effective, innovative forms of information
processing that enable enhanced insight,
decision making, and process automation. “

 What data are we talking about?
 Organizations have a long tradition of capturing
transactional data. Apart from that, organizations
nowadays are capturing additional data from its
operational environment at an increasingly fast
speed. Some example are listed here.
 Web data. Customer level web behaviour data such

as page views, searches, reading reviews,
purchasing, can be captured. They can enhance
performance in areas such as next best offer, churn
modelling, customer segmentation and targeted
advertisement.
 Text data (email, news, Facebook feeds, documents, etc) is
one of the biggest and most widely applicable types of big
data. The focus is typically on extracting key facts from the
text and then use the facts as inputs to other analytic
process (for example, automatically classify insurance
claims as fraudulent or not.)
 Time and location data. GPS and mobile phone as well as
Wi-Fi connection makes time and location information a
growing source of data. At an individual level, many
organizations come to realize the power of knowing when
their customers are at which location. Equally important is
to look at time and location data at an aggregated level. As
more individuals open up their time and location data more
publicly, lots of interesting applications start to emerge.
Time and location data is one of the most privacy-sensitive
types of big data and should beCOE,
SNJB's LS KBJ treated
Chandwadwith great caution.
 Smart grid and sensor data. Sensor data are collected
nowadays from cars, oil pipes, windmill turbines, and
they are collected in extremely high frequency. Sensor
data provides powerful information on the performance
of engines and machinery. It enables diagnosis of
problems more easily and faster development of
mitigation procedures.
 Social network data. Within social network sites like
Facebook, LinkedIn, Instagram, it is possible to do link
analysis to uncover the network of a given user. Social
network analysis can give insights into what
advertisements might appeal to given users.
Characteristics of Big Data
Volume
 Big data implies enormous volumes of data. It used
to be employees created data. Now that data is

generated by machines, networks and human
interaction on systems like social media the volume
of data to be analyzed is massive. Yet, Inderpal
states that the volume of data is not as much the
problem as other V‟s like veracity.

Variety
 Variety refers to the many sources and types of
data both structured and unstructured. We used to

store data from sources like spreadsheets and
databases. Now data comes in the form of emails,
photos, videos, monitoring devices, PDFs, audio,
etc. This variety of unstructured data creates
problems for storage, mining and analyzing data.
Jeff Veis, VP Solutions at HP Autonomypresented
how HP is helping organizations deal with big
challenges including data variety.
Velocity
 Big Data Velocity deals with the pace at which data
flows in from sources like business processes,

machines, networks and human interaction with
things like social media sites, mobile devices, etc.
The flow of data is massive and continuous. This
real-time data can help researchers and
businesses make valuable decisions that provide
strategic competitive advantages and ROI if you
are able to handle the velocity. Inderpal suggest
that sampling data can help deal with issues like
volume and velocity.
Veracity
 Big Data Veracity refers to the biases, noise and
abnormality in data. Is the data that is being stored,

and mined meaningful to the problem being
analyzed. Veracity in data analysis is the biggest
challenge when compared to things like volume
and velocity. In scoping out your big data strategy
you need to have your team and partners work to
help keep your data clean and processes to keep
„dirty data‟ from accumulating in your systems.
Validity
 Like big data veracity is the issue of validity,
meaning is the data correct and accurate for the
intended use. Clearly valid data is key to making
the right decisions. Phil Francisco, VP of Product
Management from IBM spoke about IBM‟s big data
strategy and tools they offer to help with data
veracity and validity.

Volatility
 Big data volatility refers to how long is data valid
and how long should it be stored. In this world of
real time data you need to determine at what point
is data no longer relevant to the current analysis.
 Big data clearly deals with issues beyond volume,

variety and velocity to other concerns like veracity,
validity and volatility. To hear about other big data
trends and presentation follow the Big Data
Innovation Summit on SNJB'stwitter #BIGDBN.
LS KBJ COE, Chandwad
Data Processing

IoT Analytics Lifecycle and Techniques
 1st Phase – IoT Data Collection: As part of this phase

IoT data are collected and enriched with the proper
contextual metadata, such as location information and
timestamps.
 Moreover, the data are validated in terms of their
format and source of origin. Also, they are validated in
terms of their integrity, accuracy and consistency.
 Note that IoT data collection presents several
peculiarities, when compared to traditional data
consolidation of distributed data sources, such as the
need to deal with heterogeneous IoT streams.
 2nd Phase – IoT Data Analysis: This phase deals with

the structuring, storage and ultimate analysis of IoT data
streams.
 The latter analysis involves the employment of data
mining and machine learning techniques such as
classification, clustering and rules mining.
 These techniques are typically used to transform IoT data
to actionable knowledge.

 3rd Phase – IoT Data Deployment, Operationalization

and Reuse:
 As part of this phase, the IoT analytics techniques
identified in the previous steps are actually deployed,
thus becoming operational.
 This phase ensures also the visualization of the IoT
data/knowledge according to the needs of the
application. Moreover, it enables the reuse of IoT
 knowledge and datasets across different applications.

Flow of Data
Business Data
Data Data Data
Case Acquisition
Identification Extraction Validation
Evaluation and Filtering
Data Data Data Utilization of

Representation Analysis Visualization Results

Properties of Big Data

Data Analytic Techniques and Technologies
 The applications of the IoT BigData Platform can be

classified into four main categories
i) deep understanding and insight knowledge
ii) Real time actionable insight
iii) Performance optimization and
iv) proactive and predictive applications.

Batch Processing
 Batch processing supposes that the data to be treated
is present in a database.
 The most widely used tool for the case is Hadoop
MapReduce.
 MapReduce is a programming model and Hadoop an
implementation, allowing processing large data sets
with a parallel, distributed algorithm on a cluster.
 It can run on inexpensive hardware, lowering the cost of
a computing cluster.

Stream Processing
 Stream processing is a computer programming
paradigm, equivalent to dataflow programming

and reactive programming, which allows some
applications to more easily exploit a limited form of
parallel processing.
 Flink is a streaming dataflow engine that provides
data distribution, communication and fault

tolerance.
Machine Learning
 Machine learning is the field of study that gives
computers the ability to learn without being explicitly
programmed.
 It is especially useful in the context of IoT when some
properties of the data collected need to be discovered
automatically.
 Apache Spark comes with its own machine learning
library, called MLib. It consists of common learning
algorithms and utilities, including classification,
regression, clustering, collaborative filtering,
dimensionality reduction.
Data Visualisation
 Freeboard offers simple dashboards, which are readily
useable sets of widgets able to display data. There is a
direct Orion Fiware connector.
 Freeboard offers a REST API allowing controlling of the
displays.
 Tableau Public is a free service that lets anyone publish
interactive data to the web. Once on the web, anyone
can interact with the data, download it, or create their
own visualizations of it.
 No programming skills are required.

Data Collection Using Low-power,
Long-range Radios

PaaS for IoT using WAZIUP software Platform

iKaaS Software Platform



Big Data Analytics Lifecycle

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Analytics Lifecycle

Uploaded by

Copyright:

Available Formats

UNIT 5

DATA HANDLING &

SNJB's LS KBJ COE, Chandwad

SNJB's LS KBJ COE, Chandwad

 Web data. Customer level web behaviour data such

to be employees created data. Now that data is

SNJB's LS KBJ COE, Chandwad

data both structured and unstructured. We used to

flows in from sources like business processes,

abnormality in data. Is the data that is being stored,

SNJB's LS KBJ COE, Chandwad

 Big data clearly deals with issues beyond volume,

SNJB's LS KBJ COE, Chandwad

 1st Phase – IoT Data Collection: As part of this phase

 2nd Phase – IoT Data Analysis: This phase deals with

SNJB's LS KBJ COE, Chandwad

 3rd Phase – IoT Data Deployment, Operationalization

SNJB's LS KBJ COE, Chandwad

Data Data Data Utilization of

SNJB's LS KBJ COE, Chandwad

SNJB's LS KBJ COE, Chandwad

 The applications of the IoT BigData Platform can be

SNJB's LS KBJ COE, Chandwad

SNJB's LS KBJ COE, Chandwad

paradigm, equivalent to dataflow programming

data distribution, communication and fault

SNJB's LS KBJ COE, Chandwad

SNJB's LS KBJ COE, Chandwad

SNJB's LS KBJ COE, Chandwad

SNJB's LS KBJ COE, Chandwad

SNJB's LS KBJ COE, Chandwad

SNJB's LS KBJ COE, Chandwad

You might also like