You are on page 1of 45

Unit I

Big Data
• Using data to understand customers and businesses operations to
sustain and foster growth and profitability is an increasingly more
challenging task for today’s enterprises.
• As more and more data is available, timely processing of data with
the traditional tools is becoming impractical.
• This phenomenon of having a huge set of data coming in real time
is termed as big data.
• Big data is becoming more of a buzz word but in actual terms is the
analytics that is behind the curtains.
• The term ‘Big’ is a relative term and it depends on the organization
size as well as its interpretation of this term.
• Big data has become a popular term to describe the exponential
growth, availability and the use of information, both structured and
unstructured.
Sources of Big Data
• Where is the big data come from?
• A simple answer is ‘everywhere’.
• The sources we ignored earlier because of technical
limitations are treated as gold mines today.
• Big data may come from web logs, RFIDs, GPS systems,
sensor networks, social networks, IOT, search indices,
detail call records, science experiments like nuclear physics,
medical records, military surveillance, photo archives,
video archives, e-commerce practices etc.
• Since the advent of data warehouses in early 90s,
companies are storing relevant data in large volumes.
• Many believe that big data is not only dependent on data
itself but variety, velocity, veracity, variability and value
preposition are also an important aspects of Big Data.
The Vs that define Big Data
• Big data is typically defined by three “V”s :
– Volume,
– Variety and
– Velocity.

• In addition to these three, leading big data


solution providers added other Vs such as
– Veracity (IBM)
– Variability (SAS)
– Value Proposition
1. Volume
• The most common trait of Big Data.
• Many factors contribute to the exponential increase in data
volume like transaction based data, automatic generated
RFID and GPS data etc.
• Forget about data warehouses, its data lakes now.
• Earlier due to limited technologies, storage and processing
of data was challenging. But now with less storage cost and
advanced tools, the focus is shifted to create value out of
volume.
• In 2009, the whole world had about 0.8 ZettaByte (10^21
Bytes) of data. In 2013 it is 35ZB (IBM).
• In 2016, 1.3 ZB of traffic over internet annually.
• 1 yottabyte (10^24 B) will require 250 trillion DVDs to store
the data.
2. Variety
• Data today is present in all type of formats.
• From text documents to email, sensor
captured data, video, audio etc.
• by some estimates, 80-85% of data present in
all organizations have some sort of
unstructured or semi-structured format.
• But the value of the variety of these data sets
cannot be ignored.
3. Velocity
• Velocity is referred to how fast the data is being
produced and how fast the data must be processed
(captured, stored and analyzed) to meet the need or
demand.
• Various smart devices are driving an increasing need to
deal with torrents of data in near-real time.
• It is the most overlooked characteristic of Big Data.
• Reacting quickly enough to deal with velocity is quite
challenging.
• Based on this one characteristic alone, another class of
analytics have been emerging known as “in-motion
analytics” or data stream analytics”.
https://www.domo.com/learn/data-never-sleeps-3-0
5
http://www.adweek.com/prnewser/how-many-times-do-the-worlds-social-media-users-click-every-minute/117427
4. Veracity
• Veracity is the term coined by IBM that is
being used as the fourth V to describe Big
Data.
• It refers to the conformity of facts like
accuracy, quality, truthfulness and/or
trustworthiness of the data.
• Some tools have been developed by IBM to
check the quality and trustworthiness of data.
5. Variability
• Data flows can be very inconsistent with
periodic peaks.
• Something big trending on social media will
result in peaks of data which may have limited
time frame but challenging to manage.
6. Value Preposition
• A preconceived notion of Big Data is that it
contains more patterns and interesting anomalies
than small data.
• Organizations can gain better insights on the
phenomenon and can gain value.
• It takes specialized tools and techniques for
analyzing big data and therefore it requires
investment.
• Organizations need to understand the value
preposition before making a decision.
Davenport – Competing on Analytics

“The extensive use of data, statistical


and quantitative analysis, explanatory
and predictive models, and fact-based
management to drive decisions and
actions”.

Re-quoted from Competing on Analytics


Enterprise Analytics. Thomas Davenport. Pearson
Publishing. 2013. Page 9.

14
Differences We See

• Big Data
• Real-Time use of data
• Selling data
• Decision-Based data

15
16
Allen’s Definition of Business
Analytics
Utilizing Data to Increase Shareholder Value

Data =
 Big and Small
 Internal and External
 Structured and Non-structured
 Traditional and “New”
 “Free” and Purchased

Analytics Strategic Approaches 17


Utilizing Data to Increase Shareholder Value

Utilizing =
 Determine Business Needs
 Capture and Store
 Ensure Quality
 Access and Format
 Analyze and Summarize
 Gain Insight and Produce Action
 and …….. ‘Sell’ It

18
http://topmanagement.com.mx/innovacion-social-y- empresarial-objetivo-de-hitachi/
3
The Growth of Data

https://www.domo.com/learn/data-never-sleeps-3-0
4
http://www.adweek.com/prnewser/how-many-times-do-the-worlds-social-media-users-click-every-minute/117427
The Emergence of Big Data
Tools

http://blogs.forrester.com/category/hadoop 6
http://solutions.forrester.com/Global/FileLib/webinars/Big_Data_-_Gold_Rush_or_Illusion.pd
f
7
http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/
8
http://hortonworks.com/blog/optimize-your-data-architecture-with-hadoop/
https://www.digitalnewsasia.com/business/forget-data-warehousing-its-data-lakes-now

9
http://hortonworks.com/blog/big-data-refinery-fuels-next-generation-data-architecture/ 10
11
http://www.kdnuggets.com/2014/05/big-data-landscape-v30-
analyzed.html
http://dataofthings.blogspot.com/2014/04/the-bbbt-sessions-hortonworks-big-data.html
17
18
http://www.gartner.com/it-glossary/predictive-analytics
Types of Questions and Analytics

Questions What happened? Why is this What should I do?


What’s happening? Why should I do it?
happening? What will happen What’s the best that can
What actions are next? happen?
needed? Why will it happen? What if we try this?
What exactly is
the problem?
What actions are
needed?

Enablers • Ad hoc Reports • Data Mining • Optimization


• Dashboards • Text Mining • Simulation
• Data • Web/Media Mining • Decision Modeling
Warehousing • Forecasting • Randomized Testing
• Alerts

Outcomes Well defined Accurate projections of Best possible business


business problems the future states and decisions and transactions
and opportunities conditions
Business Analytics
Domain
The Roles of Data Science

New Analytic Insights


(Information, knowledge, data story)

Data Data Product


Mass Analytic Tools
+ Visualization
Machine Learning
Recommender systems
Datafication Data Science Team
Complex Event Processing

DatascienceTh.com
Doing Data Science by O'Neil et al (2013) Data
Picture:http://www.clipartpanda.com/categories/scientist-
clip-art Scientist
Data
Visualization

21
What would be good data Visualization
designer?

22
Fundamentals of Big Data Analytics
• Big Data by itself is useless unless business users
do something about it which delivers some value
to the organization.
• The traditional means for capturing, storing and
analyzing data are not capable of dealing with Big
Data effectively and efficiently.
• New technologies are required to deal with the
enormous amount of data.
• Before investing in high end technologies,
organizations need to take decisions regarding its
use, importance, velocity etc.
The success of Big Data Analytics depends on a
number of factors. Some critical factors are
Business Problems Addressed by Big
Data Analytics
• Top business decision taken with the help of Big Data are
process efficiency, cost reduction, enhancing customer
experience and risk management.
• Efficiency and cost reduction with BDA are mostly
addressed in manufacturing, government, energy,
communication, media, transport and healthcare sector.
• Enhanced customer experience may be important for
insurance companies and retailers.
• Risk management is useful for banking sector and new
product development.
• Other problems like fraud detection, identifying new
markets, revenue maximization etc. can also be dealt with
big data analytics.
Big Data Technologies
• Although there are number of different
technologies that are useful in analyzing Big
Data. Most of them share some common
characteristics.
• There are three Big Data Technologies that
stand out of the lot:
– MapReduce
– Hadoop
– NoSQL
MapReduce
• MapReduce is a technique popularized by Google
that distributes the processing of a very large
multi-structured data files across a large cluster
of machines.
• High performance is achieved by breaking the
processing into small units of work that can be
run in parallel across thousands of clusters.
• Map reduce help organization in processing and
analyzing large volumes of multi-structured data.
For example- graph analysis, text analysis,
machine learning, data transformation etc.
Hadoop
• It is an open source framework for processing, storing
and analyzing massive amounts of distributed,
unstructured data.
• Hadoop was inspired by MapReduce and was designed
to handle petabytes and exabytes of data.
• Rather than banging away huge block of data with
single machine, Hadoop breaks up Big Data into
multiple parts so each part can be processed and
analyzed at the same time.
• Sources of data may include log files, social media
feeds and internal data sources.

You might also like