You are on page 1of 43

Big Data Analysis

By Dr.R.Venkatesh

Agenda
1

Introduction: Explosion in Quantity of Data

Big Data Characteristics

Cost Problem (example)

Importance of Big Data

55

Usage Example in Big Data

Contents
1
6

Introduction : HADOOP

2
7

Hadoop Distributed File System

3
8

Examples of Big Data Projects

4
9

Importance of Big Data

10
5

Advantages of Big Data

Introduction: Explosion in Quantity of Data


1946
Eniac
X 6000000

2012
=

LHC
1 (40 TB/S)

Air Bus A380


- 1 billion line of code
- each engine generate 10 TB
every 30 min

640TB per
Flight

Twitter Generate approximately 12 TB of data


per day
New York Stock Exchange 1TB of data
everyday
storage capacity has doubled roughly every
4
three years since the 1980s

What is big data?


Every day, we create 2.5
quintillion(exabyte)
data

bytes

of

90% of the data in the world

today has been created in the


last two years alone.

This

data
comes
from
everywhere: sensors used to
gather climate information,
posts to social media sites,
digital pictures and videos,
purchase transaction records,
and cell phone GPS signals.
This data is big data
5

Big Data Characterstics

Big Data Characterstics

Volume - too Big Terabytes and more of Credit


Card Transactions, Web Usage data, System logs

Turn 12 terabytes of Tweets created each day into


improved product sentiment analysis

Convert 350 billion annual meter readings to better


predict power consumption
7

Big Data Characterstics


Variety - too Complex truly unstructured data such as Social
Media, Customer Reviews, Call Center Records
For time-sensitive processes such as catching fraud, big data must
be used as it streams into your enterprise in order to maximize
its value.
Scrutinize 5 million trade events created each day to identify
potential fraud

Analyze 500 million daily call detail records in real-time to


predict customer churn faster

Big Data Characterstics

Velocity - too Fast - Sensor data, live web traffic,


Mobile Phone usage, GPS Data
New insights are found when analyzing these
data types together.

Monitor 100s of live video feeds from


surveillance cameras to target points of interest
Exploit the 80% data growth in images, video
and documents to improve customer satisfaction

Big Data Characterstics


Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is
being stored, and mined meaningful to the problem being analyzed

Validity
Like big data veracity is the issue of validity meaning is the data correct and accurate
for the intended use. Clearly valid data is key to making the right decisions.

Volatility
Big data volatility refers to how long is data valid and how long should it be stored.

10

Where Is This Big Data Coming From ?


12+ TBs

4.6
billion
camera
phones
world
wide

100s of
millions
of GPS
enabled

data every
day

7 TBs of

of tweet data
every day

30 billion RFID
tags today
(1.3B in 2005)

devices
sold
annually

25+ TBs
of
log data
every day

2+
billion
76 million

smart meters

people
on the
Web by
end 2011

With Big Data, Weve Moved into a New

Era of Analytics

12+ terabytes

5+ million

of Tweets
create daily.

100s
of different
types of data.

trade events
per second.

Volume

Velocity

Variety

Veracity

Only

1 in 3

decision makers trust


their information.

12

Four Characteristics of Big Data


Cost efficiently
processing the
growing Volume
50x

2010

35
ZB
2020

Establishing the
Veracity of big
data sources

Responding to the
increasing Velocity

30
Billion
RFID
sensors
and
counting

Collectively
Analyzing the
broadening Variety

80% of the

worlds data is
unstructured

1 in 3 business leaders dont trust


the information they use to make
decisions

13

What to do with these data?

Aggregation and Statistics


Data warehouse

Indexing, Searching, and Querying


Keyword based search
Pattern matching

Knowledge discovery
Data Mining
Statistical Modeling

14

WHY IS IT BECOMING IMPORTANT


NOW?

15

How BIG is Big Data?

16

How are people using it?

17

Our ambition was to show how we can


use Big Data to improve peoples lives
and day to day experiences

18

Cost Problem (example)

Cost of processing 1 Petabyte of


data with 1000 node ?
1 PB = 1015 B = 1 million gigabytes = 1 thousand terabytes
- 9 hours for each node to process 500GB at rate of 15MB/S
- 15*60*60*9 = 486000MB ~ 500 GB
- 1000 * 9 * 0.34$ = 3060$ for single run
- 1 PB = 1000000 / 500 = 2000
- The cost for 1000 cloud node each
processing 1PB
2000 * 3060$ = 6,120,000$

19

Meaningfulness of Analytics
We want to find (unrelated) people who at least twice
have stayed at the same hotel on the same day
109 people being tracked.
If everyone behaves randomly (ie., no terrorist) will
data mining detect anything suspicious?

Expected number of suspicious pairs of people:


2500
.. Too many combinations to check- we need to have
some additional evidence to find suspicious pair of
people in some efficient way.

20

Sentiment Analysis

Hotel Feedback
I had a fantastic time on holiday at your
resort. The service was excellent and friendly.
My family all really enjoyed themselves.
The pool was closed, which kind of Terrible
though.
21

Sentimental Analysis
Take a list of Positive and Negative
words
Positive

Negative

Good

Bad

Great

Worse

Fantastic

Rubbish

Excellent

Nasty

Friendly

Awful

Awesome

Terrible

Enjoyed

Bogus

22

Hotel Feedback
I had a fantastic time on holiday at your
resort. The service was excellent and
friendly. My family all really enjoyed
themselves.
The pool was closed, which kind of Terrible
though.

23

Sentiment Analysis

Count them
Positive

Negative

Fantastic

Terrible

Excellent
Friendly
Enjoyed

1
24

Sentiment Analysis

Subtract negative from


positive

4 - 1= 3
Overall sentiment:
Positive

25

Top 5 open source tools for big data

Apache Hadoop as a data processing software.


Hadoop is written in Java
hadoop as a synonym to big data.
It supports multiple operating systems: Windows, Linux, OS X .
Hadoops MapReduce is developed by Google.
26

Implementation of Big Data


MapReduce
Overview:

18

Parallel DBMS technologies


Popularly used for more than two decades

Data-parallel programming model


An associated parallel and distributed
implementation for commodity clusters
Pioneered by Google

Processes 20 PB of data per day


Popularized by open-source Hadoop
Used by Yahoo!, Facebook,

Amazon, and the list is growing

Research Projects: Gamma, Grace,


Commercial: Multi-billion dollar
industry but access to only a privileged
few
Relational Data Model
Indexing
Familiar SQL interface
Advanced query optimization
Well understood and studied

27

MapReduce
'MapReduce' is a framework for processing Parellelisable problems

across huge datasets using a large number of computers (nodes),


collectively referred to as a cluster(if all nodes are on the same local
network and use similar hardware)
Computational processing can occur on data stored either in a
filesystem (unstructured) or in a database (structured).
A MapReduce program is composed of a Map() procedure that
performs filtering and sorting and Reduce() procedure that
performs a summary operation.
"Map" step: The master node takes the input, divides it into smaller
sub-problems, and distributes them to worker nodes. A worker node
may do this again in turn, leading to a multi-level tree structure. The
worker node processes the smaller problem, and passes the answer
back to its master node.
"Reduce" step: The master node then collects the answers to all
the sub-problems and combines them in some way to form the
output the answer to the problem it was originally trying to solve.
28

Implementation of Big Data


MapReduce

19

Raw Input: <key, value>

MAP

<K2,V2>

<K3,V3>

REDUCE

<K1, V1>

29

Hadoop Distributed File System


Hadoop Distributed File System is designed to reliably store very
large files across machines in a large cluster.

It is inspired by the Google File System.


Hadoop DFS stores each file as a sequence of blocks, all blocks
in a file except the last block are the same size.

Blocks belonging to a file are replicated for fault tolerance.


The block size and replication factor are configurable per file.
Files in HDFS are "write once" and have strictly one writer at any
time.

Hadoop Distributed File System Goals:


Store large data sets
Cope with hardware failure
Emphasize streaming data access
30

Example of Hadoop
Programming

Word Count:
I ike parallel computing. I also took
courses on parallel computing

Parallel: 2
Computing: 2
I: 2
Like: 1

31

Example of Hadoop
Programming
Intuition: design <key, value>
Assume each node will process a
paragraph
Map:
What is the key?
What is the value?

Reduce:
What to collect?
What to reduce?
32

Nowadays
Who use Hadoop?
Amazon/A9
AOL
Facebook
Fox interactive media
Google
IBM
New York Times
PowerSet (now Microsoft)
Quantcast
Rackspace/Mailtrust
Veoh
Yahoo!
More at http://wiki.apache.org/hadoop/PoweredBy
33

Nowadays
When you

visit yahoo,
you are
interacting
with data
processed
with Hadoop!

34

Nowadays
When you
Content
Optimization

visit yahoo,
you are
interacting
with data
processed
with Hadoop!

Search
Index

Ads
Optimization

Content Feed
Processing

35

Nowadays
When you
Content
Optimization

visit yahoo,
you are
interacting
with data
processed
with Hadoop!

Search
Index

Machine
Learning
(e.g. Spam filters)

Ads
Optimization

Content Feed
Processing

36

Nowadays

Yahoo! has ~20,000 machines running Hadoop


The largest clusters are currently 2000 nodes
Several petabytes of user data (compressed, unreplicated)
Yahoo! runs hundreds of thousands of jobs every month

37

Hadoop in Yahoo!

Database for Search Assist is built using Hadoop.


3 years of log-data
20-steps of map-reduce

Before Hadoop

After Hadoop

Language

C++

Python

Development Time

2-3 weeks

2-3 days
38

38

Examples of Big Data Projects

Heres another way to capture what a Big Data project could mean
for your company or project: study how others have applied the
idea.
Here are some real-world examples of Big Data in action:
Consumer product companies and retail organizations are
monitoring social media like Facebook and Twitter to get an
unprecedented view into customer behavior, preferences, and
product perception.
Manufacturers are monitoring minute vibration data from their
equipment, which changes slightly as it wears down, to predict the
optimal time to replace or maintain. Replacing it too soon wastes
money; replacing it too late triggers an expensive work stoppage
Manufacturers are also monitoring social networks, but with a
different goal than marketers: They are using it to detect aftermarket
support issues before a warranty failure becomes publicly
detrimental.
Financial Services organizations are using data mined from
customer interactions to slice and dice their users into finely tuned
segments. This enables these financial institutions to create
increasingly relevant and sophisticated offers.
39

Examples of Big Data Projects


Advertising and marketing agencies are tracking social media to

understand responsiveness to campaigns, promotions, and other


advertising mediums.
Insurance companies are using Big Data analysis to see which
home insurance applications can be immediately processed, and
which ones need a validating in-person visit from an agent.
By embracing social media, retail organizations are engaging brand
advocates, changing the perception of brand antagonists, and even
enabling enthusiastic customers to sell their products.
Hospitals are analyzing medical data and patient records to predict
those patients that are likely to seek readmission within a few
months of discharge. The hospital can then intervene in hopes of
preventing another costly hospital stay.
Web-based businesses are developing information products that
combine data gathered from customers to offer more appealing
recommendations and more successful coupon programs.
The government is making data public at both the national, state,
and city level for users to develop new applications that can
generate public good.
Sports teams are using data for tracking ticket sales and even for
tracking team strategies.
40

Importance of Big Data

10

Job
The U.S. could face a shortage by 2015 of
140,000 to 190,000 people with "deep
analytical talent" and of 1.9 million people
capable of analyzing data in ways that enable
business decisions. (McKinsey & Co)
Big Data industry is worth more than $100
billion
growing at almost 10% a year (roughly twice as
fast as the software business)
41

Advantages

Real-time rerouting of transportation fleets based


on weather patterns

Customer sentiment analysis based on social


postings

Targeted disease therapies based on genomic


data

Allocation of disaster relief supplies based on


mobile and social messages from victims

Cars driving themselves.

42

Thank You !

You might also like