Bigdataanalysis

Big Data Analysis
By Dr.R.Venkatesh
Agenda
1
Introduction: Explosion in Quantity of Data
Big Data Characteristics
Cost Problem (example)
Importance of Big Data
55
Usage Example in Big Data
Contents
1
6
Introduction : HADOOP
2
7
Hadoop Distributed File System
3
8
Examples of Big Data Projects
4
9
10
5
Advantages of Big Data
Introduction: Explosion in Quantity of Data

1946
Eniac
X 6000000
2012
=
LHC
1 (40 TB/S)
Air Bus A380

- 1 billion line of code
- each engine generate 10 TB
every 30 min
640TB per
Flight
Twitter Generate approximately 12 TB of data

per day
New York Stock Exchange 1TB of data
everyday
storage capacity has doubled roughly every
4
three years since the 1980s
What is big data?

Every day, we create 2.5
quintillion(exabyte)
data
bytes
of
90% of the data in the world
today has been created in the

last two years alone.
This
data
comes
from
everywhere: sensors used to
gather climate information,
posts to social media sites,
digital pictures and videos,
purchase transaction records,
and cell phone GPS signals.
This data is big data
5
Big Data Characterstics
Volume - too Big Terabytes and more of Credit

Card Transactions, Web Usage data, System logs
Turn 12 terabytes of Tweets created each day into

improved product sentiment analysis
Convert 350 billion annual meter readings to better

predict power consumption
7

Variety - too Complex truly unstructured data such as Social
Media, Customer Reviews, Call Center Records
For time-sensitive processes such as catching fraud, big data must
be used as it streams into your enterprise in order to maximize
its value.
Scrutinize 5 million trade events created each day to identify
potential fraud
Analyze 500 million daily call detail records in real-time to

predict customer churn faster
Velocity - too Fast - Sensor data, live web traffic,

Mobile Phone usage, GPS Data
New insights are found when analyzing these
data types together.
Monitor 100s of live video feeds from

surveillance cameras to target points of interest
Exploit the 80% data growth in images, video
and documents to improve customer satisfaction

Veracity
Big Data Veracity refers to the biases, noise and abnormality in data. Is the data that is
being stored, and mined meaningful to the problem being analyzed
Validity
Like big data veracity is the issue of validity meaning is the data correct and accurate
for the intended use. Clearly valid data is key to making the right decisions.
Volatility
Big data volatility refers to how long is data valid and how long should it be stored.
10
Where Is This Big Data Coming From ?

12+ TBs
4.6
billion
camera
phones
world
wide
100s of
millions
of GPS
enabled
data every
day
7 TBs of
of tweet data
every day
30 billion RFID
tags today
(1.3B in 2005)
devices
sold
annually
25+ TBs
of
log data
every day
2+
billion
76 million
smart meters
people
on the
Web by
end 2011
With Big Data, Weve Moved into a New
Era of Analytics
12+ terabytes
5+ million
of Tweets
create daily.
100s
of different
types of data.
trade events
per second.
Volume
Velocity
Variety
Veracity
Only
1 in 3
decision makers trust

their information.
12
Four Characteristics of Big Data

Cost efficiently
processing the
growing Volume
50x
2010
35
ZB
2020
Establishing the
Veracity of big
data sources
Responding to the
increasing Velocity
30
Billion
RFID
sensors
and
counting
Collectively
Analyzing the
broadening Variety
80% of the
worlds data is
unstructured
1 in 3 business leaders dont trust

the information they use to make
decisions
13
What to do with these data?
Aggregation and Statistics

Data warehouse
Indexing, Searching, and Querying

Keyword based search
Pattern matching
Knowledge discovery
Data Mining
Statistical Modeling
14
WHY IS IT BECOMING IMPORTANT

NOW?
15
How BIG is Big Data?
16
How are people using it?
17
Our ambition was to show how we can

use Big Data to improve peoples lives
and day to day experiences
18
Cost Problem (example)
Cost of processing 1 Petabyte of

data with 1000 node ?
1 PB = 1015 B = 1 million gigabytes = 1 thousand terabytes
- 9 hours for each node to process 500GB at rate of 15MB/S
- 15*60*60*9 = 486000MB ~ 500 GB
- 1000 * 9 * 0.34$ = 3060$ for single run
- 1 PB = 1000000 / 500 = 2000
- The cost for 1000 cloud node each
processing 1PB
2000 * 3060$ = 6,120,000$
19
Meaningfulness of Analytics
We want to find (unrelated) people who at least twice
have stayed at the same hotel on the same day
109 people being tracked.
If everyone behaves randomly (ie., no terrorist) will
data mining detect anything suspicious?
Expected number of suspicious pairs of people:

2500
.. Too many combinations to check- we need to have
some additional evidence to find suspicious pair of
people in some efficient way.
20
Sentiment Analysis
Hotel Feedback
I had a fantastic time on holiday at your
resort. The service was excellent and friendly.
My family all really enjoyed themselves.
The pool was closed, which kind of Terrible
though.
21
Sentimental Analysis
Take a list of Positive and Negative
words
Positive
Negative
Good
Bad
Great
Worse
Fantastic
Rubbish
Excellent
Nasty
Friendly
Awful
Awesome
Terrible
Enjoyed
Bogus
22
Hotel Feedback
I had a fantastic time on holiday at your
resort. The service was excellent and
friendly. My family all really enjoyed
themselves.
The pool was closed, which kind of Terrible
though.
23
Sentiment Analysis
Count them
Positive
Negative
Fantastic
Terrible
Excellent
Friendly
Enjoyed
1
24
Sentiment Analysis
Subtract negative from

positive
4 - 1= 3
Overall sentiment:
Positive
25
Top 5 open source tools for big data
Apache Hadoop as a data processing software.

Hadoop is written in Java
hadoop as a synonym to big data.
It supports multiple operating systems: Windows, Linux, OS X .
Hadoops MapReduce is developed by Google.
26
Implementation of Big Data

MapReduce
Overview:
18
Parallel DBMS technologies

Popularly used for more than two decades
Data-parallel programming model

An associated parallel and distributed
implementation for commodity clusters
Pioneered by Google
Processes 20 PB of data per day

Popularized by open-source Hadoop
Used by Yahoo!, Facebook,
Amazon, and the list is growing
Research Projects: Gamma, Grace,

Commercial: Multi-billion dollar
industry but access to only a privileged
few
Relational Data Model
Indexing
Familiar SQL interface
Advanced query optimization
Well understood and studied
27
MapReduce
'MapReduce' is a framework for processing Parellelisable problems
across huge datasets using a large number of computers (nodes),

collectively referred to as a cluster(if all nodes are on the same local
network and use similar hardware)
Computational processing can occur on data stored either in a
filesystem (unstructured) or in a database (structured).
A MapReduce program is composed of a Map() procedure that
performs filtering and sorting and Reduce() procedure that
performs a summary operation.
"Map" step: The master node takes the input, divides it into smaller
sub-problems, and distributes them to worker nodes. A worker node
may do this again in turn, leading to a multi-level tree structure. The
worker node processes the smaller problem, and passes the answer
back to its master node.
"Reduce" step: The master node then collects the answers to all
the sub-problems and combines them in some way to form the
output the answer to the problem it was originally trying to solve.
28
Implementation of Big Data

MapReduce
19
Raw Input: <key, value>
MAP
<K2,V2>
<K3,V3>
REDUCE
<K1, V1>
29
Hadoop Distributed File System

Hadoop Distributed File System is designed to reliably store very
large files across machines in a large cluster.
It is inspired by the Google File System.

Hadoop DFS stores each file as a sequence of blocks, all blocks
in a file except the last block are the same size.
Blocks belonging to a file are replicated for fault tolerance.

The block size and replication factor are configurable per file.
Files in HDFS are "write once" and have strictly one writer at any
time.
Hadoop Distributed File System Goals:

Store large data sets
Cope with hardware failure
Emphasize streaming data access
30
Example of Hadoop
Programming
Word Count:
I ike parallel computing. I also took
courses on parallel computing
Parallel: 2
Computing: 2
I: 2
Like: 1
31
Example of Hadoop
Programming
Intuition: design <key, value>
Assume each node will process a
paragraph
Map:
What is the key?
What is the value?
Reduce:
What to collect?
What to reduce?
32
Nowadays
Who use Hadoop?
Amazon/A9
AOL
Facebook
Fox interactive media
Google
IBM
New York Times
PowerSet (now Microsoft)
Quantcast
Rackspace/Mailtrust
Veoh
Yahoo!
More at http://wiki.apache.org/hadoop/PoweredBy
33
Nowadays
When you
visit yahoo,
you are
interacting
with data
processed
with Hadoop!
34
Nowadays
When you
Content
Optimization
visit yahoo,
you are
interacting
with data
processed
with Hadoop!
Search
Index
Ads
Optimization
Content Feed
Processing
35
Nowadays
When you
Content
Optimization
visit yahoo,
you are
interacting
with data
processed
with Hadoop!
Search
Index
Machine
Learning
(e.g. Spam filters)
Ads
Optimization
Content Feed
Processing
36
Nowadays
Yahoo! has ~20,000 machines running Hadoop

The largest clusters are currently 2000 nodes
Several petabytes of user data (compressed, unreplicated)
Yahoo! runs hundreds of thousands of jobs every month
37
Hadoop in Yahoo!
Database for Search Assist is built using Hadoop.

3 years of log-data
20-steps of map-reduce
Before Hadoop
After Hadoop
Language
C++
Python
Development Time
2-3 weeks
2-3 days
38
38
Heres another way to capture what a Big Data project could mean
for your company or project: study how others have applied the
idea.
Here are some real-world examples of Big Data in action:
Consumer product companies and retail organizations are
monitoring social media like Facebook and Twitter to get an
unprecedented view into customer behavior, preferences, and
product perception.
Manufacturers are monitoring minute vibration data from their
equipment, which changes slightly as it wears down, to predict the
optimal time to replace or maintain. Replacing it too soon wastes
money; replacing it too late triggers an expensive work stoppage
Manufacturers are also monitoring social networks, but with a
different goal than marketers: They are using it to detect aftermarket
support issues before a warranty failure becomes publicly
detrimental.
Financial Services organizations are using data mined from
customer interactions to slice and dice their users into finely tuned
segments. This enables these financial institutions to create
increasingly relevant and sophisticated offers.
39

Advertising and marketing agencies are tracking social media to
understand responsiveness to campaigns, promotions, and other

advertising mediums.
Insurance companies are using Big Data analysis to see which
home insurance applications can be immediately processed, and
which ones need a validating in-person visit from an agent.
By embracing social media, retail organizations are engaging brand
advocates, changing the perception of brand antagonists, and even
enabling enthusiastic customers to sell their products.
Hospitals are analyzing medical data and patient records to predict
those patients that are likely to seek readmission within a few
months of discharge. The hospital can then intervene in hopes of
preventing another costly hospital stay.
Web-based businesses are developing information products that
combine data gathered from customers to offer more appealing
recommendations and more successful coupon programs.
The government is making data public at both the national, state,
and city level for users to develop new applications that can
generate public good.
Sports teams are using data for tracking ticket sales and even for
tracking team strategies.
40
10
Job
The U.S. could face a shortage by 2015 of
140,000 to 190,000 people with "deep
analytical talent" and of 1.9 million people
capable of analyzing data in ways that enable
business decisions. (McKinsey & Co)
Big Data industry is worth more than $100
billion
growing at almost 10% a year (roughly twice as
fast as the software business)
41
Advantages
Real-time rerouting of transportation fleets based

on weather patterns
Customer sentiment analysis based on social

postings
Targeted disease therapies based on genomic

data
Allocation of disaster relief supplies based on

mobile and social messages from victims
Cars driving themselves.
42
Thank You !

Bigdataanalysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bigdataanalysis

Uploaded by

Copyright:

Available Formats

Big Data Analysis

Introduction: Explosion in Quantity of Data

Big Data Characteristics

Cost Problem (example)

Importance of Big Data

Usage Example in Big Data

Hadoop Distributed File System

Examples of Big Data Projects

Importance of Big Data

Advantages of Big Data

Introduction: Explosion in Quantity of Data

Air Bus A380

Twitter Generate approximately 12 TB of data

What is big data?

90% of the data in the world

today has been created in the

Big Data Characterstics

Big Data Characterstics

Volume - too Big Terabytes and more of Credit

Turn 12 terabytes of Tweets created each day into

Convert 350 billion annual meter readings to better

Big Data Characterstics

Analyze 500 million daily call detail records in real-time to

Big Data Characterstics

Velocity - too Fast - Sensor data, live web traffic,

Monitor 100s of live video feeds from

Big Data Characterstics

Where Is This Big Data Coming From ?

With Big Data, Weve Moved into a New

decision makers trust

Four Characteristics of Big Data

1 in 3 business leaders dont trust

What to do with these data?

Aggregation and Statistics

Indexing, Searching, and Querying

WHY IS IT BECOMING IMPORTANT

How BIG is Big Data?

How are people using it?

Our ambition was to show how we can

Cost Problem (example)

Cost of processing 1 Petabyte of

Expected number of suspicious pairs of people:

Subtract negative from

Top 5 open source tools for big data

Apache Hadoop as a data processing software.

Implementation of Big Data

Parallel DBMS technologies

Data-parallel programming model

Processes 20 PB of data per day

Amazon, and the list is growing

Research Projects: Gamma, Grace,

across huge datasets using a large number of computers (nodes),

Implementation of Big Data

Raw Input: <key, value>

Hadoop Distributed File System

It is inspired by the Google File System.

Blocks belonging to a file are replicated for fault tolerance.

Hadoop Distributed File System Goals:

Yahoo! has ~20,000 machines running Hadoop

Database for Search Assist is built using Hadoop.

Examples of Big Data Projects

Examples of Big Data Projects

understand responsiveness to campaigns, promotions, and other