Professional Documents
Culture Documents
By Dr.R.Venkatesh
Agenda
1
55
Contents
1
6
Introduction : HADOOP
2
7
3
8
4
9
10
5
2012
=
LHC
1 (40 TB/S)
640TB per
Flight
bytes
of
This
data
comes
from
everywhere: sensors used to
gather climate information,
posts to social media sites,
digital pictures and videos,
purchase transaction records,
and cell phone GPS signals.
This data is big data
5
Validity
Like big data veracity is the issue of validity meaning is the data correct and accurate
for the intended use. Clearly valid data is key to making the right decisions.
Volatility
Big data volatility refers to how long is data valid and how long should it be stored.
10
4.6
billion
camera
phones
world
wide
100s of
millions
of GPS
enabled
data every
day
7 TBs of
of tweet data
every day
30 billion RFID
tags today
(1.3B in 2005)
devices
sold
annually
25+ TBs
of
log data
every day
2+
billion
76 million
smart meters
people
on the
Web by
end 2011
Era of Analytics
12+ terabytes
5+ million
of Tweets
create daily.
100s
of different
types of data.
trade events
per second.
Volume
Velocity
Variety
Veracity
Only
1 in 3
12
2010
35
ZB
2020
Establishing the
Veracity of big
data sources
Responding to the
increasing Velocity
30
Billion
RFID
sensors
and
counting
Collectively
Analyzing the
broadening Variety
80% of the
worlds data is
unstructured
13
Knowledge discovery
Data Mining
Statistical Modeling
14
15
16
17
18
19
Meaningfulness of Analytics
We want to find (unrelated) people who at least twice
have stayed at the same hotel on the same day
109 people being tracked.
If everyone behaves randomly (ie., no terrorist) will
data mining detect anything suspicious?
20
Sentiment Analysis
Hotel Feedback
I had a fantastic time on holiday at your
resort. The service was excellent and friendly.
My family all really enjoyed themselves.
The pool was closed, which kind of Terrible
though.
21
Sentimental Analysis
Take a list of Positive and Negative
words
Positive
Negative
Good
Bad
Great
Worse
Fantastic
Rubbish
Excellent
Nasty
Friendly
Awful
Awesome
Terrible
Enjoyed
Bogus
22
Hotel Feedback
I had a fantastic time on holiday at your
resort. The service was excellent and
friendly. My family all really enjoyed
themselves.
The pool was closed, which kind of Terrible
though.
23
Sentiment Analysis
Count them
Positive
Negative
Fantastic
Terrible
Excellent
Friendly
Enjoyed
1
24
Sentiment Analysis
4 - 1= 3
Overall sentiment:
Positive
25
18
27
MapReduce
'MapReduce' is a framework for processing Parellelisable problems
19
MAP
<K2,V2>
<K3,V3>
REDUCE
<K1, V1>
29
Example of Hadoop
Programming
Word Count:
I ike parallel computing. I also took
courses on parallel computing
Parallel: 2
Computing: 2
I: 2
Like: 1
31
Example of Hadoop
Programming
Intuition: design <key, value>
Assume each node will process a
paragraph
Map:
What is the key?
What is the value?
Reduce:
What to collect?
What to reduce?
32
Nowadays
Who use Hadoop?
Amazon/A9
AOL
Facebook
Fox interactive media
Google
IBM
New York Times
PowerSet (now Microsoft)
Quantcast
Rackspace/Mailtrust
Veoh
Yahoo!
More at http://wiki.apache.org/hadoop/PoweredBy
33
Nowadays
When you
visit yahoo,
you are
interacting
with data
processed
with Hadoop!
34
Nowadays
When you
Content
Optimization
visit yahoo,
you are
interacting
with data
processed
with Hadoop!
Search
Index
Ads
Optimization
Content Feed
Processing
35
Nowadays
When you
Content
Optimization
visit yahoo,
you are
interacting
with data
processed
with Hadoop!
Search
Index
Machine
Learning
(e.g. Spam filters)
Ads
Optimization
Content Feed
Processing
36
Nowadays
37
Hadoop in Yahoo!
Before Hadoop
After Hadoop
Language
C++
Python
Development Time
2-3 weeks
2-3 days
38
38
Heres another way to capture what a Big Data project could mean
for your company or project: study how others have applied the
idea.
Here are some real-world examples of Big Data in action:
Consumer product companies and retail organizations are
monitoring social media like Facebook and Twitter to get an
unprecedented view into customer behavior, preferences, and
product perception.
Manufacturers are monitoring minute vibration data from their
equipment, which changes slightly as it wears down, to predict the
optimal time to replace or maintain. Replacing it too soon wastes
money; replacing it too late triggers an expensive work stoppage
Manufacturers are also monitoring social networks, but with a
different goal than marketers: They are using it to detect aftermarket
support issues before a warranty failure becomes publicly
detrimental.
Financial Services organizations are using data mined from
customer interactions to slice and dice their users into finely tuned
segments. This enables these financial institutions to create
increasingly relevant and sophisticated offers.
39
10
Job
The U.S. could face a shortage by 2015 of
140,000 to 190,000 people with "deep
analytical talent" and of 1.9 million people
capable of analyzing data in ways that enable
business decisions. (McKinsey & Co)
Big Data industry is worth more than $100
billion
growing at almost 10% a year (roughly twice as
fast as the software business)
41
Advantages
42
Thank You !