You are on page 1of 45

Big Data Analytics

What, Why and How

Dr.Bharatheesh
Jaysimha
CTO, Abiba Systems
Bengaluru
www.abibasystems.c
om
Data trends
Personal data
Its coming (2M$ today2K$ in 10 years)
Today the pack rats have ~ 10-100GB l
1-10 GB in text (eMail, PDF, PPT, OCR) n a
s o e
10GB 50GB tiff, mpeg, jpeg, er t )
P b y y
Some have 1TB (voice + video). a d a
h e t e
Video can drive it to 1PB. T P m e
s o
Online PB affordable in 10 years. (
Get ready: tools to capture, manage,
organize, search, display will be big
app.
Relational Database (SQL)

10 TB to 100TB DBs.
Mostly DSS and data warehouses.
Some media managers
SQL performance is better than
CIFS/NFS
Most bytes are in files
No DBMSs beyond 100TB
Internet scale data

8,000 no-name PCs


Each 1/3U, 2 x 80 GB disk,
2 cpu 256MB ram
1.4 PB online.
2 TB ram online
8 TeraOps
Slice-price is 1K$ so 8M$.
15 admins (!) (==
1/100TB).
Unusual Data, but good ones
Particle Physics Hunting the Higgs and
Dark Matter
April 2006: First pp collisions at TeV energies at the
Large Hadron Collider in Geneva
ATLAS/CMS Experiments involve 2000 physicists from
200 organizations in US, EU, Asia
Need to store,access, process, analyse 10 PB/yr with
200 TFlop/s distributed computation
Building hierarchical Grid infrastructure to distribute
data and computation
ExaBytes and PetaFlop/s by 2015
Business Questions
Business Questions
Which
Whichare
areour
our
lowest/highest
lowest/highestmargin
margin
customers
customers??
Who
Whoare
aremy
mycustomers
customers
What and
andwhat
whatproducts
Whatisisthe
themost
most products
effective are
arethey
theybuying?
effectivedistribution
distribution buying?
channel?
channel?

What
Whatproduct
productprom- Which
prom- Whichcustomers
customers
-otions
-otionshave
havethe
thebiggest are
biggest aremost
mostlikely
likelyto
togo
go
impact
impactononrevenue? to
revenue? tothe
thecompetition
competition??
What
Whatimpact
impactwill
will
new
newproducts/services
products/services
have
haveon
onrevenue
revenue
and
andmargins?
margins?
Current Approach
EDW and Analytics
Data Sources

Data Warehouse

Data Marts
Billing System(s)

Daily/Monthly subscriber
change/snapshot records

Canned
Reports
Data Warehouse/Data Marts
Customer Lists
Subscriber Profile Revenue/Usage

Daily Activity Subscriber Cost


Commissions
Customer Contact Subscriber Profitability
Welcome
Letters

Revenue /
Subscriber Revenue
Usage
Activity Plan /
Actuals
Forecast
Data Mining
Actual/plan information
shared between cubes

Relational Queries EIS/Dashboards Excel Templates


Problems with Current Approach
Trends in Analytics
Problems with Current Approach
Size: up to 100 TB
Mostly structured
data
Costly hardware
Costly software

Doubling
almost every
year..
Law of
compounding
!!!
RDBMS

Data load

Size: up to EB
Any data
Commodity hardware
Free software

Hadoopic
Hadoopic
Hadoopic
Hadoopic
Hadoop
Distributed computing frame work
For clusters of computers
Thousands of Compute Nodes
Petabytes of data
Open source, Java
Googles MapReduce inspired Yahoos
Hadoop.
Now part of Apache group
Hadoop
Hadoop Work flow
Hadoop Sequence diagram
Hadoop mapReduce example
Pig - ETL
Pig
Started at Yahoo! Research
Features:
Expresses sequences of MapReduce jobs
Data model: nested bags of items
Provides relational (SQL) operators (JOIN,
GROUP BY, etc)
Easy to plug in Java functions
Pig an example of a job

Load Users Load Pages

Filter by age

Join on name

Group on url

Count clicks

Order by clicks

Take top 5
Pig just look at Java code
importjava.io.IOException; reporter.setStatus("OK"); lp.setOutputKeyClass(Text.class);
importjava.util.ArrayList; } lp.setOutputValueClass(Text.class);
importjava.util.Iterator; lp.setMapperClass(LoadPages.class);
importjava.util.List; //Dothecrossproductandcollectthevalues FileInputFormat.addInputPath(lp,new
for(Strings1:first){ Path("/ user/gates/pages"));
importorg.apache.hadoop.fs.Path; for(Strings2:second){ FileOutputFormat.setOutputPath(lp,
importorg.apache.hadoop.io.LongWritable; Stringoutval=key+","+s1+","+s2; newPath("/user/gates/tmp/indexed_pages"));
importorg.apache.hadoop.io.Text; oc.collect(null,newText(outval)); lp.setNumReduceTasks(0);
importorg.apache.hadoop.io.Writable; reporter.setStatus("OK"); JobloadPages=newJob(lp);
importorg.apache.hadoop.io.WritableComparable; }
importorg.apache.hadoop.mapred.FileInputFormat; } JobConflfu=newJobConf(MRExample.class);
importorg.apache.hadoop.mapred.FileOutputFormat; } lfu.s etJobName("LoadandFilterUsers");
importorg.apache.hadoop.mapred.JobConf; } lfu.setInputFormat(TextInputFormat.class);
importorg.apache.hadoop.mapred.KeyValueTextInputFormat; publicstaticclassLoadJoinedextendsMapReduceBase lfu.setOutputKeyClass(Text.class);
importorg.a pache.hadoop.mapred.Mapper; implementsMapper<Text,Text,Text,LongWritable>{ lfu.setOutputValueClass(Text.class);
importorg.apache.hadoop.mapred.MapReduceBase; lfu.setMapperClass(LoadAndFilterUsers.class);
importorg.apache.hadoop.mapred.OutputCollector; publicvoidmap( FileInputFormat.add InputPath(lfu,new
importorg.apache.hadoop.mapred.RecordReader; Textk, Path("/user/gates/users"));
importorg.apache.hadoop.mapred.Reducer; Textval, FileOutputFormat.setOutputPath(lfu,
importorg.apache.hadoop.mapred.Reporter; OutputColle ctor<Text,LongWritable>oc, newPath("/user/gates/tmp/filtered_users"));
imp ortorg.apache.hadoop.mapred.SequenceFileInputFormat; Reporterreporter)throwsIOException{ lfu.setNumReduceTasks(0);
importorg.apache.hadoop.mapred.SequenceFileOutputFormat; //Findtheurl JobloadUsers=newJob(lfu);
importorg.apache.hadoop.mapred.TextInputFormat; Stringline=val.toString();
importorg.apache.hadoop.mapred.jobcontrol.Job; intfirstComma=line.indexOf(','); JobConfjoin=newJobConf( MRExample.class);
importorg.apache.hadoop.mapred.jobcontrol.JobC ontrol; intsecondComma=line.indexOf(',',first Comma); join.setJobName("JoinUsersandPages");
importorg.apache.hadoop.mapred.lib.IdentityMapper; Stringkey=line.substring(firstComma,secondComma); join.setInputFormat(KeyValueTextInputFormat.class);
//droptherestoftherecord,Idon'tneeditanymore, join.setOutputKeyClass(Text.class);
publicclassMRExample{ //justpassa1forthecombiner/reducertosuminstead. join.setOutputValueClass(Text.class);
publicstaticclassLoadPagesextendsMapReduceBase TextoutKey=newText(key); join.setMapperClass(IdentityMap per.class);
implementsMapper<LongWritable,Text,Text,Text>{ oc.collect(outKey,newLongWritable(1L)); join.setReducerClass(Join.class);
} FileInputFormat.addInputPath(join,new
publicvoidmap(LongWritablek,Textval, } Path("/user/gates/tmp/indexed_pages"));
OutputCollector<Text,Text>oc, publicstaticclassReduceUrlsextendsMapReduceBase FileInputFormat.addInputPath(join,new
Reporterreporter)throwsIOException{ implementsReducer<Text,LongWritable,WritableComparable, Path("/user/gates/tmp/filtered_users"));
//Pullthekeyout Writable>{ FileOutputFormat.se tOutputPath(join,new
Stringline=val.toString(); Path("/user/gates/tmp/joined"));
intfirstComma=line.indexOf(','); publicvoidreduce( join.setNumReduceTasks(50);
Stringkey=line.sub string(0,firstComma); Textke y, JobjoinJob=newJob(join);
Stringvalue=line.substring(firstComma+1); Iterator<LongWritable>iter, joinJob.addDependingJob(loadPages);
TextoutKey=newText(key); OutputCollector<WritableComparable,Writable>oc, joinJob.addDependingJob(loadUsers);
//Prependanindextothevaluesoweknowwhichfile Reporterreporter)throwsIOException{
//itcamefrom. //Addupallthevalueswesee JobConfgroup=newJobConf(MRE xample.class);
TextoutVal=newText("1 "+value); group.setJobName("GroupURLs");
oc.collect(outKey,outVal); longsum=0; group.setInputFormat(KeyValueTextInputFormat.class);
} wh ile(iter.hasNext()){ group.setOutputKeyClass(Text.class);
} sum+=iter.next().get(); group.setOutputValueClass(LongWritable.class);
publicstaticclassLoadAndFilterUsersextendsMapReduceBase reporter.setStatus("OK"); group.setOutputFormat(SequenceFi leOutputFormat.class);
implementsMapper<LongWritable,Text,Text,Text>{ } group.setMapperClass(LoadJoined.class);
group.setCombinerClass(ReduceUrls.class);
publicvoidmap(LongWritablek,Textval, oc.collect(key,newLongWritable(sum)); group.setReducerClass(ReduceUrls.class);
OutputCollector<Text,Text>oc, } FileInputFormat.addInputPath(group,new
Reporterreporter)throwsIOException{ } Path("/user/gates/tmp/joined"));
//Pullthekeyout publicstaticclassLoadClicksextendsMapReduceBase FileOutputFormat.setOutputPath(group,new
Stringline=val.toString(); i mplementsMapper<WritableComparable,Writable,LongWritable, Path("/user/gates/tmp/grouped"));
intfirstComma=line.indexOf(','); Text>{ group.setNumReduceTasks(50);
Stringvalue=line.substring( firstComma+1); JobgroupJob=newJob(group);
intage=Integer.parseInt(value); publicvoidmap( groupJob.addDependingJob(joinJob);
if(age<18||age>25)return; WritableComparablekey,
Stringkey=line.substring(0,firstComma); Writableval, JobConftop100=newJobConf(MRExample.class);
TextoutKey=newText(key); OutputCollector<LongWritable,Text>oc, top100.setJobName("Top100sites");
//Prependanindextothevaluesow eknowwhichfile Reporterreporter) throwsIOException{ top100.setInputFormat(SequenceFileInputFormat.class);
//itcamefrom. oc.collect((LongWritable)val,(Text)key); top100.setOutputKeyClass(LongWritable.class);
TextoutVal=newText("2"+value); } top100.setOutputValueClass(Text.class);
oc.collect(outKey,outVal); } top100.setOutputFormat(SequenceFileOutputF ormat.class);
} publicstaticclassLimitClicksextendsMapReduceBase top100.setMapperClass(LoadClicks.class);
} implementsReducer<LongWritable,Text,LongWritable,Text>{ top100.setCombinerClass(LimitClicks.class);
publicstaticclassJoinextendsMapReduceBase top100.setReducerClass(LimitClicks.class);
implementsReducer<Text,Text,Text,Text>{ intcount=0; FileInputFormat.addInputPath(top100,new
public voidreduce( Path("/user/gates/tmp/grouped"));
publicvoidreduce(Textkey, LongWritablekey, FileOutputFormat.setOutputPath(top100,new
Iterator<Text>iter, Iterator<Text>iter, Path("/user/gates/top100sitesforusers18to25"));
OutputCollector<Text,Text>oc, OutputCollector<LongWritable,Text>oc, top100.setNumReduceTasks(1);
Reporterreporter)throwsIOException{ Reporterreporter)throwsIOException{ Joblimit=newJob(top100);
//Foreachvalue,figureoutwhichfileit'sfromand limit.addDependingJob(groupJob);
storeit //Onlyoutputthefirst100records
//accordingly. while(count <100&&iter.hasNext()){ JobControljc=newJobControl("Findtop 100sitesforusers
List<String>first=newArrayList<String>(); oc.collect(key,iter.next()); 18to25");
List<String>second=newArrayList<String>(); count++; jc.addJob(loadPages);
} jc.addJob(loadUsers);
while(iter.hasNext()){ } jc.addJob(joinJob);
Textt=iter.next(); } jc.addJob(groupJob);
Stringvalue=t.to String(); publicstaticvoidmain(String[]args)throwsIOException{ jc.addJob(limit);
if(value.charAt(0)=='1') JobConflp=newJobConf(MRExample.class); jc.run();
first.add(value.substring(1)); lp.se tJobName("LoadPages"); }
elsesecond.add(value.substring(1)); lp.setInputFormat(TextInputFormat.class); }
Pig now look at Pig latin
U sers = load users as (nam e, age);
Filtered = filter U sers by
age > = 18 and age < = 25;
Pages = load pages as (user, url);
Joined = join Filtered by nam e, Pages by user;
G rouped = group Joined by url;
Sum m ed = foreach G rouped generate group,
count(Joined) as clicks;
Sorted = order Sum m ed by clicks desc;
Top5 = lim it Sorted 5;

store Top5 into top5sites;


Pig latin
Notice how naturally the components of the job translate into Pig Latin.

Load Users Load Pages

U sers = load
Filter by age
Filtered = fi
lter
Pages = load
Join on name
Joined = join
Group on url
G rouped = group
Sum m ed = count()
Count clicks Sorted = order
Top5 = lim it
Order by clicks

Take top 5
Hive Data warehouse
Hive
Developed at Facebook
Used for majority of Facebook jobs
Relational database built on
Hadoop
Maintains list of table schemas
SQL-like query language (HQL)
Can call Hadoop Streaming scripts
from HQL
Supports table partitioning,
clustering, complex data types, some
optimizations
Hive
Find top 5 pages visited by users aged 18-25:
SELEC T p.url, CO U N T(1) as clicks
FRO M users u JO IN page_view s p O N (u.nam e = p.user)
W H ERE u.age > = 18 AN D u.age < = 25
G RO U P BY p.url
O RD ER BY clicks
LIM IT 5;
Mahout Data mining
Data mining
Clustering

Perfect candidate for mapReduce


Mahout k-means
Mahout k-means
Mahout k-means
Mahout k-means
Mahout k-means
Mahout Clustering
Big Data Analytics
Big Data Analytics
Acknowledgements

Thanks to all the authors who left


their slides on the Web.
I own the errors of course.

You might also like