Professional Documents
Culture Documents
Dr.Bharatheesh
Jaysimha
CTO, Abiba Systems
Bengaluru
www.abibasystems.c
om
Data trends
Personal data
Its coming (2M$ today2K$ in 10 years)
Today the pack rats have ~ 10-100GB l
1-10 GB in text (eMail, PDF, PPT, OCR) n a
s o e
10GB 50GB tiff, mpeg, jpeg, er t )
P b y y
Some have 1TB (voice + video). a d a
h e t e
Video can drive it to 1PB. T P m e
s o
Online PB affordable in 10 years. (
Get ready: tools to capture, manage,
organize, search, display will be big
app.
Relational Database (SQL)
10 TB to 100TB DBs.
Mostly DSS and data warehouses.
Some media managers
SQL performance is better than
CIFS/NFS
Most bytes are in files
No DBMSs beyond 100TB
Internet scale data
What
Whatproduct
productprom- Which
prom- Whichcustomers
customers
-otions
-otionshave
havethe
thebiggest are
biggest aremost
mostlikely
likelyto
togo
go
impact
impactononrevenue? to
revenue? tothe
thecompetition
competition??
What
Whatimpact
impactwill
will
new
newproducts/services
products/services
have
haveon
onrevenue
revenue
and
andmargins?
margins?
Current Approach
EDW and Analytics
Data Sources
Data Warehouse
Data Marts
Billing System(s)
Daily/Monthly subscriber
change/snapshot records
Canned
Reports
Data Warehouse/Data Marts
Customer Lists
Subscriber Profile Revenue/Usage
Revenue /
Subscriber Revenue
Usage
Activity Plan /
Actuals
Forecast
Data Mining
Actual/plan information
shared between cubes
Doubling
almost every
year..
Law of
compounding
!!!
RDBMS
Data load
Size: up to EB
Any data
Commodity hardware
Free software
Hadoopic
Hadoopic
Hadoopic
Hadoopic
Hadoop
Distributed computing frame work
For clusters of computers
Thousands of Compute Nodes
Petabytes of data
Open source, Java
Googles MapReduce inspired Yahoos
Hadoop.
Now part of Apache group
Hadoop
Hadoop Work flow
Hadoop Sequence diagram
Hadoop mapReduce example
Pig - ETL
Pig
Started at Yahoo! Research
Features:
Expresses sequences of MapReduce jobs
Data model: nested bags of items
Provides relational (SQL) operators (JOIN,
GROUP BY, etc)
Easy to plug in Java functions
Pig an example of a job
Filter by age
Join on name
Group on url
Count clicks
Order by clicks
Take top 5
Pig just look at Java code
importjava.io.IOException; reporter.setStatus("OK"); lp.setOutputKeyClass(Text.class);
importjava.util.ArrayList; } lp.setOutputValueClass(Text.class);
importjava.util.Iterator; lp.setMapperClass(LoadPages.class);
importjava.util.List; //Dothecrossproductandcollectthevalues FileInputFormat.addInputPath(lp,new
for(Strings1:first){ Path("/ user/gates/pages"));
importorg.apache.hadoop.fs.Path; for(Strings2:second){ FileOutputFormat.setOutputPath(lp,
importorg.apache.hadoop.io.LongWritable; Stringoutval=key+","+s1+","+s2; newPath("/user/gates/tmp/indexed_pages"));
importorg.apache.hadoop.io.Text; oc.collect(null,newText(outval)); lp.setNumReduceTasks(0);
importorg.apache.hadoop.io.Writable; reporter.setStatus("OK"); JobloadPages=newJob(lp);
importorg.apache.hadoop.io.WritableComparable; }
importorg.apache.hadoop.mapred.FileInputFormat; } JobConflfu=newJobConf(MRExample.class);
importorg.apache.hadoop.mapred.FileOutputFormat; } lfu.s etJobName("LoadandFilterUsers");
importorg.apache.hadoop.mapred.JobConf; } lfu.setInputFormat(TextInputFormat.class);
importorg.apache.hadoop.mapred.KeyValueTextInputFormat; publicstaticclassLoadJoinedextendsMapReduceBase lfu.setOutputKeyClass(Text.class);
importorg.a pache.hadoop.mapred.Mapper; implementsMapper<Text,Text,Text,LongWritable>{ lfu.setOutputValueClass(Text.class);
importorg.apache.hadoop.mapred.MapReduceBase; lfu.setMapperClass(LoadAndFilterUsers.class);
importorg.apache.hadoop.mapred.OutputCollector; publicvoidmap( FileInputFormat.add InputPath(lfu,new
importorg.apache.hadoop.mapred.RecordReader; Textk, Path("/user/gates/users"));
importorg.apache.hadoop.mapred.Reducer; Textval, FileOutputFormat.setOutputPath(lfu,
importorg.apache.hadoop.mapred.Reporter; OutputColle ctor<Text,LongWritable>oc, newPath("/user/gates/tmp/filtered_users"));
imp ortorg.apache.hadoop.mapred.SequenceFileInputFormat; Reporterreporter)throwsIOException{ lfu.setNumReduceTasks(0);
importorg.apache.hadoop.mapred.SequenceFileOutputFormat; //Findtheurl JobloadUsers=newJob(lfu);
importorg.apache.hadoop.mapred.TextInputFormat; Stringline=val.toString();
importorg.apache.hadoop.mapred.jobcontrol.Job; intfirstComma=line.indexOf(','); JobConfjoin=newJobConf( MRExample.class);
importorg.apache.hadoop.mapred.jobcontrol.JobC ontrol; intsecondComma=line.indexOf(',',first Comma); join.setJobName("JoinUsersandPages");
importorg.apache.hadoop.mapred.lib.IdentityMapper; Stringkey=line.substring(firstComma,secondComma); join.setInputFormat(KeyValueTextInputFormat.class);
//droptherestoftherecord,Idon'tneeditanymore, join.setOutputKeyClass(Text.class);
publicclassMRExample{ //justpassa1forthecombiner/reducertosuminstead. join.setOutputValueClass(Text.class);
publicstaticclassLoadPagesextendsMapReduceBase TextoutKey=newText(key); join.setMapperClass(IdentityMap per.class);
implementsMapper<LongWritable,Text,Text,Text>{ oc.collect(outKey,newLongWritable(1L)); join.setReducerClass(Join.class);
} FileInputFormat.addInputPath(join,new
publicvoidmap(LongWritablek,Textval, } Path("/user/gates/tmp/indexed_pages"));
OutputCollector<Text,Text>oc, publicstaticclassReduceUrlsextendsMapReduceBase FileInputFormat.addInputPath(join,new
Reporterreporter)throwsIOException{ implementsReducer<Text,LongWritable,WritableComparable, Path("/user/gates/tmp/filtered_users"));
//Pullthekeyout Writable>{ FileOutputFormat.se tOutputPath(join,new
Stringline=val.toString(); Path("/user/gates/tmp/joined"));
intfirstComma=line.indexOf(','); publicvoidreduce( join.setNumReduceTasks(50);
Stringkey=line.sub string(0,firstComma); Textke y, JobjoinJob=newJob(join);
Stringvalue=line.substring(firstComma+1); Iterator<LongWritable>iter, joinJob.addDependingJob(loadPages);
TextoutKey=newText(key); OutputCollector<WritableComparable,Writable>oc, joinJob.addDependingJob(loadUsers);
//Prependanindextothevaluesoweknowwhichfile Reporterreporter)throwsIOException{
//itcamefrom. //Addupallthevalueswesee JobConfgroup=newJobConf(MRE xample.class);
TextoutVal=newText("1 "+value); group.setJobName("GroupURLs");
oc.collect(outKey,outVal); longsum=0; group.setInputFormat(KeyValueTextInputFormat.class);
} wh ile(iter.hasNext()){ group.setOutputKeyClass(Text.class);
} sum+=iter.next().get(); group.setOutputValueClass(LongWritable.class);
publicstaticclassLoadAndFilterUsersextendsMapReduceBase reporter.setStatus("OK"); group.setOutputFormat(SequenceFi leOutputFormat.class);
implementsMapper<LongWritable,Text,Text,Text>{ } group.setMapperClass(LoadJoined.class);
group.setCombinerClass(ReduceUrls.class);
publicvoidmap(LongWritablek,Textval, oc.collect(key,newLongWritable(sum)); group.setReducerClass(ReduceUrls.class);
OutputCollector<Text,Text>oc, } FileInputFormat.addInputPath(group,new
Reporterreporter)throwsIOException{ } Path("/user/gates/tmp/joined"));
//Pullthekeyout publicstaticclassLoadClicksextendsMapReduceBase FileOutputFormat.setOutputPath(group,new
Stringline=val.toString(); i mplementsMapper<WritableComparable,Writable,LongWritable, Path("/user/gates/tmp/grouped"));
intfirstComma=line.indexOf(','); Text>{ group.setNumReduceTasks(50);
Stringvalue=line.substring( firstComma+1); JobgroupJob=newJob(group);
intage=Integer.parseInt(value); publicvoidmap( groupJob.addDependingJob(joinJob);
if(age<18||age>25)return; WritableComparablekey,
Stringkey=line.substring(0,firstComma); Writableval, JobConftop100=newJobConf(MRExample.class);
TextoutKey=newText(key); OutputCollector<LongWritable,Text>oc, top100.setJobName("Top100sites");
//Prependanindextothevaluesow eknowwhichfile Reporterreporter) throwsIOException{ top100.setInputFormat(SequenceFileInputFormat.class);
//itcamefrom. oc.collect((LongWritable)val,(Text)key); top100.setOutputKeyClass(LongWritable.class);
TextoutVal=newText("2"+value); } top100.setOutputValueClass(Text.class);
oc.collect(outKey,outVal); } top100.setOutputFormat(SequenceFileOutputF ormat.class);
} publicstaticclassLimitClicksextendsMapReduceBase top100.setMapperClass(LoadClicks.class);
} implementsReducer<LongWritable,Text,LongWritable,Text>{ top100.setCombinerClass(LimitClicks.class);
publicstaticclassJoinextendsMapReduceBase top100.setReducerClass(LimitClicks.class);
implementsReducer<Text,Text,Text,Text>{ intcount=0; FileInputFormat.addInputPath(top100,new
public voidreduce( Path("/user/gates/tmp/grouped"));
publicvoidreduce(Textkey, LongWritablekey, FileOutputFormat.setOutputPath(top100,new
Iterator<Text>iter, Iterator<Text>iter, Path("/user/gates/top100sitesforusers18to25"));
OutputCollector<Text,Text>oc, OutputCollector<LongWritable,Text>oc, top100.setNumReduceTasks(1);
Reporterreporter)throwsIOException{ Reporterreporter)throwsIOException{ Joblimit=newJob(top100);
//Foreachvalue,figureoutwhichfileit'sfromand limit.addDependingJob(groupJob);
storeit //Onlyoutputthefirst100records
//accordingly. while(count <100&&iter.hasNext()){ JobControljc=newJobControl("Findtop 100sitesforusers
List<String>first=newArrayList<String>(); oc.collect(key,iter.next()); 18to25");
List<String>second=newArrayList<String>(); count++; jc.addJob(loadPages);
} jc.addJob(loadUsers);
while(iter.hasNext()){ } jc.addJob(joinJob);
Textt=iter.next(); } jc.addJob(groupJob);
Stringvalue=t.to String(); publicstaticvoidmain(String[]args)throwsIOException{ jc.addJob(limit);
if(value.charAt(0)=='1') JobConflp=newJobConf(MRExample.class); jc.run();
first.add(value.substring(1)); lp.se tJobName("LoadPages"); }
elsesecond.add(value.substring(1)); lp.setInputFormat(TextInputFormat.class); }
Pig now look at Pig latin
U sers = load users as (nam e, age);
Filtered = filter U sers by
age > = 18 and age < = 25;
Pages = load pages as (user, url);
Joined = join Filtered by nam e, Pages by user;
G rouped = group Joined by url;
Sum m ed = foreach G rouped generate group,
count(Joined) as clicks;
Sorted = order Sum m ed by clicks desc;
Top5 = lim it Sorted 5;
U sers = load
Filter by age
Filtered = fi
lter
Pages = load
Join on name
Joined = join
Group on url
G rouped = group
Sum m ed = count()
Count clicks Sorted = order
Top5 = lim it
Order by clicks
Take top 5
Hive Data warehouse
Hive
Developed at Facebook
Used for majority of Facebook jobs
Relational database built on
Hadoop
Maintains list of table schemas
SQL-like query language (HQL)
Can call Hadoop Streaming scripts
from HQL
Supports table partitioning,
clustering, complex data types, some
optimizations
Hive
Find top 5 pages visited by users aged 18-25:
SELEC T p.url, CO U N T(1) as clicks
FRO M users u JO IN page_view s p O N (u.nam e = p.user)
W H ERE u.age > = 18 AN D u.age < = 25
G RO U P BY p.url
O RD ER BY clicks
LIM IT 5;
Mahout Data mining
Data mining
Clustering