5.2K views

Uploaded by Mikio Braun

My talk given at BerlinBuzzwords 2012.
Video for the talk can be found here: http://vimeo.com/44023460

- Online Learning with Stream Mining
- 120 Interview Questions
- The Data Science Handbook - Pre Release
- Data Science Boot Camp Survival Manual
- Data Science Hiring Guide
- Hadoop Installation Guide
- What is Data Science (Slides)
- Data Science
- Hadoop
- Big Data in Healthcare Hype and Hope
- Data Science With R by Jigsaw Academy
- Hadoop Training #2: MapReduce & HDFS
- Data Science With R Text Mining by Graham Williams
- Hadoop Training #4: Programming with Hadoop
- Data Science Interview Question
- Security Data Visualization
- Big Data
- Hadoop
- Machine Learning with Large Networks of People and Places
- Edx-Microsoft Data Science Timetable

You are on page 1of 33

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Let's cluster our user profiles classify our documents compute some nifty graph statistics

DB

(>2TB)

but how?

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Add a dash of

http://cassandra.apache.org

http://mongodb.org

(sharded, of course)

http://wiki.basho.com/ http://www.mysql.com

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Add

http://activemq.apache.org

http://www.zeromq.org/ http://akka.io

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

MapReduce

Big Data

http://discoproject.org http://hadoop.apache.org

Map

Reduce

Result

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Showed how to adapt ML algorithms to MapReduce Locally Weighted Linear Regression, Naive Bayes, Gaussian Discriminative Analysis, kMeans, Logistic Regression, Neural Networks, Principal Component Analysis, Independent Component Analysis, Expectation Maximization, Support Vector Machines

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Input: points X1,...Xn, number k Output: centroids 1,...,k Initialize k centroids 1,...,k at random repeat until converged compute all distances between points and centroids map each point to closest centroid update centroids by computing average of all points in cluster

end

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Input: points X1,...Xn, number k Output: centroids 1,...,k Initialize k centroids 1,...,k at random repeat until converged compute all distances between points and centroids map each point to closest centroid update centroids by computing average of all points in cluster

end

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Input: points X1,...Xn, number k Output: centroids 1,...,k Initialize k centroids 1,...,k at random repeat until converged compute all distances between points and centroids map each point to closest centroid update centroids by computing average of all points in cluster

end

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Input: points X1,...Xn, number k Output: centroids 1,...,k Initialize k centroids 1,...,k at random repeat until converged compute all distances between points and centroids map each point to closest centroid update centroids by computing average of all points in cluster

end

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Input: points X1,...Xn, number k Output: centroids 1,...,k Initialize k centroids 1,...,k at random repeat until converged compute all distances between points and centroids map each point to closest centroid update centroids by computing average of all points in cluster

end

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

centroids

data data

data

data

Map

1|2|1|5|2|3|4|1|2|0|1|2|3

5|2|3|1|1|3|2|1|2|1|2|5

3|1|4|0|1|3|4|2|1|3|1|2|5

centroids

centroids

centroids

Reduce centroids

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

repeat until converged compute all distances between points and centroids map each point to closest centroid update centroids by computing average of all points in cluster end

job k-means map: compute all distances to centroid map each point to closest centroid compute new cluster center reduce: average cluster centers end job repeat until converged send centroids to cluster run job k-means get centroids end

Additional housekeeping

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

How to train your SVM/vowpal wabbit/Naive bayes/k-nearest neighbors on 2TB of data? You probably don't have to. But if, how do you train on heaps of data?

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Large-Scale learning.

First of all, there is no such thing as Data Science. There is no scientific discipline called data science. You cant go to an university to study data science. On the other hand, I agree that there is such a thing as a data scientist. Whenever I see someone calling himself a data scientist, I think that my own profile would probably also match that description. But what is it a data scientist does?

a:4 agree:1 all:1 also:1 an:1 as:2 but:1 called:1 calling:1 can:1 data:6 description:1 discipline:1 does:1 first:1 go:1 hand:1 himself:1 i:3 is:4 it:1 match:1 my:1 no:2 of:1 on:1 other:1 own:1 probably:1 profile:1 science:3 scientific:1 scientist:3 see:1 someone:1 study:1 such:2 t:1 that:3 the:1 there:3 thing:2 think:1 to:2 university:1 what:1 whenever:1 would:1 you:1

Document

Then, learn weights for each of the words to predict between usually two classes.

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

SVM-Training

Stochastic Gradient Descent (one example at a time) Other more complex methods (bundle methods, etc.)

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Model fits into memory, essentially IO bound Examples: vowpal wabbit http://hunch.net/~vw/ Even the MapReduce paper only made micro-batches

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Superstep 1 Actions: send messages read inbox change graph structure vote to halt

Superstep 2

Superstep 3

Malewicz, Austern, Bik, Dehnert, Horn, Leiser, Czajkowski, Pregel: A System for Large-Scale Graph Processing, SIGMOD'10

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Stream Mining

Large scale processing of event streams Very large domains (e.g. IP adresses, all users on Twitter) Thousands of events per second.

Continuous Stream of Data

Bounded Resource Analyzer for example: what are the most active objects? summarize histogram of data range queries

Stream Queries

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Heavy Hitters

Count activities over large item sets (millions, even more, e.g. IP addresses, Twitter users) Interested in most active elements only.

Case 1: element already in data base 132 142 432 553 712 023 15 12 8 5 3 2 713 3 Case 2: new element 713 023 2 142 142 12 13

Metwally, Agrawal, Abbadi, Efficient computation of Frequent and Top-k Elements in Data Streams, Internation Conference on Database Theory, 2005

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Time

Keep quite a big log (a month?) Constant write/erase in database Alternative: Exponential decay

DB

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Hashing

Compress large feature sets to smaller sets at random. On average, you make a very small error.

-5:description,probably,profile -3:someone,called 0:scientist,agree,data,as,i,go,my,no 1:would,also,university,does,other,of,the 2:hand,thing,such,that,science,on,an 3:is,own,all 4:calling,t,it,to,there,can,see 5:but,whenever,first,discipline,study 6:scientific,himself,what,a,match,you,think

a:4 agree:1 all:1 also:1 an:1 as:2 but:1 called:1 calling:1 can:1 data:6 description:1 discipline:1 does:1 first:1 go:1 hand:1 himself:1 i:3 is:4 it:1 match:1 my:1 no:2 of:1 on:1 other:1 own:1 probably:1 profile:1 science:3 scientific:1 scientist:3 see:1 someone:1 study:1 such:2 t:1 that:3 the:1 there:3 thing:2 think:1 to:2 university:1 what:1 whenever:1 would:1 you:1

Weinberger, Dasgupta, Attenberg, Langford und Smola, Feature Hashing for Large Scale Multitask Learning, ICML, 2009

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Count-Min Sketches

Summarize histograms over large feature sets Like hashing, but better

m bins 0 1 0 2 0 1 5 4 3 0 3 5 0 2 2 0 0 0 1 0 2 3 3 2 0 5 7 0 0 2 3 8 n different hash functions

Query result: 1

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Online clustering

0 1 0 2

0 1 5 4

3 0 3 5

0 2 2 0

0 0 1 0

2 3 3 2

0 5 7 0

0 2 3 8

Aggarwal, A Framework for Clustering Massive-Domain Data Streams, IEEE International Conference on Data Engineering , 2009

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Real-Time Requirements

Guaranteed constant processing time per event. Resilience against volume peaks. Stream-mining method (heavy hitters, etc.) Keep hot data in memory Add scalable technology as needed for persistence, etc.

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

2011 in Retweets

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

JSON parsing language prediction sentiment analysis Actor 1 Tweets Retweet Matching & Retweet Trends Actor k

Snapshots

Trends

Scalable DB

Stream processing/ Actor based concurrency

Day n

MapReduce

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

2011 in Retweets

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

Summary

Big Data Science is not just a scaling problem. To scale, you need to scale data & computation Roll Your Own, or use an existing framework Computation models might be unnatural Large-scale learning: linear models & one example at a time Stream mining: heavy hitters, hashing, count-min sketches You don't scale into real-time.

Scalability Challenges for Big Data Science BerlinBuzzwords, June 4, 2012 2012 by Mikio L. Braun

- Online Learning with Stream MiningUploaded byMikio Braun
- 120 Interview QuestionsUploaded bysjbladen
- The Data Science Handbook - Pre ReleaseUploaded bytintojames
- Data Science Boot Camp Survival ManualUploaded byJoanna Reed
- Data Science Hiring GuideUploaded byramesh158
- Hadoop Installation GuideUploaded bySri Ram
- What is Data Science (Slides)Uploaded byHola2
- Data ScienceUploaded byaruhn
- HadoopUploaded bySummit Sharma
- Big Data in Healthcare Hype and HopeUploaded byDrBonnie360
- Data Science With R by Jigsaw AcademyUploaded byAjitabhKumar
- Hadoop Training #2: MapReduce & HDFSUploaded byDmytro Shteflyuk
- Data Science With R Text Mining by Graham WilliamsUploaded byAnda Nenu
- Hadoop Training #4: Programming with HadoopUploaded byDmytro Shteflyuk
- Data Science Interview QuestionUploaded byRahulsinghoooo
- Security Data VisualizationUploaded byShane Hartman
- Big DataUploaded byDigital Lab
- HadoopUploaded byKartik Mahadevan
- Machine Learning with Large Networks of People and PlacesUploaded byfoursquareHQ
- Edx-Microsoft Data Science TimetableUploaded byNvb Subba Reddy Peddapudi
- Hadoop World: Enabling ad-hoc Analytics at Web ScaleUploaded byOleksiy Kovyrin
- Realtime Personalization and Recommendation with Stream MiningUploaded byMikio Braun
- Data Science Book v 3Uploaded bySandeep Saju
- Beyond Scaling Real Time Even Processing With Stream MiningUploaded byMikio Braun
- Art of Data ScienceUploaded bypratikshr1
- 2015 Data Science Salary SurveyUploaded byali07
- Big Data Beer Apr2014Uploaded byMikio Braun
- DataScienceWeekly DataScientistInterviews Vol1 April2014Uploaded byclungaho7109
- UltimateGuidetoDataScienceInterviews-2Uploaded byAnonymous wt1Miztt3F
- Mike Krieger, Instagram at the Airbnb tech talk, on Scaling InstagramUploaded byingrid8775

- Realtime Personalization and Recommendation with Stream MiningUploaded byMikio Braun
- Beyond Scaling Real Time Even Processing With Stream MiningUploaded byMikio Braun
- jblas - Fast matrix computations for JavaUploaded byMikio Braun
- Big Data Beer Apr2014Uploaded byMikio Braun
- Online Learning with StreamdrillUploaded byMikio Braun
- Quantifying Spatiotemporal Dynamics of Twitter Replies to News FeedsUploaded byMikio Braun
- On Real-Time Twitter Analysis.Uploaded byMikio Braun
- Some Introductory Remarks on Bayesian InferenceUploaded byMikio Braun
- Fast Cross Validation Via Sequential Analysis - TalkUploaded byMikio Braun
- Fast Cross Validation Via Sequential Analysis - AppendixUploaded byMikio Braun
- Cassandra - An IntroductionUploaded byMikio Braun
- Fast Cross-Validation Via Sequential Analysis - PaperUploaded byMikio Braun

- GIAN Lecture 1Uploaded bySurya Ganesh Penaganti
- Rac 2Uploaded byahsumonbd
- Samyak's ResumeUploaded bySamyak Jain
- CCNA Security Questions AnswersUploaded bymathpalsonu
- 3. E-government and ManagerialismUploaded byLucila
- Tim Sweeney Archive - News ArchiveUploaded byAndrew
- 286074586-Ccna-Voice-Voip.pdfUploaded byakganesh14
- Design Rules and Basic Linux CommandsUploaded byYew Wooi Jing
- economics of ehealth reportUploaded byashmitashrivas
- FAN UL(2012.01.13)Uploaded byOscar Ojeda
- Swot AnalysisUploaded byjjrojasr
- Starter Plus Mailbox Setup GuideUploaded byAndy Gee
- Big DataUploaded byIDG_Enterprise
- Java EE and HTML5 Enterprise Application DevelopmentUploaded byPaulo José Junior
- Forecasting of Indian Rupee (INR) US DollarUploaded bychatuusumitava
- MOSHELLUploaded byAnonymous a5yWHm
- Marketing Management Information SystemUploaded bytrisshh
- CoreJava Imp PointUploaded byVasuReddyLb
- Rfc3540 Robust ECNUploaded byFahad Ahmad Khan
- Underground Piping Stress Analysis Procedure Using Caesar IIUploaded byFandy Sipata
- Igpet ManualUploaded byBladimir Jhon Palacios Hurtado
- En Sy210nt ControllerUploaded byJitender Rajvansh
- oracle Enhanced RetroPayUploaded byhamdy2001
- company-profile1Uploaded bybabymel6070
- CFturbo EnUploaded byNoelCano
- UWBUploaded byHitika Pathak Phene
- The Art of Scrum && Quintuple-Constraint - PresentationUploaded byAlex Singleton
- Pads WhitepaperUploaded byagmnm1962
- Cisco 2960X/XR Switch VS Cisco 3650 Switch VS Cisco 3850 SwitchUploaded byFrancine Johnson
- Tu - Oasis User ManualUploaded byakkisantosh7444