You are on page 1of 38

A Brief Introduction of

Existing Big Data Tools
Bingjing Zhang

Outline
• The world map of big data tools
• Layered architecture
• Big data tools for HPC and supercomputing
• MPI
• Big data tools on clouds
• MapReduce model
• Iterative MapReduce model
• DAG model
• Graph model
• Collective model
• Machine learning on big data
• Query on big data
• Stream data processing

The
World of BSP/Collective
Big Data DAG Model MapReduce Model Graph Model
Model
Tools
Hadoop
MPI
HaLoop Giraph
For Twister Hama
Iterations GraphLab
/ Spark GraphX
Learning Harp
Stratosphere
Dryad/ Reef
DryadLI
NQ Pig/PigLatin
Hive
For Tez
Query Drill
Shark
MRQL

For S4 Storm
Streamin Spark
g Samza
Streaming

nt. L. Geoffrey Fox .Layered BioKe stratio Workf (Tools BioKe NA: stratio Orche Swift. TT r. L. Parallel Horizontally Scalable Data Processing JGroups • JGroups NA – Non Apache projects Pegasus Ganglia. Ambari. Spark Communications MPI (NA) (NA) (NA) & Reductions Harp Collectives (NA) Pub/Sub Messaging Netty (NA)/ZeroMQ (NA)/ActiveMQ/Qpid/Kafka The figure of layered architecture is from Prof. Cloudera (Log Files Protocols: Thrift. ABDS Inter-process Communication HPC Inter-process Communication to HPC (darker) integration layers Inca (NA) Hadoop.r. CompLearn (NA) Message Message Protocols: High Level (Integrated) Systems for Data Processing Distributed Distributed Coordination: Hive Hcatalog Pig Shark MRQL Impala (NA) Swazall (SQL on Interfaces (Procedural (SQL on (SQL on Hadoop.. Bioconductor (NA) ImageJ (NA) Scalapack. OOD eBPE Airav Tride Taver Activ OOD Tride pler. PetSc (NA) Monitoring: Monitoring: Ambari. Spark) (SQL on Hbase) Google NA) Security & Coordination: ZooKeeper. Protobuf (Map (Iterative Stratosphere (DAG) Storm (BSP) Yahoo LinkedIn ~Pregel Privacy • Green layers are Reduce) MR) Iterative MR (NA) Stream Graph Nagios. na. Inca Batch Protobuf (NA) Apache/Commercial Cloud (light) ZooKeeper. MLlib . NA) Hama. MLbase R. Hadoop Spark NA:Twister Tez Hama S4 Samza Giraph on Hadoop & Privacy Thrift. Orche Architecture Swift. pler. ata us. yy )) . Workf (Tools Oozie Keple Galax NA: Pegas ODE. eBPE Airav Oozie Keple Galax Taver Activ Pegas ODE. Security Hadoop) Language) Spark. us. Bioinformatics Imagery Linear Algebra Mahout . Nagios. & and low nn & and low na. (Upper) Cross Cutting Capabilities Machine Learning Data Analytics Libraries: Statistics. Ganglia. ata nt.

. Azure. Protobuf Data Transport BitTorrent. Federated Interoperability Layer Whirr / JClouds OCCI CDMI (NA) DevOps/Cloud Deployment Puppet/Chef/Boto/CloudMesh(NA) IaaS System Manager Open Source Commercial Clouds Bare OpenStack. FTP. Layered Capabilities ORM Object Relational Mapping: Hibernate(NA). GFFS Object Stores POSIX Interface Distributed. Redis(NA) (key value). Moab. Hazelcast (NA). Geoffrey Fox . Slurm. OpenNebula. Llama(Cloudera) Condor. Security Sesame AllegroGraph RYA RDF on Security & Java Gnu Commercial Jena • NA – Non Apache projects (NA) (NA) (NA) Commercial Accumulo iRODS(NA) Ganglia.Python (DHT) +Document (Watson) HBase) HDFS) HDFS) NoSQL: Document NoSQL: Key Value (all NA) MongoDB CouchDB Lucene Berkeley Azure Dynamo Riak Voldemort Distributed Distributed Coordination: Monitoring: Monitoring: Ambari. Protocols: Thrift. SystemsLustre. Memcached (NA). (SQL on HBase Accumulo Cassandra (Data on (Solr+ Cassandra) (Data on (Lower) (Content) (NA) R. Swift. Google Metal The figure of layered architecture is from Prof. SSH Globus Online (GridFTP) Privacy ZooKeeper. JGroups Apache/Commercial Cloud (light) Nagios. Helix. (NA) Solr DB Table Amazon ~Dynamo ~Dynamo Message Message Protocols: NoSQL: General Graph NoSQL: TripleStore RDF SparkQL File Management Coordination: ZooKeeper. Torque(NA) ……. Ceph User Level FUSE(NA) HPC FileGluster. • Green layers are & Privacy Thrift. Amazon. Ehcache (NA). Neo4J Yarcdata Ambari. Inca HPC Cluster Resource Management Protobuf (NA) to HPC (darker) integration layers ABDS Cluster Resource Management Mesos. Eucalyptus. (NA) GPFS. HTTP. CloudStack. Parallel. Ganglia. Yarn. Nagios. Inca (NA) (NA) JGroups (NA) ABDS File Systems HDFS. OpenJPA and JDBC Standard Extraction Tools SQL NoSQL: Column Solandra Architecture UIMA (Entities) Tika MySQL Phoenix SciDB (NA) Arrays. Cross Cutting In memory distributed databases/caches: GORA (general object from NoSQL). vCloud.

Gather.Big Data Tools for HPC and Supercomputing • MPI(Message Passing Interface. Allgather.open-mpi. Scatter. • Collective communication operations • Broadcast.org/ . Allreduce. 1992) • Provide standardized function interfaces for communication between parallel processes. Reduce-scatter. • Popular implementations • MPICH (2001) • OpenMPI (2004) • http://www. Reduce.

yahoo.MapReduce Model • Google MapReduce (2004) • Jeffrey Dean et al. . MapReduce: Simplified Data Processing on Large Clusters. SOCC 2013. Apache Hadoop YARN: Yet Another Resource Negotiator. • Apache Hadoop (2005) • http://hadoop.0 (2012) • Vinod Kumar Vavilapalli et al.apache. • Separation between resource management and computation model.org/ • http://developer.com/hadoop/tutorial/ • Apache Hadoop 2. OSDI 2004.

Key Features of MapReduce Model • Designed for clouds • Large clusters of commodity machines • Designed for big data • Support from local disks based distributed file system (GFS / HDFS) • Disk based intermediate data transfer in Shuffling • MapReduce programming model • Computation pattern: Map tasks and Reduce tasks • Data abstraction: KeyValue pairs .

read. intermediate KeyValue Program emits final output pairs (1) fork (1) (1) fork (2) forkMaster (2) assign assig map n Worker reduc (6) write Output e Worker Split 0 (3) read File 0 (4) local Split 1 Worker write Split 2 Output Worker File 1 Worker (5) remote read Intermediate files Input files Map phase Reduce phase Output files (on local disks) .Google MapReduce Mapper: split. emit User Reducer: repartition.

google.iterativemapreduce. • HaLoop (2010) • Yingyi Bu et al.com/p/haloop / • Programming model • Loop-Aware Task Scheduling • Caching and indexing for Loop-Invariant Data on local disk . Twister: A Runtime for Iterative MapReduce.Iterative MapReduce Model •• Twister   (2010) • Jaliya Ekanayake et al. • http://code.org / • Simple collectives: broadcasting and aggregation. HaLoop: Efficient Iterative Data Processing on Large clusters. HPDC workshop 2010. • http://www. VLDB 2010.

Value> pairs directly () Iteration May merge data in shuffling Reduce s Combine() () Communications/data transfers operation via the pub-sub broker network & direct TCP updateCondition( • Main program may contain many )} //end while MapReduce invocations or close() iterative MapReduce invocations .. May scatter/broadcast Map .) <Key.Twister Programming Model Main program’s process Worker Nodes space configureMaps(…) Local configureReduce(… Disk ) while(condition){ Cacheable map/reduce tasks runMapReduce(.

DAG (Directed Acyclic Graph) Model • Dryad and DryadLINQ (2007) • Michael Isard et al. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks.aspx . • http:// research. 2007.microsoft.com/en-us/collaboration/tools/dryad. EuroSys.

• Matei Zaharia et al.org/ • Resilient Distributed Dataset (RDD) • RDD operations • MapReduce-like parallel operations • DAG of execution stages and pipelined transformations • Simple collectives: broadcasting and aggregation . NSDI 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. • http://spark. Spark: Cluster Computing with Working Sets. HotCloud 2010..apache.Model Composition • Apache Spark (2010) • Matei Zaharia et al.

com/notes/facebook-engineering/scaling-apache-giraph-to-a-tri llion-edges/10151617006153920 . Pregel: A System for Large-Scale Graph Processing. • Apache Hama (2010) • https://hama.apache.facebook.Graph Processing with BSP model • Pregel (2010) • Grzegorz Malewicz et al.org/ • Apache Giraph (2012) • https://giraph. SIGMOD 2010.apache.org/ • Scaling Apache Giraph to a trillion edges • https://www.

Pregel & Apache Giraph Vote to Active • Computation Model halt • Superstep as iteration • Vertex state machine: 3 6 2 1 Superste Active and Inactive. vote to halt p0 • Message passing between vertices • Combiners 6 6 2 6 Superste • Aggregators p1 • Topology mutation • Master/worker model 6 6 6 6 Superste p2 • Graph partition: hashing • Fault tolerance: checkpointing and 6 6 6 6 Superste confined recovery p3 Maximum Value Example .

FloatWritable. } if (getSuperstep() < getConf(). 0)) { sendMessageToAllEdges(vertex. } } } .getValue(). FloatWritable> { /** Number of supersteps */ public static final String SUPERSTEP_COUNT = "giraph.getValue(). Iterable<FloatWritable> messages) throws IOException { if (getSuperstep() >= 1) { float sum = 0.getNumEdges())). NullWritable.getInt(SUPERSTEP_COUNT.voteToHalt(). @Override public void compute(Vertex<IntWritable. NullWritable> vertex. FloatWritable.get().set((0. } else { vertex. for (FloatWritable message : messages) { sum += message.superstepCount". } vertex.15f / getTotalNumVertices()) + 0. new FloatWritable(vertex.get() / vertex.pageRank.Giraph Page Rank Code Example public class PageRankComputation extends BasicComputation<IntWritable.85f * sum).

• Yucheng Low. Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud.GraphLab (2010) • Yucheng Low et al.org/resources/publications. PVLDB 2012.org/projects/index. UAI 2010.html • Data graph • Update functions and the scope • Sync operation (similar to aggregation in Pregel) . et al. • http://graphlab. GraphLab: A New Parallel Framework for Machine Learning.html • http://graphlab.

Data Graph .

apply. • Gather.s. Gonzalez et al. Scatter (GAS) model • GraphX (2013) • Reynold Xin et al. Edge-cut • PowerGraph (2012) • Joseph E. model) GRADES (SIGMOD workshop) 2013.edu/publication/gr aphx-grades/ Vertex-cut (GAS model) . PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs.berkeley.cs. OSDI 2012. GraphX: A Resilient Edge-cut (Giraph Distributed Graph System on Spark.Vertex-cut v. • https ://amplab.

• Bingjing Zhang. SOCC Poster 2013. . High Performance Clustering of Social Images in a Map- Collective Programming Model. Judy Qiu. • Option 1 • Algorithmic message reduction • Fixed point-to-point communication pattern • Option 2 • Collective communication optimization • Not considered by previous BSP model but well developed in MPI • Initial attempts in Twister and Spark on clouds • Mosharaf Chowdhury et al.To reduce communication overhead…. Managing Data Transfers in Computer Clusters with Orchestra. SIGCOMM 2011.

key-values and graphs for easy programming expressiveness. • Collective communication model to support various communication operations on the data abstractions. • Caching with buffer management for memory allocation required from computation and communication • BSP style parallelism • Fault tolerance with check-pointing .0) • Hierarchical data abstraction on arrays.1 and Hadoop 2.2.Collective Model • Harp (2013) • https://github.com/jessezbj/harp-project • Hadoop Plugin (on Hadoop 1.2.

Harp Design Parallelism Model Architecture MapReduce Map-Collective Model MapReduce Map-Collective Model Application Applications Applications M M M M Harp M M M M Framework Shuffle Collective MapReduce V2 Communication R R Resource YARN Manager .

Hierarchical Data Abstraction and Collective Communication Broadcast. Allreduce. Send Long Int Double Byte Vertices. Array Table Edge Message-to-Vertex. Allgather. Send. Regroup- (combine/reduce). Key-Values Array Array Array Array Messages Basic Types Array Struct Object Broadcast. Message Edge-to-Vertex Vertex KeyValue Table Table Table Table Table <Array Type> Array Partition Edge Message Vertex KeyValue Partition < Array Type Partition Partition Partition Partition > Broadcast. Edges. Gather Commutable .

class. addPartitionMapToTable(cenDataMap.class). Context context) throws IOException. DoubleArray> cenDataMap = createCenDataMap(cParSize. if (this. numCenPartitions. DoubleArrPlus> table = new ArrTable<DoubleArray. DoubleArrPlus. DoubleArray. Map<Integer.getResourcePool()).get(KMeansConstants. DoubleArrPlus>(0. table). this. rest. } . vectorSize.Harp Bcast Code Example protected void mapCollective(KeyValReader reader.CFILE).isMaster()) { String cFile = conf. loadCentroids(cenDataMap. InterruptedException { ArrTable<DoubleArray. cFile. vectorSize. } arrTableBcast(table). conf).

5GB MPJ 0.5~2GB data) (Broadcasting 0.5GB Twis ter Bcas t 1GB MPI Bcas t 1GB Twis ter Bcas t 2GB MPI Bcas t 2GB Twis ter 1GB MPJ 1GB Twister vs.5GB data) 100 topology-awareness 90 70 80 60 70 50 60 40 50 40 30 30 20 20 10 10 0 0 1 25 50 75 100 125 150 1 25 50 75 100 125 150 Number of Nodes Number of Nodes 1 receiver 0.5GB 0.5~2GB data) 25 40 35 Topology- 20 30 15 25 Awareness 20 10 15 10 5 5 0 0 1 25 50 75 100 125 150 1 25 50 75 100 125 150 Number of Nodes Number of Nodes Twis ter Bcas t 500MB MPI Bcas t 500MB Twis ter 0. MPJ (Broadcasting 0.5GB W/O TA #receivers = #nodes 1GB 1GB W/O TA #receivers = #cores (#nodes *8) Tested on IU Polar Grid with 1 Gbps Ethernet connection . Spark (Broadcasting Twister Chain with/without 80 0. MPI Twister vs.Broadcasting with Twister vs.

K-Means Clustering Performance on Madrid Cluster (8 nodes) K-Means Clustering Harp v. Hadoop on Madrid 1600 1400 1200 1000 Execution Time (s) 800 600 400 200 0 100m 500 10m 5k 1m 50k Problem Size Hadoop 24 cores Harp 24 cores Hadoop 48 cores Harp 48 cores Hadoop 96 cores Harp 96 cores .s.

K-means Clustering Parallel Efficiency • Shantenu Jha et al. 2014. A Tale of Two Data-Intensive Paradigms: Applications. Abstractions. and Architectures. .

iu. IEEE e- Dcience 2013.edu/data/bcqt. • Big Red II • http://kb. Geoffrey Fox.html • Allgather • Bucket algorithm • Allreduce • Bidirectional exchange algorithm .WDA-MDS Performance on Big Red II • WDA-MDS • Yang Ruan. A Robust and Scalable Solution for Interpolative Multidimensional Scaling with Weighting.

64. 32. 128 nodes.Execution Time of 100k Problem Execution Time of 100k Problem 3000 2500 2000 Execution Time (Seconds) 1500 1000 500 0 0 20 40 60 80 100 120 140 Number of Nodes (8. 16. 32 cores per node) .

6 0.8 0. 64.2 0 0 20 40 60 80 100 120 140 Number of Nodes (8.4 0. 32.Parallel Efficiency Based On 8 Nodes and 256 Cores Parallel Efficiency (Based On 8Nodes and 256 Cores) 1. 128 nodes) 4096 partitions (32 cores per node) .2 1 0. 16.

76 2500 2000 1643.Scale Problem Size (100k.08 Execution Time (Seconds) 1500 1000 500 368. 200k. 300k) Scaling Problem Size on 128 nodes with 4096 cores 3500 3000 2877.39 0 100000 200000 300000 Problem Size .

apache.apache.Machine Learning on Big Data • Mahout on Hadoop • https://mahout.org/mllib/ • GraphLab Toolkits • http://graphlab.org/ • MLlib on Spark • http://spark.org/projects/toolkits.html • GraphLab Computer Vision Toolkit .

Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure 2003. Pig Latin: A Not-So-Foreign Language for Data Processing.org/ . SIGMOD 2008. • https://pig.apache. Interpreting the Data: Parallel Analysis with Sawzall. • Apache Pig (2006) • Christopher Olston et al.Query on Big Data • Query with procedural language • Google Sawzall (2003) • Rob Pike et al.

VLDB 2009. Shark: SQL and Rich Analytics at Scale. UCB/EECS 2012. • https://hive.incubator.apache. and Apache Spark .edu/ • On top of Apache Spark • Apache MRQL (2013) • http://mrql. • http://shark.org/ • On top of Apache Hadoop • Shark (2012) • Reynold Xin et al.A Warehousing Solution Over a Map- Reduce Framework. Apache Hama. Hive .SQL-like Query • Apache Hive (2007) • Facebook Data Infrastructure Team.berkeley.cs. Technical Report.apache.org / • On top of Apache Hadoop.

apache.org/drill/index. • http://incubator.apache. Dremel: Interactive Analysis of Web-Scale Datasets. VLDB 2010.html • System for interactive query .Other Tools for Query • Apache Tez (2013) • http://tez.incubator.org/ • To build complex DAG of tasks for Apache Pig and Apache Hive • On top of YARN • Dremel (2010) Apache Drill (2012) • Sergey Melnik et al.

incubator.Stream Data Processing • Apache S4 (2011) • http://incubator.org/ .org/streaming/ • Apache Samza (2013) • http://samza.apache.org/ • Spark Streaming (2012) • https://spark.org/s4/ • Apache Storm (2011) • http://storm.incubator.incubator.apache.apache.apache.

org/ • Provides system authors with a centralized (pluggable) control flow • Embeds a user-defined system controller called the Job Driver • Event driven control • Package a variety of data-processing libraries (e. etc.) in a reusable form.reef-project.. relational operators.g. query.REEF • Retainable Evaluator Execution Framework • http://www. • To cover different models such as MapReduce. graph processing and stream data processing . low-latency group communication. high- bandwidth shuffle.

Thank You! Questions? .