You are on page 1of 244

Table of contents

Apache Spark vs Hadoop: Introduction to Hadoop..................................................................12

HDFS...............................................................................................................................12

NameNode..............................................................................................................12

DataNode................................................................................................................13

YARN..............................................................................................................................13

ResourceManager...................................................................................................13

NodeManager.........................................................................................................13

Apache Spark vs Hadoop: Introduction to Apache Spark.........................................................14 Apache Spark vs Hadoop: Parameters to Compare .................................................................16

Performance.....................................................................................................................16

Ease of Use......................................................................................................................16

Costs................................................................................................................................17

Data Processing...............................................................................................................17 Batch Processing vs Stream Processing.................................................................17 Fault Tolerance................................................................................................................18

Security............................................................................................................................19

Use-cases where Hadoop fits best:............................................................................................19 Use-cases where Spark fits best:...............................................................................................19 Real-Time Big Data Analysis:.........................................................................................19 Graph Processing:............................................................................................................20 Iterative Machine Learning Algorithms:.........................................................................21 Mesos features........................................................................................................26 DevOps tooling.........................................................................................................................26 Long Running Services..........................................................................................26 Big Data Processing...............................................................................................26 Batch Scheduling....................................................................................................27 Data Storage...........................................................................................................27 The Aurora Mesos Framework.................................................................................................27 Cloud Watcher....................................................................................................................................27 What is Singularity....................................................................................................................28 How it Works............................................................................................................................29 Singularity Components............................................................................................................31 Singularity Scheduler.......................................................................................................32 Slave Placement.....................................................................................................34 Singularity Scheduler Dependencies......................................................................34 Singularity UI.........................................................................................................35 Optional Slave Components............................................................................................35 Singularity Executor...............................................................................................35 S3 Uploader............................................................................................................35 S3 Downloader.......................................................................................................35 Singularity Executor Cleanup................................................................................35 Log Watcher...........................................................................................................36 OOM Killer............................................................................................................36 Chapel: Productive Parallel Programming.........................................................................................36

Installation.................................................................................................................................38

Example....................................................................................................................................38

Configuration............................................................................................................................38

UI..............................................................................................................................................39

UI when running..............................................................................................................39 UI after running...............................................................................................................39 UI examples for features..................................................................................................39

Exelixi.................................................................................................................................................40

Quick Start.......................................................................................................................40

Flink....................................................................................................................................................41

Low latency on minimal resources..................................................................................42 Variety of Sources and Sinks...........................................................................................42 Fault tolerance.................................................................................................................42 High level API.................................................................................................................42 Stateful processing...........................................................................................................43 Exactly once processing..................................................................................................43 SQL Support....................................................................................................................43 Environment Support.......................................................................................................43 Awesome community......................................................................................................43 Spark Streaming vs Flink vs Storm vs Kafka Streams vs Samza : Choose Your Stream Processing

Framework..........................................................................................................................................44

System Properties Comparison Cassandra vs. HBase vs. MongoDB................................................58 Top 5 reasons to use the Apache Cassandra Database........................................................................67 Cassandra helps solve complicated tasks with ease........................................................67 Cassandra has a short learning curve...............................................................................68 Cassandra lowers admin overhead and costs DevOps engineer......................................68 Cassandra offers rapid writing and lightning-fast reading..............................................68 Cassandra provides extreme resilience and fault tolerance.............................................68 Final thoughts..................................................................................................................68 Who Uses These Databases?.....................................................................................................73 What About Database Structure?..............................................................................................73 Are Indexes Needed?...............................................................................................................75 How Are Their Queries Different?...........................................................................................75 Where (And How) Are These Databases Deployed?...............................................................76 What Types Of Replication / Clustering Are Available?...........................................................77 Who's Currently Behind The Databases?..................................................................................77 Who Provides Support?...........................................................................................................77 Who Maintains The Documentation?......................................................................................77 Is There An Active Community?..............................................................................................77 Which Database Is Right For Your Business?.........................................................................78

Redis:..............................................................................................................................78

MongoDB:......................................................................................................................79

ystem Properties Comparison GraphDB vs. Neo4j............................................................................80 Related products and services...................................................................................................83 Related products and services...................................................................................................89 Hadoop 2.x Architecture..................................................................................................91 Hadoop 2.x Major Components.......................................................................................92 How Hadoop 2.x Major Components Works...................................................................93 Installation and Configuration..................................................................................................96 Running HiveServer2 and Beeline.........................................................................96

Requirements...................................................................................................................96

Installing Hive from a Stable Release.............................................................................97 Building Hive from Source..............................................................................................97 Compile Hive on master.........................................................................................98 Compile Hive on branch-1.....................................................................................99 Compile Hive Prior to 0.13 on Hadoop 0.20..........................................................99

Compile Hive Prior to 0.13 on Hadoop 0.23........................................................101 Running Hive.................................................................................................................101 Running Hive CLI................................................................................................102 Running HiveServer2 and Beeline.......................................................................102 Running HCatalog................................................................................................103 Running WebHCat (Templeton)...........................................................................103 Configuration Management Overview..........................................................................104 Runtime Configuration..................................................................................................105 Hive, Map-Reduce and Local-Mode.............................................................................106 Hive Logging.................................................................................................................108 HiveServer2 Logs.................................................................................................111 Audit Logs............................................................................................................111 Perf Logger...........................................................................................................111 DDL Operations......................................................................................................................112 Creating Hive Tables......................................................................................................112 Browsing through Tables...............................................................................................113 Altering and Dropping Tables........................................................................................113 Metadata Store...............................................................................................................114 DML Operations......................................................................................................................115 SQL Operations.......................................................................................................................116 Example Queries............................................................................................................116 SELECTS and FILTERS......................................................................................117 GROUP BY..........................................................................................................118

JOIN.....................................................................................................................119

MULTITABLE INSERT......................................................................................119

STREAMING.......................................................................................................119

Simple Example Use Cases.....................................................................................................120 MovieLens User Ratings...............................................................................................120 Apache Weblog Data.....................................................................................................123

  • 6.1. Using Command Aliases.........................................................................................125

  • 6.2. Controlling the Hadoop Installation.......................................................................125

  • 6.3. Using Generic and Specific Arguments..................................................................126

  • 6.4. Using Options Files to Pass Arguments..................................................................128

  • 6.5. Using Tools.............................................................................................................129

7. sqoop-import.......................................................................................................................129

  • 7.1. Purpose...................................................................................................................130

  • 7.2. Syntax.....................................................................................................................130

    • 7.2.1. Connecting to a Database Server...............................................................131

    • 7.2.2. Selecting the Data to Import.......................................................................136

    • 7.2.3. Free-form Query Imports...........................................................................136

    • 7.2.4. Controlling Parallelism..............................................................................137

    • 7.2.5. Controlling Distributed Cache...................................................................139

    • 7.2.6. Controlling the Import Process..................................................................139

    • 7.2.7. Controlling transaction isolation................................................................140

    • 7.2.8. Controlling type mapping...........................................................................141

    • 7.2.9. Schema name handling...............................................................................141

      • 7.2.10. Incremental Imports.................................................................................142

      • 7.2.11. File Formats..............................................................................................143

      • 7.2.12. Large Objects...........................................................................................144

      • 7.2.13. Importing Data Into Hive.........................................................................148

      • 7.2.14. Importing Data Into HBase......................................................................150

7.2.16.

Additional Import Configuration Properties............................................154

  • 7.3. Example Invocations..............................................................................................155

  • 8. sqoop-import-all-tables.......................................................................................................156

    • 8.1. Purpose...................................................................................................................157

    • 8.2. Syntax.....................................................................................................................157

    • 8.3. Example Invocations..............................................................................................159

  • 9. sqoop-import-mainframe....................................................................................................160

    • 9.1. Purpose...................................................................................................................160

    • 9.2. Syntax.....................................................................................................................160

      • 9.2.1. Connecting to a Mainframe........................................................................161

      • 9.2.2. Selecting the Files to Import......................................................................162

      • 9.2.3. Controlling Parallelism..............................................................................162

      • 9.2.4. Controlling Distributed Cache...................................................................163

      • 9.2.5. Controlling the Import Process..................................................................163

      • 9.2.6. File Formats................................................................................................164

      • 9.2.7. Importing Data Into Hive...........................................................................166

      • 9.2.8. Importing Data Into HBase........................................................................168

      • 9.2.9. Importing Data Into Accumulo..................................................................170

        • 9.2.10. Additional Import Configuration Properties............................................172

  • 9.3. Example Invocations..............................................................................................172

    • 10. sqoop-export......................................................................................................................172

      • 10.1. Purpose.................................................................................................................173

      • 10.2. Syntax...................................................................................................................173

      • 10.3. Inserts vs. Updates................................................................................................176

      • 10.4. Exports and Transactions......................................................................................178

      • 10.5. Failed Exports.......................................................................................................179

      • 10.6. Example Invocations............................................................................................179

  • 11. validation...........................................................................................................................180

    • 11.1. Purpose..................................................................................................................180

    • 11.2. Introduction...........................................................................................................180

    • 11.3. Syntax...................................................................................................................181

    • 11.4. Configuration........................................................................................................181

    • 11.5. Limitations............................................................................................................182

    • 11.6. Example Invocations.............................................................................................182

  • 12. Saved Jobs.........................................................................................................................183

  • 13. sqoop-job...........................................................................................................................183

    • 13.1. Purpose.................................................................................................................183

    • 13.2. Syntax...................................................................................................................183

    • 13.3. Saved jobs and passwords....................................................................................185

    • 13.4. Saved jobs and incremental imports.....................................................................186

  • 14. sqoop-metastore................................................................................................................186

    • 14.1. Purpose.................................................................................................................186

    • 14.2. Syntax...................................................................................................................186

  • 15. sqoop-merge......................................................................................................................187

    • 15.1. Purpose.................................................................................................................187

    • 15.2. Syntax...................................................................................................................188

  • 16. sqoop-codegen..................................................................................................................189

    • 16.1. Purpose.................................................................................................................189

    • 16.2. Syntax...................................................................................................................189

    • 16.3. Example Invocations............................................................................................191

  • 17. sqoop-create-hive-table.....................................................................................................191

  • 17.2.

    Syntax...................................................................................................................192

    • 17.3. Example Invocations............................................................................................193

    • 18. sqoop-eval.........................................................................................................................193

      • 18.1. Purpose.................................................................................................................193

      • 18.2. Syntax...................................................................................................................194

      • 18.3. Example Invocations............................................................................................194

  • 19. sqoop-list-databases..........................................................................................................194

    • 19.1. Purpose.................................................................................................................195

    • 19.2. Syntax...................................................................................................................195

  • Flume vs. Kafka vs. Kinesis - A Detailed Guide on Hadoop Ingestion Tools..................................196

    Flume vs. Kafka vs. Kinesis:..................................................................................................197 Apache Flume:...............................................................................................................197 Apache Kafka:...............................................................................................................197 AWS Kinesis:.................................................................................................................198 So Which to Choose - Flume or Kafka of Kinesis:.................................................................198 There are other players as well like:.......................................................................................198 System Requirements....................................................................................................199

    Architecture...................................................................................................................199

    Data flow model...................................................................................................199 Complex flows.....................................................................................................200

    Reliability.............................................................................................................200

    Recoverability......................................................................................................200

    Setup........................................................................................................................................201

    Setting up an agent.........................................................................................................201 Configuring individual components.....................................................................201 Wiring the pieces together....................................................................................201 Starting an agent...................................................................................................201 A simple example.................................................................................................201 Using environment variables in configuration files.............................................203 Logging raw data..................................................................................................203 Zookeeper based Configuration...........................................................................204 Installing third-party plugins................................................................................205 The plugins.d directory...............................................................................205 Directory layout for plugins........................................................................205 Data ingestion................................................................................................................205

    RPC......................................................................................................................205

    Executing commands...........................................................................................206 Network streams...................................................................................................206 Setting multi-agent flow................................................................................................206

    Consolidation.................................................................................................................206

    Multiplexing the flow....................................................................................................207

    Configuration..........................................................................................................................208

    Defining the flow...........................................................................................................208 Configuring individual components..............................................................................209 Adding multiple flows in an agent.................................................................................210 Configuring a multi agent flow......................................................................................211 Fan out flow...................................................................................................................212 SSL/TLS support...........................................................................................................214 Source and sink batch sizes and channel transaction capacities....................................217 Flume Sources...............................................................................................................217 Avro Source..........................................................................................................217 Thrift Source........................................................................................................220

    Exec Source..........................................................................................................222 JMS Source..........................................................................................................224 JMS message converter..............................................................................226 SSL and JMS Source..................................................................................226 Spooling Directory Source...................................................................................227 Event Deserializers.....................................................................................230

    LINE..................................................................................................230

    AVRO................................................................................................230

    BlobDeserializer................................................................................231

    Taildir Source.......................................................................................................231 Twitter 1% firehose Source (experimental).........................................................235 Kafka Source........................................................................................................235 NetCat TCP Source..............................................................................................241 NetCat UDP Source..............................................................................................242 Sequence Generator Source.................................................................................242 Syslog Sources.....................................................................................................243 Syslog TCP Source.....................................................................................243 Multiport Syslog TCP Source.....................................................................246 Syslog UDP Source.....................................................................................249 HTTP Source........................................................................................................251

    JSONHandler..............................................................................................253

    BlobHandler................................................................................................254

    Stress Source........................................................................................................254 Legacy Sources....................................................................................................254 Avro Legacy Source....................................................................................255 Thrift Legacy Source..................................................................................255 Custom Source.....................................................................................................256 Scribe Source........................................................................................................256 HBase MapReduce Examples.................................................................................................257

    • 55.1. HBase MapReduce Read Example.......................................................................257

    • 55.2. HBase MapReduce Read/Write Example.............................................................258

    • 55.3. HBase MapReduce Read/Write Example With Multi-Table Output....................259

    • 55.4. HBase MapReduce Summary to HBase Example................................................259

    ZooKeeper: A Distributed Coordination Service for Distributed Applications......................260

    Design Goals..................................................................................................................260 Data model and the hierarchical namespace..................................................................261 Nodes and ephemeral nodes..........................................................................................261 Conditional updates and watches...................................................................................262

    Guarantees.....................................................................................................................262

    Simple API.....................................................................................................................262

    Implementation..............................................................................................................262

    Uses...............................................................................................................................263

    Performance...................................................................................................................263

    Reliability......................................................................................................................264

    Apache Hadoop YARN.....................................................................................................................265

    Overview.................................................................................................................................267

    User Commands......................................................................................................................267

    application......................................................................................................................267

    applicationattempt..........................................................................................................268

    classpath.........................................................................................................................268

    container........................................................................................................................268

    jar...................................................................................................................................268

    logs.................................................................................................................................269

    node...............................................................................................................................269

    queue..............................................................................................................................269

    version...........................................................................................................................269

    envvars...........................................................................................................................269

    Administration Commands.....................................................................................................269

    daemonlog......................................................................................................................270

    nodemanager..................................................................................................................270

    proxyserver....................................................................................................................270

    resourcemanager............................................................................................................270

    rmadmin.........................................................................................................................270

    schedulerconf.................................................................................................................272

    scmadmin.......................................................................................................................272

    sharedcachemanager......................................................................................................273

    Apache Spark vs Hadoop: Introduction to Hadoop

    Hadoop is a framework that allows you to first store Big Data in a distributed environment so that you can process it parallely. There are basically two components in Hadoop:

    HDFS

    HDFS creates an abstraction of resources, let me simplify it for you. Similar as virtualization, you can see HDFS logically as a single unit for storing Big Data, but actually you are storing your data across multiple nodes in a distributed fashion. Here, you have master-slave architecture. In HDFS, Namenode is a master node and Datanodes are slaves.

    NameNode

    It is the master daemon that maintains and manages the DataNodes (slave nodes). It records the metadata of all the files stored in the cluster, e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. It records each and every change that takes place to the file system metadata.

    For example, if a file is deleted in HDFS, the NameNode will immediately record this in the EditLog. It regularly receives a Heartbeat and a block report from all the DataNodes in the cluster to ensure that the DataNodes are live. It keeps a record of all the blocks in HDFS and in which nodes these blocks are stored.

    DataNode

    These are slave daemons which runs on each slave machine. The actual data is stored on DataNodes. They are responsible for serving read and write requests from the clients. They are also responsible for creating blocks, deleting blocks and replicating the same based on the decisions taken by the NameNode.

    YA RN YARN performs all your processing activities by allocating resources and scheduling tasks. It has

    YA

    RN

    YARN performs all your processing activities by allocating resources and scheduling tasks. It has two major daemons, i.e. ResourceManager and NodeManager.

    ResourceManager

    It is a cluster level (one for each cluster) component and runs on the master machine. It manages resources and schedule applications running on top of YARN.

    NodeManager

    It is a node level component (one on each node) and runs on each slave machine. It is responsible for managing containers and monitoring resource utilization in each container. It also keeps track of node health and log management. It continuously communicates with ResourceManager to remain up-to-date. So, you can perform parallel processing on HDFS using MapReduce.

    YA RN YARN performs all your processing activities by allocating resources and scheduling tasks. It has

    To learn more about Hadoop, you can go through this Hadoop Tutorial blog. Now, that we are all set with Hadoop introduction, let’s move on to Spark introduction. Apache Spark and Scala Certification Training Watch The Course Preview

    Apache Spark vs Hadoop:

    Introduction to Apache Spark

    Apache Spark is a framework for real time data analytics in a distributed computing environment. It executes in-memory computations to increase speed of data processing. It is faster for processing large scale data as it exploits in-memory computations and other optimizations. Therefore, it requires high processing power.

    Resilient Distributed Dataset (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. Spark components make it fast and reliable. Apache Spark has the following components:

    To learn more about Hadoop, you can go through this <a href=Hadoop Tutorial blog. Now, that we are all set with Hadoop introduction, let’s move on to Spark introduction. Apache Spark and Scala Certification Training Watch The Course Preview Apache Spark vs Hadoop: Introduction to Apache Spark Apache Spark is a framework for real time data analytics in a distributed computing environment. It executes in-memory computations to increase speed of data processing. It is faster for processing large scale data as it exploits in-memory computations and other optimizations. Therefore, it requires high processing power. Resilient Distributed Dataset (RDD) is a fundamental data structure of Spark. It is an immutable distributed collection of objects. Each dataset in RDD is divided into logical partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. Spark components make it fast and reliable. Apache Spark has the following components: 1. Spark Core – Spark Core is the base engine for large-scale parallel and distributed data processing. Further, additional libraries which are built atop the core allow diverse workloads for streaming, SQL, and machine learning. It is responsible for memory management and fault recovery, scheduling, distributing and monitoring jobs on a cluster & interacting with storage systems " id="pdf-obj-9-16" src="pdf-obj-9-16.jpg">

    1. Spark Core – Spark Core is the base engine for large-scale parallel and distributed data processing. Further, additional libraries which are built atop the core allow diverse workloads for streaming, SQL, and machine learning. It is responsible for memory management and fault recovery, scheduling, distributing and monitoring jobs on a cluster & interacting with storage systems

    • 2. Spark Streaming – Spark Streaming is the component of Spark which is used to process real-time streaming data. Thus, it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams

    • 3. Spark SQL: Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing.

    • 4. GraphX: GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph. At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph: a directed multigraph with properties attached to each vertex and edge.

    • 5. MLlib (Machine Learning): MLlib stands for Machine Learning Library. Spark MLlib is used to perform machine learning in Apache Spark.

    As you can see, Spark comes packed with high-level libraries, including support for R, SQL, Python, Scala, Java etc. These standard libraries increase the seamless integrations in complex workflow. Over this, it also allows various sets of services to integrate with it like MLlib, GraphX, SQL + Data Frames, Streaming services etc. to increase its capabilities.

    To learn more about Apache Spark, you can go through this Spark Tutorial blog. Now the ground is all set for Apache Spark vs Hadoop. Let’s move ahead and compare Apache Spark with Hadoop on different parameters to understand their strengths.

    Apache Spark vs Hadoop: Parameters to Compare

    Performance

    Spark is fast because it has in-memory processing. It can also use disk for data that doesn’t all fit into memory. Spark’s in-memory processing delivers near real-time

    analytics. This makes Spark suitable for credit card processing system, machine learning, security analytics and Internet of Things sensors.

    Hadoop was originally setup to continuously gather data from multiple sources without worrying about the type of data and storing it across distributed environment. MapReduce uses batch processing. MapReduce was never built for real-time processing, main idea behind YARN is parallel processing over distributed dataset.

    The problem with comparing the two is that they perform processing differently.

    Ease of Use

    Spark comes with user-friendly APIs for Scala, Java, Python, and Spark SQL. Spark SQL is very similar to SQL, so it becomes easier for SQL developers to learn it. Spark also provides an interactive shell for developers to query & perform other actions, & have immediate feedback.

    You can ingest data in Hadoop easily either by using shell or integrating it with multiple tools like Sqoop, Flume etc. YARN is just a processing framework and it can be integrated with multiple tools like Hive and Pig. HIVE is a data warehousing component which performs reading, writing and managing large data sets in a distributed environment using SQL-like interface. You can go through this Hadoop ecosystem blog to know about the various tools that can be integrated with Hadoop.

    Costs

    Hadoop and Spark are both Apache open source projects, so there’s no cost for the software. Cost is only associated with the infrastructure. Both the products are designed in such a way that it can run on commodity hardware with low TCO.

    Now you may be wondering the ways in which they are different. Storage & processing in Hadoop is disk-based & Hadoop uses standard amounts of memory. So, with Hadoop we need a lot of disk space as well as faster disks. Hadoop also requires multiple systems to distribute the disk I/O.

    Due to Apache Spark’s in memory processing it requires a lot of memory, but it can deal with a standard speed & amount of disk. As disk space is a relatively

    inexpensive commodity and since Spark does not use disk I/O for processing, instead it requires large amounts of RAM for executing everything in memory. Thus, Spark system incurs more cost.

    But yes, one important thing to keep in mind is that Spark’s technology reduces the number of required systems. It needs significantly fewer systems that cost more. So, there will be a point at which Spark reduces the costs per unit of computation even with the additional RAM requirement.

    Data Processing

    There are two types of data processing: Batch Processing & Stream Processing.

    Batch Processing vs Stream Processing

    Batch Processing: Batch processing has been crucial to big data world. In simplest term, batch processing is working with high data volumes collected over a period. In batch processing data is first collected and then processed results are produced at a later stage.

    Batch processing is an efficient way of processing large, static data sets. Generally, we perform batch processing for archived data sets. For example, calculating average income of a country or evaluating the change in e-commerce in last decade.

    Stream processing: Stream processing is the current trend in the big data world. Need of the hour is speed and real-time information, which is what steam processing does. Batch processing does not allow businesses to quickly react to changing business needs in real time, stream processing has seen a rapid growth in demand.

    Now coming back to Apache Spark vs Hadoop, YARN is a basically a batch- processing framework. When we submit a job to YARN, it reads data from the cluster, performs operation & write the results back to the cluster. Then it again reads the updated data, performs the next operation & write the results back to the cluster and so on.

    Spark performs similar operations, but it uses in-memory processing and optimizes the steps. GraphX allows users to view the same data as graphs and as collections. Users can also transform and join graphs with Resilient Distributed Datasets (RDDs).

    Fault Tolerance

    Hadoop and Spark both provides fault tolerance, but both have different approach. For HDFS and YARN both, master daemons (i.e. NameNode & ResourceManager respectively) checks heartbeat of slave daemons (i.e. DataNode & NodeManager respectively). If any slave daemon fails, master daemons reschedules all pending and in-progress operations to another slave. This method is effective, but it can significantly increase the completion times for operations with single failure also. As Hadoop uses commodity hardware, another way in which HDFS ensures fault tolerance is by replicating data.

    As we discussed above, RDDs are building blocks of Apache Spark. RDDs provide fault tolerance to Spark. They can refer to any dataset present in external storage system like HDFS, HBase, shared filesystem. They can be operated parallelly.

    RDDs can persist a dataset in memory across operations, which makes future actions 10 times much faster. If a RDD is lost, it will automatically be recomputed by using the original transformations. This is how Spark provides fault-tolerance.

    Security

    Hadoop supports Kerberos for authentication, but it is difficult to handle. Nevertheless, it also supports third party vendors like LDAP (Lightweight Directory Access Protocol) for authentication. They also offer encryption. HDFS supports traditional file permissions, as well as access control lists (ACLs). Hadoop provides Service Level Authorization, which guarantees that clients have the right permissions for job submission.

    Spark currently supports authentication via a shared secret. Spark can integrate with HDFS and it can use HDFS ACLs and file-level permissions. Spark can also run on YARN leveraging the capability of Kerberos.

    Use-cases where Hadoop fits best:

    Analysing Archive Data. YARN allows parallel processing of huge amounts of data. Parts of Data is processed parallelly & separately on different DataNodes & gathers result from each NodeManager.

    If instant results are not required. Hadoop MapReduce is a good and economical solution for batch processing.

    Use-cases where Spark fits best:

    Real-Time Big Data Analysis:

    Real-time data analysis means processing data generated by the real-time event streams coming in at the rate of millions of events per second, Twitter data for instance. The strength of Spark lies in its abilities to support streaming of data along with distributed processing. This is a useful combination that delivers near real- time processing of data. MapReduce is handicapped of such an advantage as it was designed to perform batch cum distributed processing on large amounts of data. Real-time data can still be processed on MapReduce but its speed is nowhere close to that of Spark.

    Spark claims to process data 100x faster than MapReduce, while 10x faster with the disks.

     If instant results are not required. Hadoop MapReduce is a good and economical solution for

    Graph Processing:

    Most graph processing algorithms like page rank perform multiple iterations over the same data and this requires a message passing mechanism. We need to program MapReduce explicitly to handle such multiple iterations over the same data. Roughly, it works like this: Read data from the disk and after a particular iteration, write results to the HDFS and then read data from the HDFS for next the iteration. This is very inefficient since it involves reading and writing data to the disk which involves heavy I/O operations and data replication across the cluster for fault tolerance. Also, each MapReduce iteration has very high latency, and the next iteration can begin only after the previous job has completely finished.

    Also, message passing requires scores of neighboring nodes in order to evaluate the score of a particular node. These computations need messages from its neighbors (or data across multiple stages of the job), a mechanism that MapReduce lacks. Different graph processing tools such as Pregel and GraphLab were designed in order to address the need for an efficient platform for graph processing algorithms. These tools are fast and scalable, but are not efficient for creation and post- processing of these complex multi-stage algorithms.

    Also, message passing requires scores of neighboring nodes in order to evaluate the score of a

    Introduction of Apache Spark solved these problems to a great extent. Spark contains a graph computation library called GraphX which simplifies our life. In- memory computation along with in-built graph support improves the performance of the algorithm by a magnitude of one or two degrees over traditional MapReduce programs. Spark uses a combination of Netty and Akka for distributing messages throughout the executors. Let’s look at some statistics that depict the performance of the PageRank algorithm using Hadoop and Spark.

    Also, message passing requires scores of neighboring nodes in order to evaluate the score of a

    Iterative Machine Learning Algorithms:

    Almost all machine learning algorithms work iteratively. As we have seen earlier, iterative algorithms involve I/O bottlenecks in the MapReduce implementations. MapReduce uses coarse-grained tasks (task-level parallelism) that are too heavy for iterative algorithms. Spark with the help of Mesos – a distributed system kernel, caches the intermediate dataset after each iteration and runs multiple iterations on this cached dataset which reduces the I/O and helps to run the algorithm faster in a fault tolerant manner.

    Spark has a built-in scalable machine learning library called MLlib which contains high-quality algorithms that leverages iterations and yields better results than one pass approximations sometimes used on MapReduce.

    Iterative Machine Learning Algorithms: Almost all machine learning algorithms work iteratively. As we have seen earlier,

    Fast data processing. As we know, Spark allows in-memory processing. As a

    result, Spark is up to 100 times faster for data in RAM and up to 10 times for data in storage. Iterative processing. Spark’s RDDs allow performing several map operations in

    memory, with no need to write interim data sets to a disk. Near real-time processing. Spark is an excellent tool to provide immediate business insights. This is the reason why Spark is used in credit card’s streaming system.

    Open source datacenter computing with Apache Mesos

    Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. Mesos is a open source software originally developed at the University of California at Berkeley. It sits between the application layer and the operating system and makes it easier to deploy and manage applications in large-scale clustered environments more efficiently. It can run many applications on a dynamically shared pool of nodes. Prominent users of Mesos include Twitter, Airbnb, MediaCrossing, Xogito and Categorize.

    Mesos leverages features of the modern kernel—"cgroups" in Linux, "zones" in Solaris—to provide isolation for CPU, memory, I/O, file system, rack locality, etc. The big idea is to make a large collection of heterogeneous resources. Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. It is a thin resource sharing layer that enables fine-grained sharing across diverse cluster computing frameworks, by giving frameworks a common interface for accessing cluster resources.The idea is to deploy multiple distributed systems to a shared pool of nodes in order to increase resource utilization. A lot of modern workloads and frameworks can run on Mesos, including Hadoop, Memecached, Ruby on Rails, Storm, JBoss Data Grid, MPI, Spark and Node.js, as well as various web servers, databases and application servers.

    Open source datacenter computing with Apache Mesos <a href=Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications or frameworks. Mesos is a open source software originally developed at the University of California at Berkeley. It sits between the application layer and the operating system and makes it easier to deploy and manage applications in large-scale clustered environments more efficiently. It can run many applications on a dynamically shared pool of nodes. Prominent users of Mesos include Twitter , Airbnb, MediaCrossing, Xogito and Categorize. Mesos leverages features of the modern kernel—"cgroups" in Linux, "zones" in Solaris—to provide isolation for CPU, memory, I/O, file system, rack locality, etc. The big idea is to make a large collection of heterogeneous resources. Mesos introduces a distributed two-level scheduling mechanism called resource offers. Mesos decides how many resources to offer each framework, while frameworks decide which resources to accept and which computations to run on them. It is a thin resource sharing layer that enables fine-grained sharing across diverse cluster computing frameworks, by giving frameworks a common interface for accessing cluster resources.The idea is to deploy multiple distributed systems to a shared pool of nodes in order to increase resource utilization. A lot of modern workloads and frameworks can run on Mesos, including Hadoop, Memecached, Ruby on Rails, Storm, JBoss Data Grid, MPI, Spark and Node.js, as well as various web servers, databases and application servers. Node abstraction in Apache Mesos ( source ) In a similar way that a PC operating system manages access to the resources on a desktop computer, Mesos ensures applications have access to the resources they need in a cluster. Instead of setting up numerous server clusters for different parts of an application, Mesos allows you to share a pool of servers that can all run different parts of your application without them interfering with each other " id="pdf-obj-17-12" src="pdf-obj-17-12.jpg">

    Node abstraction in Apache Mesos (source)

    In a similar way that a PC operating system manages access to the resources on a desktop computer, Mesos ensures applications have access to the resources they need in a cluster. Instead of setting up numerous server clusters for different parts of an application, Mesos allows you to share a pool of servers that can all run different parts of your application without them interfering with each other

    and with the ability to dynamically allocate resources across the cluster as needed. That means, it could easily switch resources away from framework1 (for example, doing big-data analysis) and allocate them to framework2 (for example, a web server), if there is heavy network traffic. It also reduces a lot of the manual steps in deploying applications and can shift workloads around automatically to provide fault tolerance and keep utilization rates high.

    and with the ability to dynamically allocate resources across the cluster as needed. That means, it( source ) Mesos is essentially data center kernel—which means it's the software that actually isolates the running workloads from each other. It still needs additional tooling to let engineers get their workloads running on the system and to manage when those jobs actually run. Otherwise, some workloads might consume all the resources, or important workloads might get bumped by less- important workloads that happen to require more resources.Hence Mesos needs more than just a kernel— Chronos scheduler , a cron replacement for automatically starting and stopping services (and handling failures) that runs on top of Mesos. The other part of the Mesos is Marathon that provides API for starting, stopping and scaling services (and Chronos could be one of those services). " id="pdf-obj-18-4" src="pdf-obj-18-4.jpg">

    Resource sharing across the cluster increases throughput and utilization (source)

    Mesos is essentially data center kernel—which means it's the software that actually isolates the running workloads from each other. It still needs additional tooling to let engineers get their workloads running on the system and to manage when those jobs actually run. Otherwise, some workloads might consume all the resources, or important workloads might get bumped by less- important workloads that happen to require more resources.Hence Mesos needs more than just a kernel—Chronos scheduler, a cron replacement for automatically starting and stopping services (and handling failures) that runs on top of Mesos. The other part of the Mesos is Marathon that provides API for starting, stopping and scaling services (and Chronos could be one of those services).

    Workloads in Chronos and Marathon <a href=( source ) Architecture Mesos consists of a master process that manages slave daemons running on each cluster node, and frameworks that run tasks on these slaves. The master implements fine-grained sharing across frameworks using resource offers. Each resource offer is a list of free resources on multiple slaves. The master decides how many resources to offer to each framework according to an organizational policy, such as fair sharing or priority. To support a diverse set of inter-framework allocation policies, Mesos lets organizations define their own policies via a pluggable allocation module. " id="pdf-obj-19-2" src="pdf-obj-19-2.jpg">

    Workloads in Chronos and Marathon (source) Architecture

    Mesos consists of a master process that manages slave daemons running on each cluster node, and frameworks that run tasks on these slaves. The master implements fine-grained sharing across frameworks using resource offers. Each resource offer is a list of free resources on multiple slaves. The master decides how many resources to offer to each framework according to an organizational policy, such as fair sharing or priority. To support a diverse set of inter-framework allocation policies, Mesos lets organizations define their own policies via a pluggable allocation module.

    Workloads in Chronos and Marathon <a href=( source ) Architecture Mesos consists of a master process that manages slave daemons running on each cluster node, and frameworks that run tasks on these slaves. The master implements fine-grained sharing across frameworks using resource offers. Each resource offer is a list of free resources on multiple slaves. The master decides how many resources to offer to each framework according to an organizational policy, such as fair sharing or priority. To support a diverse set of inter-framework allocation policies, Mesos lets organizations define their own policies via a pluggable allocation module. " id="pdf-obj-19-12" src="pdf-obj-19-12.jpg">

    Mesos architecture with two running frameworks (source)

    Each framework running on Mesos consists of two components: a scheduler that registers with the master to be offered resources, and an executor process that is launched on slave nodes to run the framework's tasks. While the master determines how many resources to offer to each framework, the frameworks' schedulers select which of the offered resources to use. When a framework accepts offered resources, it passes Mesos a description of the tasks it wants to launch on them.

    Mesos architecture with two running frameworks <a href=( source ) Each framework running on Mesos consists of two components: a scheduler that registers with the master to be offered resources, and an executor process that is launched on slave nodes to run the framework's tasks. While the master determines how many resources to offer to each framework, the frameworks' schedulers select which of the offered resources to use. When a framework accepts offered resources, it passes Mesos a description of the tasks it wants to launch on them. Framework scheduling in Mesos ( source ) The figure above shows an example of how a framework gets scheduled to run tasks. In step one, slave 1 reports to the master that it has 4 CPUs and 4 GB of memory free. The master then invokes the allocation module, which tells it that framework 1 should be offered all available resources. In step two, the master sends a resource offer describing these resources to framework 1. In step three, the framework's scheduler replies to the master with information about two tasks to run on the slave, using 2 CPUs; 1 GB RAM for the first task, and 1 CPUs; 2 GB RAM for the second task. Finally, in step four, the master sends the tasks to the slave, which allocates appropriate resources to the framework's executor, which in turn launches the two tasks (depicted with dotted borders). Because 1 CPU and 1 GB of RAM are still free, the allocation module may now offer them to framework 2. In addition, this resource offer process repeats when tasks finish and new resources become free. While the thin interface provided by Mesos allows it to scale and allows the frameworks to evolve independently. A framework will reject the offers that do not satisfy its constraints and accept the ones that do. In particular, we have found that a simple policy called delay scheduling, in which frameworks wait for a limited time to acquire nodes storing the input data, yields nearly optimal data locality. Mesos features  Fault-tolerant replicated master using ZooKeeper  Scalability to thousands of nodes " id="pdf-obj-20-9" src="pdf-obj-20-9.jpg">

    Framework scheduling in Mesos (source)

    The figure above shows an example of how a framework gets scheduled to run tasks. In step one, slave 1 reports to the master that it has 4 CPUs and 4 GB of memory free. The master then invokes the allocation module, which tells it that framework 1 should be offered all available resources. In step two, the master sends a resource offer describing these resources to framework 1. In step three, the framework's scheduler replies to the master with information about two tasks to run on the slave, using 2 CPUs; 1 GB RAM for the first task, and 1 CPUs; 2 GB RAM for the second task. Finally, in step four, the master sends the tasks to the slave, which allocates appropriate resources to the framework's executor, which in turn launches the two tasks (depicted with dotted borders). Because 1 CPU and 1 GB of RAM are still free, the allocation module may now offer them to framework 2. In addition, this resource offer process repeats when tasks finish and new resources become free.

    While the thin interface provided by Mesos allows it to scale and allows the frameworks to evolve independently. A framework will reject the offers that do not satisfy its constraints and accept the ones that do. In particular, we have found that a simple policy called delay scheduling, in which frameworks wait for a limited time to acquire nodes storing the input data, yields nearly optimal data locality.

    Mesos features

    Fault-tolerant replicated master using ZooKeeper

    Scalability to thousands of nodes

    Isolation between tasks with Linux containers

    Multi-resource scheduling (memory and CPU aware)

    Java, Python and C++ APIs for developing new parallel applications

    Web UI for viewing cluster state

    There are a number of software projects built on top of Apache Mesos:

    DevOps tooling

    Vamp is a deployment and workflow tool for container orchestration systems, including Mesos/ Marathon. It brings canary releasing, A/B testing, auto scaling and self healing through a web UI, CLI and REST API.

    Long Running Services

    Aurora is a service scheduler that runs on top of Mesos, enabling you to run long-running services that take advantage of Mesos' scalability, fault-tolerance, and resource isolation.

    Marathon is a private PaaS built on Mesos. It automatically handles hardware or software failures and ensures that an app is "always on."

    Singularity is a scheduler (HTTP API and web interface) for running Mesos tasks: long running processes, one-off tasks, and scheduled jobs.

    SSSP is a simple web application that provides a white-label "Megaupload" for storing and sharing files in S3.

    Big Data Processing

    Cray Chapel is a productive parallel programming language. The Chapel Mesos scheduler lets you run Chapel programs on Mesos.

    Dpark is a Python clone of Spark, a MapReduce-like framework written in Python, running on Mesos.

    Exelixi is a distributed framework for running genetic algorithms at scale.

    Flink is an open source platform for distributed stream and batch data processing.

    Hadoop : Running Hadoop on Mesos distributes MapReduce jobs efficiently across an entire cluster.

    Hama is a distributed computing framework based on Bulk Synchronous Parallel computing techniques for massive scientific computations e.g., matrix, graph and network algorithms.

    MPI is a message-passing system designed to function on a wide variety of parallel computers.

    Spark is a fast and general-purpose cluster computing system which makes parallel jobs easy to write.

    Storm is a distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing.

    Batch Scheduling

    Chronos is a distributed job scheduler that supports complex job topologies. It can be used as a more fault-tolerant replacement for cron.

    Jenkins is a continuous integration server. The mesos-jenkins plugin allows it to dynamically launch workers on a Mesos cluster depending on the workload.

    JobServer is a distributed job scheduler and processor which allows developers to build custom batch processing Tasklets using point and click web UI.

    Torque is a distributed resource manager providing control over batch jobs and distributed compute nodes.

    Data Storage

    Cassandra is a highly available distributed database. Linear scalability and proven fault-

    tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. ElasticSearch is a distributed search engine. Mesos makes it easy to run and scale.

    Hypertable is a high performance, scalable, distributed storage and processing system for structured and unstructured data.

    Aurora

    The Aurora Mesos Framework

    Cloud Watcher

    Article from ADMIN 28/2015

    By Udo Seidel

    Apache Aurora is a service daemon built for the data center.

    Apache's Mesos project is an important building block for a new generation of cloud applications. The goal of the Mesos project is to let the developer "program against the datacenter, like it's a single pool of resources. Mesos abstracts CPU, memory, storage, and other compute resources away from machines (physical or virtual), enabling fault-tolerant and elastic distributed systems to easily be built and run effectively" [1].

    An important tool that has evolved out of the Mesos environment is Aurora, which recently graduated from the Apache Incubator and is now a full Apache project Figure 1. According to the project website,, "Aurora runs applications and services across a shared pool of machines, and is responsible for keeping them running, forever. When machines experience failure, Aurora intelligently reschedules those jobs onto healthy machines" [2]. In other words, Aurora is a little like an init tool for data centers and cloud-based virtual environments.

    Figure 1: Aurora is a Mesos Framework; Mesos is in turn an Apache project. The Aurora[3] . In the early years, development took place only within Twitter and behind closed doors. However, more and more employees contributed to the development, and Aurora became increasingly important for the various Twitter services. Eventually, the opening of the project in the direction of the open source community was a natural step to maintain such a fast-growing software project. Aurora has been part of the Apache family since 2013. What is Singularity Singularity is a platform that enables deploying and running services and scheduled jobs in the cloud or data centers. Combined with Apache Mesos, it provides efficient management of the underlying processes life cycle and effective use of cluster resources. " id="pdf-obj-23-2" src="pdf-obj-23-2.jpg">

    Figure 1: Aurora is a Mesos Framework;

    Mesos is in turn an Apache project.

    The Aurora project has many fathers: In addition to its kinship with Apache and Mesos, Aurora was initially supported by Twitter, and Google was at least indirectly an inspiration for the project. The beginnings of Aurora date back to 2010. Bill Farner, a member of the research team at Twitter, launched a project to facilitate the operation of tweeting infrastructure. The IT landscape of the short message service had grown considerably at that time. The operations team was faced with thousands of computers and hundreds of applications. Added to this was the constant rollout of new software versions.

    Bill Farner had previously worked at Google and had some experience working with Google's Borg cluster manager [3]. In the early years, development took place only within Twitter and behind closed doors. However, more and more employees contributed to the development, and Aurora became increasingly important for the various Twitter services. Eventually, the opening of the project in the direction of the open source community was a natural step to maintain such a fast-growing software project. Aurora has been part of the Apache family since 2013.

    What is Singularity

    Singularity is a platform that enables deploying and running services and scheduled jobs in the cloud or data centers. Combined with Apache Mesos, it provides efficient management of the underlying processes life cycle and effective use of cluster resources.

    Singularity is an essential part of the HubSpot Platform and is ideal for deploying micro-services. ItApache Mesos framework . It runs as a task scheduler on top of Mesos Clusters taking advantage of Apache Mesos' scalability, fault-tolerance, and resource isolation. Apache Mesos is a cluster manager that simplifies the complexity of running different types of applications on a shared pool of servers. In Mesos terminology, Mesos applications that use the Mesos APIs to schedule tasks in a cluster are called frameworks . " id="pdf-obj-24-2" src="pdf-obj-24-2.jpg">

    Singularity is an essential part of the HubSpot Platform and is ideal for deploying micro-services. It is optimized to manage thousands of concurrently running processes in hundreds of servers.

    How it Works

    Singularity is an Apache Mesos framework. It runs as a task scheduler on top of Mesos Clusters taking advantage of Apache Mesos' scalability, fault-tolerance, and resource isolation. Apache Mesos is a cluster manager that simplifies the complexity of running different types of applications on a shared pool of servers. In Mesos terminology, Mesos applications that use the Mesos APIs to schedule tasks in a cluster are called frameworks.

    There are different types of frameworks and most frameworks concentrate on a specific type of task

    There are different types of frameworks and most frameworks concentrate on a specific type of task (e.g. long-running vs scheduled cron-type jobs) or supporting a specific domain and relevant technology (e.g. data processing with hadoop jobs vs data processing with spark).

    Singularity tries to be more generic by combining long-running tasks and job scheduling functionality in one framework to support many of the common process types that developers need to deploy every day to build modern web applications and services. While Mesos allows multiple frameworks to run in parallel, it greatly simplifies the PaaS architecture by having a consistent and uniform set of abstractions and APIs for handling deployments across the organization. Additionally, it reduces the amount of framework boilerplate that must be supported - as all Mesos frameworks must keep state, handle failures, and properly interact with the Mesos APIs. These are the main reasons HubSpot engineers initiated the development of a new framework. As of this moment, Singularity supports the following process types:

    Web Services. These are long running processes which expose an API and may run with

    multiple load balanced instances. Singularity supports automatic configurable health checking of the instances at the process and API endpoint level as well as load balancing. Singularity will automatically restart these tasks when they fail or exit. Workers. These are long running processes, similar to web services, but do not expose an API. Queue consumers are a common type of worker processes. Singularity does automatic health checking, cool-down and restart of worker instances.

    Scheduled (CRON-type) Jobs. These are tasks that periodically run according to a

    provided CRON schedule. Scheduled jobs will not be restarted when they fail unless instructed to do so. Singularity will run them again on the next scheduling cycle. On-Demand Processes. These are manually run processes that will be deployed and ready to run but Singularity will not automatically run them. Users can start them through an API call or using the Singularity Web UI, which allows them to pass command line parameters on-demand.

    Singularity Components

    Mesos frameworks have two major components. A scheduler component that registers with the Mesos master to be offered resources and an executor component that is launched on cluster slave nodes by the Mesos slave process to run the framework tasks.

    The Mesos master determines how many resources are offered to each framework and the framework scheduler selects which of the offered resources to use to run the required tasks. Mesos slaves do not directly run the tasks but delegate the running to the appropriate executor that has knowledge about the nature of the allocated task and the special handling that might be required.

     Scheduled (CRON-type) Jobs . These are tasks that periodically run according to a  provided

    As depicted in the figure, Singularity implements the two basic framework components as well as a few more to solve common complex / tedious problems such as task cleanup and log tailing / archiving without requiring developers to implement it for each task they want to run:

    Singularity Scheduler

    The scheduler is the core of Singularity: a DropWizard API that implements the Mesos Scheduler Driver. The scheduler matches client deploy requests to Mesos resource offers and acts as a web service offering a JSON REST API for accepting deploy requests.

    Clients use the Singularity API to register the type of deployable item that they want to run (web service, worker, cron job) and the corresponding runtime settings (cron schedule, # of instances, whether instances are load balanced, rack awareness, etc.).

    After a deployable item (a request, in API terms) has been registered, clients can post Deploy requests for that item. Deploy requests contain information about the command to run, the executor to use, executor specific data, required cpu, memory and port resources, health check URLs and a variety of other runtime configuration options. The Singularity scheduler will then attempt to match Mesos offers (which in turn include resources as well as rack information and what else is running on slave hosts) with its list of Deploy requests that have yet to be fulfilled.

    Rollback of failed deploys, health checking and load balancing are also part of the advanced functionality the Singularity Scheduler offers. A new deploy for a long runing service will run as shown in the diagram below.

    When a service or worker instance fails in a new deploy, the Singularity scheduler will rollbacky de p lo y ed service instances and to remove those that were previously running. Check Integration with Load Balancers to learn more. Singularity also provides generic webhooks which allow third party integrations, which can be registered to follow request, deploy, or task updates. " id="pdf-obj-28-2" src="pdf-obj-28-2.jpg">

    When a service or worker instance fails in a new deploy, the Singularity scheduler will rollback all instances to the version running before the deploy, keeping the deploys always consistent. After the scheduler makes sure that a Mesos task (corresponding to a service instance) has entered the TASK_RUNNING state it will use the provided health check URL and the specified health check timeout settings to perform health checks. If health checks go well, the next step is to perform load balancing of service instances. Load balancing is attempted only if the corresponding deployable item has been defined to be loadBalanced. To perform load balancing between service instances, Singularity supports a rich integration with a specific Load Balancer API. Singularity will post requests to the Load Balancer API to add the newly deployed service instances and to remove those that were previously running. Check Integration with Load Balancers to learn more. Singularity also provides generic webhooks which allow third party integrations, which can be registered to follow request, deploy, or task updates.

    Slave Placement

    When matching a Mesos resource offer to a deploy, Singularity can use one of several strategies to determine if the host in the offer is appropriate for the task in question, or SlavePlacement in Singularity terms. Available placement strategies are:

    GREEDY: uses whatever slaves are available

    SEPARATE_BY_DEPLOY/SEPARATE: ensures no 2 instances / tasks of the same request and

    deploy id are ever placed on the same slave SEPARATE_BY_REQUEST: ensures no two tasks belonging to the same request (regardless if

    deploy id) are placed on the same host OPTIMISTIC: attempts to spread out tasks but may schedule some on the same slave

    SPREAD_ALL_SLAVES: ensure the task is running on every slave. Some behaviour as SEPARATE_BY_DEPLOY but with autoscaling the Request to keep instances equal number of slaves.

    Slave placement can also be impacted by slave attributes. There are three scenarios that Singularity supports:

    • 1. Specific Slaves -> For a certain request, only run it on slaves with matching attributes - In this case, you would specify requiredSlaveAttributes in the json for your request, and the tasks for that request would only be scheduled on slaves that have all of those attributes.

    • 2. Reserved Slaves -> Reserve a slave for specific requests, only run those requests on those slaves - In your Singularity config, specify the reserveSlavesWithAttributes field. Singularity will then only schedule tasks on slaves with those attributes if the request's required attributes also match those.

    • 3. Test Group of Slaves -> Reserve a slave for specific requests, but don't restrict the requests to that slave - In your Singularity config, specify the reserveSlavesWithAttributes field as in the previous example. But, in the request json, specify the allowedSlaveAttributes field. Then, the request will be allowed to run elsewhere in the cluster, but will also have the matching attributes to run on the reserved slave.

    Singularity Scheduler Dependencies

    The Singularity scheduler uses ZooKeeper as a distributed replication log to maintain state and keep track of registered deployable items, the active deploys for these items and the running tasks that fulfill the deploys. As shown in the drawing, the same ZooKeeper quorum utilized by Mesos masters and slaves can be reused for Singularity.

    Since ZooKeeper is not meant to handle large quantities of data, Singularity can optionally (and recommended for any real usage) utilize a database (MySQL or PostgreSQL) to periodically offload historical data from ZooKeeper and keep records of deployable item changes, deploy request history as well as the history of all launched tasks.

    In production environments Singularity should be run in high-availability mode by running multiple instances of the Singularity Scheduler component. As depicted in the drawing, only one instance is always active with all the other instances waiting in stand-by mode. While only one instance is registered for receiving resource offers, all instances can process API requests. Singularity uses

    ZooKeeper to perform leader election and maintain a single leader. Because of the ability for all instances to change state, Singularity internally uses queues which are consumed by the Singularity leader to make calls to Mesos.

    Singularity UI

    The Singularity UI is a single page static web application served from the Singularity Scheduler that uses the Singularity API to present information about deployed items.

    It is a fully-featured application which provides historical as well as active task information. It allows users to view task logs and interact directly with tasks and deploy requests.

    Optional Slave Components

    Singularity Executor

    Users can opt for the default Mesos executor, the Docker container executor, or the Singularity executor. Like the other executors, the Singularity executor is executed directly by the Mesos slave process for each task that executes on a slave. The requests sent to the executor contain all the required data for setting up the running environment like the command to execute, environment variables, executable artifact URLs, application configuration files, etc. The Singularity executor provides some advanced (configurable) features:

    Custom Fetcher Downloads and extracts artifacts over HTTP, directly from S3, or using the

    S3 Downloader component. Log Rotation Sets up logrotate for specified log files inside the task directory.

    Task Sandbox Cleanup. Can cleanup large (uninteresting) application files but leave

    important logs and debugging files. Graceful Task Killing. Can send SIGTERM and escalate to SIGKILL for graceful

    shutdown of tasks. Environment Setup and Runner Script. Provides for setup of environment variables and corresponding bash script to run the command.

    S3 Uploader

    The S3 uploader reliably uploads rotated task log files to S3 for archiving. These logs can then be downloaded directly from the Singularity UI.

    S3 Downloader

    The S3 downloader downloads and extract artifacts from S3 outside of the context of an executor - this is useful to avoid using the memory (page cache) of the executor process and also downloads from S3 without pre-generating expiring URIs (a bad idea inside Mesos.)

    Singularity Executor Cleanup

    While the Mesos slave has the ability to garbage collect tasks, the cleanup process maintains consistent state with other Singularity services (like the uploader and log watcher). This is a utility that is meant to run in each slave on CRON (e.g once per hour) and will clean the sandbox of finished or failed tasks that the Singualrity executor failed to clean.

    Log Watcher

    The log watcher is an experimental service that provides log tailing and streaming / forwarding of executor task log lines to third party services like fluentd or logstash to support real-time log viewing and searching.

    OOM Killer

    The Out of Memory process Killer is an experimental service that replaces the default memory limit checking supported by Mesos and Linux Kernel CGROUPS. The intention of the OOM Killer is to provide more consistent task notification when tasks are killed. It is also an attempt to workaround Linux Kernel issues with CGROUP OOMs and also prevents the CGROUP OOM killer from killing tasks due to page cache overages.

    Chapel: Productive Parallel Programming

    Parallel computing has resulted in numerous significant advances in science and technology over the past several decades. However, in spite of these successes, the fact remains that only a small fraction of the world’s programmers are capable of effectively using the parallel processing languages and programming models employed within HPC and mainstream computing. Chapel is an emerging parallel language being developed at Cray Inc. with the goal of addressing this issue and making parallel programming far more productive and generally accessible.

    Chapel originated from the DARPA High Productivity Computing Systems (HPCS) program, which challenged vendors like Cray to improve the productivity of high-end computing systems. Engineers at Cray noted that the HPC community was hungry for alternative parallel processing languages and developed Chapel as part of our response. The reaction from HPC users so far has been very encouraging—most would be excited to have the opportunity to use Chapel once it becomes production-grade.

    Chapel Overview

    Though it would be impossible to give a thorough introduction to Chapel in the space of this article, the following characterizations of the language should serve to give an idea of what we are pursuing:

    General Parallelism: Chapel has the goal of supporting any parallel algorithm you can

    conceive of on any parallel hardware you want to target. In particular, you should never hit a point where you think “Well, that was fun while it lasted, but now that I want to do x, I’d better go back to MPI.” Separation of Parallelism and Locality: Chapel supports distinct concepts for describing

    parallelism (“These things should run concurrently”) from locality (“This should be placed here; that should be placed over there”). This is in sharp contrast to conventional approaches that either conflate the two concepts or ignore locality altogether. Multiresolution Design: Chapel is designed to support programming at higher or lower

    levels, as required by the programmer.

    Moreover, higher-level features—like data

    distributions or parallel loop schedules—may be specified by advanced programmers within the language. Productivity Features: In addition to all of its features designed for supercomputers, Chapel also includes a number of sequential parallel processing language features designed

    for productive programming. Examples include type inference, iterator functions, object- oriented programming, and a rich set of array types. The result combines productivity features as in Python™, Matlab®, or Java™ software with optimization opportunities as in Fortran or C.

    Chapel’s implementation is also worth characterizing:

    Open Source: Since its outset, Chapel has been developed in an open-source manner, with

    collaboration from academics, computing labs, and industry. Chapel is released under a BSD license in order to minimize barriers to its use. Portable: While Cray machines are an obvious target for Chapel, the parallel processing

    language was designed to be very portable. Today, Chapel runs on virtually any architecture supporting a C compiler, UNIX-like environment, POSIX threads, and MPI or UDP. Optimized for Crays: Though designed for portability, the Chapel implementation has also been optimized to take advantage of Cray-specific features.

    Chapel: Today and Tomorrow

    While the HPCS project that spawned Chapel concluded successfully at the end of 2012, the Chapel project remains active and ongoing. The Chapel prototype and demonstrations developed under HPCS were considered compelling enough to users that Cray plans to continue the project over the next several years. Current priorities include:

    Performance Optimizations: To date, the implementation effort has focused primarily on

    correctness over performance. Improving performance is typically considered the number one priority for growing the Chapel community. Support for Accelerators: Emerging compute nodes are increasingly likely to contain

    accelerators like GPU supercomputers or Intel® MIC chips. We are currently working on extending our locality abstractions to better handle such architectures. Interoperability: Beefing up Chapel’s current interoperability features is a priority, to

    permit users to reuse existing libraries or gradually transition applications to Chapel. Feature Improvements: Having completed HPCS, we now have the opportunity to go back

    and refine features that have not received sufficient attention to date. In many cases, these improvements have been motivated by feedback from early users. Outreach and Evangelism: While improving Chapel, we are seeking out ways to grow

    Chapel’s user base, particularly outside of the traditional HPC sphere. Research Efforts: In addition to hardening the implementation, a number of interesting research directions remain for Chapel, including resilience mechanisms, applicability to “big data” analytic computations, energy-aware computing, and support for domain specific languages.

    Dpark

    DPark is a Python clone of Spark, MapReduce(R) alike computing framework supporting iterative computation.

    Installation

    ## Due to the use of C extensions, some libraries need to be installed first.

    $ sudo apt-get install libtool pkg-config build-essential autoconf automake $ sudo apt-get install python-dev $ sudo apt-get install libzmq-dev

    ## Then just pip install dpark (``sudo`` maybe needed if you encounter permission problem).

    $ pip install dpark

    Example

    for word counting (wc.py):

    from dpark import DparkContext ctx = DparkContext() file = ctx.textFile("/tmp/words.txt") words = file.flatMap(lambda x:x.split()).map(lambda x:(x,1)) wc = words.reduceByKey(lambda x,y:x+y).collectAsMap() print wc

    This script can run locally or on a Mesos cluster without any modification, just using different command-line arguments:

    $ python wc.py $ python wc.py -m process $ python wc.py -m host[:port]

    See examples/ for more use cases.

    Configuration

    DPark can run with Mesos 0.9 or higher.

    If a $MESOS_MASTER environment variable is set, you can use a shortcut and run DPark with Mesos just by typing

    $ python wc.py -m mesos

    $MESOS_MASTER can be any scheme of Mesos master, such as

    $ export MESOS_MASTER=zk://zk1:2181,zk2:2181,zk3:2181/mesos_master

    In order to speed up shuffling, you should deploy Nginx at port 5055 for accessing data in

    DPARK_WORK_DIR (default is /tmp/dpark), such as:

    server {

     

    listen 5055; server_name localhost; root /tmp/dpark/;

    }

    UI

    2 DAGs:

    1.stage graph: stage is a running unit, contain a set of task, each run same ops for a split of rdd.

    2.use api callsite graph

    UI when running

    Just open the url from log like start listening on Web UI http://server_01:40812 .

    UI after running

    1.before run, config LOGHUB & LOGHUB_PATH_FORMAT in dpark.conf, pre-create LOGHUB_DIR.

    2.get log hubdir from log like logging/prof to LOGHUB_DIR/2018/09/27/16/b2e3349b- 9858-4153-b491-80699c757485-8754, which in clude mesos framework id. 3.run dpark_web.py -p 9999 -l LOGHUB_DIR/2018/09/27/16/b2e3349b-9858-4153-b491-

    80699c757485-8728/, dpark_web.py is in tools/

    UI examples for features

    show sharing shuffle map output

    rdd = DparkContext().makeRDD([(1,1)]).map(m).groupByKey() rdd.map(m).collect() rdd.map(m).collect()

    combine nodes iff with same lineage, form a logic tree inside stage, then each node containApache Mesos , mostly implemented in Python using gevent for high-performanc e concurrency It is inten ded to run cluster computing jobs (partitioned batch jobs, which include some messaging) in pure Python. By default, it runs genetic algorithms at scale. However, it can handle a broad range of other problem domains by using --uow command line option to override the UnitOfWorkFactory class definition. Please see the project wiki for more details, including a tutorial on how to build Mesos- based frameworks. Quick Start To check out the GA on a laptop (with Python 2.7 installed), simply run: ./src/ga.py Otherwise, to run at scale, the following steps will help you get Exelixi running on Apache Mesos . For help in general with command line options: " id="pdf-obj-35-2" src="pdf-obj-35-2.jpg">

    combine nodes iff with same lineage, form a logic tree inside stage, then each node contain a PIPELINE of rdds.

    rdd1 = get_rdd() rdd2 = dc.union([get_rdd() for i in range(2)]) rdd3 = get_rdd().groupByKey() dc.union([rdd1, rdd2, rdd3]).collect()

    Exelixi

    Exelixi is a distributed framework based on Apache Mesos, mostly implemented in Python using gevent for high-performance concurrency It is intended to run cluster computing jobs (partitioned batch jobs, which include some messaging) in pure Python. By default, it runs genetic algorithms at scale. However, it can handle a broad range of other problem domains by using --uow command line option to override the UnitOfWorkFactory class definition. Please see the project wiki for more details, including a tutorial on how to build Mesos- based frameworks.

    Quick Start

    To check out the GA on a laptop (with Python 2.7 installed), simply run:

    ./src/exelixi.py -h

    The following instructions are based on using the Elastic Mesos service, which uses Ubuntu Linux servers running on Amazon AWS. Even so, the basic outline of steps shown here apply in general.

    First, launch an Apache Mesos cluster. Once you have confirmation that your cluster is running (e.g., Elastic Mesos sends you an email messages with a list of masters and slaves) then use ssh to login on any of the masters:

    ssh -A -l ubuntu <master-public-ip>

    You must install the Python bindings for Apache Mesos. The default version of Mesos changes in this code as there are updates to Elastic Mesos, since the tutorials are based on that service. You can check http://mesosphere.io/downloads/ for the latest. If you run Mesos in different environment, simply make a one-line change to the EGG environment variable in the bin/local_install.sh script. Also, you need to install the Exelixi source. On the Mesos master, download the master branch of the Exelixi code repo on GitHub and install the required libraries:

    wget https://github.com/ceteri/exelixi/archive/master.zip ; \ unzip master.zip ; \ cd exelixi-master ; \ ./bin/local_install.sh

    If you've customized the code by forking your own GitHub code repo, then substitute that download URL instead. Alternatively, if you've customized by subclassing the uow.UnitOfWorkFactory default GA, then place that Python source file into the src/ subdirectory. Next, run the installation command on the master, to set up each of the slaves:

    ./src/exelixi.py -n localhost:5050 | ./bin/install.sh

    Now launch the Framework, which in turn launches the worker services remotely on slave nodes. In the following case, it runs workers on two slave nodes:

    ./src/exelixi.py -m localhost:5050 -w 2

    Once everything has been set up successfully, the log file in exelixi.log will show a line:

    all worker services launched and init tasks completed

    Flink

    Apache Flink is an open source streaming platform which provides you tremendous capabilities to run real-time data processing pipelines in a fault-tolerant way at a scale of millions of events per second.

    The key point is that it does all this using the minimum possible resources at single millisecond latencies. So how does it manage that and what makes it better than other solutions in the same domain?

    Low latency on minimal resources

    Flink is based on the DataFlow model i.e. processing the elements as and when they come rather than processing them in micro-batches (which is done by Spark streaming).

    Micro-batches can contain huge number of elements and the resources needed to process those elements at once can be substantial. In the case of a sparse data stream (in which you get only a burst of data at irregular intervals), this becomes a major pain point.

    You also don’t need to go through the trial and error of configuring the micro-batch size so that the processing time of the batch doesn’t exceed it’s accumulation time. If it happens, then the batches start to queue up and eventually all the processing will come to a halt.

    Dataflow allows flink to process millions of records per minutes at milliseconds of latencies on a single machine (it’s also because of Flink’s managed memory and custom serialisation but more on that in next article). Here are some benchmarks.

    Variety of Sources and Sinks

    Flink provides seamless connectivity to a variety of data sources and sinks. Some of these include:

    Apache Cassandra

    Elasticsearch

    Kafka

    RabbitMQ

    Hive

    Fault tolerance

    Flink provides robust fault-tolerance using checkpointing (periodically saving internal state to external sources such as HDFS).

    However, Flink’s checkpointing mechanism can be made incremental (save only the changes and not the whole state) which really reduces the amount of data in HDFS and the I/O duration. The checkpointing overhead is almost negligible which enables users to have large states inside Flink applications.

    Flink also provides a high availability setup through zookeeper. This is for re-spawning the job in the cases when the driver (which is known as JobManager in Flink) crashes due to some error.

    High level API

    Unlike Apache Storm (which also follows a data flow model), Flink provides a extremely simple high level api in the form of Map/Reduce, Filters, Window, GroupBy, Sort and Joins. This provides a developer lot of flexibility and speeds up the development while writing new jobs.

    Stateful processing

    Sometimes an operation requires some config or data from some other source to perform an operations. A simple example will be to count the number of records of type Y in a stream X. This counter will be known as the state of the operation.

    Flink provides a simple API to interact with state like you would interact with a java object. States can be backed by Memory, Filesystem or RocksDB which are check pointed and are thus fault tolerant. e.g. With respect to the above example, in case your application restarts, your counter value will still be preserved.

    Exactly once processing

    Apache Flink provides exactly once processing like Kafka 0.11 and above with minimal overhead and zero dev effort. This is not trivial to do in other streaming solutions such as Spark Streaming and Storm and is not supported in Apache Samza.

    SQL Support

    Like Spark streaming Flink also provides a SQL API interface which makes writing a job easier for people with non programming background. Flink SQL is maturing day by day and is already being used by companies such as UBER and Alibaba to do analytics on real time data.

    Environment Support

    A Flink job can be run in a distributed system or in local machine. The program can run on mesos, yarn, kubernetes as well as standalone mode (e.g. in docker containers). Since Flink 1.4, Hadoop is not a pre-requisite which opens up a number of possibilities for places to run a flink job.

    Awesome community

    Flink has a great dev community which allows for frequent new features and bug fixes as well as great tools to ease the developer effort further. Some of these tools are:

    Flink Tensorflow  — Run Tensorflow graphs as a Flink process

    Flink HTM —Anomaly detection in a stream in Flink

    Tink  — A temporal graph library build on top of Flink

    Flink SQL and Complex Event Processing (CEP) were also initially developed by Alibaba and contributed back to flink.

    Apache Hama

    Apache Hama is a BSP (Bulk Synchronous Parallel) computing framework on top of HDFS (Hadoop Distributed File System) for massive scientific computations such as matrix, graph and network algorithms.

    This release is the first release as a top level project, contains two significant new features (Message Compressor, complete clone of the

    Google's Pregel) and many improvements for computing system performance and durability.

    MPI

    The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners. Open MPI is therefore able to combine the expertise, technologies, and resources from all across the High Performance Computing community in order to build the best MPI library available. Open MPI offers advantages for system and software vendors, application developers and computer science researchers.

    Features implemented or in short-term development for Open MPI include:

    Full MPI-3.1 standards conformance Thread safety and concurrency Dynamic process spawning Network and process fault tolerance Support network heterogeneity Single library supports all networks Run-time instrumentation Many job schedulers supported

    Many OS's supported (32 and 64 bit) Production quality software High performance on all platforms Portable and maintainable Tunable by installers and end-users Component-based design, documented APIs Active, responsive mailing list Open source license based on the BSD license

    Spark Vs Other tools :

    Spark Streaming vs Flink vs Storm vs Kafka Streams vs Samza : Choose Your Stream Processing Framework

    According to a recent report by IBM Marketing cloud, “90 percent of the data in the world today has been created in the last two years alone, creating 2.5 quintillion bytes of data every day — and with new devices, sensors and technologies emerging, the data growth rate will likely accelerate even more”. Technically this means our Big Data Processing world is going to be more complex and more challenging. And a lot of use cases (e.g. mobile app ads, fraud detection, cab booking, patient monitoring,etc) need data processing in real-time, as and when data arrives, to

    make quick actionable decisions. This is why Distributed Stream Processing has become very popular in Big Data world.

    Today there are a number of open source streaming frameworks available. Interestingly, almost all of them are quite new and have been developed in last few years only. So it is quite easy for a new person to get confused in understanding and differentiating among streaming frameworks. In this post I will first talk about types and aspects of Stream Processing in general and then compare the most popular open source Streaming frameworks : Flink, Spark Streaming, Storm, Kafka Streams. I will try to explain how they work (briefly), their use cases, strengths, limitations, similarities and differences.

    What is Streaming/Stream Processing :

    The most elegant definition I found is : a type of data processing engine that is designed with infinite data sets in mind. Nothing more.

    Unlike Batch processing where data is bounded with a start and an end in a job and the job finishes after processing that finite data, Streaming is meant for processing unbounded data coming in realtime continuously for days,months,years and forever. As such, being always meant for up and running, a streaming application is hard to implement and harder to maintain.

    Important Aspects of Stream Processing:

    There are some important characteristics and terms associated with Stream processing which we should be aware of in order to understand strengths and limitations of any Streaming framework :

    Delivery Guarantees :

    It means what is the guarantee that no matter what, a particular incoming record in a streaming engine will be processed. It can be either Atleast-once (will be processed atleast one time even in case of failures) , Atmost-once (may not be processed in case of failures) or Exactly-once (will be processed one and exactly one time even in case of failures) . Obviously Exactly-once is desirable but is hard to achieve in distributed systems and comes in tradeoffs with performance.

    Fault Tolerance :

    In case of failures like node failures,network failures,etc, framework should be able to recover and should start processing again from the point where it left. This is

    achieved through checkpointing the state of streaming to some persistent storage from time to time. e.g. checkpointing kafka offsets to zookeeper after getting record from Kafka and processing it.

    State Management :

    In case of stateful processing requirements where we need to maintain some state (e.g. counts of each distinct word seen in records), framework should be able to provide some mechanism to preserve and update state information. Performance :

    This includes latency(how soon a record can be processed), throughput (records processed/second) and scalability. Latency should be as minimum as possible while throughput should be as much as possible. It is difficult to get both at same time.

    Advanced Features : Event Time Processing, Watermarks, Windowing

    These are features needed if stream processing requirements are complex. For example, processing records based on time when it was generated at source (event time processing). To know more in detail, please read these must-read posts by Google guy Tyler Akidau : part1 and part2. Maturity :

    Important from adoption point of view, it is nice if the framework is already proven

    and battle tested at scale by big companies. More likely to get good community support and help on stackoverflow.

    Two Types of Stream Processing:

    Now being aware of the terms we just discussed, it is now easy to understand that there are 2 approaches to implement a Streaming framework:

    Native Streaming :

    Also known as Native Streaming. It means every incoming record is processed as soon as it arrives, without waiting for others. There are some continuous running processes (which we call as operators/tasks/bolts depending upon the framework) which run for ever and every record passes through these processes to get processed. Examples : Storm, Flink, Kafka Streams, Samza.

    Micro-batching :

    Also known as Fast Batching. It means incoming records in every few seconds are batched together and then processed in a single mini batch with delay of few seconds. Examples:

    Spark Streaming, Storm-Trident.

    Both approaches have some advantages and disadvantages. Native Streaming feels natural as every record is processed as soon as it arrives, allowing the framework to achieve the minimum latency possible. But it also means that it is hard to achieve fault tolerance without compromising on throughput as for each record, we need to track and checkpoint once processed. Also, state management is easy as there are long running processes which can maintain the required state easily.

    Micro-batching , on the other hand, is quite opposite. Fault tolerance comes for free as it is essentially a batch and throughput is also high as processing and checkpointing will be done in one shot for group of records. But it will be at some cost of latency and it will not feel like a natural streaming. Also efficient state management will be a challenge to maintain.

    Streaming Frameworks One By One:

    Storm :

    Storm is the hadoop of Streaming world. It is the oldest open source streaming framework and one of the most mature and reliable one. It is true streaming and is good for simple event based use cases. I have shared details about Storm at length in these posts: part1 and

    part2.

    Advantages:

    Very low latency,true streaming, mature and high throughput

    Excellent for non-complicated streaming use cases

    No state management

    No advanced features like Event time processing, aggregation, windowing, sessions,

    watermarks, etc Atleast-once guarantee

    Spark has emerged as true successor of hadoop in Batch processing and the first framework to fully support the Lambda Architecture (where both Batch and Streaming are implemented; Batch for correctness, Streaming for Speed). It is immensely popular, matured and widely adopted. Spark Streaming comes for free with Spark and it uses micro batching for streaming. Before 2.0 release, Spark Streaming had some serious performance limitations but with new release 2.0+ , it is called structured streaming and is equipped with many good features like custom memory management (like flink) called tungsten, watermarks, event time processing support,etc. Also Structured Streaming is much more abstract and there is option to switch between micro-batching and continuous streaming mode in 2.3.0 release. Continuous Streaming mode promises to give sub latency like Storm and Flink, but it is still in infancy stage with many limitations in operations.

    Advantages:

    Supports Lambda architecture, comes free with Spark

    High throughput, good for many use cases where sub-latency is not required

    Fault tolerance by default due to micro-batch nature

    Simple to use higher level APIs

    Big community and aggressive improvements

    Exactly Once

    Disadvantages

    Not true streaming, not suitable for low latency requirements

    Too many parameters to tune. Hard to get it right. Have written a post on my personal experience while tuning Spark Streaming

    Stateless by nature

    Lags behind Flink in many advanced features

    Flink :

    Flink is also from similar academic background like Spark. While Spark came from UC Berkley, Flink came from Berlin TU University. Like Spark it also supports Lambda architecture. But the implementation is quite opposite to that of Spark. While Spark is essentially a batch with Spark streaming as micro-batching and special case of Spark Batch, Flink is essentially a true streaming engine treating batch as special case of streaming with bounded data. Though APIs in both frameworks are similar, but they don’t

    have any similarity in implementations. In Flink, each function like map,filter,reduce,etc is implemented as long running operator (similar to Bolt in Storm) Flink looks like a true successor to Storm like Spark succeeded hadoop in batch. Advantages:

    Leader of innovation in open source Streaming landscape

    First True streaming framework with all advanced features like event time

    processing, watermarks, etc Low latency with high throughput, configurable according to requirements

    Auto-adjusting, not too many parameters to tune

    Exactly Once

    Getting widely accepted by big companies at scale like Uber,Alibaba.

    Disadvantages

    Little late in game, there was lack of adoption initially

    Community is not as big as Spark but growing at fast pace now

    No known adoption of the Flink Batch as of now, only popular for streaming.

    Kafka Streams :

    Kafka Streams , unlike other streaming frameworks, is a light weight library. It is useful for streaming data from Kafka , doing transformation and then sending back to kafka. We can understand it as a library similar to Java Executor Service Thread pool, but with inbuilt support for Kafka. It can be integrated well with any application and will work out of the box.

    Due to its light weight nature, can be used in microservices type architecture. There is no match in terms of performance with Flink but also does not need separate cluster to run, is very handy and easy to deploy and start working . Internally uses Kafka Consumer group and works on the Kafka log philosophy. This post thoroughly explains the use cases of Kafka Streams vs Flink Streaming.

    One major advantage of Kafka Streams is that its processing is Exactly Once end to end. It is possible because the source as well as destination, both are Kafka and from Kafka 0.11 version released around june 2017, Exactly once is supported. For enabling this feature, we just need to enable a flag and it will work out of the box. For more details shared here and here.

    Advantages:

    Very light weight library, good for microservices,IOT applications

    Does not need dedicated cluster

    Inherits all Kafka good characteristics

    Supports Stream joins, internally uses rocksDb for maintaining state.

    Exactly Once ( Kafka 0.11 onwards).

    Disadvantages

    Tightly coupled with Kafka, can not use without Kafka in picture

    Quite new in infancy stage, yet to be tested in big companies

    Not for heavy lifting work like Spark Streaming,Flink.

    Samza :

    Will cover Samza in short. Samza from 100 feet looks like similar to Kafka Streams in approach. There are many similarities. Both of these frameworks have been developed from same developers who implemented Samza at LinkedIn and then founded Confluent where they wrote Kafka Streams. Both these technologies are tightly coupled with Kafka, take raw data from Kafka and then put back processed data back to Kafka. Use the same Kafka Log philosophy. Samza is kind of scaled version of Kafka Streams. While Kafka Streams is a library intended for microservices , Samza is full fledge cluster processing which runs on Yarn. Advantages :

    Very good in maintaining large states of information (good for use case of joining

    streams) using rocksDb and kafka log. Fault Tolerant and High performant using Kafka properties

    One of the options to consider if already using Yarn and Kafka in the processing

    pipeline. Good Yarn citizen

    Low latency , High throughput , mature and tested at scale

    Disadvantages :

    Tightly coupled with Kafka and Yarn. Not easy to use if either of these not in your

    processing pipeline. Atleast-Once processing guarantee. I am not sure if it supports exactly once now like

    Kafka Streams after Kafka 0.11 Lack of advanced streaming features like Watermarks, Sessions, triggers, etc

    Comparison of Streaming Frameworks:

    We can compare technologies only with similar offerings. While Storm, Kafka Streams and Samza look now useful for simpler use cases, the real competition is clear between the heavyweights with latest features: Spark vs Flink When we talk about comparison, we generally tend to ask: Show me the numbers :) Benchmarking is a good way to compare only when it has been done by third parties.

    For example one of the old bench marking was this. But this was at times before Spark Streaming 2.0 when it had limitations with RDDs and project tungsten was not in place. Now with Structured Streaming post 2.0 release , Spark Streaming is trying to catch up a lot and it seems like there is going to be tough fight ahead.

    Recently benchmarking has kind of become open cat fight between Spark and Flink. Spark had recently done benchmarking comparison with Flink to which Flink developers responded with another benchmarking after which Spark guys edited the post.

    It is better not to believe benchmarking these days because even a small tweaking can completely change the numbers. Nothing is better than trying and testing ourselves before deciding. As of today, it is quite obvious Flink is leading the Streaming Analytics space, with most of the desired aspects like exactly once, throughput, latency, state management, fault tolerance, advance features, etc.

    These have been possible because of some of the true innovations of Flink like light weighted snapshots and off heap custom memory management. One important concern with Flink was maturity and adoption level till sometime back but now companies like Uber,Alibaba,CapitalOne are using Flink streaming at massive scale certifying the potential of Flink Streaming.

    Recently, Uber open sourced their latest Streaming analytics framework called AthenaX which is built on top of Flink engine. In this post, they have discussed how they moved their streaming analytics from STorm to Apache Samza to now Flink.

    One important point to note, if you have already noticed, is that all native streaming frameworks like Flink, Kafka Streams, Samza which support state management uses RocksDb internally. RocksDb is unique in sense it maintains persistent state locally on each node and is highly performant. It has become crucial part of new streaming systems. I have shared detailed info on RocksDb in one of the previous posts.

    How to Choose the Best Streaming Framework :

    This is the most important part. And the honest answer is: it depends :) It is important to keep in mind that no single processing framework can be silver bullet for every use case. Every framework has some strengths and some limitations too. Still , with some experience, will share few pointers to help in taking decisions:

    • 1. Depends on the use cases: If the use case is simple, there is no need to go for the latest and greatest framework if it is complicated to learn and implement. A lot depends on how much we are willing to invest for how much we want in return. For example, if it is simple IOT kind of event based alerting system, Storm or Kafka Streams is perfectly fine to work with.

    • 2. Future Considerations: At the same time, we also need to have a conscious consideration over what will be the possible future use cases? Is it possible that demands of advanced features like event time processing,aggregation, stream joins,etc can come in future ? If answer is yes or may be, then its is better to go ahead with advanced streaming frameworks like Spark Streaming or Flink. Once invested and implemented in one technology, its difficult and huge cost to change later. For example, In previous company we were having a Storm pipeline up and running from last 2 years and it was working perfectly fine until a requirement came for uniquifying incoming events and only report unique events. Now this demanded state management which is not inherently supported by Storm. Although I implemented using time based in- memory hashmap but it was with limitation that the state will go away on restart . Also, it gave issues during such changes which I have shared in one of the previous

    posts. The point I am trying to make is, if we try to implement something on our own which the framework does not explicitly provide, we are bound to hit unknown issues. 3. Existing Tech Stack :

    One more important point is to consider the existing tech stack. If the existing stack has Kafka in place end to end, then Kafka Streams or Samza might be easier fit. Similarly, if the processing pipeline is based on Lambda architecture and Spark Batch or Flink Batch is already in place then it makes sense to consider Spark Streaming or Flink Streaming. For example, in my one of previous projects I already had Spark Batch in pipeline and so when the streaming requirement came, it was quite easy to pick Spark Streaming which required almost the same skill set and code base.

    In short, If we understand strengths and limitations of the frameworks along with our use cases well, then it is easier to pick or atleast filtering down the available options. Lastly it is always good to have POCs once couple of options have been selected. Everyone has different taste bud after all.

    Datastorage for big Data:

    Types of NoSQL databases-

    There are 4 basic types of NoSQL databases:

    • 1. Key-Value Store – It has a Big Hash Table of keys & values {Example- Riak, Amazon S3 (Dynamo)}

    • 2. Document-based Store- It stores documents made up of tagged elements. {Example- CouchDB}

    • 3. Column-based Store- Each storage block contains data from only one column, {Example- HBase, Cassandra}

    • 4. Graph-based-A network database that uses edges and nodes to represent and store data. {Example- Neo4J}

    • 1. Key Value Store NoSQL Database

    The schema-less format of a key value database like Riak is just about what you need for your storage needs. The key can be synthetic or auto-generated while the value can be String, JSON, BLOB (basic large object) etc.

    The key value type basically, uses a hash table in which there exists a unique key and a pointer to a particular item of data. A bucket is a logical group of keys – but they don’t physically group the data. There can be identical keys in different buckets.

    Performance is enhanced to a great degree because of the cache mechanisms that accompany the mappings. To read a value you need to know both the key and the bucket because the real key is a hash (Bucket+ Key).

    There is no complexity around the Key Value Store database model as it can be implemented in a breeze. Not an ideal method if you are only looking to just update part of a value or query the database.

    When we try and reflect back on the CAP theorem, it becomes quite clear that key value stores are great around the Availability and Partition aspects but definitely lack in Consistency.

    Example: Consider the data subset represented in the following table. Here the key is the name of the 3Pillar country name, while the value is a list of addresses of 3PiIllar centers in that country.

    Key

    Value

    India”

    {“B-25, Sector-58, Noida, India – 201301”

    Romania

    {“IMPS Moara Business Center, Buftea No. 1, Cluj-Napoca, 400606″,City Business Center, Coriolan

    Brediceanu No. 10, Building B, Timisoara, 300011”}

    US”

    {“3975 Fair Ridge Drive. Suite 200 South, Fairfax, VA 22033”}

    The key can be synthetic or auto-generated while the value can be String, JSON, BLOB (basic large object) etc.

    This key/value type database allow clients to read and write values using a key as follows:

    • Get(key), returns the value associated with the provided key.

    • Put(key, value), associates the value with the key.

    • Multi-get(key1, key2,

    ..

    ,

    keyN), returns the list of values associated with the list of keys.

    • Delete(key), removes the entry for the key from the data store.

    While Key/value type database seems helpful in some cases, but it has some weaknesses as well. One, is that the model will not provide any kind of traditional database capabilities (such as atomicity of transactions, or consistency when multiple transactions are executed simultaneously). Such capabilities must be provided by the application itself.

    Secondly, as the volume of data increases, maintaining unique values as keys may become more difficult; addressing this issue requires the introduction of some complexity in generating character strings that will remain unique among an extremely large set of keys.

    • Riak and Amazon’s Dynamo are the most popular key-value store NoSQL databases.

      • 2. Document Store NoSQL Database

    The data which is a collection of key value pairs is compressed as a document store quite similar to a key-value store, but the only difference is that the values stored (referred to as “documents”) provide some structure and encoding of the managed data. XML, JSON

    (Java Script Object Notation), BSON (which is a binary encoding of JSON objects) are some common standard encodings.

    The following example shows data values collected as a “document” representing the names of specific retail stores. Note that while the three examples all represent locations, the representative models are different.

    {officeName:”3Pillar Noida”, {Street: “B-25, City:”Noida”, State:”UP”, Pincode:”201301”} } {officeName:”3Pillar Timisoara”, {Boulevard:”Coriolan Brediceanu No. 10”, Block:”B, Ist Floor”, City: “Timisoara”, Pincode:

    300011”}

    } {officeName:”3Pillar Cluj”, {Latitude:”40.748328”, Longitude:”-73.985560”} }

    One key difference between a key-value store and a document store is that the latter embeds attribute metadata associated with stored content, which essentially provides a way to query the data based on the contents. For example, in the above example, one could search for all documents in which City” is “Noida” that would deliver a result set containing all documents associated with any “3Pillar Office” that is in that particular city.

    Apache CouchDB is an example of a document store. CouchDB uses JSON to store data, JavaScript as its query language using MapReduce and HTTP for an API. Data and relationships are not stored in tables as is a norm with conventional relational databases but in fact are a collection of independent documents.

    The fact that document style databases are schema-less makes adding fields to JSON documents a simple task without having to define changes first.

    • Couchbase and MongoDB are the most popular document based databases.

      • 3. Column Store NoSQL Database

    In column-oriented NoSQL database, data is stored in cells grouped in columns of data rather than as rows of data. Columns are logically grouped into column families. Column families can contain a virtually unlimited number of columns that can be created at runtime or the definition of the schema. Read and write is done using columns rather than rows.

    In comparison, most relational DBMS store data in rows, the benefit of storing data in columns, is fast search/ access and data aggregation. Relational databases store a single row as a continuous disk entry. Different rows are stored in different places on disk while Columnar databases store all the cells corresponding to a column as a continuous disk entry thus makes the search/access faster.

    For example:

    To query the titles from a bunch of a million articles will be a painstaking

    task while using relational databases as it will go over each location to get item titles. On

    the other hand, with just one disk access, title of all the items can be obtained.

    Data Model

    • ColumnFamily: ColumnFamily is a single structure that can group Columns and SuperColumns with ease.

    • Key: the permanent name of the record. Keys have different numbers of columns, so the database can scale in an irregular way.

    • Keyspace: This defines the outermost level of an organization, typically the name of the application. For example, ‘3PillarDataBase’ (database name).

    • Column: It has an ordered list of elements aka tuple with a name and a value defined.

    The best known examples are Google’s BigTable and HBase & Cassandra that were

    inspired from BigTable.

    BigTable, for instance is a high performance, compressed and proprietary data storage system owned by Google. It has the following attributes:

    • Sparse – some cells can be empty

    • Distributed – data is partitioned across many hosts

    • Persistent – stored to disk

    • Multidimensional – more than 1 dimension

    • Map – key and value

    • Sorted – maps are generally not sorted but this one is

    A 2-dimensional table comprising of rows and columns is part of the relational database

    system.

    City

    Pincode

    Strength

    Project

    Noida

    201301

    250

    20

    Cluj

    400606

    200

    15

    Timisoara

    300011

    150

    10

    Fairfax

    VA 22033

    100

    5

    For above RDBMS table a BigTable map can be visualized as shown below.

    { 3PillarNoida: { city: Noida pincode: 201301 }, details: { strength: 250

    projects: 20 } } { 3PillarCluj: { address: { city: Cluj pincode: 400606 }, details: { strength: 200 projects: 15 } }, { 3PillarTimisoara: { address: { city: Timisoara pincode: 300011 }, details: { strength: 150 projects: 10 } } { 3PillarFairfax : { address: { city: Fairfax pincode: VA 22033 }, details: { strength: 100 projects: 5 } }

    • The outermost keys 3PillarNoida, 3PillarCluj, 3PillarTimisoara and 3PillarFairfax are analogues to rows.

    • address’ and ‘details’ are called column families.

    • The column-family ‘address’ has columns ‘city’ and ‘pincode’.

    • The column-family details’ has columns ‘strength’ and ‘projects’. Columns can be referenced using CloumnFamily.

    • Google’s BigTable, HBase and Cassandra are the most popular column store based databases.

    In a Graph Base NoSQL Database, you will not find the rigid format of SQL or the tables and columns representation, a flexible graphical representation is instead used which is perfect to address scalability concerns. Graph structures are used with edges, nodes and properties which provides index-free adjacency. Data can be easily transformed from one model to the other using a Graph Base NoSQL database.

    • These databases that uses edges and nodes to represent and store data.

    • These nodes are organised by some relationships with one another, which is represented by edges between the nodes.

    • Both the nodes and the relationships have some defined properties.

    The following are some of the features of the graph based database, which are explained on the basis of the example below:

    Labeled, directed, attributed multi-graph : The graphs contains the nodes which are labelled properly with some properties and these nodes have some relationship with one another which is shown by the directional edges. For example: in the following representation, “Alice knows Bob” is shown by an edge that also has some properties.

    While relational database models can replicate the graphical ones, the edge would require a join which is a costly proposition.

    UseCase

    Any ‘Recommended for You’ rating you see on e-commerce websites (book/video renting sites) is often derived by taking into account how other users have rated the product in question. Arriving at such a UseCase is made easy using Graph databases.

    InfoGrid and Infinite Graph are the most popular graph based databases. InfoGrid allows the connection of as many edges (Relationships) and nodes (MeshObjects), making it easier to represent hyperlinked and complex set of information.

    There are two kinds of GraphDatabase offered by InfoGrid, these include the following:

    MeshBase

    It is a perfect option where standalone deployment is required.

    NetMeshBase – It is ideally suited for large distributed graphs and has additional capabilities to communicate with other similar NetMeshbase.

    System Properties Comparison Cassandra vs. HBase vs. MongoDB

    Editorial information provided by DB-Engines

     

    Cassandra Xexclu

     

    Name

    de from

    comparison

    HBase Xexclude

    from comparison

    Wide-column store

    Description

    based on ideas of

    BigTable and

    DynamoDB

    based on ideas of BigTable and DynamoDB Optimized for write access

    Optimized for

    write access

    Wide-column store

    based on Apache

    Hadoop and on

    concepts of

    BigTable

    MongoDB Xexclu

    de from

    comparison

    One of the most

    stores

    Document store

    DB-Engines

    Ranking

    Ranking

    measures the

    popularity of

    database

    management

    systems

    popular document

    Primary database

    System Properties Comparison Cassandra vs. HBase vs. MongoDB Editorial information provided by DB-Engines Cassandra X exclu

    Wide column store Wide column store

    model

    Score 123.37

    Score 60.28

     

    Rank #11

    Overall Rank #17

    Score 395.09 Overall Rank #5

    Wide

     

    Wide

    #1

    column

    #2

    column

    #1

    stores

    stores

    Website

    cassandra.apache.o

     

    rg

    Technical

    cassandra.apache.o

     

    documentation

    rg/ doc/ latest

     

    Apache Software

     

    Foundation

    Foundation
     

    Foundation

    Foundation Foundation

    Developer

    Apache top level

     

    project, originally

     

    developped by

     

    developed by

    Facebook

    Powerset

     

    Initial release

    2008

    2008

    Current release

    3.11.4, February

    1.4.8, October

    2019

    2018

    License

    License Open Source Open Source

    Open Source

    Open Source
     

    Open Source

    Commercial or

    Apache version 2

    Open Source

     

    hbase.apache.org

    hbase.apache.org

    Apache Software

    Apache top-level

    project, originally

    System Properties Comparison Cassandra vs. HBase vs. MongoDB Editorial information provided by DB-Engines Cassandra X exclu

    Apache version 2

    Trend Chart

    Overall

    Document

    stores

    www.mongodb.co

    m

    docs.mongodb.com

    / manual

    MongoDB, Inc

    2009

    4.0.6, February

    2019

    Open Source

    Open Source

    MongoDB Inc.'s

    Server Side Public

    License v1. Prior

    versions were

    published under GNU AGPL v3.0. Commercial licenses are also available. Cloud-based only no MongoDB Only available
    published under
    GNU AGPL v3.0.
    Commercial
    licenses are also
    available.
    Cloud-based only
    no MongoDB
    Only available
    no
    no
    available as DBaaS
    as a cloud service
    (MongoDB Atlas)
    DBaaS offerings
    (sponsored links)
    Database as a
    Service
    Providers of
    DBaaS offerings,
    please contact us to
    be listed.
    Implementation
    Java
    Java
    C++
    language
    BSD
    Linux
    Linux
    Server operating
    Linux
    Unix
    OS X
    systems
    OS X
    Windows
    using
    Solaris
    Windows
    Cygwin
    Windows
    schema-free
    Although schema-
    free, documents of
    the same collection
    often follow the
    Data scheme
    schema-free
    schema-free
    same structure.
    Optionally impose
    all or part of a
    schema by defining
    a JSON schema.
    yes
    string,
    Typing
    integer, double,
    predefined data
    yes
    no
    decimal, boolean,
    types such as float
    date, object_id,
    or date
    geospatial
    XML support
    Some form of
    processing data in
    XML format, e.g.
    support for XML
    no
    no
    data structures,
    and/or support for
    XPath, XQuery or
    XSLT.
    Secondary indexes restricted
    only
    no
    yes
    equality queries,

    not always the best

    SQL

    SQL Support of

    Support of

    SQL

    APIs and other

    access methods

    Supported

     

    performing

    solution

    SQL-like SELECT,

    DML and DDL

    no

    statements (CQL)

    Proprietary

    protocol CQL Java API

    protocol CQL

    Java API

    (Cassandra Query

    RESTful HTTP

    Language, an SQL- API

    like language)

    Thrift

    Thrift

    C#

    C

    C++

    C#

    C++

    Erlang

    Groovy

    Go

    Java

    Haskell

    PHP

    Java

    Python

    JavaScript

    Scala

    Scala

    Node.js

    Perl

    PHP

    Python

    Ruby

    Scala

    Read-only SQL

    queries via the

    MongoDB

    Connector for BI

    proprietary

    protocol using

    JSON

    Actionscript

    Actionscript

    inofficial driver

    C

    C#

     

    C++

    Clojure

    Clojure inofficial

    inofficial

    driver

    ColdFusion

    ColdFusion

    inofficial driver

    D

    inofficial

    inofficial

    driver

     

    Dart

    Dart inofficial

    inofficial

    driver

    Delphi

    Delphi inofficial

    inofficial

    driver

    Erlang

    Go

    inofficial

    inofficial

    driver

    Groovy

    Groovy

    inofficial driver

    Haskell

     

    Java

    JavaScript

     

    Lisp

    Lisp inofficial

    inofficial

    driver

     

    Lua

    Lua inofficial

    inofficial

    driver

    MatLab

    MatLab

    inofficial driver

     

    Perl

    PHP

    PowerShell

    PowerShell

    inofficial driver

    Prolog

    Prolog inofficial

    inofficial

    driver

    Python

    R

    inofficial

    inofficial

    driver

    Server-side scripts

    Stored

    Stored

    procedures

     

    Triggers

     

    Partitioning

     

    methods

    methods

    Methods for

     

    storing different

    data on different

    nodes

     

    Replication

     

    methods

    methods

    Methods for

     

    data on multiple

    nodes

     

    MapReduce

    MapReduce

    Reduce methods

    Consistency

     

    concepts

    concepts

    consistency in a

    Offers an API for

    user-defined Map/

    Methods to ensure

    distributed system

    Foreign keys

    Foreign keys

    Referential

    integrity

    Transaction

     

    concepts

    concepts

    Support to ensure

    data integrity after

    non-atomic

     

    manipulations of

    data

    Concurrency

    Concurrency

    Support for

    concurrent

    manipulation of

    data

     

    yes

    yes

    no

    Coprocessors in

    Java

    yes

    yes

    Sharding no

    Sharding no

    "single point of

    Sharding

    failure"

    selectable

    replication factor

    Representation selectable

    Representation

    selectable

    replication factor

    redundantly storing of geographical

    distribution of

    servers is possible

    yes

    Eventual

     

    Consistency

    Immediate

    Consistency can

    Consistency can

    be individually

    decided for each

    write operation

    no

    no Atomicity

    no Atomicity

    and isolation are

    supported for

     

    single operations

    yes

    Immediate

    Consistency

    no

    no

    yes

    yes

    Server-side scripts Stored procedures Triggers Partitioning methods Methods for storing different data on different nodes Replication

    Durability yes

    yes

    Ruby

     

    Scala

    Smalltalk

    Smalltalk

    inofficial driver

    JavaScript

     

    no

    Sharding

     

    Master-slave

    replication

    yes

    Eventual

     

    Consistency

    Immediate

    Consistency can

    Consistency can

    be individually

    decided for each

    write operation

    no

    typically not

    typically not

    used, however

    similar

     

    functionality with

    DBRef possible

    Multi-document

    ACID Transactions

    with snapshot

     

    isolation

    yes

    yes

    optional

    optional

    Support for making

    data persistent

     

    In-memory

    capabilities Is

    capabilities Is

     

    yes

    In-memory

    In-memory

    there an option to

    storage engine

    define some or all

    no

    no

    introduced with

    structures to be

     

    MongoDB version

    held in-memory

     

    3.2

    only.

     

    Access Control

     

    User concepts

    Access control

    User concepts Access control Access rights for Lists (ACL) Access rights for users and roles

    Access rights for

    Lists (ACL)

    User concepts Access control Access rights for Lists (ACL) Access rights for users and roles

    Access rights for

    users and roles

    users can be

    defined per object

    Implementation

    based on Hadoop

     

    and ZooKeeper

     

    More information provided by the system vendor

     
     

    Cassandra

    HBase

    MongoDB

    Apache Cassandra

    MongoDB is the

    is the leading

    next-generation

    NoSQL, distributed

    database that helps

    Specific

    characteristics

     

    database

    management

    businesses

    transform their

     

    system, well ...

    industries

    ...

    » more

    » more

    No single point of

    By offering the

    failure ensures

    best of traditional

    Competitive

     

    100% availability .

    databases as well

    advantages

    Operational

    as the flexibility,

    simplicity for ...

    scale,

    ...

    » more

    » more

    Internet of Things

    Internet of Things

    (IOT), fraud

    (Bosch, Silver

    Typical application

    scenarios

    detection

    applications,

    Spring Networks)

    Mobile (The

    recommendation

    Weather Channel,

    engines, product ...

    ADP,

    ...

    » more

    » more

     

    ADP, Adobe,

    Key customers

     

    Apple, Netflix,

    Uber, ING,,

    Intuit,Fidelity, NY

    Times, Outbrain,

    BazaarVoice,

    Best ...

    » more

    Amadeus,

    AstraZeneca,

    Barclays, BBVA,

    Bond, Bosch,

    Cisco, CERN,

    City ...

     

    » more

    Market metrics

     

    Cassandra is used

    40 million

     

    by 40% of the

    downloads

    Fortune 100.

    (growing at 30

    » more

    thousand

     

    downloads per

     

    customers ....

    » more

     

    Apache licenseÂ

    MongoDB

    Pricing for

    database server:

    commercial

    Free Software

    Licensing and

    distributions

    Foundation's GNU

    pricing models

    provided by

    AGPL v3.0.

    DataStax and

    Commercial

    available ...

    licenses

    ...

    » more

    » more

    We invite representatives of system vendors to contact us for updating and extending the system

    information,

    and for displaying vendor-provided information such as key customers, competitive advantages and

    market metrics.

    Related products and services

     

    3rd parties

    Instaclustr: Fully

    Dremio is like

    CData: Connect to

    Hosted and

    magic for HBase

    Big Data &

    Managed Apache

    accelerating your

    NoSQL through

    Cassandra

    analytical queries

    standard Drivers.

    » more

    up to 1,000x with

    » more

    Apache Arrow.

    DataStax

    » more

    Dremio: Analyze

    Enterprise: Apache

    your data with

    Cassandra for

    standard SQL and

    enterprises.

    any BI tool.

    » more

    Accelerate your

    queries up to

    CData: Connect to

    1,000x.

    Big Data &

    » more

    NoSQL through

    standard Drivers.

    Studio 3T: The

    » more

    world's favorite

    IDE for working

    with MongoDB

    » more

    Navicat for

    MongoDB gives

    you a highly

    effective GUI

    interface for

    MongoDB

    database

    management,

    administration and

    development.

    » more

    ScaleGrid: Deploy,

    monitor, backup

    and scale

    MongoDB in the

    cloud with the #1

    Database-as-a-

    Service (DBaaS)

    platform.

    » more

    Benchmarking NoSQL Databases: Cassandra vs. MongoDB vs. HBase vs. Couchbase

    Understanding the performance behavior of a NoSQL database like Apache Cassandra™ under

    various conditions is critical. Conducting a formal proof of concept (POC) in the environment in

    which the database will run is the best way to evaluate platforms. POC processes that include the

    right benchmarks such as production configurations, parameters and anticipated data and concurrent

    user workloads give both IT and business stakeholders powerful insight about platforms under

    consideration and a view for how business applications will perform in production.

    Independent benchmark analyses and testing of various NoSQL platforms under big data,

    production-level workloads have been performed over the years and have consistently identified

    Apache Cassandra as the platform of choice for businesses interested in adopting NoSQL as the

    database for modern Web, mobile and IOT applications.

    One benchmark analysis (Solving Big Data Challenges for Enterprise Application Performance

    Management) by engineers at the University of Toronto, which in evaluating six different data

    stores, found Apache Cassandra the “clear winner throughout our experiments”. Also, End Point

    Corporation, a database and open source consulting company, benchmarked the top NoSQL

    databases including: Apache Cassandra, Apache HBase, Couchbase, and MongoDB using a variety

    of different workloads on AWS EC2.

    The databases involved were:

    Apache Cassandra: Highly scalable, high performance distributed database designed to handle

    large amounts of data across many commodity servers, providing high availability with no single

    point of failure.

    Apache HBase: Open source, non-relational, distributed database modeled after Google’s BigTable

    and is written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop

    project and runs on top of HDFS (Hadoop Distributed File System), providing BigTable-like

    capabilities for Hadoop.

    MongoDB: Cross-platform document-oriented database system that eschews the traditional table-

    based relational database structure in favor of JSON-like documents with dynamic schemas making

    the integration of data in certain types of applications easier and faster.

    Couchbase: Distributed NoSQL document-oriented database that is optimized for interactive

    applications.

    End Point conducted the benchmark of these NoSQL database options on Amazon Web Services

    EC2 instances, which is an industry-standard platform for hosting horizontally scalable services. In

    order to minimize the effect of AWS CPU and I/O variability, End Point performed each test 3 times

    on 3 different days. New EC2 instances were used for each test run to further reduce the impact of

    any “lame instance” or “noisy neighbor” effects sometimes experienced in cloud environments, on

    any one test.

    NoSQL Database Performance Testing Results

    When it comes to performance, it should be noted that there is (to date) no single “winner takes all”

    among the top NoSQL databases or any other NoSQL engine for that matter. Depending on the use

    case and deployment conditions, it is almost always possible for one NoSQL database to outperform

    another and yet lag its competitor when the rules of engagement change. Here are a couple

    snapshots of the performance benchmark to give you a sense of how each NoSQL database stacks

    up.

    Throughput by Workload

    Each workload appears below with the throughput/operations-per-second (more is better) graphed

    vertically, the number of nodes used for the workload displayed horizontally, and a table with the

    result numbers following each graph.

    Load process

    For load, Couchbase, HBase, and MongoDB all had to be configured for non-durable writes to

    complete in a reasonable amount of time, with Cassandra being the only database performing

    durable write operations. Therefore, the numbers below for Couchbase, HBase, and MongoDB

    represent non-durable write metrics.

    Nodes Cassandra HBase MongoDB Couchbase 1 18,683.43 15,617.98 8,368.44 13,761.12 2 31,144.24 23,373.93 13,462.51 26,140.82
    Nodes Cassandra HBase
    MongoDB
    Couchbase
    1
    18,683.43
    15,617.98
    8,368.44
    13,761.12
    2
    31,144.24
    23,373.93
    13,462.51
    26,140.82
    • 4 53,067.62

    38,991.82

    18,038.49

    • 8 86,924.94

    74,405.64

    34,305.30

    76,504.40

    173,001.20

    • 16 143,553.41 73,335.62

    131,887.99

    • 32 296,857.36 134,968.87

    326,427.07

    192,204.94

    Mixed Operational and Analytical Workload

    Note that Couchbase was eliminated from this test because it does not support scan operations

    (producing the error: “Range scan is not supported”).

    4 53,067.62 38,991.82 18,038.49 4 <a href=0,063.34 8 86,924.94 74,405.64 34,305.30 76,504.40 173,001.20 16 143,553.41 73,335.62 131,887.99 32 296,857.36 134,968.87 326,427.07 192,204.94 Mixed Operational and Analytical Workload Note that Couchbase was eliminated from this test because it does not support scan operations (producing the error: “Range scan is not supported”). Nodes Cassandra HBase MongoDB 1 2 4,690.41 10,386.08 269.30 333.12 4 1,228.61 18,720.50 8 2,151.74 36,773.58 16 5,986.65 78,894.24 32 8,936.18 128,994.91 939.01 30.96 10.55 39.28 377.04 227.80 For a comprehensive analysis, please download the complete report: Benchmarking Top NoSQL Databases . NoSQL Database Performance Conclusion These performance metrics are just a few of the many that have solidified Apache Cassandra as the NoSQL database of choice for businesses needing a modern, distributed database for their Web, mobile and IOT applications. Each database option (Cassandra, HBase, Couchbase and MongoDB) will certainly shine in particular use cases, so it’s important to test your specific use cases to ensure your selected database meets your performance SLA. Whether you are primarily concerned with throughput or latency, or more interested in the architectural benefits such as having no single point of failure or being able to have elastic scalability across multiple data centers and the cloud, much of an application’s success comes down to its ability to deliver the response times Web, mobile and IOT customers expect. " id="pdf-obj-61-47" src="pdf-obj-61-47.jpg">

    Nodes Cassandra HBase MongoDB

    1

    2

    4,690.41

    10,386.08

    269.30

    333.12

    • 4 1,228.61

    18,720.50

    • 8 2,151.74

    36,773.58

    • 16 5,986.65

    78,894.24

    • 32 8,936.18

    128,994.91

    939.01

    30.96

    10.55

    39.28

    377.04

    227.80

    For a comprehensive analysis, please download the complete report: Benchmarking Top NoSQL

    Databases.

    NoSQL Database Performance Conclusion

    These performance metrics are just a few of the many that have solidified Apache Cassandra as the

    NoSQL database of choice for businesses needing a modern, distributed database for their Web,

    mobile and IOT applications. Each database option (Cassandra, HBase, Couchbase and MongoDB)

    will certainly shine in particular use cases, so it’s important to test your specific use cases to ensure

    your selected database meets your performance SLA. Whether you are primarily concerned with

    throughput or latency, or more interested in the architectural benefits such as having no single point

    of failure or being able to have elastic scalability across multiple data centers and the cloud, much

    of an application’s success comes down to its ability to deliver the response times Web, mobile and

    IOT customers expect.

    As the benchmarks referenced here showcase, Cassandra’s reputation for fast write and read

    performance, and delivering true linear scale performance in a masterless, scale-out design, bests its

    top NoSQL database rivals in many use cases.

    Apache HBase: Why We Use It and Believe In It

    As the benchmarks referenced here showcase, Cassandra’s reputation for fast write and read performance, and delivering

    What is HBase?

    Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google’s Bigtable: A Distributed Storage System for Structured Data.

    HBase supports random, real-time read/write access with a goal of hosting very

    large tables atop clusters of commodity hardware.

    HBase features include:

    Consistent reads and writes Automatic and configurable sharding of tables Automatic failover support How HBase Works:

    HBase uses ZooKeeper for coordination of “truth” across the cluster. As region servers come online, they register themselves with ZooKeeper as members of the cluster. Region servers have shards of data (partitions of a database table) called “regions”.

    When a change is made to a row, it is updated in a persistent Write-Ahead-Log (WAL) file and Memstore, the sorted memory cache for HBase. Once Memstore fills, its changes are “flushed” to HFiles in HDFS. The WAL