You are on page 1of 3

Apache Spark - Introduction

Apache Spark
Apache Spark is an ultra-fast cluster computing technology designed for fast
calculations. It is based on the Hadoop MapReduce and extends the MapReduce
model to efficiently use it for more types of calculations, including interactive
queries and flow processing. Spark's key feature is in-memory cluster computing
that increases the speed of processing an application.

Spark is designed to cover a wide variety of workloads such as batch applications,


iterative algorithms, interactive queries, and streaming. In addition to supporting
all these workloads in a respective system, it reduces the administrative burden of
keeping tools separate.

Apache Spark Evolution


Spark is one of Hadoop subprojects developed in 2009 at AMPLab of UC Berkeley
by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was
donated to the Apache software foundation in 2013 and now Apache Spark has
become a high-end Apache project since February 2014.
Apache Spark Features
Apache Spark has the following features.

• Speed: Spark helps run an application on the Hadoop cluster, up to 100 times
faster in memory and 10 times faster when running on disk. This is possible by
reducing the number of read / write operations on the disk. Stores intermediate
processing data in memory.

• Supports multiple languages: Spark provides Java, Scala or Python integrated


APIs. Therefore, you can write applications in different languages. Spark comes
with 80 high level operators for interactive queries.

• Advanced analysis: Spark not only supports "Map" and "Reduce". It also
supports SQL queries, data transmission, machine learning (ML) and graphing
algorithms. Get Apache Spark Online Training

Components Apache Spark


The following illustration shows the different components of Spark.

Apache Spark Core

Spark Core is the underlying general execution engine for the ignition platform on
which all other functionality is based. It provides memory computation and
reference data sets on external storage systems.

Spark SQL

Spark SQL is a component beyond Spark Core that features a new data abstraction
called SchemaRDD, which provides support for structured and semi-structured
data.

Spark streaming

Spark Streaming takes advantage of Spark Core's fast programming capability to


perform stream analysis. It inserts data into mini-batches and performs resilient
distributed data sets (RDD) transformations on these mini-batches of data.

MLlib (Machine Learning Library)


MLlib is a distributed machine learning framework in Spark due to Spark's
distributed memory-based architecture. According to benchmarks, MLlib
developers do this against Alterning Least Squares (ALS) implementations. Spark
MLlib is nine times faster than the disk-based version of Hadoop's Apache Mahout
(before Mahout gets a Spark interface).

GraphX

GraphX is a distributed graphical rendering framework in Spark. It provides an


API to express the calculation of graphs that can be modeled by user-defined
graphs using the Pregel abstraction API. It also provides optimized runtime for this
abstraction.

You might also like