Professional Documents
Culture Documents
a) Introduction to SPARK
b) Need for SPARK
c) Key Features of SPARK
d) Use Cases of SPARK
SPARK
• Started as a research project in UC
Berkley AMPLab in 2009
• Matei Zaharia is a Romanian-
Canadian computer scientist and the
creator of Apache Spark
• Donated to Apache Foundation as a
Open-source in 2010
• Apache SPARK became a top level
project in 2014
• Zaharia is a co-founder and CTO of
Databricks
• Databricks set a new world record in
large scale sorting using Spark
SPARK
• Apache Spark is an open-
source distributed general-
purpose cluster-
computing framework.
• Spark provides an interface for
programming entire clusters with
implicit data parallelism and fault
tolerance.
Why SPARK
• Speed -Speed is the major reason for its
popularity and it offers 100 times faster
processing speed than Hadoop. It runs
10 times faster when running on disk by
reducing the no. of read/write in disk
and also it stores intermediate results in
disks
• Cost – It is cost-effective as it uses a few
numbers of resources only.
• Compatibility - Spark is compatible with
the resource manager and runs with
Hadoop just like MapReduce. Other
resource managers like YARN and Moses
are also compatible with Spark.
Why SPARK
• Real-time Processing–The other reason for
the popularity of Spark includes real-time
processing in batch mode. It remains high in
demand due to in-memory processing
feature.
• Supports multiple languages − Spark
provides built-in APIs in Java, Scala, or
Python. Therefore, you can write
applications in different languages. Spark
comes up with 80 high-level operators for
interactive querying.
• Advanced Analytics − Spark not only
supports ‘Map’ and ‘reduce’. It also supports
SQL queries, Streaming data, Machine
learning (ML), and Graph algorithms.
Spark on Hadoop
SPARK SPARK
SPARK
Yarn
MapReduce
MLLib
Spark
Spark SQL (Machine GraphX
Streaming
Learning)