You are on page 1of 16

Difference between Spark & Hadoop frameworks

Apache Spark and Apache Hadoop are both open-source


frameworks designed for distributed computing and big data
processing, but they have some key differences

• Data Processing Model:


• Hadoop: Hadoop MapReduce, the processing model associated with
the Hadoop ecosystem, processes data in a batch-oriented fashion. It
divides large datasets into smaller chunks and processes them in
parallel across a distributed cluster.
• Spark: Spark supports both batch processing and in-memory data
processing.
Difference between Spark &
Hadoop frameworks
• Data Processing Libraries:
• Hadoop: While Hadoop has a wide range of tools and libraries (e.g.,
HBase, Hive, Pig), it relies on separate projects for additional
functionalities beyond basic batch processing.
• Spark: Spark includes built-in libraries for diverse tasks, such as Spark
SQL for SQL queries, MLlib for machine learning, GraphX for graph
processing, and Spark Streaming for real-time data processing. This
integration simplifies the development of end-to-end data processing
pipelines.
Difference between Spark &
Hadoop frameworks
• Fault Tolerance:
• Hadoop: Hadoop provides fault tolerance through data replication.
Data is stored in multiple copies across the cluster, and in case of
node failures, the processing can be rerouted to other nodes with
copies of the data.
• Spark: Spark uses Resilient Distributed Datasets (RDDs) for fault
tolerance. RDDs store lineage information, allowing lost partitions to
be recomputed if needed. Spark also benefits from Hadoop's
underlying HDFS (Hadoop Distributed File System) for storage and
fault tolerance.
Difference between Spark & Hadoop
frameworks

• Speed:
• Hadoop: Hadoop MapReduce is disk-based, which can result in slower
processing times for iterative algorithms as data is written to and read
from disk in each iteration.
• Spark: Spark's in-memory processing capability speeds up data
processing, especially for iterative algorithms, by keeping
intermediate data in memory between stages. This can significantly
improve performance compared to Hadoop MapReduce.
Difference between Spark &
Hadoop frameworks
• Ease of Use:
• Hadoop: Writing programs in Hadoop MapReduce typically involves
low-level coding in Java, which can be complex and time-consuming
for developers.
• Spark: Spark provides high-level APIs in multiple programming
languages, including Java, Scala, Python, and R.
Difference between Spark &
Hadoop frameworks
• Use Cases:
• Hadoop: Hadoop is well-suited for batch processing of large datasets
where latency is not critical. It is commonly used for tasks like log
processing, data warehousing, and ETL (Extract, Transform, Load)
jobs.
• Spark: Spark is versatile and can handle batch processing, interactive
queries, machine learning, graph processing, and streaming data. Its
in-memory processing makes it suitable for applications that require
low-latency processing.
Spark Vs Hadoop
Spark Features
Apache Spark
• As against a common belief, Spark is not a modified version of
Hadoop and is not, really, dependent on Hadoop because it has its
own cluster management. Hadoop is just one of the ways to
implement Spark.
• Spark uses Hadoop in two ways – one is storage and second
is processing. Since Spark has its own cluster management
computation, it uses Hadoop for storage purpose only.
Spark Built on Hadoop:
Spark Built on Hadoop

• There are three ways of Spark deployment as explained below.


• Standalone − Spark Standalone deployment means Spark occupies
the place on top of HDFS(Hadoop Distributed File System) and space
is allocated for HDFS, explicitly. Here, Spark and MapReduce will run
side by side to cover all spark jobs on cluster.
• Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs
on Yarn without any pre-installation or root access required. It helps
to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows
other components to run on top of stack.
Spark Use Cases
Spark Architecture
Spark Buzzwords
Spark Buzzwords

You might also like