Professional Documents
Culture Documents
Introduction
Apache Spark is an integrated, fast, in-memory, general-purpose engine for large scale data processing.
Spark is ideal for iterative and interactive processing tasks on large data sets and streams.
Spark achieves 10–100x performance over Hadoop by operating with an in-memory data construct called
Resilient Distributed Datasets (RDDs).
It helps avoid the latencies involved in disk reads and writes. Spark also offers built-in libraries for
Machine Learning, Graph Processing, Stream processing and SQL to deliver seamless superfast data
processing along with high programmer productivity.
Spark is compatible with Hadoop file systems and tools.
Spark enables applications in Hadoop clusters to run up to 100 times faster in memory and 10 times faster
even when running on disk.
History
Apache Spark was developed in 2009 in UC Berkeley’s AMPLab, and open sourced in 2010 as an Apache
project.
It can process data from a variety of data repositories, including the Hadoop Distributed File System
(HDFS), and NoSQL databases such as HBase and Cassandra.
It also supports in-memory data sharing across DAGs, so that different jobs can work with the same data.
These include a SQL query engine, a library of machine learning algorithms, a graph processing system,
and streaming data processing software.
Page 1 of 7
Resilient Distributed Datasets (RDDs)
RDDs, Resilient Distributed Datasets, is a distributed memory construct.
motivated by two types of applications that current computing frameworks handle inefficiently:
o Iterative algorithms, and Interactive data mining tools.
o In both cases, keeping data in memory can improve performance by an order of magnitude
RDDs are immutable and partitioned collection of records.
RDDs can only be created by:
(a) Reading data from a stable storage such as HDFS or
(b) Transformations on existing RDDs.
The two major types of operations on RDD
o Transformations: return a new, modified RDD based on the original.
Several transformations via Spark API,
map( ), filter( ), sample ( ), and union ( ).
o Actions : return a value based on some computation being performed on an RDD.
Some examples of actions supported by the Spark API include reduce ( ), count ( ),
first ( ), and for each ( ).
Page 2 of 7
Apache Spark master-slave architecture
Apache Spark works in a master-slave architecture where the master is called “Driver” and slaves are
called “Workers”.
When you run a Spark application, Spark Driver creates a context that is an entry point to your
application, and all operations (transformations and actions) are executed on worker nodes, and the
resources are managed by Cluster Manager
Cluster Manager Types
o Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
o Apache Mesos – Mesons is a Cluster manager that can also run Hadoop MapReduce and PySpark
applications.
o Hadoop YARN – the resource manager in Hadoop 2. This is mostly used, cluster manager.
o Kubernetes – an open-source system for automating deployment, scaling, and management of
containerized applications.
Working of Cluster
o The working of Spark architecture involves several steps that enable distributed data processing
and computation on large-scale datasets.
o Below are the key components and the working of the Spark architecture:
Driver Program:
o The Spark application starts with a Driver Program, which runs on the master node.
o The Driver Program is responsible for creating the SparkContext, which serves as the entry
point to any Spark functionality.
o The main entry point of a Spark application, running on the master node.
o Responsible for defining the application's logic, transformations, and actions.
o Manages communication with the SparkContext to coordinate task execution.
o Submits Spark applications to the cluster manager for execution.
o Handles task aggregation and final result processing.
Page 3 of 7
SparkContext
o It acts as the connection to the Spark cluster and coordinates the execution of tasks on the cluster.
o The SparkContext splits the data into partitions and schedules tasks to be executed on worker
nodes.
o SparkContext is the heart of Apache Spark, serving as the primary interface to the cluster.
o It coordinates the execution of tasks and manages distributed data processing.
o Responsible for data partitioning, resource allocation, and fault tolerance.
o Provides essential methods for creating RDDs, which are the fundamental data structure in
Spark
Cluster Manager
o The Cluster Manager is responsible for allocating and managing resources across the cluster for
Spark applications.
o Common cluster managers used with Spark are Apache Mesos, Hadoop YARN, and Standalone
mode.
o The Spark application requests resources from the cluster manager, which then allocates the
necessary resources for task execution.
o The Cluster Manager is responsible for resource management and job scheduling in Spark
applications.
o It allocates resources (CPU cores, memory) to different applications running on the cluster.
o Manages the distribution of tasks to worker nodes based on available resources and data locality.
o Monitors and ensures fault tolerance by restarting failed tasks on other nodes.
o Supports different cluster management systems like Apache Mesos, Hadoop YARN, and Spark
Standalone.
Worker Nodes:
o Worker Nodes are the machines (physical or virtual) in the cluster where tasks are executed.
o Each worker node runs a Spark executor process, which is responsible for executing tasks and
storing data in memory or disk.
o The number of worker nodes and the resources allocated to each node depend on the cluster
configuration.
DAG Scheduler:
o The Directed Acyclic Graph (DAG) Scheduler breaks down the logical execution plan (sequence
of transformations) into stages of tasks based on data dependencies and narrow transformations.
o It optimizes the execution plan to minimize data shuffling and improve performance.
Task Scheduler:
o The Task Scheduler is responsible for launching tasks on worker nodes in a distributed manner.
Page 4 of 7
o It ensures that tasks are scheduled efficiently across the cluster based on data locality, resource
availability, and task dependencies.
Executors:
o Executors are worker processes running on worker nodes.
o Each executor runs tasks in separate threads and holds data in memory or disk storage.
o Executors communicate with the Driver Program and the Cluster Manager to execute tasks and
manage resources.
SPARK ECOSYSTEM
Spark is an integrated stack of tools responsible for scheduling, distributing, and monitoring
applications consisting of many computational tasks across many worker machines, or a computing
cluster.
Spark is written primarily in Scala, but includes code from Python, Java, R, and other languages.
Spark comes with a set of integrated tools that reduce learning time and deliver higher user productivity.
Spark ecosystem includes Mesos resource manager, and other tools.
Page 5 of 7
MLlib
o MLlib is Spark's machine learning library.
o It offers basic ML algorithms: classification, regression, clustering, collaborative filtering, and
dimensionality reduction.
o Provides lower-level optimization primitives and higher-level pipeline APIs for streamlined ML
workflows.
o RDDs help Spark excel at iterative computation, thus enabling MLlib to run fast.
o In addition, Spark MLlib is easy to use and it can support Scala, Java, Python, and SparkR.
Spark GraphX
o GraphX is Spark’s component for graphs and graph-parallel computation.
o It extends Spark RDD by introducing a new Graph abstraction, representing a directed multi-graph
with properties on vertices and edges.
o GraphX offers fundamental graph operators like subgraph, joinVertices, and aggregateMessages,
based on an optimized variant of the Pregel API (Google's graph processing system for PageRank).
o It includes a collection of graph algorithms and builders to simplify graph analytics tasks.
o GraphX enables efficient and scalable graph processing within the Spark ecosystem.
SparkR
o Facilitates distributed data processing in R.
o It uses Spark’s distributed
o computation engine to run large scale data analysis from the R shell.
Spark SQL
o Spark SQL is a language for working with structured data in Apache Spark.
o It allows running queries on data to obtain meaningful results.
o Supports SQL and HiveQL for querying and processing data
Spark Streaming
o Spark Streaming processes real-time data streams from input sources in a cluster.
o Data streams are chopped into small batches, each representing a few seconds of data.
o Spark treats each batch as RDDs and applies RDD operations to process and analyze the data, then
pushes out the results as batches to databases or dashboards.
SPARK APPLICATIONS
Some real uses cases that are solved well by a tool like Apache Spark include:
1. Real-time Log Data monitoring.
2. Massive Natural Language Processing.
3. Large Scale Online Recommendation Systems.
Page 6 of 7
Page 7 of 7